Visible–Infrared Person Re-Identification via Global Feature Constraints Led by Local Features

Wang, Jin; Jiang, Kaiwei; Zhang, Tianqi; Gu, Xiang; Liu, Guoqing; Lu, Xin

doi:10.3390/electronics11172645

Open AccessArticle

Visible–Infrared Person Re-Identification via Global Feature Constraints Led by Local Features

by

Jin Wang

^1,*

,

Kaiwei Jiang

¹

,

Tianqi Zhang

¹,

Xiang Gu

¹,

Guoqing Liu

² and

Xin Lu

³

¹

School of Information Science and Technology, Nantong University, Nantong 226019, China

²

Zhongtian Smart Equipment Co., Ltd., Nantong 226010, China

³

School of Computer and Information Engineering, Nantong Institute of Technology, Nantong 226002, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(17), 2645; https://doi.org/10.3390/electronics11172645

Submission received: 16 July 2022 / Revised: 16 August 2022 / Accepted: 19 August 2022 / Published: 24 August 2022

(This article belongs to the Special Issue Pattern Recognition and Machine Learning Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Smart security is needed for complex scenarios such as all-weather and multi-scene environments, and visible–infrared person re-identification (VI Re-ID) has become a key technique in this field. VI Re-ID is usually modeled as a pattern recognition issue, which faces the problems of inter-modality and intra-modality discrepancies. To alleviate these problems, we designed the Local Features Leading Global Features Network (LoLeG-Net), a representation learning network. Specifically, for cross-modality discrepancies, we employed a combination of ResNet50 and non-local attention blocks to obtain the modality-shareable features and convert the problem to a single-modality person re-identification (Re-ID) problem. For intra-modality variations, we designed global feature constraints led by local features. In this method, the identity loss and hetero-center loss were employed to alleviate intra-modality variations of local features. Additionally, hard sample mining triplet loss combined with identity loss was used, ensuring the effectiveness of global features. With this method, the final extracted global features were much more robust against the background environment, pose differences, occlusion and other noise. The experiments demonstrate that LoLeG-Net is superior to existing works. The result for SYSU-MM01 was Rank-1/mAP 51.40%/51.41%, and the result for RegDB was Rank-1/mAP 76.58%/73.36%.

Keywords:

person re-identification; cross modality; non-local attention; feature constraint

1. Introduction

Person re-identification (Re-ID) is a technology used to intelligently track pedestrians. It is intended to recognize specified pedestrians from many pedestrian images taken from non-overlapping cameras [1,2,3]. When a query set is given, the Re-ID is a task to recognize specific images containing the same identity [4]. The query set represents a set of images to be queried, and the gallery set represents a large set of images collected from multiple cameras for recognition. For Re-ID, scholars have proposed some supervised [5,6,7,8,9,10,11,12,13,14,15,16], unsupervised and weakly supervised learning schemes [17,18,19,20,21,22,23,24], and the recently proposed deep feature learning using partial levels [25,26,27,28,29], which can achieve a level of accuracy that approximates human eye discrimination [30]. As the performance of Re-ID tasks improves, many scholars have started working on cross-domain Re-ID problems [31,32], which will substantially expand the applicable scenarios of Re-ID technology and improve recognition efficiency.

Most early studies of Re-ID involved single-modality Re-ID, and the images they processed were mostly visible images captured during the daytime. These early studies assumed that the visible images were taken in adequate illumination. However, we find this assumption to be too idealistic in the real-world application of Re-ID technology. For example, criminals generally choose to commit crimes in poorly lit environments or at night. In this case, visible cameras have difficulty capturing high-quality visible images. Thus, infrared cameras were introduced to help Re-ID capture clear pedestrian images in poor illumination [33]. Currently, most cameras are equipped with both visible cameras and infrared cameras. Such a device can capture high-quality color pedestrian images during daytime and adequately illuminated infrared pedestrian images at night, or in poor illumination conditions, using an infrared camera, with better free Re-ID from lighting limitations. Therefore, visible–infrared person re-identification (VI Re-ID) has become a hot topic [34]. VI Re-ID eliminates the limitation of illumination, but encounters many other challenges.

One of the main problems of VI Re-ID is inter-modality discrepancies. Visible samples contain three channels, while the infrared samples contain one channel. Moreover, the differences in wavelength ranges between visible and infrared images create large discrepancies between them, posing a great challenge to the VI Re-ID task.

VI Re-ID also has the problem of intra-modality variations, which also occur in the single-modality Re-ID task. In Figure 1, we can observe that one pedestrian has been captured by different cameras; there are differences in the viewpoint position, photo angle, and lighting environment of these cameras, which inevitably leads to images of different backgrounds, poses, occlusions, etc. This will reduce the similarity between the images of one pedestrian, as well as the similarity between the images of different pedestrians, causing us to preferentially match pedestrians with different identities from the queried pedestrian during the query process.

Thus, VI Re-ID still presents a strong challenge. To address challenges in VI Re-ID, we propose the Local Features Leading Global Features Network (LoLeG-Net). The main idea is to combine the residual network ResNet50 and the non-local attention block to obtain modality-shareable features to alleviate inter-modality discrepancies, and we constrain local features through hetero-center loss and identity loss. Then, we use the local features to guide global features that are robust against complex environments, inevitable occlusion, and other noises, alleviating intra-modality variations.

Our main contributions can be found as follows:

To alleviate inter-modality discrepancies, LoLeG-Net is proposed to obtain modality-shareable features, converting cross-modality recognition to single-modality recognition;
To reduce intra-modality variations, the global feature constraints led by local features are proposed, enabling the final global features used in LoLeG-Net to have the advantages of local features, and enhancing the robustness of LoLeG-Net against background, occlusion and other noise.

The remaining contributions are shown below: the related work on VI Re-ID is introduced in Section 2, LoLeG-Net is presented in Section 3, the experiments on LoLeG-Net are expounded in Section 4, and the conclusion is shown in Section 5.

2. Related Work

Since 2017, deep learning has become popular in VI Re-ID tasks. Nguyen et al. [35] proposed a combined visible and infrared pedestrian image method for Re-ID, employing infrared images in Re-ID tasks for the first time, and they also contributed the RegDB dataset.

Wu et al. [36] disclosed the SYSU-MM01 dataset. They designed three network structures and a deep zero-padding method, which better alleviated cross-modality discrepancies. However, this method does not address the problem of intra-modality variations because it lacks the process of metric learning.

Ye et al. [37] designed a ranking loss based on a dual-constrained structure to reduce inter-modality discrepancies. However, this approach presupposes that training samples and testing samples have the same data distributions, while in VI Re-ID tasks, their data distributions are not the same most of the time.

Wu et al. [38] demonstrated different data distributions of training samples and testing samples in VI Re-ID by the t-SNE dimensionality reduction method. They converted the issue of extracting shared information in cross-modality recognition into the issue of preserving cross-modality similarity by constraining the similarity of samples across modalities. Additionally, they further alleviated cross-modality discrepancies by using modality gate nodes.

Ye et al. [39] built schemes, including attention blocks, a specific average pooling method, and weighted regularized triplet loss. This baseline enhanced the representation of modality-shared features by acquiring information through the non-local attention block. However, the images had much interference, such as a complex environment and occlusion, and this method obtained modality-shared features based on global features only. Their method using only global features had less robustness against noise. Moreover, this method used a non-local attention block that focused only on global correlations and did not consider positional correlations, so the noise was also globally employed. This made it much more difficult for the model to learn global features with discriminability.

The extent to which the main problems are resolved via the existing methods described above can be summarized as follows:

For inter-modality discrepancies, existing methods mostly extract the shared features of both modalities by global representation learning to convert the cross-modality problem into a single-modality problem. However, the global features have poor robustness against noise, such as background and occlusion. In the method of enhancing feature robustness against noise, local features are able to divide the original features so that each block of local features contains information about relevant regions, and the noise in the relevant region is likewise divided, which makes local features much more robust against image noise. However, the camera has different views for each pedestrian sample, so selecting the appropriate division strategy becomes a major difficulty for local features.
For intra-modality variations, existing schemes focus on metric learning. Triplet loss is one of these schemes to control the distance of the sample [40], while identity loss is employed to measure the classification precision of models. It is mutually beneficial for both of them to be used in representation learning [39]. However, the simple triplet loss will dominate the entire distance metric learning process. Moreover, the triplet loss does not ensure that the samples of the same class between visible and infrared modalities are as close as possible.

We find that the main problems are alleviated to varying degrees with existing schemes. However, robustness against background, occlusion, and other noises needs to be further enhanced. Therefore, a method is needed to enhance robustness against noise and better alleviate the main problems of VI Re-ID.

3. Proposed Method

We propose a representation learning network LoLeG-Net based on a two-stream structure. The structure of LoLeG-Net is shown in Figure 2. First, LoLeG-Net mainly includes three parts: the network for extracting modality-shared features, global features, and local features. Second, metric learning will be used to constrain LoLeG-Net. Specifically, LoLeG-Net is constrained via global feature constraints led by local features, including leading loss and global feature loss. Leading loss constrains local features through hetero-center loss and identity loss. Moreover, global feature loss includes identity loss and hard sample mining triplet loss, which is employed to ensure the effectiveness of global features.

Our LoLeG-Net includes two inputs and the samples of the two modalities. We presume that we have

n_{R G B}

visible samples and

n_{I R}

infrared samples.

I_{R G B}^{i}

represents the

i - t h (i \in n_{R G B})

visible image and

I_{I R}^{j}

represents the

j - t h (j \in n_{I R})

infrared image.

y_{R G B}^{i}

and

y_{I R}^{j}

represent the

I_{R G B}^{i}

and

I_{I R}^{j}

corresponding identities, respectively.

{\{I_{R G B}^{i}, y_{R G B}^{i}\}}_{i = 1}^{n_{R G B}}

and

{\{I_{I R}^{j}, y_{I R}^{j}\}}_{j = 1}^{n_{I R}}

indicate the

i - t h

visible sample and the

j - t h

infrared sample, respectively. Next, we take the visible modality samples

{\{I_{R G B}^{i}, y_{R G B}^{i}\}}_{i = 1}^{n_{R G B}}

and infrared modality samples

{\{I_{I R}^{j}, y_{I R}^{j}\}}_{j = 1}^{n_{I R}}

of the same pedestrian

(y_{R G B}^{i} = y_{I R}^{j})

as an example to introduce LoLeG-Net.

3.1. Extracting Shared Features

Before extracting shared features, we need to extract their shallow modality-specific features from visible and infrared images. Specifically, shallow modality-specific features

f_{S p e c i f i c}^{i - R G B}

and

f_{S p e c i f i c}^{j - I R}

are extracted from the visible and infrared samples

{\{I_{R G B}^{i}, y_{R G B}^{i}\}}_{i = 1}^{n_{R G B}}

and

{\{I_{I R}^{j}, y_{I R}^{j}\}}_{j = 1}^{n_{I R}}

via a corresponding convolution layer. The two convolution layers of the two modalities have the same structure and different parameters, which represent different modalities.

Then, the shallow specific features of the two modalities are fed into the network block with the same structure and parameters, obtaining the modality-shareable features marked

f_{S h a r e d}^{i - R G B}

and

f_{S h a r e d}^{j - I R}

. The network block consists of non-local attention blocks and ResNet50.

The second and third layers of ResNet50 are followed by two layers of non-local attention blocks whose components can be seen in Figure 3. First, the input

X

is separately dropped to half of the original channels through two

1 \times 1

convolution operations to obtain

θ (X)

with

ϕ (X)

. Then, the similarity of

X

is measured with respect to itself by calculating their dot product similarity, which can be shown as Equation (1).

f (x_{i}, x_{j}) = θ {(x_{i})}^{T} ϕ (x_{j}), \forall x_{j} \in X

(1)

After that, the dot product similarity makes matrix multiplication with

g (X)

, which reduces the channel by another

1 \times 1

convolution operation. The obtained result performs a

1 \times 1

convolution to keep the same channels as

X

, and then is fed into the BN layer to obtain

W (y)

. Finally, the output

Z

is obtained by adding

W (y)

to

X

.

3.2. Extracting Global Features

The global features finally adopted in our method will be extracted from

f_{S h a r e d}^{i - R G B}

and

f_{S h a r e d}^{j - I R}

.

First, the modality-shared features of the two modalities are pooled by the pooling operation of the global average, and from the result of the operation, we obtained the quasi-global features

f_{G l o b a l - R}^{i - R G B}

and

f_{G l o b a l - R}^{j - I R}

.

Then, to avoid the gradient disappearance problem, the quasi-global features

f_{G l o b a l - R}^{i - R G B}

and

f_{G l o b a l - R}^{j - I R}

are input into the batch normalization layer so that the data distribution is close to the normal distribution, and the final global features

f_{G l o b a l}^{i - R G B}

and

f_{G l o b a l}^{j - I R}

are obtained.

3.3. Extracting Local Features

From the modality-shareable features

f_{S h a r e d}^{i - R G B}

and

f_{S h a r e d}^{j - I R}

, we obtain local features.

First, the modality-shareable features

f_{S h a r e d}^{i - R G B}

and

f_{S h a r e d}^{j - I R}

are subjected to the operation of a

1 \times 1

convolution to decrease channels to one quarter of the original to obtain

f_{S h a r e d - Q}^{i - R G B}

and

f_{S h a r e d - Q}^{j - I R}

. Then, the shared features after the descending channel are divided into four parts according to the height, and the local feature groups

f_{L o c a l - R}^{i - R G B} = {\{f_{L o c a l - B l o c k - R}^{i - R G B, p}\}}_{p = 1}^{4}

and

f_{L o c a l - R}^{j - I R} = {\{f_{L o c a l - B l o c k - R}^{j - I R, q}\}}_{q = 1}^{4}

composed of 4 blocks are obtained. Finally, the 4 blocks are passed through the batch normalization layer to extract

f_{L o c a l}^{i - R G B} = {\{f_{L o c a l - B l o c k}^{i - R G B, p}\}}_{p = 1}^{4}

and

f_{L o c a l}^{j - I R} = {\{f_{L o c a l - B l o c k}^{j - I R, q}\}}_{q = 1}^{4}

.

3.4. Global Feature Constraints Led by Local Features

LoLeG-Net is constrained via metric learning. Specifically, it is constrained via global feature constraints led by local features, including leading loss and a global feature loss.

3.4.1. Leading Loss Function

The leading loss shows the constraints led by local ones, and the global features can obtain the advantages of local features with strong robustness to noise. First, we should ensure that the local features are valid to provide the correct guidance for the global features. Next, we should ensure that the distances between local features of two different modalities of one identity are as close as possible to further reduce intra-modality variations. Finally, the global features are made to be as close as possible to the local features so that the local features act as guides and increase the robustness of the global features against noise.

To ensure the validity of local features, we employ identity loss to measure the classification accuracy of the model and extend its scope to visible and infrared modalities. We presume that we optionally pick up

P

pedestrians, and we randomly select

K

visible and infrared samples from the corresponding pedestrians. In this way,

2 P K

samples are picked up to build mini-batch training samples. The identity loss function can be shown as Equation (2).

L_{I d e n t i f y} (f) = - \frac{1}{2 P K} \sum_{i = 1}^{2 P K} \log (p (y_{i} |f_{i}))

(2)

In Equation (1),

f

represents features and

p (y_{i} |f_{i})

represents the possibility that

f_{i}

will be identified as

y_{i}

. Identity loss is calculated separately for each local feature block via local features

f_{L o c a l}^{i - R G B} = {\{f_{L o c a l - B l o c k}^{i - R G B, p}\}}_{p = 1}^{4}

and

f_{L o c a l}^{j - I R} = {\{f_{L o c a l - B l o c k}^{j - I R, q}\}}_{q = 1}^{4}

, and is shown in Equation (3).

L_{L o c a l - I d} = \sum_{n = 1}^{4} L_{I d e n t i t y} (\{f_{L o c a l - B l o c k}^{i - R G B, n}, f_{L o c a l - B l o c k}^{j - I R, n}\})

(3)

After ensuring the validity of the local features, the intra-modality variations need to be further alleviated. We employ hetero-center loss function to make distances between local features of the same identity and different modalities as close as possible. Similar to the identity loss of local features, we also assume that we optionally pick up

P

pedestrians and we randomly select

K

visible and infrared samples from corresponding pedestrians. In this way,

2 P K

samples are picked up to build mini-batch training samples. The hetero-center loss can be shown as Equation (4).

L_{H C} = \frac{1}{4} \sum_{n = 1}^{4} ({‖l_{2} (f_{L o c a l - B l o c k - R}^{i - R G B, n}), l_{2} (f_{L o c a l - B l o c k - R}^{j - I R, n})‖}_{2})

(4)

where

l_{2} (f)

represents L2-norm normalization. Assuming the feature is

f = (x_{1}, x_{1}, \dots, x_{n})

, the expression of

l_{2} (f)

can be expressed as Equation (5).

l_{2} (f) = \frac{f}{\sqrt{x_{1}^{2} + x_{2}^{2} + \dots + x_{n}^{2}}}

(5)

After ensuring the reliability of local features and reducing intra-modality variations, local features need to be employed to guide the global features, allowing the merits of the local features to be obtained by global features. L2-norm loss is employed to evaluate the correlation of global with local features by calculating their 2-norm so that the local features can be guided. The L2-norm loss can be expressed in the form shown in Equation (6).

L_{L 2 - L o s s} = {‖f_{L o c a l}^{i - R G B} - f_{G l o b a l}^{i - R G B}‖}_{2}^{2} + {‖f_{L o c a l}^{j - I R} - f_{G l o b a l}^{j - I R}‖}_{2}^{2}

(6)

In this case, the local feature blocks of the local features are stitched together to lead the global features.

Ultimately, a guiding loss function is obtained, as shown in Equation (7).

L_{L e a d} = L_{L o c a l - I d} + L_{H C} + L_{L 2 - L o s s}

(7)

3.4.2. Global Feature Loss Function

To ensure the validity of the global features, we employ identity loss to measure the classification accuracy of the model, employ hard sample mining triplet loss to further alleviate the intra-class variations of the global features, and extend the scope to the VI Re-ID task. Similar to the assumptions in the local features, we also assume that we optionally pick up

P

pedestrians and we randomly select

K

visible and infrared samples from the corresponding pedestrians. In this way,

2 P K

samples are picked up to build mini-batch training samples.

The hard sample mining triplet loss can be shown as Equation (8).

L_{T r i - V I} (f) = \frac{1}{2 P K} \sum_{i = 1}^{P} \sum_{a = 1}^{2 K} [\max_{p = 1, \dots, 2 K} D (f_{a}^{i}, f_{p}^{i}) - \min_{\begin{matrix} j = 1, \dots, P \\ n = 1, \dots, 2 K \\ j \neq i \end{matrix}} D (f_{a}^{i}, f_{n}^{j}) + ρ]_{+}

(8)

f

represents features. The anchor feature

f_{a}^{i}

is selected from the samples of both modalities. The identity of

f_{p}^{i}

and

f_{a}^{i}

is the same, but

f_{n}^{i}

has a different identity. The hard sample mining triplet loss traverses all images, finding a hard sample for each image. The hardest positive and negative samples are combined with the anchor image to form the hard triplet sample. The triplet loss formed by this method effectively alleviates intra-class variation.

Finally, the global feature loss function can be expressed as shown in Equation (9).

L_{G l o b a l} = L_{T r i - V I} (\{{f_{G l o b a l}^{i - R G B}}_{i = 1}^{K}, {f_{G l o b a l}^{j - I R}}_{j = 1}^{K}\}) + L_{I d - V I} (\{{f_{G l o b a l}^{i - R G B}}_{i = 1}^{K}, {f_{G l o b a l}^{j - I R}}_{j = 1}^{K}\})

(9)

3.4.3. Total Loss Function

Ultimately, the loss function proposed in this paper for global feature constraints led by local features can be expressed in the form of Equation (10).

L_{L o L e G - N e t} = L_{L e a d} + L_{G l o b a l}

(10)

4. Experiment

We evaluate LoLeG-Net by experiments. First, we concisely describe the datasets and evaluation metrics to be employed in the experiments. Second, we perform detailed comparison experiments and ablation experiments on LoLeG-Net and analyze the segmentation strategy of local features.

4.1. Datasets

The SYSU-MM01 [36] and RegDB [35] datasets of VI Re-ID tasks will be used in our experiments.

4.1.1. SYSU-MM01

The SYSU-MM01 dataset is the first published cross-modality person re-identification dataset. The pedestrian images are collected by four RGB and two near-infrared cameras. In total, the training set contains 22,258 visible and 11,909 infrared images of 395 identities and the testing set includes 96 identities with 3808 infrared images for the query and 301 visible images as the gallery set. There are two evaluation modes because the images are captured in both indoor and outdoor environments. In All-Search mode, the gallery set contains all visible images. In Indoor-Search mode, the gallery set contains visible images captured by only two indoor visible cameras. For both modes, we adopt a single-shot setting and repeat the 10 evaluation trials with a random split of the gallery and probe set to obtain the most reliable experimental results.

4.1.2. RegDB

This dataset includes 412 identities, and 10 visible and infrared images in each corresponding identity. In the experiment, the two retrieval modes of visible image retrieval for infrared images (visible-to-infrared, V2I) and infrared image retrieval for visible images (infrared-to-visible, I2V) are employed, and the training and testing sets are selected via 10 random segmentations to record the average accuracy.

4.2. Evaluation Metrics

For fairness, following the approach of existing work, the cumulative matching characteristic (CMC) will be employed in this experiment. Rank-k in CMC indicates the performance of the right images in the first k results. We also employ the mean average precision (mAP), which captures the average performance of the method. In particular, the mean inverse negative penalty (mINP) is used to check the model performance to recognize the most difficult samples. The most difficult sample is the last correct sample that appears in the sequence of recognition results.

4.3. Experimental Configuration

The batchsize of this experiment is eight. Each batch of this experiment randomly selects four images for each pedestrian identity from each of the two modalities. We expand the channels of infrared images into the same three channels as the visible images in this experiment. The images are finally cut to 256 × 128, and the normalization of the images follows the normalization standards of ImageNet.

In this experiment, we employ

L_{L o L e G - N e t}

as the constraint, where

K

is 4,

ρ

is 0.3, and

P

is 8. In this experiment, the original learning rate is 0.01, with 80 epochs of training. The variation in the learning rate

L e a r n i n g_r a t e (e p o c h)

with

e p o c h

is shown in Equation (11).

L e a r n i n g_r a t e (e p o c h) = \{\begin{matrix} 0.01 \times e p o c h & 0 < e p o c h \leq 10 \\ 0.1 & 10 < e p o c h \leq 20 \\ 0.01 & 20 < e p o c h \leq 50 \\ 0.001 & 50 < e p o c h \leq 80 \end{matrix}

(11)

4.4. Comparison Experiments

We select some existing schemes for comparison with our method to verify the superiority of LoLeG-Net, consisting of zero-padding [36], one-stream [36], two-stream [36], BDTR [37], eBDTR [41], HCML [42], AlignGAN [43], HI-CMD [44], and AGW [39].

4.4.1. Performance on SYSU-MM01

The performance in the two modes on the SYSU-MM01 dataset is shown in Table 1 and Table 2.

HCML [42] learns modality-shareable features by improving the two-stream network. BDTR [37] and eBDTR [41] improve the loss function and distance metrics based on the two-stream network to optimize the distance of sample images in the feature space. AlignGAN [43] and Hi-CMD [44] use GAN to mitigate the differences between pedestrian images of different modalities. AGW [39] builds a scheme, including attention blocks, a specific average pooling method, and the weighted regularized triplet loss to enhance the representation of modality-shared features. The experimental results show that LoLeG-Net has a better performance compared to BDTR [37]. Rank-1 and mAP increase by 24.08% and 24.09% in the All-Search mode, and increase by 24.71% and 22.83% in the Indoor-Search mode. It can be seen that the modality-shared features can be considered global via non-local attention blocks.

In the comparison experiments in Table 1 and Table 2, the best method other than LoLeG-Net is AGW [39]. In the All-Search mode, the Rank-1, mAP, and mINP of LoLeG-Net increase by 3.9%, 3.76%, and 3.43% in the All-Search mode and increase by 2.46%, 1.72%, and 1.46% in the Indoor-Search mode.

For a further comparison with AGW, we test the visual recognition results of AGW and LoLeG-Net on SYSU-MM01. We optionally pick three samples as a query set to test the results of the two methods, and the performance can be seen in Figure 4. A correct recognition is marked with a green border, and a wrong recognition is marked with a red border. We find that AGW considers only global features, and has inferior robustness against image noise. Each part of the local features designed in this paper includes only the image knowledge of the corresponding block, which alleviates the noise effect on the samples. By using global feature constraints led by local features, LoLeG-Net can be considered to be robust against noise.

4.4.2. Performance of RegDB

Table 3 and Table 4 show the comparison of two modes for RegDB.

Compared with AlignGAN [43], our method does not employ GAN which brings extra noise, but proposes global feature constraints led by local features so that the features obtained by LoLeG-Net have better robustness against noise. In V2I mode, the Rank-1 and mAP of LoLeG-Net increase by 18.68% and 19.76%, respectively, while they increase by 18.20% and 18.48%, respectively, in I2V.

In the results of Table 3 and Table 4, the best method other than our method is AGW [39]. In V2I mode, the Rank-1, the mAP, and the mINP of the method in this paper increase by 6.53%, 6.99%, and 12.09%, respectively, while they increase by 4.01%, 5.98%, and 8.71%, respectively, in I2V mode.

For further comparison with AGW, we test the visual recognition results of AGW and LoLeG-Net on RegDB. We randomly pick three examples for comparison, as shown in Figure 5 and Figure 6.

In Figure 5, the images to be queried contain rich knowledge on color and texture but also contain background, occlusion and other noise. The matching results of the AGW algorithm considering only global features do not perform well. In contrast, the matching results of LoLeG-Net are excellent, which further indicates that the global feature constraints led by the local features method used in LoLeG-Net have better robustness against background, occlusion, and other noise.

In Figure 6, the images to be queried are infrared images. The recognition difficulty is greatly increased because infrared images have no color information. In this mode, AGW pays attention to poses, which results in poor performance. LoLeG-Net is able to match the corresponding visible pedestrian images based on the limited pedestrian structure information, which further shows that LoLeG-Net builds an enhanced correlation between different modalities in the same pedestrian identity.

We find that LoLeG-Net outperforms AGW.

4.5. Ablation Experiments

We designed experiments to test the validity of the whole part of LoLeG-Net. The baseline used in this experiment is from the lightweight Re-ID library named Open-ReID, also compared by Luo et al. [45]. From this baseline, several modules proposed in this paper are added in turn to the SYSU-MM01 dataset to evaluate the method. In this way, the enhancement effect of each part on the VI Re-ID task can be clearly demonstrated, as shown in Table 5.

Model A is the baseline. In the VI Re-ID task, Model A was better for representation, learning by obtaining modality-shared features for both modalities based on the ResNet50 structure and metric learning.

Model B utilizes the method of non-local attention blocks based on Model A. The blocks consider all details about the global correlation, and are able to obtain global-rich features. Table 5 shows that Model B outperforms Model A, indicating the validity of the non-local attention blocks.

Model C utilizes the method of global feature constraints led by local features based on Model A. Each local feature block includes only the image knowledge of the corresponding block, which alleviates the noise effect and uses global feature constraints led by local features to increase the robustness of the model against noise. Table 5 shows that Model C outperforms Model A, indicating the validity of global feature constraints led by local features.

Model D is LoLeG-Net, which adds both global feature constraints led by local features and non-local attention blocks on the basis of the baseline. Thus, compared to the baseline, LoLeG-Net is able to obtain global-rich modality-shareable features on the one hand, and absorb the advantages of local features which make LoLeG-Net robust against background, occlusion and other noise on the other. From Table 5, Model D outperforms the other three models, which shows the validity of the combination of non-local attention blocks and global feature constraints led by local features. Moreover, since Model D outperforms Model B, the global feature constraints led by local features improve the performance of the method based on non-local attention blocks, alleviating the problem that non-local attention blocks have no location correlation.

The local feature segmentation strategy used in this paper has four equal parts. To evaluate the validity of the segmentation strategy, strategies using one (i.e., no local features), two, four, and eight equal parts were evaluated, as shown in Table 6.

From Table 6, it can be observed that the four-equal-part performance was better than that of the others, so we finally adopted the four-equal-part strategy.

5. Conclusions

We reduced inter-modality discrepancies by obtaining modality-shared features with long dependencies by employing non-local attention blocks. We reduced intra-modality variations by hetero-center loss with identity loss. In addition, we employed hard sample mining triplet loss and identity loss to ensure the effectiveness of global features. Because of this, LoLeG-Net had strong robustness against the background environment, occlusion, and other noise. The experiments showed that LoLeG-Net is superior to the existing schemes, with Rank-1 reaching 51.40% on the SYSU-MM01 dataset and Rank-1 reaching 76.58% on the RegDB dataset. In future VI Re-ID studies, we will conduct research on the connection between local and global levels to enhance the behavior of the schemes and start working on cross-domain tasks.

Author Contributions

Conceptualization, J.W., K.J. and T.Z.; methodology, J.W., K.J. and T.Z.; software, J.W., K.J., T.Z. and X.G.; validation, J.W., K.J., T.Z. and X.L.; formal analysis, J.W., K.J., X.G. and G.L.; investigation, J.W., K.J., G.L. and X.L.; resources, J.W., X.G. and G.L.; data curation, J.W., K.J. and T.Z.; writing—original draft preparation, J.W., K.J., T.Z., X.G., G.L. and X.L.; writing—review and editing, J.W., K.J., T.Z. and X.G.; visualization, J.W., K.J., T.Z. and G.L.; supervision, J.W.; project administration, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Nantong Science Foundation Grant (2022-193).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, Q.; Huang, H.; Zhong, Y.; Min, W.; Han, Q.; Xu, D.; Xu, C. Swin Transformer Based on Two-Fold Loss and Background Adaptation Re-Ranking for Person Re-Identification. Electronics 2022, 11, 1941. [Google Scholar] [CrossRef]
Liu, H.; Cheng, J.; Wang, W.; Su, Y.; Bai, H. Enhancing the Discriminative Feature Learning for Visible-Thermal Cross-Modality Person Re-Identification. Neurocomputing 2020, 398, 11–19. [Google Scholar] [CrossRef] [Green Version]
Wang, H.; Wu, L.; Chen, F.; Ding, Z.; Yin, Y.; Dai, C. Common-covariance based person re-identification model. Pattern Recognit. Lett. 2021, 146, 77–82. [Google Scholar] [CrossRef]
Chen, Y.; Wan, L.; Li, Z.; Jing, Q.; Sun, Z. Neural Feature Search for RGB-Infrared Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 587–597. [Google Scholar]
Xiao, T.; Li, H.; Ouyang, W.; Wang, X. Learning Deep Feature Representations with Domain Guided Dropout for Person Re-identification. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 23–30 June 2016. [Google Scholar]
Zhang, Z.; Lan, C.; Zeng, W.; Chen, Z. Densely Semantically Aligned Person Re-Identification. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 667–676. [Google Scholar]
Zhao, Y.; Lin, J.; Xuan, Q.; Xi, X. HPILN: A feature learning framework for cross-modality person re-identification. Inst. Eng. Technol. 2019, 13, 2897–2904. [Google Scholar] [CrossRef]
Zheng, M.; Karanam, S.; Wu, Z.; Radke, R.J. Re-Identification with Consistent Attentive Siamese Networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5728–5737. [Google Scholar]
Paisitkriangkrai, S.; Shen, C.; Hengel, A. Learning to rank in person re-identification with metric ensembles. In Proceedings of the Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1846–1855. [Google Scholar]
Liao, S.; Hu, Y.; Zhu, X.; Li, S.Z. Person Re-identification by Local Maximal Occurrence Representation and Metric Learning. In Proceedings of the Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2197–2206. [Google Scholar]
Bai, S.; Tang, P.; Torr, P.H.; Latecki, L.J. Re-Ranking via Metric Fusion for Object Retrieval and Person Re-Identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 740–749. [Google Scholar]
Zheng, Z.; Zheng, L.; Yang, Y. A Discriminatively Learned CNN Embedding for Person Reidentification. ACM Trans. Multimed. Comput. Commun. Appl. 2017, 14, 1–20. [Google Scholar] [CrossRef]
Wang, Z.; Wang, Z.; Zheng, Y.; Chuang, Y.; Satoh, S. Learning to Reduce Dual-Level Discrepancy for Infrared-Visible Person Re-Identification. In Proceedings of the Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 618–626. [Google Scholar]
Dai, Z.; Chen, M.; Gu, X.; Zhu, S.; Tan, P. Batch DropBlock Network for Person Re-identification and Beyond. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar]
Sun, Y.; Xu, Q.; Li, Y.; Zhang, C.; Li, Y.; Wang, S.; Sun, J. Perceive Where to Focus: Learning Visibility-Aware Part-Level Features for Partial Person Re-Identification. In Proceedings of the Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 393–402. [Google Scholar]
Wu, W.; Tao, D.; Li, H.; Yang, Z.; Cheng, J. Deep features for person re-identification on metric learning. Pattern Recognit. 2021, 110, 107424. [Google Scholar] [CrossRef]
Song, Y.; Liu, S.; Yu, S.; Zhou, S. Adaptive Label Allocation for Unsupervised Person Re-Identification. Electronics 2022, 11, 763. [Google Scholar] [CrossRef]
Wang, H.; Gong, S.; Xiang, T. Unsupervised Learning of Generative Topic Saliency for Person Re-identification. In Proceedings of the British Machine Vision Association, Nottingham, UK, 1–5 September 2014; pp. 1–11. [Google Scholar]
Zhao, F.; Liao, S.; Xie, G.; Zhao, J.; Zhang, K.; Shao, L. Unsupervised Domain Adaptation with Noise Resistible Mutual-Training for Person Re-identification. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 526–544. [Google Scholar]
Jin, X.; Lan, C.; Zeng, W.; Chen, Z. Global Distance-distributions Separation for Unsupervised Person Re-identification. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 735–751. [Google Scholar]
Wang, J.; Zhu, X.; Gong, S.; Wei, L. Transferable Joint Attribute-Identity Deep Learning for Unsupervised Person Re-Identification. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 719–728. [Google Scholar]
Yang, Q.; Yu, H.-X.; Zheng, W.S. Patch-based Discriminative Feature Learning for Unsupervised Person Re-identification. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3633–3642. [Google Scholar]
Yu, H.; Zheng, W.; Wu, A.; Guo, X.; Gong, S.; Lai, J. Unsupervised Person Re-Identification by Soft Multilabel Learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2143–2152. [Google Scholar]
Meng, J.; Wu, S.; Zheng, W. Weakly Supervised Person Re-Identification. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 760–769. [Google Scholar]
Zhu, Y.; Yang, Z.; Wang, L.; Zhao, S.; Hu, X.; Tao, D. Hetero-Center Loss for Cross-Modality Person Re-Identification. Neurocomputing 2020, 386, 97–109. [Google Scholar] [CrossRef] [Green Version]
Liu, H.; Chai, Y.; Tan, X.; Li, D.; Zhou, X. Strong but Simple Baseline with Dual-Granularity Triplet Loss for Visible-Thermal Person Re-Identification. IEEE Signal Process. Lett. 2021, 28, 653–657. [Google Scholar] [CrossRef]
Zheng, F.; Deng, C.; Sun, X.; Jiang, X.; Guo, X.; Yu, Z.; Huang, F.; Ji, R. Pyramidal Person Re-IDentification via Multi-Loss Dynamic Training. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8506–8514. [Google Scholar]
Chen, B.; Deng, W.; Hu, J. Mixed High-Order Attention Network for Person Re-Identification. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 371–381. [Google Scholar]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline). In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Ye, M.; Shen, J.; Crandall, D.J.; Shao, L.; Luo, J. Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 229–247. [Google Scholar]
Zhou, S.; Wang, Y.; Zhang, F.; Wu, J. Cross-view similarity exploration for unsupervised cross-domain person re-identification. Neural Comput. Appl. 2021, 33, 4001–4011. [Google Scholar] [CrossRef]
Delussu, R.; Putzu, L.; Fumera, G.; Roli, F. Online Domain Adaptation for Person Re-Identification with a Human in the Loop. In Proceedings of the 25th International Conference on Pattern Recognition, Milan, Italy, 10–15 January 2021; pp. 3829–3836. [Google Scholar]
Lu, Y.; Wu, Y.; Liu, B.; Zhang, T.; Li, B.; Chu, Q.; Yu, N. Cross-modality Person re-identification with Shared-Specific Feature Transfer. In Proceedings of the Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13379–13389. [Google Scholar]
Song, C.; Huang, Y.; Ouyang, W.; Wang, L. Mask-Guided Contrastive Attention Model for Person Re-identification. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1179–1188. [Google Scholar]
Nguyen, D.T.; Hong, H.G.; Kim, K.W.; Park, K.R. Person Recognition System Based on a Combination of Body Images from Visible Light and Thermal Cameras. Sensors 2017, 17, 605. [Google Scholar] [CrossRef] [PubMed]
Wu, A.; Zheng, W.; Yu, H.; Gong, S.; Lai, J. RGB-Infrared Cross-Modality Person Re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5380–5389. [Google Scholar]
Ye, M.; Wang, Z.; Lan, X.; Yuen, P. Visible Thermal Person Re-Identification via Dual-Constrained Top-Ranking. In Proceedings of the International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 1092–1099. [Google Scholar]
Wu, A.; Zheng, W.-S.; Gong, S.; Lai, J. RGB-IR Person Re-identification by Cross-Modality Similarity Preservation. Int. J. Comput. Vis. 2020, 128, 1765–1785. [Google Scholar] [CrossRef]
Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C. Deep Learning for Person Re-identification: A Survey and Outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2872–2893. [Google Scholar] [CrossRef] [PubMed]
Cai, X.; Liu, L.; Zhu, L.; Zhang, H. Dual-modality hard mining triplet-center loss for visible infrared person re-identification. Knowl. Based Syst. 2021, 215, 106772. [Google Scholar] [CrossRef]
Ye, M.; Lan, X.; Wang, Z.; Yuen, P. Bi-Directional Center-Constrained Top-Ranking for Visible Thermal Person Re-Identification. IEEE Trans. Inf. Forens. Secur. 2020, 15, 407–419. [Google Scholar] [CrossRef]
Ye, M.; Lan, X.; Li, J.; Yuen, P. Hierarchical Discriminative Learning for Visible Thermal Person Re-Identification. Assoc. Adv. Artif. Intell. 2018, 32, 7501–7508. [Google Scholar] [CrossRef]
Wang, G.; Zhang, T.; Cheng, J.; Liu, S.; Yang, Y.; Hou, Z. RGB-Infrared Cross-Modality Person Re-Identification via Joint Pixel and Feature Alignment. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 3622–3631. [Google Scholar]
Choi, S.; Lee, S.; Kim, Y.; Kim, T.; Kim, C. Hi-CMD: Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification. In Proceedings of the Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10254–10263. [Google Scholar]
Luo, H.; Jiang, W.; Gu, Y.; Liu, F.; Liao, X.; Lai, S.; Gu, J. A Strong Baseline and Batch Normalization Neck for Deep Person Re-Identification. IEEE Trans. Multimed. 2020, 22, 2597–2609. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Images taken by different color cameras of one pedestrian.

Figure 2. LoLeG-Net network structure.

Figure 3. The component of the non-local attention block.

Figure 4. Visual recognition performance of AGW with LoLeG-Net on SYSU-MM01.

Figure 5. Visual recognition performance of AGW with LoLeG-Net for RegDB in V2I mode.

Figure 6. Visual recognition performance of AGW with LoLeG-Net for RegDB in I2V mode.

Table 1. Comparison of results in All-Search mode.

Method	Venue	Rank-1	Rank-10	Rank-20	mAP	mINP
One-stream [36]	ICCV-2017	12.04	49.68	66.74	13.67	-
Two-stream [36]	ICCV-2017	11.65	47.99	65.50	12.85	-
Zero-Padding [36]	ICCV-2017	14.80	54.12	71.33	15.95	-
HCML [42]	AAAI-2018	14.32	53.16	69.17	16.16	-
BDTR [37]	IJCAI-2018	27.32	66.96	81.07	27.32	-
eBDTR [41]	TIFS-2019	27.82	67.34	81.34	28.42	-
AlignGAN [43]	ICCV-2019	42.40	85.00	93.70	40.70	-
Hi-CMD [44]	CVPR-2020	34.90	77.60	-	35.90	-
AGW [39]	TPAMI-2021	47.50	84.39	92.14	47.65	35.30
Ours	-	51.40	89.18	95.12	51.41	38.73

Table 2. Comparison of results in Indoor-Search mode.

Method	Venue	Rank-1	Rank-10	Rank-20	mAP	mINP
One-stream [36]	ICCV-2017	16.94	63.55	82.10	22.95	-
Two-stream [36]	ICCV-2017	15.60	61.18	81.02	21.49	-
Zero-Padding [36]	ICCV-2017	20.58	68.38	85.79	26.92	-
HCML [42]	AAAI-2018	24.52	73.25	86.73	30.08	-
BDTR [37]	IJCAI-2018	31.92	77.18	89.28	41.86	-
eBDTR [41]	TIFS-2019	32.46	77.42	89.62	42.46	-
AlignGAN [43]	ICCV-2019	45.90	87.60	94.40	54.30	-
AGW [39]	TPAMI-2021	54.17	91.14	95.98	62.97	59.23
Ours	-	56.63	92.72	97.97	64.69	60.69

Table 3. Comparison results in V2I mode for RegDB.

Method	Venue	Rank-1	Rank-10	Rank-20	mAP	mINP
Zero-Padding [36]	ICCV-2017	17.75	34.21	44.35	18.90	-
HCML [42]	AAAI-2018	24.44	47.53	56.78	20.08	-
BDTR [37]	IJCAI-2018	33.56	58.61	67.43	32.76	-
eBDTR [41]	TIFS-2019	34.62	58.96	68.72	33.46	-
AlignGAN [43]	ICCV-2019	57.90	-	-	53.60	-
Hi-CMD [44]	CVPR-2020	70.93	86.39	-	66.04	-
AGW [39]	TPAMI-2021	70.05	86.21	91.55	66.37	50.19
Ours	-	76.58	89.60	94.07	73.36	62.28

Table 4. Comparison results in I2V mode for RegDB.

Method	Venue	Rank-1	Rank-10	Rank-20	mAP	mINP
Zero-Pad [36]	ICCV-2017	16.63	34.68	44.25	17.82	-
HCML [42]	AAAI-2018	21.70	45.02	55.58	22.24	-
BDTR [37]	IJCAI-2018	32.92	58.46	68.43	31.96	-
eBDTR [41]	TIFS-2019	34.21	58.74	68.64	32.49	-
AlignGAN [43]	ICCV-2019	56.30	-	-	53.40	-
AGW [39]	TPAMI-2021	70.49	87.12	91.84	65.90	51.24
Ours	-	74.50	88.97	93.72	71.88	59.95

Table 5. Ablation experiments.

Method	Rank-1	Rank-10	Rank-20	mAP	mINP
A: Baseline	45.80	75.60	86.22	46.38	34.36
B: Baseline + Attention	49.35	77.67	86.84	48.90	36.29
C: Baseline + Local Lead	48.80	78.12	87.08	48.93	36.61
D: Baseline + Attention + Local Lead	51.40	80.76	89.18	51.41	38.73

Table 6. Performance of different segmentation strategies.

Equal-Parts	Rank-1	Rank-5	Rank-10	Rank-20	mAP	mINP
1	48.98	76.91	86.31	93.67	48.55	35.90
2	49.26	76.97	86.26	93.34	48.52	35.79
4	51.40	80.76	89.18	95.12	51.41	38.73
8	47.77	76.67	86.91	94.24	48.02	35.86

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Jiang, K.; Zhang, T.; Gu, X.; Liu, G.; Lu, X. Visible–Infrared Person Re-Identification via Global Feature Constraints Led by Local Features. Electronics 2022, 11, 2645. https://doi.org/10.3390/electronics11172645

AMA Style

Wang J, Jiang K, Zhang T, Gu X, Liu G, Lu X. Visible–Infrared Person Re-Identification via Global Feature Constraints Led by Local Features. Electronics. 2022; 11(17):2645. https://doi.org/10.3390/electronics11172645

Chicago/Turabian Style

Wang, Jin, Kaiwei Jiang, Tianqi Zhang, Xiang Gu, Guoqing Liu, and Xin Lu. 2022. "Visible–Infrared Person Re-Identification via Global Feature Constraints Led by Local Features" Electronics 11, no. 17: 2645. https://doi.org/10.3390/electronics11172645

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Visible–Infrared Person Re-Identification via Global Feature Constraints Led by Local Features

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Extracting Shared Features

3.2. Extracting Global Features

3.3. Extracting Local Features

3.4. Global Feature Constraints Led by Local Features

3.4.1. Leading Loss Function

3.4.2. Global Feature Loss Function

3.4.3. Total Loss Function

4. Experiment

4.1. Datasets

4.1.1. SYSU-MM01

4.1.2. RegDB

4.2. Evaluation Metrics

4.3. Experimental Configuration

4.4. Comparison Experiments

4.4.1. Performance on SYSU-MM01

4.4.2. Performance of RegDB

4.5. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI