Contextual Patch-NetVLAD: Context-Aware Patch Feature Descriptor and Patch Matching Mechanism for Visual Place Recognition

Sun, Wenyuan; Chen, Wentang; Huang, Runxiang; Tian, Jing

doi:10.3390/s24030855

Open AccessArticle

Contextual Patch-NetVLAD: Context-Aware Patch Feature Descriptor and Patch Matching Mechanism for Visual Place Recognition

¹

Institute of Systems Science, National University of Singapore, Singapore 119615, Singapore

²

State Key Laboratory of Fluid Power and Mechatronic Systems, School of Mechanical Engineering, Zhejiang University, Hangzhou 310027, China

³

Engineering Research Center for Design Engineering and Digital Twin of Zhejiang Province, School of Mechanical Engineering, Zhejiang University, Hangzhou 310027, China

⁴

Robotics Institute, Zhejiang University, Hangzhou 310027, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Sensors 2024, 24(3), 855; https://doi.org/10.3390/s24030855

Submission received: 7 November 2023 / Revised: 22 January 2024 / Accepted: 24 January 2024 / Published: 28 January 2024

(This article belongs to the Special Issue Vision Sensors: Image Processing Technologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

The goal of visual place recognition (VPR) is to determine the location of a query image by identifying its place in a collection of image databases. Visual sensor technologies are crucial for visual place recognition as they allow for precise identification and location of query images within a database. Global descriptor-based VPR methods face the challenge of accurately capturing the local specific regions within a scene; consequently, it leads to an increasing probability of confusion during localization in such scenarios. To tackle feature extraction and feature matching challenges in VPR, we propose a modified patch-NetVLAD strategy that includes two new modules: a context-aware patch descriptor and a context-aware patch matching mechanism. Firstly, we propose a context-driven patch feature descriptor to overcome the limitations of global and local descriptors in visual place recognition. This descriptor aggregates features from each patch’s surrounding neighborhood. Secondly, we introduce a context-driven feature matching mechanism that utilizes cluster and saliency context-driven weighting rules to assign higher weights to patches that are less similar to densely populated or locally similar regions for improved localization performance. We further incorporate both of these modules into the patch-NetVLAD framework, resulting in a new approach called contextual patch-NetVLAD. Experimental results are provided to show that our proposed approach outperforms other state-of-the-art methods to achieve a Recall@10 score of

99.82

on

P i t t s b u r g h 30 k

,

99.82

on FMDataset, and

97.68

on our benchmark dataset.

Keywords:

visual place recognition; feature description; feature learning; feature matching

1. Introduction

The sensing technique plays a pivotal role in place recognition [1,2], where the goal is to estimate the location of input query data from a reference database. An important aspect of such a perception-based application is the selection of appropriate sensors based on sensor capabilities and application specificities. The commonly used sensors for this application are cameras, LiDAR, and RADAR [3]. Camera-based image sensing technology is critical in place recognition due to its ability to capture detailed visual data [4,5]. Compared to other sensing methods like LiDAR and RADAR, image sensors can provide rich color information, which is crucial for distinguishing between similar structures or landscapes. They provide high precision in identifying and locating query images within a database. Moreover, they are also generally less expensive and easier to implement. For instance, in autonomous vehicles, image sensors contribute to the recognition of traffic signs, pedestrians, and other vehicles.

The two major challenges of visual place recognition (VPR) are feature description and feature matching. Firstly, in traditional VPR techniques, local features like SIFT [6], SURF [7], and ORB [8] are manually crafted and can be combined into a global descriptor, such as VLAD [9], or a convolutional neural network (CNN)-based NetVLAD [10]. Nevertheless, local features face challenges in handling significant changes in illumination. An alternative approach is to apply spatial pooling from the feature map of the backbone model to examine regions of interest. For example, a sliding window technique is utilized to generate patches, and patch descriptors are subsequently derived from NetVLAD [11]. Secondly, one common feature-matching approach is to exploit a two-stage retrieval strategy. This strategy involves performing a global retrieval step to retrieve a set of top candidates from the reference database for each query. These candidates are then refined in a subsequent step based on their local features for improved image ranking.

Patch-NetVLAD [11] exploits a two-stage retrieval strategy. It applies the NetVLAD [10] to obtain descriptors to retrieve the top 100 images that are most similar to the query image and then rearrange them through patch-level matching. Patch-NetVLAD+ [12] applies a fine-tuned NetVLAD to extract patch-level descriptors and assigns weights to patches according to the distances of patch descriptors from the centroids in the description space. However, they suffer from two problems. Firstly, local descriptors emphasize spatial accuracy but may not fully represent the larger context. On the other hand, global descriptors are robust to appearance and lighting noise but struggle to identify minor local variations. Secondly, descriptors extracted from locally similar regions in descriptor space often exhibit strong similarities and are challenging to distinguish.

To tackle these inherent challenges of VPR, we propose a new VPR approach in this paper called “contextual patch-NetVLAD”. As indicated by its name, building upon the same strategy as patch-NetVLAD [11], we introduce two new modules, including a context-aware patch descriptor and a context-aware patch matching mechanism. To perform the VPR task, the proposed approach employs a sliding window method to extract patch descriptors from NetVLAD. It leverages the use of local regions for matching by decomposing the image into patches.

The following is a summary of our contextual patch-NetVLAD’s motivations and contributions.

Local descriptors emphasize spatial accuracy but may not adequately represent the larger environment, whereas global descriptors are resilient to appearance and light changes but have difficulty identifying minute local variations. Motivated by this, a context-driven patch feature descriptor is proposed to aggregate the features from each patch’s neighborhood.
Descriptors taken from an area of a descriptor space that is heavily populated (e.g., generic building’s walls are all very similar) or a locally similar region (e.g., descriptors extracted from the smooth ceiling area) are deemed to be less distinctive. Inspired by this, to improve VPR performance by exploiting the distinctiveness of feature descriptors, a new context-driven feature matching mechanism is proposed. It consists of (i) a cluster context-driven weighting rule, which exploits the global information of all patch features extracted from the whole database to assign larger weights for patches far away from cluster centroids in the dataset, and (ii) a saliency context-driven weighting rule, which exploits local information of each patch by comparing it and its neighboring patches’ features and assigning larger weights for patches with a higher difference.

Furthermore, we propose integrating these two contributions in the patch-NetVLAD framework [11] for the VPR task in challenging indoor environments. We conduct experiments using our benchmark dataset to evaluate the effectiveness of this integration.

Unlike the standard VPR datasets (e.g.,

P i t t s b u r g h 30 k

[13]) that use outdoor images, we conduct an evaluation using our benchmark indoor dataset in this paper. As shown in Figure 1, our benchmark dataset is more challenging due to three main challenges.

Scale: the standard $P i t t s b u r g h 30 k$ dataset primarily comprises outdoor locations, which are typically large and spacious. In contrast, our benchmark dataset consists of indoor locations that are smaller; therefore, distinctive key features are more susceptible to obstruction from obstacles (e.g., crowded crowds).
Spectral Information: as shown in Figure 1, the spectral differences between images of different locations in the standard dataset are quite significant. In contrast, the spectral information of images from different indoor locations in our benchmark dataset is relatively similar, making descriptors harder to distinguish.
Lighting condition: indoor lighting in our dataset can be affected by reflections from objects like furniture and walls, leading to more intricate lighting and shadow effects than outdoor scenarios.

The rest of this paper is organized as follows. Section 2 provides a brief review of the existing VPR research works. Then, the proposed contextual path-NetVLAD is presented in Section 3, including the proposed new feature description and patch matching, and then evaluated in extensive experiments in Section 4. Finally, this paper is concluded in Section 5.

2. Related Works

This section provides a brief overview of VPR works with a focus on two challenges of VPR, including feature description and feature matching.

2.1. Feature Description

The existing feature description approaches can be grouped into three classes: (i) global descriptor, (ii) local keypoint descriptor, and (iii) local patch descriptor.

Firstly, global descriptor methods primarily focus on global statistical features of the image, such as VLAD [9], BoW [14], WI-SURF [15], and Fish vector (FV) [16]. In recent VPR research, deep learning methods have made significant progress by utilizing features extracted from a backbone CNN [10] that is pretrained on image classification datasets. These features are then passed through a trainable aggregation layer, which transforms them into robust and compact representations. Examples of such methods include NetBoW [17] and NetFV [18], which improve BoW [14] and FV [16] using deep learning-based architectures, respectively. Attention-based pyramid aggregation network (APANet) [19] utilizes spatial pyramid pooling to aggregate the multiscale information and attention blocks to highlight the discriminative features. In [20], a hybrid image descriptor is proposed to aggregate salient visual information and complement it with appearance-based descriptions. To model the saliency of local features from different dimensions, the approach incorporates three attention modules that consider individual, spatial, and cluster dimensions [21]. In the study [22], robust feature selection and matching processes are investigated to enhance the accuracy of place recognition. They integrate a BoW vocabulary with a feature matcher to adapt to varying environmental conditions. Sergi et al. explore the application of a CNN architecture to simultaneously detect and describe local features for image matching within the context of cultural heritage [23]. A novel approach termed MixVPR is introduced in [24]. It utilizes feature maps from pretrained backbones as a collection of global features and enriches them by incorporating a global relationship among elements within each feature map through a series of feature mixing stages.

Secondly, local keypoint descriptors primarily involve reordering the initial candidate list generated by global methods to obtain a more reasonable matching sequence [25,26,27,28]. These methods use traditional handcrafted local feature methods [6,7,8] or deep-learned local feature methods that learn discriminative and robust local features directly from images, such as LIFT [29], DeLF [30], and SuperPoint [31]. SAND features are proposed to provide hierarchical context information while extracting features [32]. However, most learning-based methods focus on enhancing nearest neighbor matching performance at the keypoint level.

Thirdly, local patch descriptors emphasize the local specific patch features within images instead of considering the whole image. A landmark-based VPR approach [33] is proposed to combine edge boxes [34] to detect landmarks with CNN features. A pre-trained CNN is used to re-rank the list of candidates [35] or utilize RefineNet [36] to obtain local semantic features [37]. Patch-NetVLAD [11] generates patch descriptors from the NetVLAD framework. In order to differentiate between dissimilar regions within the same scene, patch-NetVLAD+ [12] fine-tunes NetVLAD with a triplet loss to improve patch descriptor extraction for differentiating dissimilar regions within the same scene. In the study [38], a hot-spot detector is developed specifically for a learned local key-patch descriptor.

2.2. Feature Matching

Feature matching methods re-rank matched patch candidates, which are obtained from a global feature retrieval step; therefore, they are crucial in obtaining the final retrieval results. In the patch-NetVLAD approach [11], candidate images are initially identified using NetVLAD. Next, to rank the candidate images and identify the best matching image, the patch match score is calculated for each pair of images. However, patch-NetVLAD treats all patches equally during the matching process. To address this, a patch weighting rule is proposed in [39] that applies the standard Hamming embedding method to evaluate descriptor distinctiveness. Patch-NetVLAD+ [12] evaluates the importance of patch features and selectively assigns less frequently occurring patches a more significant role in the matching process.

3. Proposed Contextual Patch-Netvlad Framework

The proposed contextual patch-NetVLAD approach leverages the conventional patch-NetVLAD framework [11] with two new components proposed in this paper. Our approach involves using a sliding window technique to derive patch descriptors from NetVLAD [10]. For the query image, it is passed through backbone feature extraction layers. They are clustered into fixed clusters, and their respective residuals are obtained. This is achieved by utilizing a VLAD aggregation layer, followed by a projection layer and principal component analysis (PCA) to create a feature map [10].

In our proposed approach, the patch features undergo two key steps. Firstly, they are processed by the context-driven patch feature descriptor, which replaces a patch within a feature map with the average of its eight neighboring patches. This step aims to capture the contextual information within the patches. Secondly, these patch features are further processed by the context-driven patch-matching mechanism, which assigns a score to each position based on its relevance to VPR. These scores are then used as weights to adjust the contributions of the patch-matching process, taking into account the information extracted from images in the database. In the end, match scoring is carried out to calculate the similarity score between the query image and each image in the database.

3.1. Proposed Context-Driven Patch Feature Descriptor

Given an input query image, we build its feature map

F

confirm if all variables in bold should be retained.

\in R^{H \times W \times D}

from the NetVLAD framework [10], where H and W denote the height and width of the image, respectively, and D is the length of the descriptor. For a patch

P_{r, c}

centered at the r-th row and c-th column, its descriptor is represented as

f_{r, c}

. Then, rather than using its original feature

f_{r, c}

, we propose to aggregate the features from its neighborhood (as a context) to obtain its context-driven patch feature descriptor as

f_{r, c} = \frac{1}{| Ω_{r, c} |} \sum_{(m, n) \in Ω_{r, c}} {\hat{f}}_{m, n},

(1)

where

f_{r, c}

is the replacement of the original feature

{\hat{f}}_{r, c}

,

| Ω_{r, c} |

is the cardinality of the set of neighboring patches

Ω_{r, c}

centered at

(r, c)

. An 8-connected neighborhood is used in our approach; therefore, (1) boils down to

f_{r, c} = \frac{1}{8} \sum_{m = r - 1}^{r + 1} \sum_{n = c - 1}^{c + 1} {\hat{f}}_{m, n}

.

By averaging a patch with its neighboring patches, we improve the representation of features within the feature map. This process effectively reduces noise and brings out more prominent features. In situations where the patch’s region extends beyond the boundaries of the feature map, automatic zero-padding is applied to fill the areas that exceed the feature map’s limits. This ensures consistent performance of the replacement operation and eliminates any boundary-related problems, allowing for uniform feature enhancement across the entire feature map.

3.2. Proposed Context-Driven Patch Matching

We introduce a context-driven patch-matching mechanism that incorporates two rules: (i) a cluster context-driven weighting rule, which utilizes global information from all patch features extracted from the entire database, and (ii) a saliency context-driven weighting rule, which utilizes local information of each patch by evaluating its features and those of its neighboring patches. The details of both rules are described below.

Firstly, a cluster context-driven weighting rule is proposed. We adopt the original NetVLAD [10] to extract patch descriptor sets ${\{f_{i}^{d b} \in R^{1 \times D}\}}_{i = 1}^{N_{p}^{d b}}$ and ${\{f_{i}^{q} \in R^{1 \times D}\}}_{i = 1}^{N_{p}^{q}}$ from the database $d b$ and the query image q, respectively; N denotes the number of patches, and $f$ denotes the descriptor. We take into account the occurrences of patch descriptors and assign a weight for patch descriptors based on their distance to the cluster centroid of the database in the feature space. First, it adopts a K-means method to cluster database descriptor sets ${\{f_{i}^{d b} \in R^{1 \times D}\}}_{i = 1}^{N_{p}^{d b}}$ to obtain the K centroid ${\{f_{i}^{c} \in R^{1 \times D}\}}_{i = 1}^{K}$ . Then, the distance from a patch descriptor $f$ to a centroid is calculated using the cosine distance to indicate its weighting factor $w_{c} (f)$ as [12]

$w_{c} (f) = \sum_{i}^{α} {\{d_{cos} (f, f_{i}^{c})\}}_{{min}_{α}},$

(2)

where ${}_{{min}_{α}}$ represents a subset of $α$ smallest items.
Secondly, a saliency context-driven weighting rule is proposed by assigning weights to patches within a feature map based on their variance. The underlying principle is that the variance within a patch indicates its significance or distinctiveness. Patches with higher variance might contain more salient information about the scene for more accurate place recognition. The weighting factor $w_{s} (f_{r, c})$ of each patch feature is defined as

$w_{s} (f_{r, c}) = \frac{1}{| Ω_{r, c} |} \sum_{(m, n) \in Ω_{r, c}} {(f_{m, n} - \bar{f})}^{2},$

(3)

where $\bar{f}$ represents the mean feature vector averaged by all neighboring features $f_{m, n}$ centered at position $(r, c)$ , where $| Ω_{r, c} |$ is the cardinality of the set of neighboring patches $Ω_{r, c}$ . By assigning higher weights to patch descriptors with high variance, two patch descriptors with greater significance contribute more to the place recognition task.

It is important to highlight the difference between the proposed context-driven patch matching and the conventional approaches [11,12]. Firstly, unlike the conventional approach [11], where all patches are treated as equally important and assigned the same weights during patch-wise matching, the proposed approach automatically evaluates the distinctiveness of patch features and assigns different weights accordingly. Secondly, while the approach [12] only considers a cluster-based weight (i.e., (2)) to exploit the global information from the whole database, the proposed approach applies both a cluster context-driven weighting rule and a saliency context-driven weighting rule. This allows for the utilization of both global and local information to evaluate the distinctiveness of patch features.

3.3. Summary of the Proposed Contextual Patch-Netvlad

A summary of the proposed contextual patch-NetVLAD approach is provided in this section and illustrated in Figure 2.

First, we adopt NetVLAD [10] to find the top 100 images most similar to the query image and obtain the initial retrieval image set. The image pair list

L

is described as [11]

L = (I_{q}, I_{r}),

(4)

where

I_{q}

and

I_{r}

represent the query image and candidate image, respectively. For each image pair in

L

, the cosine distances (denoted as

d_{cos}

) between two patch descriptors are calculated to generate the distance matrix

D

as [11]

D = [\begin{matrix} d_{cos} (f_{1}^{q}, f_{1}^{r}) & d_{cos} (f_{1}^{q}, f_{2}^{r}) & \dots \\ d_{cos} (f_{2}^{q}, f_{1}^{r}) & d_{cos} (f_{2}^{q}, f_{2}^{r}) & \dots \\ ⋮ & ⋮ & ⋱ \end{matrix}] .

(5)

Then, we update

D

by applying two proposed weighting mechanisms (2) and (3) on (5) to obtain

D = [\begin{matrix} \frac{1}{w_{c} (f_{1}^{q}) w_{s} (f_{1}^{q})} \frac{1}{w_{c} (f_{1}^{r}) w_{s} (f_{1}^{r})} & \frac{1}{w_{c} (f_{1}^{q}) w_{s} (f_{1}^{q})} \frac{1}{w_{c} (f_{2}^{r}) w_{s} (f_{2}^{r})} & \dots \\ \frac{1}{w_{c} (f_{2}^{q}) w_{s} (f_{2}^{q})} \frac{1}{w_{c} (f_{1}^{r}) w_{s} (f_{1}^{r})} & \frac{1}{w_{c} (f_{2}^{q}) w_{s} (f_{2}^{q})} \frac{1}{w_{c} (f_{2}^{r}) w_{s} (f_{2}^{r})} & \dots \\ ⋮ & ⋮ & ⋱ \end{matrix}] \circ D,

(6)

where ∘ is the Hadamard product.

Next, we employ mutual nearest neighbors

P

, which represents a pair of the query image and database image that have nearest neighbors to each other in (6), to perform patch-level matching of the patch descriptors [11]

P = \{(i, j) : i = N N_{r} (f_{j}^{q}), j = N N_{q} (f_{i}^{r})\},

(7)

where

N N_{r} (f) = {argmin}_{i} (d_{\cos} (f, f_{i}^{r}))

and

N N_{q} (f) =

{argmin}_{j} (d_{cos} (f, f_{j}^{q}))

retrieve the nearest neighbor descriptors matching the cosine distance in both query and reference image, respectively.

Finally, a spatial scoring method [11] is leveraged to compute the similarity score between a query/reference image pair, resulting in the final image retrieval results.

4. Experimental Results

4.1. Dataset

To conduct performance evaluation, we utilized two public benchmark datasets,

P i t t s b u r g h 30 k

[13] and FMDataset [40], as well as a benchmark dataset

U T o w n 7

that we have collected ourselves (the

U T o w n 7

dataset is available at https://doi.org/10.17632/8td4f55j2g.1 (accessed on 1 November 2023). Our

U T o w n 7

dataset consists of images that are gathered from the Stephen Riady Centre, located on the campus of the National University of Singapore, as shown in Figure 3. This area was selected because it contains diverse indoor and outdoor environments that can create a rich image dataset for our study. We strategically chose seven unique locations across three different levels of the center, including flavor back door, flavor front door, gym, swimming pool entrance, bank, restroom, convenience store, to ensure the diversity of our collected images. All images are collected in varying light conditions and are of different architectural elements. For each location, we collected 20 raw images. Then, we applied data augmentation to generate an enlarged image dataset. Table 1 presents the list of data augmentation techniques applied, which include random cropping (extracting random patches from the original images), random flipping (randomly mirroring images horizontally), random rotation (applying arbitrary rotations to the images), random scaling (altering the size of objects in the images), random adjustments to brightness, contrast, and saturation (varying lighting conditions and color variations in the input data), and random noise (adding a degree of noise to images). The rationale behind using these parameters in data augmentation is to more accurately replicate the conditions under which people take photos in real-world situations. Leveraging these augmentation methods, we built a benchmark dataset with 1540 images to ensure that our model can be evaluated in different conditions, which is critical in the VPR performance evaluation.

4.2. Experimental Setup

In our experiments, all images were resized to a resolution of

640 \times 480

pixels. Then, we extracted the patch-based features using a pre-trained patch-NetVLAD model [11] without any additional fine-tuning. The patch is defined as a size of

5 \times 5

with a stride of 1. For the multiscale fusion, square patch sizes of 2, 5, and 8 were employed with associated weights of

0.45

,

0.15

, and

0.4

, respectively. To provide a fair performance comparison, the same configuration was used for the model inference across our entire dataset. All approaches were implemented on a workstation equipped with an NVIDIA RTX A4000-16GB GPU, Intel(R) Xeon(R) CPU E5-2686 v4@2.30GHz, with Python 3.9, PyTorch 1.12.1, and CUDA 11.3.

Various VPR approaches were evaluated using the Recall@N metric, where a query image is graded to be correctly localized if at least one of the top N images falls within the ground truth tolerance range [10,41]. The recall rate is calculated as the percentage of query images that are correctly localized. We followed the strategy of patch-NetVLAD [11] to rank the initial retrieval top 100 image sets generated by NetVLAD [10]. In our experiments, we chose various N values to be 1, 5, and 10, respectively; therefore, the Recall@N metrics are Recall@1, Recall@5, and Recall@10.

4.3. Results

The proposed contextual patch-NetVLAD was evaluated with four state-of-the-art approaches, including NetVLAD [10], DELG [42], patch-NetVLAD [11], and patch-NetVLAD+ [12]. These methods were chosen due to their utilization of patch-level global descriptors as local features and their ability to re-rank patches. The experiments were conducted using two public benchmark datasets,

P i t t s b u r g h 30 k

[13] and FMDataset [40], as well as a benchmark dataset that we have collected ourselves.

Public benchmark datasets. Our study evaluates five different methods on two public benchmark datasets,

P i t t s b u r g h 30 k

[13] and FMDataset [40], as shown in Table 2. In each dataset, we randomly selected 500 images and further applied data augmentation to generate 5000 augmented images per dataset for the performance evaluation. Specifically, our method achieves Recall@10 of

99.82 %

on both

P i t t s b u r g h 30 k

[13] and FMDataset [40], emphasizing the effectiveness of our data augmentation strategy in enhancing recognition accuracy. DELG [42] achieves moderate performance on both datasets, achieving Recall@1 of

49.11 %

and

54.65 %

, respectively. NetVLAD [10] and Patch-NetVLAD [11] demonstrated competitive performance, achieving Recall@5 as high as

97.45 %

on both datasets. Patch-NetVLAD+ [12] achieves Recall@1 (

52.73 %

) on

P i t t s b u r g h 30 k

[13], indicating enhanced precision for top-ranked retrievals.

Our benchmark dataset. The quantitative performance comparison of various VPR approaches using our benchmark dataset is shown in Table 3. Our proposed approach demonstrates superior performance compared to the other state-of-the-art methods, including DELG [42], NetVLAD [10], Patch-NetVLAD [11], and Patch-NetVLAD+ [12]. Specifically, our method achieves exceptional results on our proprietary dataset, showcasing a Recall@5 of

94.73 %

and a Recall@10 of

97.68 %

. These results exhibit substantial improvements of

6.46 %

and

6.01 %

, respectively, over Patch-NetVLAD [11]. Moreover, when compared to Patch-NetVLAD+ [12], our approach outperforms with an enhanced Recall@5 of

6.12 %

and a Recall@10 of

4.65 %

. Although our method’s Recall@1 is slightly lower than DELG [42] and NetVLAD [10], it surpasses all other methods in Recall@5 and Recall@10. The runtime taken to process one query image is

14.45

seconds because it applies a matching process to re-rank the top candidates in order to improve retrieval performance.

Our findings show noticeable differences between Recall@5 and Recall@10. The inferior Recall@1 performance might be due to the dataset characteristics. The augmentation process potentially introduces ambiguity into local features, affecting their distinctiveness and the clustering within their vicinity. Despite the decrease in Recall@1, both Recall@5 and Recall@10 show improvement. This is due to the increased diversity of the augmented dataset, which assists the model in capturing a wider range of features, consequently enhancing retrievals in distant neighborhoods.

4.4. Ablation Study

This section conducts ablation studies to assess the contributions of different components within our frameworks to the final recognition performance using our

U T o w n 7

dataset.

To justify the proposed context-driven patch feature descriptor, we chose the patch-NetVLAD framework as the baseline approach. Then, we evaluated the performance of this framework by incorporating the proposed context-driven patch feature descriptor. Table 4 presents a quantitative performance evaluation using Recall@1, Recall@5 and Recall@10. The incorporation of the proposed context-driven patch feature descriptor leads to an accuracy improvement of $5.61 %$ at Recall@5 and $5.27 %$ at Recall@10, respectively.
To justify the proposed context-driven patch-matching mechanism, we chose the patch-NetVLAD framework as the baseline approach. We then conducted a comparison of four variations, each incorporating two proposed rules: the proposed saliency context-driven weighting rule and the proposed cluster context-driven weighting rule. As shown in Table 5, the incorporation of both of the rules achieves the highest accuracy, with a recall rate of $94.73 %$ at Recall@5 and $97.68 %$ at Recall@10.

4.5. Discussion and Limitations

The proposed approach utilizes a sliding window technique to extract patch descriptors from NetVLAD, forming the foundation for a robust VPR system. Previous VPR methods such as Patch-NetVLAD and Patch-NetVLAD+ have employed NetVLAD for image retrieval but have struggled to balance local and global descriptor emphases, which impact accuracy and discriminative power. To counter these issues, the proposed approach enhances the approach with context-aware patch descriptors and matching mechanisms. It introduces two innovative modules: a context-aware patch descriptor and a context-aware patch-matching mechanism. The sliding window technique is used to extract patch descriptors, emphasizing the use of local regions for matching by breaking down the image into patches.

The key strength of the proposed method lies in its ability to tackle the inherent challenges of VPR. The introduction of a context-aware patch descriptor ensures that the system captures both local and global information, mitigating the limitations of previous methods. Furthermore, the context-aware patch-matching mechanism enhances the discriminative power of descriptors extracted from locally similar regions in the descriptor space.
The proposed approach potentially yields the following limitations. The first one lies in the challenge of balancing local and global descriptor emphases. Our method uses both local and global information, which could be challenging to balance effectively. This could impact the accuracy and discriminative power of the system. The second one lies in the computational complexity because the proposed approach applies a refinement pass (matching) to re-rank the top candidates in order to improve retrieval performance.

5. Conclusions

A new visual place recognition approach has been proposed in this paper by integrating two proposed modules into the conventional patch-NetVLAD framework to form a contextual patch-NetVLAD approach. Our approach achieves more accurate place recognition results due to the incorporation of two key components: the proposed context-aware patch descriptor, which aggregates features from neighboring patches, and the proposed patch-matching mechanism, which assigns different weights to guide the contribution of various patches. The proposed approach is able to achieve more accurate place recognition results, as verified in our quantitative performance comparisons using two public benchmark datasets and our benchmark dataset. In the future, we intend to apply our contextual patch-NetVLAD on a larger indoor dataset or a mixed indoor–outdoor dataset to further evaluate its performance and usage.

Author Contributions

Conceptualization, W.S., W.C. and J.T.; data curation, W.S. and W.C.; methodology, W.S., W.C. and J.T.; software, W.S. and W.C.; supervision, J.T.; writing—original draft, W.S., W.C., R.H. and J.T.; writing—review and editing, W.S., W.C., R.H. and J.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, X.; Wang, L.; Su, Y. Visual place recognition: A survey from deep learning perspective. Pattern Recognit. 2021, 113, 107760. [Google Scholar] [CrossRef]
Masone, C.; Caputo, B. A Survey on Deep Visual Place Recognition. IEEE Access 2021, 9, 19516–19547. [Google Scholar] [CrossRef]
Barros, T.; Pereira, R.; Garrote, L.; Premebida, C.; Nunes, U.J. Place recognition survey: An update on deep learning approaches. arXiv 2021, arXiv:2106.10458. [Google Scholar]
Schubert, S.; Neubert, P.; Garg, S.; Milford, M.; Fischer, T. Visual Place Recognition: A Tutorial. IEEE Robot. Autom. Mag. 2023, 2–16. [Google Scholar] [CrossRef]
Berton, G.; Mereu, R.; Trivigno, G.; Masone, C.; Csurka, G.; Sattler, T.; Caputo, B. Deep Visual Geo-localization Benchmark. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5386–5397. [Google Scholar]
Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded up robust features. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Jégou, H.; Douze, M.; Schmid, C.; Pérez, P. Aggregating local descriptors into a compact image representation. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3304–3311. [Google Scholar]
Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1437–1452. [Google Scholar] [CrossRef] [PubMed]
Hausler, S.; Garg, S.; Xu, M.; Milford, M.; Fischer, T. Patch-NetVLAD: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14141–14152. [Google Scholar]
Cai, Y.; Zhao, J.; Cui, J.; Zhang, F.; Feng, T.; Ye, C. Patch-NetVLAD+: Learned patch descriptor and weighted matching strategy for place recognition. In Proceedings of the 2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Bedford, UK, 20–22 September 2022; pp. 1–8. [Google Scholar]
Torii, A.; Sivic, J.; Okutomi, M.; Pajdla, T. Visual place recognition with repetitive structures. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 2346–2359. [Google Scholar] [CrossRef] [PubMed]
Jegou, H.; Perronnin, F.; Douze, M.; Sanchez, J.; Perez, P.; Schmid, C. Aggregating local image descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 1704–1716. [Google Scholar] [CrossRef] [PubMed]
Badino, H.; Huber, D.; Kanade, T. Real-time topometric localization. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation, Saint Paul, MN, USA, 14–18 May 2012; pp. 1635–1642. [Google Scholar]
Sánchez, J.; Perronnin, F.; Mensink, T.; Verbeek, J. Image classification with the Fisher vector: Theory and practice. Int. J. Comput. Vis. 2013, 105, 222–245. [Google Scholar] [CrossRef]
Ong, E.J.; Husain, S.S.; Bober-Irizar, M.; Bober, M. Deep architectures and ensembles for semantic video classification. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 3568–3582. [Google Scholar] [CrossRef]
Miech, A.; Laptev, I.; Sivic, J. Learnable pooling with context gating for video classification. arXiv 2017, arXiv:1706.06905. [Google Scholar]
Zhu, Y.; Wang, J.; Xie, L.; Zheng, L. Attention-based pyramid aggregation network for visual place recognition. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 99–107. [Google Scholar]
Garg, S.; Suenderhauf, N.; Milford, M. Semantic-geometric visual place recognition: A new perspective for reconciling opposing views. Int. J. Robot. Res. 2022, 41, 573–598. [Google Scholar] [CrossRef]
Peng, G.; Zhang, J.; Li, H.; Wang, D. Attentional pyramid pooling of salient visual residuals for place recognition. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 885–894. [Google Scholar]
Arshad, S.; Kim, G.-W. A Robust Feature Matching Strategy for Fast and Effective Visual Place Recognition in Challenging Environmental Conditions. Int. J. Control. Autom. Syst. 2023, 21, 948–962. [Google Scholar] [CrossRef]
Sergi, I.; Leo, M.; Carcagnì, P.; La Franca, M.; Distante, C.; Patrono, L. A Microservices Architecture based on a Deep-learning Approach for an Innovative Fruition of Art and Cultural Heritage. J. Commun. Softw. Syst. 2022, 18, 184–194. [Google Scholar] [CrossRef]
Ali-Bey, A.; Chaib-Draa, B.; Giguere, P. MixVPR: Feature Mixing for Visual Place Recognition. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2997–3006. [Google Scholar]
Taira, H.; Okutomi, M.; Sattler, T.; Cimpoi, M.; Pollefeys, M.; Sivic, J.; Pajdla, T.; Torii, A. InLoc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7199–7209. [Google Scholar]
Sarlin, P.E.; Cadena, C.; Siegwart, R.; Dymczyk, M. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12716–12725. [Google Scholar]
Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-Net: A trainable CNN for joint description and detection of local features. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8092–8101. [Google Scholar]
Fathy, M.E.; Tran, Q.H.; Zia, M.Z.; Vernaza, P.; Chandraker, M. Hierarchical metric learning and matching for 2D and 3D geometric correspondences. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 803–819. [Google Scholar]
Yi, K.M.; Trulls, E.; Lepetit, V.; Fua, P. LIFT: Learned invariant feature transform. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 467–483. [Google Scholar]
Noh, H.; Araujo, A.; Sim, J.; Weyand, T.; Han, B. Large-scale image retrieval with attentive deep local features. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3456–3465. [Google Scholar]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
Spencer, J.; Bowden, R.; Hadfield, S. Scale-adaptive neural dense features: Learning via hierarchical context aggregation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6200–6209. [Google Scholar]
Sünderhauf, N.; Shirazi, S.; Jacobson, A.; Dayoub, F.; Pepperell, E.; Upcroft, B.; Milford, M. Place recognition with Convnet landmarks: Viewpoint-robust, condition-robust, training-free. In Proceedings of the Robotics: Science and Systems XI, Rome, Italy, 13–17 July 2015; pp. 1–10. [Google Scholar]
Zitnick, C.L.; Dollár, P. Edge boxes: Locating object proposals from edges. In Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 391–405. [Google Scholar]
Camara, L.G.; Přeučil, L. Visual place recognition by spatial matching of high-level CNN features. Robot. Auton. Syst. 2020, 133, 103625. [Google Scholar] [CrossRef]
Lin, G.; Milan, A.; Shen, C.; Reid, I. RefineNet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1925–1934. [Google Scholar]
Garg, S.; Suenderhauf, N.; Milford, M. Lost appearance-invariant place recognition for opposite viewpoints using visual semantics. arXiv 2018, arXiv:1804.05526. [Google Scholar]
Li, Z.; Lee, C.D.W.; Tung, B.X.L.; Huang, Z.; Rus, D.; Ang, M.H. Hot-NetVLAD: Learning Discriminatory Key Points for Visual Place Recognition. IEEE Robot. Autom. Lett. 2023, 8, 974–980. [Google Scholar] [CrossRef]
Arandjelović, R.; Zisserman, A. DisLocation: Scalable descriptor distinctiveness for location recognition. In Proceedings of the 12th Asian Conference on Computer Vision, Singapore, 1–5 November 2015; pp. 188–204. [Google Scholar]
Zhu, Z.; Xu, F.; Yan, C.; Hao, X.; Ji, X.; Zhang, Y.; Dai, Q. Real-time Indoor Scene Reconstruction with RGBD and Inertial Input. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 7–12. [Google Scholar]
Torii, A.; Arandjelovic, R.; Sivic, J.; Okutomi, M.; Pajdla, T. 24/7 place recognition by view synthesis. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1808–1817. [Google Scholar]
Cao, B.; Araujo, A.; Sim, J. Unifying deep local and global features for image search. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 726–743. [Google Scholar]

Figure 1. A comparison between the public

P i t t s b u r g h 30 k

[13] dataset (the first row) and our benchmark dataset (the fourth row). The second and third lines show the color statistics (in terms of red, green, and blue histogram) of the corresponding images. The x-axis and y-axis indicate the intensity values and count of pixels, respectively.

Figure 1. A comparison between the public

P i t t s b u r g h 30 k

[13] dataset (the first row) and our benchmark dataset (the fourth row). The second and third lines show the color statistics (in terms of red, green, and blue histogram) of the corresponding images. The x-axis and y-axis indicate the intensity values and count of pixels, respectively.

Figure 2. An overview of the proposed contextual patch-NetVLAD approach.

Figure 3. An overview of the benchmark dataset

(U T o w n 7)

used in our experiments.

Figure 3. An overview of the benchmark dataset

(U T o w n 7)

used in our experiments.

Table 1. The list of image augmentation methods that are used to create the benchmark dataset

(U T o w n 7)

for the performance evaluation of various VPR approaches.

Table 1. The list of image augmentation methods that are used to create the benchmark dataset

(U T o w n 7)

for the performance evaluation of various VPR approaches.

Method	Detailed Parameter Settings
Random crop	$[0.6, 1]$
Random flip	True or False
Random rotate	$[- 50, 50]$
Random scale	$[0.8, 1.2]$
Random brightness, contrast, saturation	$[- 0.3, 0.3]$
Random noise	$[0, 255]$

Table 2. The quantitative performance comparison of various VPR approaches on public benchmark datasets.

Method	$Pittsburgh 30 k$ [13]			FMDataset [40]
	Recall@1	Recall@5	Recall@10	Recall@1	Recall@5	Recall@10
DELG [42]	49.11	95.28	97.19	54.65	95.77	98.23
NetVLAD [10]	50.82	97.45	99.73	55.36	96.64	99.64
Patch-NetVLAD [11]	50.82	97.45	99.64	55.36	97.27	99.82
Patch-NetVLAD+ [12]	52.73	96.73	99.63	55.64	96.64	99.73
Ours	50.82	97.09	99.82	55.36	97.52	99.82

Table 3. The quantitative performance comparison of various VPR approaches on our

U T o w n 7

dataset.

Table 3. The quantitative performance comparison of various VPR approaches on our

U T o w n 7

dataset.

Method	Recall@1	Recall@5	Recall@10
DELG [42]	78.52	87.46	90.19
NetVLAD [10]	79.76	89.12	92.52
Patch-NetVLAD [11]	79.76	88.27	91.67
Patch-NetVLAD+ [12]	79.76	88.61	93.03
Ours	76.53	94.73	97.68

Table 4. The ablation study of the proposed context-driven patch feature descriptor.

	Recall@1	Recall@5	Recall@10
Baseline	79.76	88.27	91.67
+ Proposed context-driven patch feature descriptor	79.76	93.88	96.94

Table 5. The ablation study of the proposed context-driven patch matching mechanism.

Proposed Cluster Context-Driven Weighting Rule (2)	Proposed Saliency Context Driven Weighting Rule (3)	Recall@1	Recall@5	Recall@10
-	-	79.76	93.88	96.94
√	-	79.76	88.61	93.03
-	√	72.79	93.20	97.32
√	√	76.53	94.73	97.68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, W.; Chen, W.; Huang, R.; Tian, J. Contextual Patch-NetVLAD: Context-Aware Patch Feature Descriptor and Patch Matching Mechanism for Visual Place Recognition. Sensors 2024, 24, 855. https://doi.org/10.3390/s24030855

AMA Style

Sun W, Chen W, Huang R, Tian J. Contextual Patch-NetVLAD: Context-Aware Patch Feature Descriptor and Patch Matching Mechanism for Visual Place Recognition. Sensors. 2024; 24(3):855. https://doi.org/10.3390/s24030855

Chicago/Turabian Style

Sun, Wenyuan, Wentang Chen, Runxiang Huang, and Jing Tian. 2024. "Contextual Patch-NetVLAD: Context-Aware Patch Feature Descriptor and Patch Matching Mechanism for Visual Place Recognition" Sensors 24, no. 3: 855. https://doi.org/10.3390/s24030855

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Contextual Patch-NetVLAD: Context-Aware Patch Feature Descriptor and Patch Matching Mechanism for Visual Place Recognition

Abstract

1. Introduction

2. Related Works

2.1. Feature Description

2.2. Feature Matching

3. Proposed Contextual Patch-Netvlad Framework

3.1. Proposed Context-Driven Patch Feature Descriptor

3.2. Proposed Context-Driven Patch Matching

3.3. Summary of the Proposed Contextual Patch-Netvlad

4. Experimental Results

4.1. Dataset

4.2. Experimental Setup

4.3. Results

4.4. Ablation Study

4.5. Discussion and Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI