A2TPNet: Alternate Steered Attention and Trapezoidal Pyramid Fusion Network for RGB-D Salient Object Detection

Duan, Songsong; Gao, Xiuju; Xia, Chenxing; Ge, Bin

doi:10.3390/electronics11131968

Open AccessArticle

A2TPNet: Alternate Steered Attention and Trapezoidal Pyramid Fusion Network for RGB-D Salient Object Detection

¹

College of Computer Science and Engineering, Anhui University of Science and Technology, Huainan 232001, China

²

School of Electrical and Information Engineering, Anhui University of Science and Technology, Huainan 232001, China

³

Hefei Comprehensive National Science Center, Institute of Energy, Hefei 230031, China

⁴

Anhui Purvar Bigdata Technology Co., Ltd., Huainan 232001, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(13), 1968; https://doi.org/10.3390/electronics11131968

Submission received: 25 April 2022 / Revised: 15 June 2022 / Accepted: 16 June 2022 / Published: 24 June 2022

(This article belongs to the Section Microwave and Wireless Communications)

Download

Browse Figures

Versions Notes

Abstract

:

RGB-D salient object detection (SOD) aims at locating the most eye-catching object in visual input by fusing complementary information of RGB modality and depth modality. Most of the existing RGB-D SOD methods integrate multi-modal features to generate the saliency map indiscriminately, ignoring the ambiguity between different modalities. To better use multi-modal complementary information and alleviate the negative impact of ambiguity among different modalities, this paper proposes a novel Alternate Steered Attention and Trapezoidal Pyramid Fusion Network (A2TPNet) for RGB-D SOD composed of Cross-modal Alternate Fusion Module (CAFM) and Trapezoidal Pyramid Fusion Module (TPFM). CAFM is focused on fusing cross-modal features, taking full consideration of the ambiguity between cross-modal data by an Alternate Steered Attention (ASA), and it reduces the interference of redundant information and non-salient features in the interactive process through a collaboration mechanism containing channel attention and spatial attention. TPFM endows the RGB-D SOD model with more powerful feature expression capabilities by combining multi-scale features to enhance the expressive ability of contextual semantics of the model. Extensive experimental results on five publicly available datasets demonstrate that the proposed model consistently outperforms 17 state-of-the-art methods.

Keywords:

alternate fusion; alternate steered attention; ambiguity; salient object detection; trapezoidal pyramid

1. Introduction

Salient object detection (SOD) aims to highlight the most eye-catching object or area in a given scene by simulating the human visual attention mechanism. In recent years, SOD has been developed rapidly due to its wide range of applications, such as image retrieval [1], video segmentation [2], semantic segmentation [3], video tracking [4], character reconstruction [5], face recognition [6], thumbnail creation [7], and quality evaluation [8]. Most existing SOD methods mainly focus on RGB images and achieve satisfactory results for simple scenes. However, it is difficult to cope with challenging scenarios using single-modal data, such as complex backgrounds, low-contrast, similar texture, transparent objects, low-light and others.

With the popularity of depth information collection equipment, such as Kinect camera and Huawei Mate30, depth maps have been increasingly introduced into different fields (for example, SOD), which can provide supplementary information for RGB features, such as spatial structure, 3D distribution, object edges, and other information. Since the RGB image comprises rich color and texture features while the depth image comprises abundant spatial and contour information, many works [9,10,11,12,13] attempt to integrate depth maps into RGB SOD, namely RGB-D SOD, to boost the performance of SOD results. However, there are two main problems with these methods that need to be addressed:

(1) Ambiguity between cross-modal features: Previous works [9,14,15,16] attempt to integrate RGB features and depth features to improve the performance of SOD methods. These methods usually indiscriminately fuse the cross-modal features, resulting in the role of an RGB image being the same as a depth map. However, the ambiguity between RGB features and depth features is neglected for those RGB-D SOD methods. Some visual demonstrations of ambiguities among two modalities are shown in Figure 1, where the red boxes indicate the difference of saliency between the RGB image and the depth map. Taking Figure 1e as an example, the tree on the right is an obvious salient object on the depth map while it is classified as a part of the background on the RGB image. Therefore, how to solve the ambiguity problem of cross-modal features is a key issue, which can further improve the results of the SOD model. In this paper, our model mainly defines ambiguity between the two modalities and eliminates its negative effect.

(2) Multi-scale feature integration effects: Benefiting from the integration and utilization of multi-scale features, most current RGB-D SOD [11,12] networks achieve promising results, which fully consider both high-level semantic information and low-level local information. However, most existing RGB-D SOD models utilize multi-scale information in a simple multi-scale fusion way, such as concatenation [11] and simple convolution layer [12], which is challenging to establish the feature interaction bridge between multi-scale features and fully consider high-level semantic information and low-level local information. In this paper, our model needs to design an effective integration scheme of multi-scale features for accurate saliency object detection.

As a solution for the aforementioned issues, this paper proposes a novel alternate steered attention and trapezoidal pyramid fusion network (A2TPNet). Concretely, a cross-modal alternate fusion module (CAFM) is designed to distinguish the ambiguity between RGB features and depth features, which can elevate the performance of RGB-D SOD. Concurrently, an alternate steered attention (ASA) mechanism is formulated to filter out redundant information and non-salient features by generating channel and spatial weights from cross-modal features. Additionally, a trapezoidal pyramid fusion module (TPFM) is presented to integrate cross-modal features (depth features and RGB features) with a multi-level sparse fusion scheme, capturing multi-scale features to improve the performance of SOD further. Notably, the flow pyramid [17] has been proposed for image saliency detection tasks. In contrast to [17], this paper adopts a sparse connection method and transfer the integrated features to the next level instead of density connection. In addition, the proposed trapezoidal pyramid is less computationally intensive and has a powerful expression ability to features compared to the flow pyramid.

In summary, extensive experiments were conducted to validate the effectiveness of our method, comparisons with 21 other SOTA RGB-D SOD methods over five widely adopted benchmark datasets. Overall, the main contributions of this proposed work are as follows:

An Alternate Steered Attention and Trapezoidal Pyramid Fusion Network (A2TPNet) for RGB-D salient object detection is proposed to explore and eliminate ambiguity between multi-modality features for accurate saliency detection.
A Cross-modal Alternate Fusion Module (CAFM) is utilized to integrate RGB features and depth features alternately, and an Alternate Steered Attention (ASA) mechanism is designed to eliminate the ambiguity of multi-modality in the alternating process.
A Trapezoidal Pyramid Fusion Module (TPFM) is proposed with a multi-level sparse fusion scheme, which can adequately establish the feature interaction bridge between multi-scale features and primarily integrate high-level semantic information and low-level local information.

The remainder of this article is organized as follows: Section 2 reviews the related work, and our model for RGB-D SOD is presented in Section 3. Comprehensive experimental results are discussed in Section 4 for performance evaluations compared with SOTA RGB-D SOD methods. Finally, concluding remarks and future directions are given in Section 5.

2. Related Work

2.1. RGB-D Salient Object Detection

Early RGB-D SOD methods [18,19,20] relied on various hand-designed features, such as contrast [18], shape [19], and prior [20]. Limited by the expressive ability of hand-designed features, it is challenging for traditional methods to handle some complex scenes, such as low contrast, complex backgrounds, and similar shapes of salient objects to the surroundings. Recently, the convolutional neural network (CNN) has shown great potential and ability to achieve high performance in RGB-D SOD tasks, and several works [10,17,21,22,23] have employed the CNN for RGB-D SOD tasks and achieved state-of-the-art results. Equivalent to the two-stream CNN proposed [24], Han et al. [21] utilized a fully connected layer to fuse cross-modal features extracted from RGB and depth based on the two-view CNN, producing the final prediction. However, it is difficult for a cross-view transport structure to deal with salient objects on depth maps in complex scenarios.

To effectively fuse multi-scale information and establish a communicative bridge between multi-scale features, Chen et al. [14] designed a progressive fusion method. The deeper prediction was utilized to guide the shallower features, and the prediction of different scales was supervised by ground truth before final fusion. However, this strategy of saliency prediction based on five scales increases the training time and may also cause overfitting. Fan et al. [25] employed the depth depurator unit to filter out low-quality depth maps with large MAE for saliency detection. Nevertheless, the threshold is a fixed hyperparameter, which cannot dynamically adjust its value to adaptively judge the quality of depth maps. Different from [25], Piao et al. [18] proposed a Depth Distiller to convert the depth information to the RGB stream to guide salient prediction. Although this idea is compelling, the performance may be greatly limited by the quality of the predicted depth map.

To effectively integrate multi-scale features, this paper designs a trapezoidal pyramid fusion module (TPFM) to extract beneficial information from multi-scale features, where a sparse connection way is presented to build a transmission path from deep features to shallow features, reducing time and space consumption.

2.2. The Depth Quality of the Depth Map

Even though previous RGB-D SOD methods [26,27,28] show good capabilities, they neglected the adverse effects of low-quality depth maps on RGB-D SOD methods. Notably, several limitations of RGB-D SOD due to the quality of depth maps are twofold. The former is that there always exist missing or errors in the raw depth images that may come from multiple factors [26,29], such as collecting device, occlusion, reflection, weather conditions, and so on. For the latter, expression of a salient object in the depth map may be inconsistent with a real saliency object [27], especially in a low-contrast RGB image that may seriously interfere with the RGB-D SOD model to locate the salient object accurately. In addition to the issues mentioned, many other factors interfere with the uncertain quality of depth maps, preventing the progress of the data-driven RGB-D SOD methods.

To solve the above-mentioned issues, several methods [18,26,29,30,31,32] have been proposed from different viewpoints. Zhao et al. [18] proposed a new contrast loss function to enlarge the contrast between the foreground and background of the depth map for repairing its quality. Nevertheless, this function would generate fragmentary results when the loss of depth contrast cannot enhance the foreground. Fan et al. [26] used structural measures to estimate whether the quality of the depth map was right or poor, and trained a classifier to judge the quality of depth maps. Unfortunately, this method would be affected by low-contrast RGB images. Piao et al. [29] introduced a two-phase strategy to train an RGB-D SOD model by knowledge distillation techniques [31]. However, the complementary information obtained from depth data via knowledge distillation techniques is inconsistent in the phase of the test.

To address the aforementioned depth map quality problem, this paper proposes a novel cross-modal fusion strategy based on a well-designed alternate strategy to eliminate the influence of low-quality depth maps. The ASA mechanism is presented to steer the RGB-D SOD model to focus attention on RGB features’ essential regions automatically, and depth features in salient features extraction, so negative effects of low-quality depth maps can be automatically eliminated.

3. Methodology

3.1. Overview Network Architecture

The complete pipeline of the proposed alternate steered attention and trapezoidal pyramid fusion network can be divided into three stages: (1) Two-stream encoders for extracting features, (2) Cross-modal alternate fusion module (CAFM) for multi-modal features, and (3) Trapezoidal pyramid fusion module (TPFM) for multi-scale features, as shown in Figure 2.

(1) Two-stream Encoders: Two standalone VGG16 backbone networks are applied as encoders to extract cross-modal features from an RGB image I and the corresponding depth map D, respectively. As a result, five RGB features

{F_{r}^{i}}_{i = 1}^{5}

and five depth features

{F_{d}^{i}}_{i = 1}^{5}

can be generated from the RGB image and the corresponding depth map by the RGB encoder and depth encoder, respectively. Note that the resolution of the input images and the corresponding depth maps are set to 3 × 256 × 256 and 1 × 256 × 256, respectively.

(2) Cross-modal Alternate Fusion Module (CAFM): To efficiently integrate cross-modal features, the CAFM is proposed to integrate multi-modal features in an alternate way. As shown in Figure 3, an information exchange channel between RGB features and depth features is implemented in the proposed alternate fusion strategy to improve the complementary effect of cross-modal features. However, redundant information and non-salient features are inevitably generated in the process of alternation, which will deteriorate the predicted results. To solve such an issue, an alternate steered attention (ASA) mechanism is presented to filter out these bad information through a hybrid of channel attention and spatial attention. The discussions are presented in Section 3.2.

(3) Trapezoidal Pyramid Fusion Module (TPFM): As is well-known, the features at different scales play different roles for SOD. In order to make full use of multi-scale features to improve the performance of the proposed model, this paper designs a TPFM (Figure 2) to effectively integrate multi-scale features by building a transmission path from deep features to shallow features. Unlike previous work [12,33] using a densely connected way, the proposed A2TPNet introduces a sparsely connected fashion to fuse multi-scale features from deep to shallow, achieving faster computer speed and fewer parameters than [12]. In addition, the proposed A2TPNet designs a cooperative method of TPFM and dilated convolutions (TPFM-D) to obtain a larger receptive field that can make our model capture the semantic information of saliency. The discussions are presented in Section 3.3.

3.2. Cross-Modal Alternate Fusion Module (CAFM)

To capture the affiliation of features, various attention mechanisms [32,34] are proposed to extract meaningful information from a global or local perspective. Unfortunately, these attention mechanisms are generated by single-modal data, which may be limited in terms of information and performance gains. However, it is reasonable to believe that better feature representations for RGB-D SOD can be obtained if cross-modal features can be taken into account additionally. Therefore, it is in demand to tailor a mechanism to explore the correlation and performance gains between cross-modal features. As a result, the proposed method deploys a cooperation mechanism between channel attention (CA) and spatial attention (SA) to form the alternate steered attention (ASA) mechanism, which can comprehensively explore the correlation between channels and within features. The cooperation channel attention and spatial attention aim to overcome the deficiency of a single CA and SA, where CA cannot capture the vital information within features and SA does not consider the correlation between channels. As shown in Figure 3b, channel weight and spatial weights can be obtained using the alternate fusion strategy between the RGB feature and the depth feature in the ASA mechanism. The calculation process is

\begin{matrix} W_{s} = s o f t m a x (R e s c o n v (X^{r}) \oplus R e s c o n v (X^{d})), \end{matrix}

(1)

\begin{matrix} W_{c} = s o f t m a x (G (R e s c o n v (X^{r}) ⊙ R e s c o n v (X^{d}))), \end{matrix}

(2)

where

X^{r}

and

X^{d} \in R^{H \times W \times C}

represent RGB features and depth features, respectively;

R e s c o n v (\cdot)

means residual convolution as shown in Figure 3b,

W_{s}

and

W_{c}

indicate the spatial and the channel weights, respectively;

G (\cdot)

is global average pooling operation. ⊙ and ⊕ denote the element-wise multiplication and the element-wise addition, respectively;

s o f t m a x

means the softmax activate function. Finally, the fused feature can be obtained through the ASA mechanism, which can be described as:

\begin{matrix} X_{α}^{t m p} = R e s c o n v (R e s c o n v (X^{α})), \end{matrix}

(3)

\begin{matrix} X_{α}^{f} = (W_{s} ⊙ X_{α}^{t m p}) \oplus (W_{c} ⊙ X_{α}^{t m p}), \end{matrix}

(4)

where

α \in (r, d)

.

Considering the ambiguity between cross-modal features, the proposed method adopts CAFM to incorporate RGB features and depth features alternately. Specifically, an element-wise addition and an element-wise multiplication are used to form the fused features

F_{a d}

and

F_{m u}

in the alternate strategy, where the fused features

F_{a d}

and

F_{m u}

are deployed to generate the channel weights and the spatial weights. The channel weights and the spatial weights are used to obtain RGB features and depth features through the cross-modal alternation as shown in Figure 3. The process can be described as Figure 4:

\begin{matrix} X_{r}^{f} = A S A (c o n v (X^{r}), c o n v (c o n v (X^{d}))), \end{matrix}

(5)

\begin{matrix} X_{d}^{f} = A S A (c o n v (X^{d}), c o n v (c o n v (X^{r}))), \end{matrix}

(6)

\begin{matrix} f_{c a t} = c a t (X_{r}^{f}, X_{d}^{f}), \end{matrix}

(7)

where

c o n v

(·) indicates a convolution operation with a kernel of 3×3;

A S A

(·) represents the ASA mechanism;

c a t

is the concatenation operation;

X_{r}^{f}

and

X_{d}^{f}

indicate the RGB features and depth features after filtering, respectively. Then, the proposed model alternately concatenates those features and express them as:

\begin{matrix} X_{f}^{f d} = (X_{r}^{f} \oplus c o n v (c o n v (X^{d}))), \end{matrix}

(8)

\begin{matrix} X_{f}^{f r} = (X_{d}^{f} \oplus c o n v (c o n v (X^{r}))), \end{matrix}

(9)

\begin{matrix} F_{f}^{i} = c a t (X_{f}^{f d}, X_{f}^{f r}, c o n v (f_{c a t})), \end{matrix}

(10)

where

F_{f}^{i} \in R^{H \times W \times C}

represents the feature after cross-modal fusion alternately; i indicates the level of features in our model.

As shown in Figure 4, it can be seen that low-quality depth maps may cause ambiguity between RGB and depth features. The bottom row of Figure 4 is a representative sample, where the red boxes indicate the difference of salient objects. Many existing RGB-D SOD methods (such as ASIFNet [36] and CoNet [36]) fail to obtain correct results due to ignoring the impact of the ambiguity. Unlike them, our method can obtain complete results when encountering the interference of ambiguity and low-quality depth maps. It can be seen that the saliency regions that RGB features and depth features are focused on together are emphasized in Figure 4d–f, which is attributed to the effect of the alternate strategy. Therefore, it is reasonable to think that our method can alleviate the negative effects of depth map degradation.

3.3. Trapezoidal Pyramid Fusion Module (TPFM)

Global semantic information is more manifest in deep features, which can benefit the model locating the completeness of the salient object, while the shallow features focus on the local information, such as edge and contour information of a salient object. Therefore, RGB-D SOD tasks need to ensure the expression ability of global semantics and the preservation of local detail information when dealing with cross-modal fused features. Based on the above considerations, the proposed model designs a multi-scale feature fusion module, named trapezoidal pyramid fusion module (TPFM), which is dedicated to fusing cross-modal features of different scales and features of the dilated rate at the same level. The detailed schematic of TPFM is shown in Figure 2. To obtain a larger receptive field, TPFM and dilated convolutions are united to form TPFM-D, enabling our method to better capture the salient semantic information of saliency. We input the fused features

{F_{f}^{i}}_{i = 1}^{5}

generated from CAFM into TPFM-D to obtain the enhanced features

{F_{e}^{i}}_{i = 1}^{5}

with a large receptive field and then input these features into the TPFM to predict the saliency results.

The TPFM has five layers, and the first layer is composed of five nodes (features). The specific calculation of first layer in TPFM can be formulated as follows:

\begin{matrix} F_{k}^{s e} = U P (F_{k + 1}^{f s}) \oplus F_{k}^{f s} k \in {1, 2, 3, 4}, \end{matrix}

(11)

where k means the index of nodes (features) of the second layer while

{F_{i}^{f s}}_{i = 1}^{5}

are the nodes (features) of the first layer in TPFM. The

U P (\cdot)

indicates the up-sampling operations of features, the output results

{F_{k}^{s e}}_{k = 1}^{4}

are the four nodes of the second layer. Similar to the first layer, the proposed method can repeat the operation of the first layer to obtain the remaining layers of TPFM. Finally, the proposed method compresses channels of the fifth layer to 1 and then performs sigmoid activation function to obtain the final saliency prediction

S_{p r e}

.

We combine the dilated convolution operation and the TPFM to form the TPFM-D. Specifically, the formula can be described as:

\begin{matrix} f_{d l}^{j} = D L_{j} (F_{f}^{i}) j \in {2, 4, 6, 8}, \end{matrix}

(12)

where

i \in {1, 2, 3, 4, 5}

indicates the index of layer in decoder,

D L_{j}

(·) is a dilated convolution with dilated rate

j \in {2, 4, 6, 8}

. Then, the proposed method utilizes TPFM and residual connection to get the output

F_{e}^{i}

of TPFM-D. The calculation can be defined as:

\begin{matrix} F_{e}^{i} = F_{f}^{i} \oplus T P F M (F_{D L}), \end{matrix}

(13)

where

F_{D L} = {f_{d l}^{2}, f_{d l}^{4}, f_{d l}^{6}, f_{d l}^{8}}

,

T P F M

indicates the Trapezoid Pyramid Fusion Module, and

i \in {1, 2, 3, 4, 5}

denotes the

i^{t h}

layer in decoder.

In order to explain the function of TPFM, the proposed method displays the visual illustration of multi-scale features of TPFM in Figure 5. The features of the top two layers obviously contain disordered local information (e.g., edge, contour), while the last two layers focus on the salient regions and manifest more global information (e.g., integrality). These results can illustrate the effectiveness of TPFM.

3.4. Loss Function

The universal loss function in the SOD tasks is binary cross entropy (BCE), a pixel-level loss that independently performs error calculation and supervision at different positions, defined as:

\begin{matrix} L_{b c e} = \frac{1}{H \times W} \sum_{h}^{H} \sum_{w}^{W} [g log p + (1 - g) log (1 - p)], \end{matrix}

(14)

where

P = {p | 0 < p < 1} \in R^{1 \times H \times W}

and

G = {g | 0 < g < 1} \in R^{1 \times H \times W}

represent the predicted value and the corresponding ground truth, respectively. H and W represent the height and width of the input image, respectively.

L_{b c e}

calculates the error between the ground truth G and the predicted P for each pixel.

4. Experiments

4.1. Datasets

To comprehensively evaluate the performance of the proposed model, this paper conducts experiments on five public datasets, including LFSD [37], NLPR [38], DUT [39], NJU2K [40], and RGBD135 [41] datasets. LFSD includes 100 light fields collected using a Lytro light field camera, and consists of 60 indoor and 40 outdoor scenes. DUT consists of 800 indoor and 400 outdoor scenes with challenging depth images, such as multiple objects, complex backgrounds, and low-contrast environments. NLPR contains 1000 paired RGB-D images from a variety of indoor and outdoor scenes. NJU2K consists of 1985 stereo image pairs, collected from the Internet, 3D movies, and photographs taken by a Fuji W3 stereo camera. RGBD135 consists of 135 indoor RGB-D images captured by Kinect.

For fair comparisons, this paper uses the same training set and test set as [10,26]. The training set includes randomly selected 650 RGB-D image pairs in the NLPR dataset, 1400 image pairs in the NJUD dataset, and 800 sample image pairs in the DUT dataset. The size of the training sample is reset to 256 × 256 during training. The remaining RGB-D SOD public available dataset and the remaining sample image pairs in the above three datasets are used as the test set.

4.2. Evaluation Metrics

To quantitatively analyze the performance of the proposed A2TPNet, adaptive E-measure (adpEm), mean E-measure (meanEm) [42], adaptive F-measure (adpFm), mean F-measure (meanFm) score [43], weight F-measure (WF) score [44], Mean Absolute Error (MAE) [45], and Precision-Recall (PR) curve are adopted.

To obtain the PR, predicted saliency maps are first binarized with fixed thresholds varying from 0 to 255, and then the precision and recall scores can be obtained by comparing the binary maps with the ground truth. The definitions of precision and recall are as follows:

\begin{matrix} P r e c i s i o n = a r e a (S \cup G) / a r e a (S), \end{matrix}

(15)

\begin{matrix} R e c a l l = a r e a (S \cap G) / a r e a (G), \end{matrix}

(16)

where G and S respectively denote the foreground of ground truth and the foreground of the saliency map.

In general, saliency detection requires both precision and recall rates to be high. However, it is difficult to meet the requirement at the same time. Therefore, F-measure is defined as a comprehensive measurement to balance recall and precision:

\begin{matrix} F_{β} = \frac{(1 + β^{2}) P r e c i s i o n \times R e c a l l}{β^{2} \times P r e c i s i o n + R e c a l l}, \end{matrix}

(17)

where setting

β^{2}

= 0.3 for emphasizing the precision as suggested in [43];

P r e c i s i o n

and

R e c a l l

present the precision score and recall score. On the basis of [43], adpFm is given under different thresholds [0,255]. The WF [44] is proposed to improve the existing F-measure.

MAE measures the pixel-level errors between the predicted saliency map S and ground truth G:

\begin{matrix} M A E = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} | S (i, j) - G (i, j) | . \end{matrix}

(18)

E-measure is utilized to compute the similarity of characterizes both image-level statistics and local pixel matching:

\begin{matrix} E_{ϕ} = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} ϕ_{F M} (i, j), \end{matrix}

(19)

where

ϕ_{F M}

represents an enhanced alignment matrix [42], and adpEm and meanEm can be obtained from the basics.

4.3. Implementation Details

In this design, VGG16 is used as the RGB and depth encoder’s backbone network. The backbone is initialized with ImageNet pre-trained parameter weights; only the convolutional and final classification layers are retained in these two encoders, while the remaining pooling and fully connected layers are removed. In addition, our model is implemented on PyTorch in a computer server with one NVIDIA GTX 2080Ti GPU accelerator card. Adam with momentum optimizer is used to train the proposed model. The weight decay is set as 5 × 10⁻⁴, batch size as 8, the initial learning rate as 5 × 10⁻³, and the momentum as 0.9. The proposed network is trained for 30 epochs until convergence.

4.4. Comparison with State-of-the-Art Methods

Extensive experimentation and comparisons on five public RGB-D datasets (LFSD, DUT, NLPR, RGBD135, and NJU2K) with 21 state-of-the-art methods are conducted, including A2dele [29], ASIFNet [35], CMWNet [14], CPFP [12], D3Net [26], DANet [46], DFNet [29], DMRA [47], ICNet [10], TANet [11], MMCI [48], CoNet [49], SSF [50], CJLB [51], BiANet [19], DQSD [52], HAINet [53], CFIDNet [54], DCFM [55], DRLF [56], and MobileSal [57]. Saliency maps of comparison methods are generated by the original code under default parameters, or provided by the authors of those paper publications.

(1) Quantitative Evaluation: To evaluate the performances of 21 RGB-D SOD methods on five public datasets, the PR curves are presented in Figure 6 and a summary report of complete quantitative evaluation results are shown in Table 1. It can be seen that our model achieves the best results over comparison methods on the five datasets, except for the adpEm score on the DUT dataset.

On a large-scaled NJU2K dataset, our model outperforms the second-best HAINet [53] in terms of adpFm, meanEm, adpFm, meanFm, WF, and MAE. Furthermore, our model can obtain perceptible gain (0.948 → 0.915, 0.919 → 0.915, 0.918 → 0.914, 0.901 → 0.900, 0.032 → 0.033) in metrics of meanEm, adpFm, meanFm, WF, and MAE compared with the suboptimal method SSF [50] on DUT, which is a large-scale dataset with confusing background, transparent object, and multi-object scenarios. On small-scale LFSD datasets, our method still can obtain the best results compared with other RGB-D SOD methods on the six evaluation metrics. The proposed model can obtain percentage gain of 0.7%, 0.6%, 2.4%, 1.8%, 2.3%, and 22.6% on metrics of adpEm, meanEm, adpFm, meanFm, WF, and MAE compared with the suboptimal method HAINet [53] on RGBD135. Our method also can obtain the best results on the NLPR dataset, which contains many outdoor and indoor scenes.

To further evaluate the performance, the PR curves on five datasets are also reported in Figure 6, and it can be seen that our model is effective and obtains both higher accuracy and recall scores than other comparison state-of-the-art methods on all datasets in Figure 6.

(2) Visual Comparisons: To further demonstrate the effectiveness of the proposed A2TPNet, this article provides some visual comparison results of different methods. As shown in Figure 7, our model outperforms all other methods in some complex scenes: low-contrast scene, similar between foreground and background, multi-object scene, confusing background scene, and low-quality depth map scenes. From several typical challenging scenarios mentioned above, our model can effectively and accurately locate the salient objects, showing that the proposed method outperforms other comparable models.

4.5. Ablation Study

In this section, this article provides the ablation study to validate the effectiveness of different components designed in the proposed network on NLPR [38], DUT [39], and NJU2K [40] datasets. We mainly investigate and discuss: (1) the benefits of the CAFM, (2) the effectiveness of the ASA mechanism in eliminating the ambiguity between the RGB features and the depth features, and (3) the effectiveness of the TPFM in integrating multi-scale features.

(1) The effectiveness of the CAFM: The baseline model (denoted as “Model 0") used contains a VGG16 backbone and a concatenation operation. The performance of the basic network without any module and mechanism is illustrated in Table 2. To verify the effectiveness of the CAFM, we choose to add CAFM into the “Model 0", which is denoted as “Model 1". The performance of “Model 1" is shown in Table 2. Compared with “Model 0", “Model 1" yields the better performance with the percentage gain of 2.6%, 3.2%, 4.6%, 5.1%, 5.8%, and 26% in terms of adpEm, meanEm, adpFm, meanFm, WF, and MAE, on the DUT dataset, respectively, while achieving the percentage gain of 0.8%, 1%, 2.2%, 1.8%, 2.1%, and 9.8% in terms of adpEm, meanEm, adpFm, meanFm, WF, and MAE on the NJU2K dataset. These results of the comparison can prove the effectiveness of our CAFM module. Comparing the performance of “Model 0" and “Model 1", it can be seen that removing the CAFM from the “Model 1" will cause performance damage, which can illustrate that the proposed CAFM is a compelling fusion way.

(2) The effectiveness of the ASA mechanism: To demonstrate its effectiveness, the proposed ASA mechanism is utilized to reduce the redundant information and non-salient features in the processing of fusion. Based on “Model 1", we remove the ASA mechanism from the CAFM, which is denoted as “Model 2". As shown in Table 2, “Model 1" achieves the percentage gain of 2.1%, 2.6%, 3.9%, 4.5%, 5.0%, and 22.4% on the DUT dataset and 0.07%, 1%, 1.7%, 1.8%, 2.1%, and 12.5% on the NLPR dataset in terms of adpEm, meanEm, adpFm, meanFm, WF, and MAE compared with “Model 2", which demonstrate the advantages of the ASA mechanism in the alternate fusion of cross-modal features.

(3) The effectiveness of TPFM: To investigate the effectiveness of TPFM, this paper provides two variants: “Model 3" and “Model 4", where marking the combination of the TPFM and the “Model 1" as “Model 3" and denote the incorporation of TPFM-D and “Model 3" as “Model 4". As shown in Table 2, the results of “Model 3" outperform the “Model 1" (e.g., adpEM: 0.905 → 0.900, meanEm: 0.911 → 0.909, adpFm: 0.856 → 0.844, meanFm: 0.861 → 0.854, WF: 0.837 → 0.830, and MAE: 0.053 → 0.055 on NJU2K dataset), which confirm the effectiveness of TPFM. In addition, we further append the TPFM-D to “Model 1" to verify the effectiveness of TPFM-D; the proposed method can obtain the percentage gain of 2.5%, 1.6% 6.4%, 3.8%, 4.4%, and 25% compared with “Model 4" in terms of adpEm, meanEm, adpFm, meanFm, WF, and MAE on the NLPR dataset. We observe that the TPFM-D is a rewarding combination for the TPFM and the dilated convolution.

4.6. Limitations and Analysis

Although the our model can achieve promising results, it is challenging to locate salient objects ideally in extreme environments. The salient objects with abundant textures are challenging scenes. As shown in Figure 8, the proposed method fails to highlight the salient objects accurately due to the rich texture in RGB images and the unclear texture information in depth maps make our method unable to extract useful features, resulting in meaningless predictions.

5. Conclusions

In this paper, a novel Alternate Steered Attention and Trapezoidal Pyramid Fusion Network (A2TPNet) is proposed for RGB-D SOD tasks. The proposed model reconsiders the difference of two modalities, and explores the ambiguity between RGB and depth modalities. The ambiguity of two modalities is vital for accurate saliency reasoning in RGB-D images. Some examples of ambiguity are presented in Figure 1. Based on the view of the ambiguity between RGB and depth modalities, a Cross-modal Alternate Fusion Module (CAFM) with alternate steered attention (ASA) is proposed to eliminate the negative effects of ambiguity between RGB and depth modalities. Thanks to the effectiveness of ASA, our model can availably mitigate the adverse effects of ambiguity. Furthermore, to further enhance the semantic of cross-modal fused features, our model constructs a multi-scale feature fusion module with a trapezoidal pyramid structure, namely TPFM, which adapts a top-down progressive strategy to integrate multi-scale features. We have conducted extensive experiments, including quantitative analysis, visual comparison, and ablation studies, demonstrating that our model outperforms other 21 SOTA RGB-D SOD methods. In particular, our method can obtain a pretty good performance of 0.015 score in MAE metrics on the RGBD135 dataset. In short, in this paper, our method mainly considers the ambiguity of RGB and Depth modalities by cross-modal alternation and attention mechanism, which provides a feasible idea for extraction and learning of multi-modal data. Furthermore, we adopt a sparse connection to construct a progressive multi-scale feature decoder. We believe that the proposed strategy to explore the ambiguity between the two modalities and sparse connection of multi-scale features is valuable to the community.

Our method may be further improved from the following aspects in the future. First, it may be a feasible solution for multi-modality data by a joint learning framework of ambiguity and consistency between RGB and depth modalities to further enhance the representative ability of multi-modality features. Secondly, efficient and simple multi-scale feature integration strategies (e.g., sparse connectivity) may be employed for saliency inference in the future. In addition, the images used that come from datasets are far from what happens in real life [59]. In the future of our work, we will consider noisy issues generated from the real environments by employing denoising technologies, like DCSR [60].

Author Contributions

Data curation, S.D. and X.G.; Methodology, S.D., X.G. and C.X.; Writing–original draft, S.D. and X.G.; Writing–review & editing, C.X. and B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (number 6210071479); Natural Science Research Project of Colleges and Universities in Anhui Province (number KJ2020A0299); University-level key projects of Anhui University of science and technology (number QN2019102); University-level general projects of Anhui University of science and technology (number xjyb2020-04); Anhui Province Natural Science Foundation of China (number 2108085QF258).

Data Availability Statement

The datasets involved in this paper are all from open source links. Researchers in the field have integrated them, and we can query the Object-level Salient Object Detection datasets chapter in the http://mmcheng.net/socbenchmark/ to obtain them.

Acknowledgments

Special thanks to reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gao, Y.; Wang, M.; Tao, D.; Ji, R.; Dai, Q. 3-d object retrieval and recognition with hypergraph analysis. IEEE Trans. Image Process. 2012, 21, 142–149. [Google Scholar] [CrossRef] [PubMed]
Hong, S.; You, T.; Kwak, S.; Han, B. Online tracking by learning discriminative saliency map with convolutional neural network. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 597–606. [Google Scholar]
Fan, D.-P.; Ji, G.-P.; Zhou, T.; Chen, G.; Fu, H.; Shen, J.; Shao, L. Pranet: Parallel reverse attention network for polyp segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention, Lima, Peru, 4–8 October 2020; pp. 263–273. [Google Scholar]
Martinel, N.; Micheloni, C.; Foresti, G.L. Kernelized saliency-based person re-identification through multiple metric learning. IEEE Trans. Image Process. 2015, 24, 5645–5658. [Google Scholar] [CrossRef]
Yao, Y.; Chen, T.; Xie, G.S.; Zhang, C.; Shen, F.; Wu, Q.; Zhang, J. Non-salient region object mining for weakly supervised semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual. 19–25 June 2021; pp. 2623–2632. [Google Scholar]
Adjabi, I.; Ouahabi, A.; Benzaoui, A.; Taleb-Ahmed, A. Past, present, and future of face recognition: A review. Electronics 2020, 9, 1188. [Google Scholar] [CrossRef]
Wang, W.; Shen, J.; Yu, Y.; Ma, K.L. Stereoscopic thumbnail creation via efficient stereo saliency detection. IEEE Trans. on Visualization and Computer Graphics. 2016, 23, 2014–2027. [Google Scholar] [CrossRef] [PubMed]
Xia, C.; Zhang, H.; Gao, X.; Li, K. Exploiting background divergence and foreground compactness for salient object detection. Neurocomputing 2020, 383, 194–211. [Google Scholar] [CrossRef]
Cheng, M.-M.; Mitra, N.J.; Huang, X.; Torr, P.H.; Hu, S.-M. Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 569–582. [Google Scholar]
Zhu, C.; Li, G. A multilayer backpropagation saliency detection algorithm and its applications. Multimed. Tools Appl. 2018, 77, 181–197. [Google Scholar] [CrossRef] [Green Version]
Feng, D.; Barnes, N.; You, S.; McCarthy, C. Local background enclosure for rgb-d salient object detection. In Proceedings of the IEEE Conference on Ccomputer Vision and Pattern Recognition, Seattle, WA, USA, 16–21 June 2016; pp. 2343–2350. [Google Scholar]
Chen, H.; Li, Y.; Su, D. Multi-modal fusion network with multiscale multi-path and cross-modal interactions for rgb-d salient object detection. Pattern Recognit. 2019, 86, 376–385. [Google Scholar] [CrossRef]
Yao, C.; Feng, L.; Kong, Y.; Li, S.; Li, H. Double cross-modality progressively guided network for RGB-D salient object detection. Image Vision Comput. 2022, 117, 104351. [Google Scholar] [CrossRef]
Fan, D.-P.; Zhai, Y.; Borji, A.; Yang, J.; Shao, L. BBSNet: Rgb-d salient object detection with a bifurcated backbone strategy network. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 275–292. [Google Scholar]
Fu, K.; Fan, D.-P.; Ji, G.-P.; Zhao, Q. JL-DCF: Joint learning and densely-cooperative fusion framework for rgb-d salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3052–3062. [Google Scholar]
Li, G.; Liu, Z.; Ling, H. ICNet: Information conversion network for rgb-d based salient object detection. IEEE Trans. Image Process. 2020, 29, 4873–4884. [Google Scholar] [CrossRef]
Li, G.; Liu, Z.; Ye, L.; Wang, Y.; Ling, H. Cross-modal weighting network for rgb-d salient object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 665–681. [Google Scholar]
Zhao, J.X.; Cao, Y.; Fan, D.P.; Cheng, M.M.; Li, X.Y.; Zhang, L. Contrast prior and fluid pyramid integration for RGBD salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3927–3936. [Google Scholar]
Zhang, Z.; Lin, Z.; Xu, J.; Jin, W.-D.; Lu, S.-P.; Fan, D.-P. Bilateral attention network for rgb-d salient object detection. IEEE Trans. Image Process. 2021, 30, 1949–1961. [Google Scholar] [CrossRef] [PubMed]
Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1254–1259. [Google Scholar] [CrossRef] [Green Version]
Liu, T.; Yuan, Z.; Sun, J.; Wang, J.; Zheng, N.; Tang, X.; Shum, H.-Y. Learning to detect a salient object. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 353–367. [Google Scholar]
Zhu, W.; Liang, S.; Wei, Y.; Sun, J. Saliency optimization from robust background detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2814–2821. [Google Scholar]
Wang, W.; Lai, Q.; Fu, H.; Shen, J.; Ling, H.; Yang, R. Salient object detection in the deep learning era: An in-depth survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3239–3259. [Google Scholar] [CrossRef]
Shen, X.; Wu, Y. A unified approach to salient object detection via low rank matrix recovery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 853–860. [Google Scholar]
Jian, M.; Wang, J.; Liu, X.; Yu, H. Visual saliency detection based on full convolution neural networks and center prior. In Proceedings of the International Conference on Human System Interaction, Richmond, VA, USA, 25–27 June 2019; pp. 225–228. [Google Scholar]
Hou, Q.; Cheng, M.-M.; Hu, X.; Borji, A.; Tu, Z.; Torr, P.H. Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–16 July 2017; pp. 3203–3212. [Google Scholar]
Qin, X.; Zhang, Z.; Huang, C.; Gao, C.; Dehghan, M.; Jagersand, M. Bet: Boundary-aware salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 7479–7489. [Google Scholar]
Liu, Y.; Zhang, X.-Y.; Bian, J.-W.; Zhang, L.; Cheng, M.-M. SAMNet: Stereoscopically attentive multi-scale network for lightweight salient object detection. IEEE Trans. Image Process. 2021, 30, 3804–3814. [Google Scholar] [CrossRef]
Piao, Y.; Rong, Z.; Zhang, M.; Ren, W.; Lu, H. A2dele: Adaptive and attentive depth distiller for efficient RGB-D salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9060–9069. [Google Scholar]
Chen, Z.; Cong, R.; Xu, Q.; Huang, Q. DPANet: Depth potentiality aware gated attention network for rgb-d salient object detection. IEEE Trans. Image Process. 2020, 30, 7012–7024. [Google Scholar] [CrossRef]
Pang, Y.; Zhang, L.; Zhao, X.; Lu, H. Hierarchical dynamic filtering network for rgb-d salient object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 235–252. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Shi, C.; Zhang, W.; Duan, C.; Chen, H. A pooling-based feature pyramid network for salient object detection. Image Vis. Comput. 2021, 107, 104099. [Google Scholar] [CrossRef]
Zhou, X.; Li, G.; Gong, C.; Liu, Z.; Zhang, J. Attention-guided RGBD saliency detection using appearance information. Image Vis. Comput. 2020, 95, 103888. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Fan, D.-P.; Lin, Z.; Zhang, Z.; Zhu, M.; Cheng, M.-M. Rethinking rgb-d salient object detection: Models, data sets, and large-scale benchmarks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 2075–2089. [Google Scholar] [CrossRef]
Chen, S.; Fu, Y. Progressively guided alternate refinement network for rgb-d salient object detection. Proceedings of European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 520–538. [Google Scholar]
Fan, D.-P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.-M.; Borji, A. Enhanced-alignment measure for binary foreground map evaluation. In Proceedings of the International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 698–704. [Google Scholar]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 22–24 June 2009; pp. 1597–1604. [Google Scholar]
Margolin, R.; Zelnik-Manor, L.; Tal, A. How to evaluate foreground maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 248–255. [Google Scholar]
Perazzi, F.; uhl Krahenb, P.; Pritch, Y.; Hornung, A. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 733–740. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Han, J.; Chen, H.; Liu, N.; Yan, C.; Li, X. Cnns-based rgb-d saliency detection via cross-view transfer and multiview fusion. IEEE Trans. Cybern. 2018, 48, 3171–3183. [Google Scholar] [CrossRef] [PubMed]
Wang, N.; Gong, X. Adaptive fusion for rgb-d salient object detection. IEEE Access 2019, 55, 255–277. [Google Scholar] [CrossRef]
Chen, H.; Li, Y. Three-stream attention-aware network for rgb-d salient object detection. IEEE Trans. Image Process. 2019, 28, 2825–2835. [Google Scholar] [CrossRef] [PubMed]
Piao, Y.; Ji, W.; Li, J.; Zhang, M.; Lu, H. Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 733–740. [Google Scholar]
Chen, H.; Deng, Y.; Li, Y.; Hung, T.-Y.; Lin, G. Rgbd salient object detection via disentangled cross-modal fusion. IEEE Trans. Image Process. 2020, 29, 8407–8416. [Google Scholar] [CrossRef]
Li, C.; Cong, R.; Kwong, S.; Hou, J.; Fu, H.; Zhu, G.; Zhang, D.; Huang, Q. ASIF-net: Attention steered interweave fusion network for rgb-d salient object detection. IEEE Trans. Cybern. 2020, 51, 88–100. [Google Scholar] [CrossRef]
Olaf, R.; Philipp, F.; Thomas, B. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Zhang, M.; Ren, W.; Piao, Y.; Rong, Z.; Lu, H. Select, supplement and focus for RGB-D saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3472–3481. [Google Scholar]
Zhu, X.; Li, Y.; Fu, H.; Fan, X.; Shi, Y.; Lei, J. RGB-D salient object detection via cross-modal joint feature extraction and low-bound fusion loss. Neurocomputing 2021, 453, 623–635. [Google Scholar] [CrossRef]
Chen, C.; Wei, J.; Peng, C.; Qin, H. Depth-quality-aware salient object detection. IEEE Trans. Image Process. 2021, 30, 2350–2363. [Google Scholar] [CrossRef]
Li, G.; Liu, Z.; Chen, M.; Bai, Z.; Lin, W.; Ling, H. Hierarchical alternate interaction network for RGB-D salient object detection. IEEE Trans. Image Process. 2021, 30, 3528–3542. [Google Scholar] [CrossRef]
Chen, T.; Hu, X.; Xiao, J.; Zhang, G.; Wang, S. CFIDNet: Cascaded feature interaction decoder for RGB-D salient object detection. Neural Comput. Appl. 2022, 34, 7547–7563. [Google Scholar] [CrossRef]
Wang, F.; Pan, J.; Xu, S.; Tang, J. Learning Discriminative Cross-Modality Features for RGB-D Saliency Detection. IEEE Trans. Image Process. 2022, 31, 1285–1297. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Li, S.; Chen, C.; Fang, Y.; Hao, A.; Qin, H. Data-level recombination and lightweight fusion scheme for RGB-D salient object detection. IEEE Trans. Image Process. 2020, 30, 458–471. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.H.; Liu, Y.; Xu, J.; Bian, J.W.; Gu, Y.C.; Cheng, M.M. MobileSal: Extremely efficient RGB-D salient object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021. [Google Scholar] [CrossRef] [PubMed]
Sun, P.; Zhang, W.; Wang, H.; Li, S.; Li, X. Deep RGB-D saliency detection with depth-sensitive attention and automatic multi-modal fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual. 19–25 June 2021; pp. 1407–1417. [Google Scholar]
Mahdaoui, A.E.; Ouahabi, A.; Moulay, M.S. Image denoising using a compressive sensing approach based on regularization constraints. Sensors 2022, 22, 2199. [Google Scholar] [CrossRef]
Ouahabi, A. (Ed.) Signal and Image Multiresolution Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]

Figure 1. Visual demonstrations of ambiguities among different modalities. The red boxes indicate salient differences between depth map and RGB image. (a–e) present five examples of the ambigity between RGB and Depth images, respectively. GT means the ground truth of saliency map.

Figure 2. Pipeline of the proposed A2TPNet.

Figure 3. The structure of the cross-modal alternate fusion module (CAFM).

Figure 4. Visual demonstrations of cross-modal fusion in CAFM. The annotations are (a) RGB images, (b) depth maps, (c) ground truth, (d) RGB feature in 1st level of RGB Encoder, (e) depth features in 1st level of depth Encoder, (f) the fused features of (d) and (e) by CAFM, (g) the predicted saliency map, (h) the predicted saliency map by ASIFNet [35], and (i) the predicted saliency map by CoNet [36]. The red boxes indicate the ambiguity of cross-modal features.

Figure 5. Some visual demonstrations of multi-scale features. The annotations are (a) RGB images, (b) depth maps, (c) ground truth, (d) the fused features in the 1st level, (e) the fused features in the 2nd level, (f) the fused features in the 3rd level, (g) the fused features in the 4th level, (h) the fused features in the 5th level, (i) multi-scale fused features, and (j) the predicted saliency map.

Figure 6. Precision–Recall curves of the proposed model and other salient object detection methods on LFSD [37], NLPR [38], DUT [39], NJU2K [40], and RGBD135 [40] datasets.

Figure 7. Visual comparison of different RGB-D SOD methods. Annotations of from left to right are: (a) Input image, (b) depth, (c) Ground Truth, (d) DMRA [47], (e) TANet [11], (f) MMCI [15], (g) CPFP [12], (h) CMWN [14], (i) DANet [46], (j) D3Net [26], (k) ICNet [10], and (l) Ours.

Figure 8. Some failure cases of replacedour model A2TPNet, DSA2F [58], and D3Net [26]. The top two rows show the salient objects with occlusion, and the bottom rows show the salient objects with abundant texture.

Table 1. Quantitative results of proposed method against 21 state-of-the-art RGB-D SOD methods in terms of

a d p E m

,

m e a n E m

,

a d p F m

,

m e a n F m

,

W F

, and

M A E

. The best top three results are highlighted in red, blue, and green, respectively. ↑ and ↓ stand for larger and smaller is better, respectively.

Table 1. Quantitative results of proposed method against 21 state-of-the-art RGB-D SOD methods in terms of

a d p E m

,

m e a n E m

,

a d p F m

,

m e a n F m

,

W F

, and

M A E

. The best top three results are highlighted in red, blue, and green, respectively. ↑ and ↓ stand for larger and smaller is better, respectively.

Dataset	Metric	CPFP	DMRA	MMCI	TANet	A2dele	SSF	CMWNet	CoNet	DANet	D3Net	DFNet	ICNet	ASIF	CJLB	BiANet	DQSD	HAINet	DRLF	MobileSal	CFIDNet	DCFM	Ours
Dataset	Metric	$C V P R^{19}$	$C V P R^{19}$	$P R^{19}$	$T I P^{19}$	$C V P R^{20}$	$C V P R^{20}$	$E C C V^{20}$	$E C C V^{20}$	$E C C V^{20}$	$T N N L S^{20}$	$T I P^{20}$	$T I P^{21}$	$T C y b^{21}$	$N P^{21}$	$T I P^{21}$	$T I P^{21}$	$T I P^{21}$	$T I P^{21}$	$T P A M I^{21}$	$N C A^{22}$	$T I P^{22}$
LFSD	$a d p E m ↑$	$0.867$	$0.899$	$0.840$	$0.851$	$0.880$	$0.901$	0.907	$0.901$	$0.877$	$0.863$	$0.839$	$0.900$	$0.861$	$0.850$	$0.822$	$0.844$	−	0.872	0.894	0.901	0.905	0.908
	$m e a n E m ↑$	0.863	0.893	0.774	0.820	0.878	0.891	0.900	0.895	0.872	0.848	0.823	0.891	0.850	0.840	0.781	0.863	−	0.859	0.881	0.895	0.894	0.903
	$a d p F m ↑$	0.813	0.848	0.779	0.794	0.835	0.867	0.870	0.847	0.826	0.804	0.767	0.861	0.827	0.786	0.751	0.842	−	0.821	0.840	0.857	0.861	0.870
	$m e a n F m ↑$	0.811	0.844	0.721	0.771	0.833	0.859	0.861	0.844	0.825	0.795	0.749	0.852	0.818	0.786	0.715	0.826	−	0.808	0.828	0.849	0.850	0.866
	$W F ↑$	0.775	0.814	0.663	0.071	0.810	0.831	0.833	0.818	0.789	0.759	0.070	0.821	0.780	0.753	0.670	0.796	−	0.772	0.800	0.825	0.825	0.835
	$M A E ↓$	0.088	0.075	0.131	0.111	0.073	0.066	0.066	0.071	0.082	0.095	0.118	0.071	0.090	0.106	0.127	0.085	−	0.089	0.079	0.070	0.068	0.065
RGBD35	$a d p E m ↑$	0.927	0.944	0.904	0.919	0.922	0.948	0.967	0.945	0.960	0.951	0.923	0.959	−	0.960	0.925	0.970	0.967	0.954	0.973	0.943	0.967	0.974
	$m e a n E m ↑$	0.888	0.936	0.825	0.863	0.919	0.932	0.954	0.934	0.922	0.910	0.895	0.940	−	0.932	0.868	0.955	0.958	0.910	0.959	0.934	0.948	0.964
	$a d p F m ↑$	0.829	0.857	0.762	0.794	0.865	0.876	0.900	0.861	0.891	0.870	0.818	0.889	−	0.870	0.830	0.894	0.913	0.868	0.910	0.898	0.896	0.936
	$m e a n F m ↑$	0.824	0.868	0.735	0.789	0.862	0.873	0.909	0.873	0.877	0.860	0.822	0.893	−	0.873	0.808	0.901	0.913	0.852	0.911	0.898	0.901	0.930
	$W F ↑$	0.787	0.849	0.649	0.738	0.845	0.860	0.887	0.856	0.848	0.828	0.779	0.867	−	0.856	0.774	0.887	0.897	0.829	0.895	0.875	0.881	0.918
	$M A E ↓$	0.038	0.029	0.065	0.046	0.028	0.025	0.022	0.027	0.028	0.031	0.040	0.027	−	0.029	0.038	0.021	0.019	0.030	0.021	0.023	0.023	0.015
NLPR	$a d p E m ↑$	0.924	0.941	0.871	0.916	0.945	0.951	0.940	0.934	0.944	0.945	0.933	0.944	0.946	0.939	0.939	0.935	0.952	0.936	0.952	0.951	0.940	0.953
	$m e a n E m ↑$	0.918	0.939	0.841	0.901	0.942	0.947	0.939	0.933	0.932	0.935	0.928	0.940	0.938	0.931	0.924	0.935	0.949	0.922	0.953	0.948	0.938	0.950
	$a d p F m ↑$	0.823	0.854	0.730	0.795	0.878	0.875	0.859	0.848	0.865	0.861	0.838	0.869	0.871	0.849	0.849	0.842	0.891	0.844	0.877	0.885	0.854	0.895
	$m e a n F m ↑$	0.840	0.864	0.737	0.819	0.876	0.882	0.877	0.865	0.871	0.872	0.854	0.884	0.875	0.861	0.856	0.864	0.893	0.854	0.889	0.892	0.875	0.895
	$W F ↑$	0.813	0.845	0.675	0.779	0.867	0.874	0.856	0.849	0.849	0.848	0.827	0.864	0.856	0.848	0.833	0.843	0.880	0.830	0.874	0.876	0.856	0.880
	$M A E ↓$	0.036	0.031	0.059	0.041	0.028	0.026	0.029	0.031	0.031	0.029	0.034	0.028	0.030	0.033	0.032	0.029	0.025	0.032	0.025	0.026	0.029	0.024
DUT	$a d p E m ↑$	0.868	0.930	−	−	0.930	0.952	0.922	0.952	0.929	0.849	0.853	0.901	0.883	−	0.894	0.889	0.939	0.871	0.940	−	0.950	0.951
	$m e a n E m ↑$	0.841	0.926	−	−	0.926	0.915	0.911	0.947	0.916	0.789	0.832	0.884	0.869	−	0.863	0.863	0.936	0.841	0.926	−	0.944	0.948
	$a d p F m ↑$	0.793	0.883	−	−	0.893	0.915	0.865	0.909	0.884	0.755	0.749	0.830	0.823	−	0.810	0.818	0.906	0.803	0.912	−	0.906	0.919
	$m e a n F m ↑$	0.779	0.885	−	−	0.888	0.914	0.863	0.911	0.879	0.717	0.742	0.826	0.813	−	0.799	0.807	0.908	0.783	0.902	−	0.907	0.918
	$W F ↑$	0.742	0.857	−	−	0.870	0.900	0.831	0.895	0.846	0.668	0.698	0.784	0.779	−	0.760	0.775	0.883	0.740	0.869	−	0.889	0.901
	$M A E ↓$	0.076	0.048	−	−	0.042	0.033	0.056	0.033	0.047	0.096	0.104	0.072	0.072	−	0.075	0.072	0.038	0.080	0.044	−	0.035	0.032
NJU2K	$a d p E m ↑$	0.900	0.920	0.881	0.909	0.916	0.935	0.922	0.924	0.926	0.915	0.913	0.912	0.923	0.895	0.907	0.913	0.931	0.903	0.939	0.929	0.925	0.942
	$m e a n E m ↑$	0.910	0.919	0.851	0.895	0.913	0.928	0.922	0.925	0.920	0.922	0.912	0.913	0.921	0.906	0.892	0.919	0.934	0.909	0.935	0.937	0.930	0.942
	$a d p F m ↑$	0.837	0.872	0.813	0.844	0.874	0.886	0.880	0.872	0.876	0.865	0.858	0.867	0.875	0.830	0.848	0.861	0.896	0.849	0.894	0.892	0.881	0.906
	$m e a n F m ↑$	0.850	0.873	0.793	0.841	0.870	0.885	0.881	0.874	0.874	0.878	0.860	0.868	0.877	0.852	0.844	0.875	0.898	0.858	0.891	0.898	0.888	0.907
	$W F ↑$	0.828	0.853	0.739	0.804	0.851	0.871	0.856	0.856	0.852	0.854	0.831	0.843	0.854	0.835	0.811	0.852	0.879	0.831	0.874	0.882	0.867	0.890
	$M A E ↓$	0.053	0.051	0.078	0.060	0.051	0.042	0.045	0.046	0.046	0.046	0.052	0.052	0.047	0.056	0.056	0.050	0.038	0.055	0.041	0.038	0.043	0.037

Table 2. Results of ablation studies. The best results are highlighted in red.

Variants	DUT						NLPR						NJU2K
Variants	$a d p E m ↑$	$m e a n E m ↑$	$a d p F m ↑$	$m e a n F m ↑$	$W F ↑$	$M A E ↓$	$a d p E m ↑$	$m e a n E m ↑$	$a d p F m ↑$	$m e a n F m ↑$	$W F ↑$	$M A E ↓$	$a d p E m ↑$	$m e a n E m ↑$	$a d p F m ↑$	$m e a n F m ↑$	$W F ↑$	$M A E ↓$
Model 0	0.900	0.888	0.820	0.820	0.788	0.070	0.919	0.921	0.817	0.838	0.817	0.039	0.893	0.901	0.826	0.839	0.813	0.061
Model 1	0.923	0.916	0.858	0.862	0.834	0.052	0.920	0.926	0.823	0.847	0.827	0.035	0.900	0.909	0.844	0.854	0.830	0.055
Model 2	0.904	0.893	0.826	0.825	0.794	0.067	0.914	0.917	0.809	0.832	0.810	0.040	0.890	0.899	0.822	0.834	0.809	0.062
Model 3	0.924	0.914	0.863	0.864	0.838	0.053	0.937	0.936	0.855	0.869	0.853	0.032	0.905	0.911	0.856	0.861	0.837	0.053
Model 4	0.924	0.914	0.861	0.863	0.836	0.053	0.930	0.935	0.841	0.862	0.843	0.032	0.904	0.911	0.849	0.857	0.834	0.055
Ours	0.951	0.948	0.919	0.918	0.901	0.032	0.953	0.950	0.895	0.895	0.880	0.024	0.942	0.942	0.906	0.907	0.890	0.037

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Duan, S.; Gao, X.; Xia, C.; Ge, B. A2TPNet: Alternate Steered Attention and Trapezoidal Pyramid Fusion Network for RGB-D Salient Object Detection. Electronics 2022, 11, 1968. https://doi.org/10.3390/electronics11131968

AMA Style

Duan S, Gao X, Xia C, Ge B. A2TPNet: Alternate Steered Attention and Trapezoidal Pyramid Fusion Network for RGB-D Salient Object Detection. Electronics. 2022; 11(13):1968. https://doi.org/10.3390/electronics11131968

Chicago/Turabian Style

Duan, Songsong, Xiuju Gao, Chenxing Xia, and Bin Ge. 2022. "A2TPNet: Alternate Steered Attention and Trapezoidal Pyramid Fusion Network for RGB-D Salient Object Detection" Electronics 11, no. 13: 1968. https://doi.org/10.3390/electronics11131968

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A2TPNet: Alternate Steered Attention and Trapezoidal Pyramid Fusion Network for RGB-D Salient Object Detection

Abstract

1. Introduction

2. Related Work

2.1. RGB-D Salient Object Detection

2.2. The Depth Quality of the Depth Map

3. Methodology

3.1. Overview Network Architecture

3.2. Cross-Modal Alternate Fusion Module (CAFM)

3.3. Trapezoidal Pyramid Fusion Module (TPFM)

3.4. Loss Function

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Comparison with State-of-the-Art Methods

4.5. Ablation Study

4.6. Limitations and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI