Attention Mechanism Used in Monocular Depth Estimation: An Overview

Li, Yundong; Wei, Xiaokun; Fan, Hanlu

doi:10.3390/app13179940

Open AccessReview

Attention Mechanism Used in Monocular Depth Estimation: An Overview

by

Yundong Li

^*

,

Xiaokun Wei

and

Hanlu Fan

School of Information Science and Technology, North China University of Technology, Beijing 100144, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(17), 9940; https://doi.org/10.3390/app13179940

Submission received: 11 July 2023 / Revised: 18 August 2023 / Accepted: 25 August 2023 / Published: 2 September 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Monocular depth estimation (MDE), as one of the fundamental tasks of computer vision, plays important roles in downstream applications such as virtual reality, 3D reconstruction, and robotic navigation. Convolutional neural networks (CNN)-based methods gained remarkable progress compared with traditional methods using visual cues. However, recent researches reveal that the performance of MDE using CNN could be degraded due to the local receptive field of CNN. To bridge the gap, various attention mechanisms were proposed to model the long-range dependency. Although reviews of MDE algorithms based on CNN were reported, a comprehensive outline of how attention boosts MDE performance is not explored yet. In this paper, we firstly categorize recent attention-related works into CNN-based, Transformer-based, and hybrid (CNN–Transformer-based) approaches in the light of how the attention mechanism impacts the extraction of global features. Secondly, we discuss the details and contributions of attention-based MDE methods published from 2020 to 2022. Then, we compare the performance of the typical attention-based methods. Finally, the challenges and trends of the attention mechanism used in MDE are discussed.

Keywords:

monocular depth estimation; self-attention; cross-attention; supervised; self-supervised

1. Introduction

Depth information between the camera and object is crucial for applications such as 3D scene reconstruction, robotic avoidance, extended reality [1,2], etc. Traditionally, depth information is obtained using technology such as LiDAR, stereo cameras, time of flight (TOF), etc. Recently, monocular depth estimation utilizing a single image attracted lots of interest because MDE is superior to other technologies in terms of cost, payload and ease of assembly. However, depth estimation from a single image is an ill-posed problem because a pixel of image corresponds to infinite physical points in the real world. To address this issue, researchers are inspired by the human vision, in which visual cues are exploited as prior knowledge to determine distances between objects, and even only one eye is used. Visual cues include features of object scale, shadow, occlusion, texture and shape, etc. Therefore, MDE can be formulated as an issue of uncovering the underlying non-linear mapping between visual cues and the corresponding depth information.

In 2014, Eigen et al. proposed a supervised MDE method using CNN from only a single image [3]. Inspired by their groundbreaking work, tremendous MDE approaches are intensively investigated [4,5,6,7]. An encoder–decoder architecture consisting of CNN is predominant in the community of MDE tasks. CNN uses convolutional operations within a local receptive field and downsampling to extract hierarchical features from input images [8,9]. The lower layers have high resolution and a small receptive field, while the higher layers have low resolution and a large receptive field. Kavuran et al. employ CNN for classification tasks, thereby resolving several issues with human development, such as education, economy, and epidemics [8]. In addition, Hamad et al. effectively detect the novel coronavirus from X-rays using CNN [9]. Different from image classification and detection, dense prediction tasks, such as depth estimation, require a long-range context and detailed surroundings to determine the depth value of the central pixel [10]. However, the higher feature extracted by CNN loses too much detail. This situation is not fully addressed, though skip connections are utilized to supplement details from lower layers [11]. This drawback will lead to missed detection of small objects, edge blur, and incomplete images of objects. Therefore, CNN is not suitable for the MDE task [10].

To address the issue of the local receptive field of CNN, the attention mechanism is introduced into MDE to capture the long-distance dependency. Attention is inspired by the biological vision system and it is originally proposed to model the correlations between the source and the target in the sequence-to-sequence tasks [12]. Essentially, attention can reweight the information of inputs and yield a new representation with the global receptive field. In the MDE task, attention is conducted either in the spatial domain or the channel domain of the feature maps extracted by encoders. Experiments show that the points of the new representation that have more contributions are assigned higher weights, which leads to a remarkable improvement for MDE tasks [12]. Recently, Transformer was constructed using the multi-head self-attention modules in which convolutional operations are entirely discarded [13]. The Transformer gained significant success in the task of natural language processing so far. Furthermore, Transformer was adapted for computer vision tasks, thereby the vision Transformer (ViT) was proposed by Dosovitskiy et al. [14]. In 2021, Ranftl et al. proposed the dense prediction Transformer (DPT), which uses the ViT as backbone architecture [10]. Leveraging the global receptive field and non-downsampling operations of the ViT, DPT outperforms the state-of-the-art CNN-based methods by a large margin in terms of accuracy for MDE tasks. After that, tremendous xFormers following this idea emerged in the community of MDE [11,15,16,17]. Although Transformer-based methods obtain substantial improvements for MDE tasks, these models are difficult to train compared with CNN models due to the fact that the Transformer lacks spatial inductive bias [10]. To mitigate this drawback, an intuitive motivation is to combine the global receptive field of Transformer and the inductive bias of CNN into an integrated framework. In 2022, Zhang et al. proposed Lite-Mono, which effectively integrates CNN and Transformer to capture local features and global context information [18]. Compared to the Transformer-based methods, it maintains strong performance on MDEs with a significantly reduced number of parameters. Herein, we call such CNN–Transformer architecture hybrid methods. Soon after, several hybrid models were proposed in the field of MDE [19,20,21].

The existing MDE methods can be categorized into supervised and self-supervised ways from the perspective of learning strategy. The training of supervised-based methods needs ground truth, which is expensive to label. Leveraging the inherent geometric consistency of stereo images or video sequences, the self-supervised-based methods remove requirements of ground truth, and use photometric error to train the depth estimation networks. Consequently, we summarize the existing attention-based approaches into six categories: supervised CNN-based, self-supervised CNN-based, supervised Transformer-based, self-supervised Transformer-based, supervised hybrid, and self-supervised hybrid methods. Hereafter, we follow this taxonomy to review the literature.

In this work, we provide a comprehensive survey on the attention mechanism used in MDE methods. There were a few overviews on monocular depth estimation [22,23,24]. However, to our best knowledge, this is the first time that the focus is on the attention mechanism used in MDE. Our contributions lie in providing a detailed review on how attention is integrated into the MDE methods, as well as why attention can benefit the performance of dense prediction tasks. We hope that our contributions can not only clarify the concept of attention in MDE, but also facilitate the related researchers to determine the potential directions.

The remainders of the paper are structured as follows: Section 2 introduces the backgrounds of MDE and attention. Section 3 analyzes the existing MDE methods. Section 4 presents performance comparison of typical algorithms. Section 5 discusses the challenges and trends of MDE algorithms. Section 6 concludes the review.

2. Preliminary

2.1. Monocular Depth Estimation

The goal of monocular depth estimation is to retrieve the distance information using only a single image. It is a great challenge to do that because a pixel of a 2D image may match infinite physical points in the 3D space. However, humans can predict the distance by aggregating the surrounding visual cues even when only one eye is used. This fact suggests that the visual cues may imply distance information. Thus, MDE can be formulated as a problem of uncovering an underlying mapping between the visual cues and the distance information, which can be defined as follows:

D = f (v_{1}, v_{2}, \dots, v_{n})

(1)

where

D

is the distance,

v_{1}, v_{2}, \dots, v_{n}

are visual cues, and

f

is a non-linear function. Visual cues, such as scale, shadow, texture, etc., are often fused into intermediate representations extracted by encoders, while decoders are used to gather the features and predict the distance information. Therefore, Equation (1) can be rewritten as (2):

D = D e c (E n c (X))

(2)

where

X

is input image,

E n c

is an encoder, and

D e c

is a decoder. Most of the existing MDE methods exploit an encoder–decoder architecture consisting of CNN or Transformer, or a mixture of CNN and Transformer.

2.2. Attention

The concept of attention is inspired by the biological system. For instance, in the image classification task, the salient components are more relevant to the object category, and should be paid more attention after a glimpse of the holistic scene. In the machine translation task, some words may have more influence to the output. Therefore, each component of the source may have different impacts on the target, and should not be equally treated. The attention mechanism is able to not only model the relevance between the source and the target, but also generate new representations according to the weights of each component of the source. Technically, the attention model calculates coefficients using a query (Q) and a set of keys (K) and generates new representations by combining values (V) weighted by the coefficients. A generalized attention model is defined as Equation (3). For details, please refer to [25].

A t t n (Q, K, V) = \sum_{i} p (σ (k_{i}, Q)) \times v_{i}

(3)

where

p

is distribution function and

σ

is alignment function.

Bahdanau et al. firstly proposed an attention model to address the issue of the machine translation task [12]. Recently, the attention mechanism cooperating with CNN was popular in the computer vision tasks to capture the long-range context [26]. In the MDE task, attention can be applied over the space axis or the channel axis. Additionally, the source and the target may be the same sequences or the different sequences. Thus, we classify the attention mechanism in the MDE task as four categories: channel self-attention (CSA), channel cross-attention (CCA), spatial self-attention (SSA), and spatial cross-attention (SCA).

Channel self-attention. The attention map is generated over different channels of feature maps extracted by an encoder. Channel self-attention fuses the information of several channels since each channel indicates different aspects of the input image. Hu et al. built a squeeze-and-excitation (SE) block [27], as shown in Figure 1, to recalibrate the relationship of channels. A global average pooling is firstly exerted over the whole image to produce the global dependency in the squeeze stage; a sigmoid gating function is then used to generate an attention map in the excitation stage.

Woo et al. proposed the convolutional block attention module (CBAM), consisting of a channel attention module and a spatial attention module to enhance the performance of CNN [28]. The channel attention module of CBAM is shown in Figure 2. The input feature map

F \in R^{C \times H \times W}

is pooled parallel within the

H \times W

space by average-pool and max-pool operations to obtain two 1D vectors of

C

variables. Then, these two vectors are fed into full connection layers to generate the channel attention map, i.e., the channel weights. At last, the new representation is output by multiplication of channel weights and the original inputs. Compared with SE, we see that CBAM uses not only average pooling, but also maximum pooling. In addition, the channel attention module is followed by a spatial attention module in CBAM.

Channel cross-attention. Unlike channel self-attention, the attention map of channel cross-attention is generated by the interactions of groups of feature maps extracted by different encoders. Ates et al. utilize the channel cross-attention module, which is shown in Figure 3a, to capture global dependencies among channels by leveraging cross-attention across channel tokens of multi-scale encoder features [29]. To begin, each token is subjected to a layer normalization operation. The tokens are then concatenated along the channel dimension to form keys and values. Finally, the projected queries, keys, and values are fed into cross-attention.

Spatial self-attention. The input of the attention module is an input image or a specific feature map. We call it spatial self-attention when the Q, K, and V are of the same inputs. Woo et al. proposed an attention implementation, which is shown in Figure 4 [28]. The input feature maps are firstly pooled over channel axis by average-pool and max-pool operations. Then, these two feature descriptors are concatenated and convoluted by a 7 × 7 kernel to produce spatial attention weight. Finally, the new representation is generated by multiplication of attention weight and the original input. Another popular scheme of spatial self-attention is used in Transformer. We will discuss it later.

Spatial cross-attention. We call it spatial cross-attention if the Q, K, and V are of different feature maps. As illustrated in Figure 3b, Ates et al. designed spatial cross-attention so as to capture spatial dependencies across spatial tokens [29]. They first perform layer normalization and concatenation along the channel dimension. Different from the channel cross-attention module, the concatenated tokens are utilized as queries and keys while using each token as value. Then, they employ depth-wise projection over the queries, keys, and values, all of which eventually employ attention computations.

Note that the main difference between self-attention and cross-attention is that the former, whose query, key and value come from the same sequence, performs attention computation within one sequence, while the latter computes attention between sequences.

2.3. Transformer

In 2017, Vaswani et al. proposed an innovative transduction model called Transformer entirely depending on self-attention mechanism [13]. Transformer gained remarkable success in natural language processing (NLP), and later was adapted for computer vision tasks. Transformer adopts the global receptive field and consistent resolution since convolutional and downsampling operations are discarded. The key point of Transformer is scaled dot product attention, which is shown in Figure 5. Firstly, the input is projected to produce Q, K, and V vectors. Then, weights are calculated by Q and a set of keys followed by scale and softmax operations. At last, the output is generated by multiplication of weights and V. The scaled dot product attention mechanism of Transformer is essentially a sort of self-attention since Q, K, and V are of the same input. The scaled dot product attention mechanism is defined as follows:

A t t n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(4)

3. Attention Mechanism in MDE Methods

3.1. Overview of MDE Framework with Attention

Most of the MDE methods start out with an encoder–decoder architecture in which encoders extract hierarchical features while decoders are employed to fuse multi-scale features to the yield depth map. Although combinations of backbone and attention modules vary a lot, we try to figure out predominant architectures for attention-based MDE methods. On the encoder end, features are mainly extracted through CNN, Transformer, and hybrid methods of both; furthermore, the hybrid methods of CNN and Transformer can be divided into serial and parallel. Therefore, according to the architecture of encoder, we summarize four typical prototypes of CNN-based, Transformer-based, and CNN–Transformer-based MDE frameworks, which are shown in Figure 6, Figure 7, Figure 8 and Figure 9, respectively. Figure 6 shows a typical architecture of CNN-based MDE. ResNets [30] are often used as a backbone to extract features. In such structures, attention modules can be placed at any stage of encoder and decoder, even at the skip connections, to model long-range dependency by spatial or channel attentions. In Figure 7, we outline a Transformer-based MDE method, in which images are firstly embedded as images such as maps, and then fed into Transformer blocks. The attention mechanism exists in the Transformer blocks or locates any stage of decoder and skip connections. Figure 8 illustrates a serial scheme of the CNN–Transformer-based MDE framework. Unlike the Transformer-based methods, the input images are firstly projected into feature maps by convolutional operations, and features are then fed into Transformer blocks. Figure 9 presents another option of a CNN–Transformer-based MDE framework, in which CNN and Transformer blocks are arranged in parallel. The RGB images are inputted into two branches separately, and fusion blocks are utilized to integrate global and local information.

Notably, the components of the schemes mentioned above may not simultaneously appear in one framework. We just report the possible combinations in one figure. In addition, we only describe supervised MDE methods in Figure 6, Figure 7, Figure 8 and Figure 9 and the inputs and outputs of each block in these figures are two-dimensional images. For the self-supervised MDE, an extra camera pose estimator is needed to obtain rotation and shift matrices, and photometric error is calculated by the original image and the wrapped image is used to train the networks.

3.2. CNN-Based Attention Methods

3.2.1. Supervised Training

In 2021, Gao et al. proposed a weakly supervised MDE method utilizing a commonly used U-net architecture [31]. Both encoder and decoder are constructed based on VGG blocks and channel self-attention modules to recalibrate the channel-wise information. A spatial self-attention module is inserted before the prediction head to capture the large dependency. Different from attentions of CBAM and SE, they adopted adaptive weighted pooling to generate attention maps, which are shown in Figure 10.

Chen et al. designed a context aggregation module, which is inserted between an encoder and a decoder to model the pixel-level similarity [32]. The context aggregation module consists of two branches: a spatial self-attention same as that of Transformer, and an image-level global average pooling to obtain the awareness of categories. Aich et al. proposed a bidirectional attention mechanism to calculate attention maps across all the intermediate features [33], which can be seen as a cross-attention within different groups of features. In their method, global context is implemented by average pooling with a large kernel. In addition to the stages of encoder and decoder, attention modules can also be inserted into the skip connections of the widely used U-net architecture. Zhang et al. designed a CBAM and embedded them into the skip connections [34]. Lee et al. reported a patch-wise attention mechanism with which EdgeConv [35] is used to extract patch-wise edge features and patch-wise attention weights are produced based on the edge features [36]. In their method, the spatial self-attention module is placed after the last layer of encoder. Attention maps are commonly calculated over channel direction and spatial view. However, Jung et al. adopted a multi-view attention mechanism in which attention maps are created from channel, height, and width directions [37]. Each attention block consisting of channel attention and spatial attention can be located at any stage of decoders. The multi-view attention model is shown in Figure 11.

The edge features are salient parts in both RGB images and depth maps. Motivated by that fact, Naderi et al. proposed an adaptive geometric attention (AGA) module to measure the structure similarity of the features outputted by encoders and decoders [38]. The AGA module consists of two parallel branches, namely, a SE channel attention model with the inputs of concatenation of the outputs of encoders and decoders, and a spatial attention model constructed by cosine distance between the intermediate representations of encoders and decoders at each spatial points. The AGA scheme is shown in Figure 12.

Lu et al. used Fourier transform to separate the low frequency and high frequency information of RGB images at different frequency bands, and combined coarse depth maps with intermediate features to refine the depth maps [39]. Ren et al. proposed a novel MDE method using temporal attention in which cross-channel attention is performed along the temporal channel dimension [40]. The inputs of the temporal attention model are three consecutive frames wrapped by the results of optical flow estimation.

3.2.2. Self-Supervised Training

Recently, self-supervised approaches were predominant in the community of MDE. Self-supervised MDE methods exploit geometric constraints contained in the video sequence or stereo pairs as training loss, thus removing the expensive cost of sample labelling. To calculate photometric error from video sequences, pose networks are often used to estimate extrinsics of cameras to obtain wrapped images. Zhang et al. reported a self-supervised method in which channel and spatial attentions are embedded at the bottleneck of an encoder–decoder architecture [41]. Similar works include [42,43], in which channel and spatial attention blocks are parallel or sequentially inserted at the bottleneck. The channel and spatial attention modules can also be inserted into any stage of the decoder [44]. Similarly, Lei et al. also added a channel attention block into each stage of the decoder in a U-net structure to adjust the weights of channels inputted from the skip connections [45]. Johnston et al. proposed a discrete disparity prediction method in which a ResNet encoder followed by a spatial self-attention module is used to predict low-resolution disparity [46]. The low-resolution disparity is then processed by a multi-scale decoder to yield high-resolution disparity. A temporally consistent depth (TC-Depth) prediction method was proposed by Ruhkamp et al. [31], in which a spatial-temporal attention block is placed at the bottleneck to aggregate geometric and sequential consistencies. Unlike those of SE and CBAM, TC depth adopts physical distance between points of 3D space to measure the spatial attention map defined as follows:

A t t n_{i, j} = \exp (- \frac{∥ P_{i} - P_{j} ∥_{2}}{σ})

(5)

where

P_{i}

and

P_{j}

are points of 3D space back-projected by depth

d_{i}

,

d_{j}

, and coordinates of pixels

C_{i}

and

C_{j}

of a 2D image, and

K

indicates the intrinsics of the camera.

P_{i} = K^{- 1} (d_{i} \cdot C_{i}), P_{j} = K^{- 1} (d_{j} \cdot C_{j})

(6)

Another contribution of the work is temporal attention, which uses consecutive frames as Q and K to calculate attention weights along the temporal axis. We categorize this mechanism as one of the kinds of of spatial cross-attention since the Q and K are from different representations. Yan et al. proposed a channel-wise attention-based depth prediction method in which a structure perception module and a details emphasis module are utilized to aggregate and recalibrate the channel features [47]. Bhattacharyya et al. proposed a lightweight network for MDE using stereo pairs [48]. In their scheme, the discriminator of generative adversarial networks (GANs) [49] is used to distinguish the warped image and the original image. The generator of GANs is constructed by U-net networks with attention blocks embedded in the decoder. The proposed attention block consists of global average pooling operations and a group of dilated convolutional operations with an increasing receptive field to aggregate multi-scale features and recalibrate the channel-wise information. The attention model of [48] is shown in Figure 13.

Song et al. designed a multi-level dual attention network (MLDA-Net) for self-supervised MDE [50]. The input image is firstly down-sampled to extract multi-level features. These features are then fused after processing of channel self-attention. Notably, unlike those of SE and CVBM, the proposed channel attention exploits dot product operation instead of global average pooling to yield the attention weights, which is shown in Figure 14.

Atrous spatial pyramid pooling (ASPP) is often used in CNN to enlarge the receptive field. Xu et al. fostered a multi-scale spatial attention-guided block for the task of self-supervised MDE [51]. The multi-scale spatial attention-guided block consists of ASPP modules followed by spatial attention constructed by parallel spatial self-attention and cross-attention similar to that of CBAM. A combination of spatial attention based on dot product operations with channel attention of CBAM is used for self-supervised MDE [52]. However, the order of spatial attention and channel attention is reverse to that of CBAM. A hard attention strategy is also used to weigh multi-scale depth maps in the calculation of photometric loss. To balance the global average pooling and maximum pooling of CBAM, Li et al. proposed normalized CBAM (NCBAM) by limiting the input features to a range of [−1,1] [53]. NCBAM modules are incorporated into skip connections of the depth prediction networks, as well as the pose estimation networks, to mitigate the effects of texture-copy and depth drift. Cost volume created by features of two consecutive frames contains geometric and photometric information, and can thus be used for depth prediction. To recalibrate the cost volume and enlarge the receptive field, Hong et al. designed a voxel-based attention model via 3D convolutional operations and a recurrent-based attention mechanism via sequential 2D convolution along the depth dimension [54]. A lightweight framework is proposed for running on various embedded devices [55] of which channel attention blocks are inserted in the decoder stage. Wei et al. presented an unsupervised MDE method using stereo pairs [56]. In their framework, triaxial squeeze attention modules (TSAM) are embedded into the encoder structures. TSAM, shown in Figure 15, means that attention operations are performed over three directions: vertical, horizontal, and channel.

Ling et al. proposed an unsupervised MDE approach using a multi-warp reconstruction strategy [57]. As for attention, they adopted the channel and spatial attention operations embedded into the structures of the encoder and decoder. Xiang et al. presented an unsupervised MDE method for autonomous driving, in which channel and spatial attention blocks are inserted in each stage of the decoder structure [58]. Another contribution is that they proposed a scale awareness loss to determine the absolute scale information.

3.3. Transformer-Based Attention Methods

3.3.1. Supervised Training

Dense prediction Transformer (DPT) is the first MDE model that uses ViT as a backbone [10]. The features of the different stages of the encoder, extracted by Transformer blocks, are resampled and fused by convolutional operations followed by the prediction head to the output depth map. Compared with the state-of-the-art CNN-based method, DPT gains a remarkable 28% improvement in terms of accuracy. However, the DPT needs a large amount of data to be trained because convolutional operations are entirely removed in the encoder (DPT-base and DPT-large). The overview of DPT model is shown in Figure 16.

DPT inspires a surge of investigations of Transformer for MDE [15,59], as well as applications of downstream tasks [60]. Wu et al. proposed a rich global feature-guided network based on Transformer [61], which highlights global information fusion in the decoder stage by large kernel convolution attention. Most existing MDE methods utilize a regression strategy to predict depth; however, BinsFormer [16] formulates it as a regression classification problem by predicting the bins center and the probability that a certain pixels belong to certain bins. In their scheme, Transformer is used to adaptively predict bins, while any existing backbone can be used to extract features. We categorize BinsFormer as a kind of Transformer-based method since the performance is better when Transformer is used as the backbone. Depthformer also adopts a regression classification strategy that uses a Transformer module to predict the bins and an encoder that is the same as that of DPT to extract features [17].

Transformer suffers heavy computation since the multi-head spatial attention operations are conducted over the holistic image. To alleviate the computational payload, SW–Transformer limits the attention operations into specific windows [62]. Cheng et al. proposed a MDE method using SW–Transformer as the backbone [63]; meanwhile, CBAMs are used in the decoder stage to fuse multi-scale features. Agarwal et al. also adopted SW–Transformer to extract intermediate features [11]. In addition, the commonly used skip connection is replaced by a skip attention module, which fuses lower-resolution features of the decoder and higher-resolution features of the encoder in a cross-attention way. In line with the goal of computational reduction, RA-Swin [64] was proposed for MDE. RA-Swin applies SW-Transformer to extract hierarchical features followed by Refine Net fusing the multi-scale features in the decoder stage. An alternative strategy for mitigating the computational payload of Transformer is to simplify the architecture of Transformer. Ibrahem et al. reported a lightweight MDE framework in which the number of Transformer blocks is reduced from 12 to 6 or 4 [65]. Shu et al. proposed a real-time MDE method named SideRT using Swin Transformer as the backbone [66]. The contribution of SideRT is that cross-attentions between adjacent scales of feature maps are conducted to capture global information.

3.3.2. Self-Supervised Training

Yun et al. presented a ViT-based MDE method that adopts self-supervised training in combination with supervised training [67]. Unlike DPT, it uses non-local fusion blocks instead of common fusion blocks to gather long-range contexts in the decoder stage. Yang et al. proposed a self-supervised method that uses Transformers for both the depth network and the pose network [68]. The highlight of the method lies in that a simplified Transformer is used instead of the original architecture, i.e., (1) the learnable layer of the generating V matrix is replaced by column-wise average pooling, and (2) the feature maps are firstly reduced in space before feeding to the attention module. In addition, the features of the depth network are introduced into the pose network in a spatial cross-attention manner. To reduce the complexity and parameter size, the pooling pyramid vision Transformer was proposed for MDE [69], in which spatial reduction and pooling operations are applied to matrix K, V, and Q, respectively. Han et al. proposed Transformer-based depth estimation via the self-supervised learning method (TransDSSL) [70]. The highlights of TransDSSL are: (1) the pixel-wise attention module is used as a skip connection to supply details from the encoder stages, and (2) using the estimated depth of the high-resolution layer to guide the training of the low-resolution feature maps. The pixel-wise attention module is shown in Figure 17.

Varma et al. constructed a self-supervised MDE method [71] by readapting DPT for the depth network and the decoder part of Monodepth2 [3] for the pose network. In their method, a data efficient image Transformer (DeiT) [72] is used in the pose network to estimate the camera pose and camera intrinsics simultaneously.

3.4. Hybrid Attention Methods

3.4.1. Supervised Training

Adabins is a typical hybrid method that incorporates the advantages of Transformer and CNN architectures [73]. Adabins adopts a regression classification scheme, which consists of two stages. Although EfficientNet [74] is used in the first stage to generate decoded features, a simplified Transformer structure named mViT is utilized to adaptively predict which bin widths and pixel-wise probability belong to which bin center. The decoded features from the encoder–decoder structure are fed into the mViT module shown in Figure 18 to generate range attention map by pixel-wise dot product operations and predict bin widths by MLP. The highlight of AdaBins lies in the fact that it builds global dependency over the decoded features by pixel-wise dot product; meanwhile, the global dependency is beneficial to regress accurate depth map.

Compared with DPT-base and DPT-large, DPT-hybrid exploits ReSNet50 instead of linear projection to create tokens [10]. Thus, the backbone of DPT-hybrid can be seen as a hybrid encoder consisting of CNN and Transformer. The benefit of the hybrid encoder is that it balances the depth accuracy and complexity by the convolutional operations. Hong et al. designed a supervised MDE method using 3D point clouds as supplementary information [75]. In their method, RGB images and point clouds are inputted into convolutional networks followed a simple AdaBins module to produce depth map. Unlike the serial mechanism of DPT-hybrid, DepthFormer [76] adopts a parallel scheme, which consists of two encoder branches based on Transformer and CNN, respectively. The features extracted by Transformer are firstly enhanced by spatial self-attention. Then, the features from CNN and Transformer are fed into a cross-attention module to model the dependency between the global and local information. Being similar to that of DepthFormer, Manimaran et al. also exploited two separate encoders to extract global and local features [77]. The Transformer-based encoder uses images of 224 × 224 as inputs, while the encoder based on DenseNet [78] uses images of 512 × 512 as inputs. Notably, focal self-attention is employed to replace the regular self-attention to reduce the computational complexity of DepthFormer. Huo et al. argued that the tokens of Transformer should not be equally treated because most attentions focus on regions that are similar to the reference patches [79]. In the light of this evidence, they proposed the token attention mechanism shown in Figure 19 to reweight the tokens. Features are firstly extracted by convolutional operations, then token attention is conducted using the features as inputs, and finally, Transformer is utilized to enhance the features.

3.4.2. Self-Supervised Training

Tomar et al. presented a self-supervised depth prediction structure consisting of three encoders, i.e., the one based on Transformer to extract global information, and the other two are based on CNN to extract local features from images of different resolutions [19]. All the features of the three encoders are fused by a mask-guided multi-scale fusion block followed by a Transformer-based decoder to generate a depth map. MonoFormer was proposed by Bae et al. to address the generalizability issue of MDE methods [20]. They found that the Transformer-based MDE model outperforms the CNN-based model on the generalizability capability. Inspired by this observation, MonoFormer employs a hybrid encoder consisting of a CNN module followed by Transformers. Moreover, an attention connection module consisting of spatial attention and channel attention is employed to replace the traditional skip connection in the classical encoder–decoder structure. Zhao et al. proposed a self-supervised MDE method named MonoViT consisting of a lightweight pose network and a Transformer–CNN hybrid depth estimation network [21]. Unlike the parallel scheme of DepthFormer, they designed a joint CNN and Transformer layer in which three Transformer blocks and a convolutional block are arranged in parallel. Four joint CNN and Transformer layers are cascaded to form the backbone of the depth estimation network. Figure 20 illustrates the structure of the joint CNN and Transformer layer. In addition, MonoViT adopts spatial attention and channel attention in the decoder to fuse the multi-scale features.

Hwang et al. presented a cost volume-based self-supervised MDE method [80]. Cost volume is firstly created by calculating the similarity of the features of the source and target images. Then, cost volume is fed into an encoder consisting of CNN followed by Transformers to extract features. Finally, features are decoded by fusion blocks consisting of spatial attention, channel attention, and upsampling operations. Zhang et al. proposed Lite-Mono, a lightweight and self-supervised hybrid model [18]. They investigated the effective combination of CNN and Transformer, designing a consecutive dilated convolutions module and local–global features interaction module; the former extracts rich local features, and the latter captures global characteristics via a self-attentive mechanism. The features extracted by the encoder are then fed to the decoder to generate a high-resolution depth map through multi-scale fusion (Figure 21).

3.5. Summary of MDE Methods

We presented details of typical MDE methods published from 2020 to 2022. In this section, we further summarize these methods in Table 1. In Table 1, each method is described from the aspects of category, training strategy, attention mechanism, and contributions. We hope this list could provide related researchers with a clear picture of attention mechanism used in the MDE task.

4. Performance Comparison

4.1. Metrics

We adopt the commonly used metrics in the community of MDE to compare the performance of the typical algorithms. These metrics are root mean square error (RMSE), a traditional method for measuring regression errors; logarithm RMSE (RMSE_log), the logarithm makes this error relative, reducing the effect of large errors with the distance; absolute relative error (AbsRel), which normalize per-pixel errors, reducing the effect of large errors with the distance; square relative error (SqRel), the squared penalizes for larger depth error; and accuracy rate metric

δ_{t}

, which are defined as follows [1]:

RMSE : \sqrt{\frac{1}{N} \sum_{i \in N} ∥ d_{i} - d_{i}^{g} ∥^{2}}

(7)

{RMSE}_{\log} : \sqrt{\frac{1}{N} \sum_{i \in N} ∥ \log (d_{i}) - \log {(d_{i}^{g}) ∥}^{2}}

(8)

AbsRel : \frac{1}{N} \sum_{i \in N} \frac{d_{i} - d_{i}^{g}}{d_{i}^{g}}

(9)

SqRel : \frac{1}{N} \sum_{i \in N} \frac{∥ d_{i} - d_{i}^{g} ∥^{2}}{d_{i}^{g}}

(10)

δ_{t} : % of d_{i} s . t . m a x (\frac{d_{i}}{d_{i}^{g}}, \frac{d_{i}^{g}}{d_{i}}) < {1.25}^{t}

(11)

where

d_{i}

and

d_{i}^{g}

are predictive depth and ground truth of pixel i, N is the number of total pixels.

4.2. Comparison and Analysis

KITTI [81] is an outdoor dataset that is widely used for MDE algorithm validation. The dataset contains a total of 93,000 RGB-D training images with corresponding ground truth. These images are collected from the city of Karlsruhe, the wild area and the highway, consisting of five categories: “Road”, ‘‘City”, ‘‘Residential”, ‘‘Campus”, and ‘‘Person”. In this paper, we report and analyze the experimental results of the typical MDE algorithms on dataset KITTI. Although self-supervised MDE methods gained remarkable improvements recently, it is unfair to compare performance of self-supervised methods with that of supervised methods. Therefore, we report the comparison results of supervised methods and self-supervised methods separately, which are shown in Table 2 and Table 3, respectively. The best results are in bold. Notably, not all of the methods mentioned above are included in Table 2 and Table 3 since some methods do not publish experimental results on KITTI dataset.

Compared to the CNN-based approaches [32,33,34,35,38,39,40], it can be found that Transformer-based methods perform better in MDE tasks in general [10,16,17,63,66]. The reasons are summarized as follows: Firstly, MDE comes under the dense prediction problem, which requires predicting the depth information of all pixels in the image. However, CNN-based approaches apply enormous downsampling operations, resulting in substantial pixels lost at the encoder side, which is irreversible. As a result of this limitation of CNN, considerable image features would be missed, and thus the performance would be greatly degraded. Secondly, CNN lacks the ability to capture long-distance dependencies in the image owing to its limited receptive field (3 × 3 or 5 × 5). Although the atrous convolution enlarged the receptive field [82], it is still limited compared to the Transformer. Therefore, Transformer-based models perform superior at handling intensive prediction tasks such as MDE. Nevertheless, there are still individual Transformer-based methods [18,20,21,80,81] in Table 3 that perform worse than some CNN-based ones. This happens due to the fact that there are some models utilizing lightweight Transformers, which pursue a dramatic reduction in parameters and computation, inevitably affecting the final performance. It is not in conflict with our conclusion.

The comparisons suggest some aspects: (1) for the supervised methods, the Transformer-based method (PixelFormer [11]) achieves the best performance, while the hybrid method (DepthFormer [14]) can yield comparable results; (2) for the self-supervised methods, the hybrid method (MonoVit [21]) outperforms other methods by a large margin; (3) compared with the supervised methods, the self-supervised methods exhibit great potential since the performance gap is closer.

5. Challenges and Trends

Estimating depth from a single image remains greatly challenging since the mapping between visual cues and depth is not clear so far. Although intensive investigations were conducted in this field, there are still obstacles to be overcome, such as computation and memory costs, generalizability, prediction accuracy, construction of training dataset, ambiguous scale, etc. In this section, we focus on the challenges and trends of how the attention mechanism helps MDE task improve performance.

5.1. Determining the Range of Context

Long-range dependency can be beneficial to improve edge blur and object inconsistency issues for the MDE task. Attention operations of most existing methods are conducted over the whole image; however, is attention over the holistic image the best answer? Attention over the holistic image exerts great computational payload, but may be not an absolute option. For example, pixels within a same object usually have similar depths, thus these pixels need a relative small context. However, a larger context is needed when we try to distinguish objects occluded by other objects. Therefore, it is worth investigating how to determine a suitable context range.

5.2. Reducing Complexity of Multi-Head Self-Attention

A Transformer based on multi-head self-attention (MHSA) gains great success in modeling the sequence-to-sequence task, such as NLP. To adapt the structure of Transformer for the MDE task, images are segmented to patches and projected into a sequence of vectors. However, the MDE task is more complicated than NLP because the sequence length is longer than that of NLP. For conventional attention mechanisms in which different attentions are integrated in a simply concatenative manner, it is clear that there must be redundant information captured by all attention heads. Adopting the same MHSA scheme as the NLP task will lead to great computational cost. Given that we can capture information exclusive to each header while filtering out generic information, that makes a lot of sense for the MDE task. Therefore, how to design an efficient MHSA is to be investigated in the future.

5.3. Exploring the Potential of Temporal Attention

Video sequence is often used to estimate depth from a single image according to the geometric constraints of multi-view vision. There are existing similarities of intensity, shape, textile, edge, trajectory, and other properties among the consecutive frames. These similarities imply visual cues for MDE task. Temporal attention, which can improve the understanding and representation of dynamic information in videos, is an attention mechanism that allows you to focus your attention on a certain point in time or interval. It can be categorized into soft attention and hard attention, corresponding to whether the vector distribution of the attention output is soft or one-hot, respectively. How to model the affinity between consecutive images using temporal attention remains an issue to be solved.

5.4. Developing Fast Implementation of Attention Algorithms

MDE has great potential in the real-time applications, such as robotic navigation, virtual reality, and autonomous driving. Although MDE models based on attentional mechanisms now achieved impressive performance [10,16,17,66,73,76], there are still no frameworks that can estimate depth maps with high accuracy and resolution in real time under conditions of low computational resources. Recently, some studies optimized the Transformer-based network architecture to drastically reduce the number of model parameters and processing effort [18,20]; however, these models not only remain unable to be deployed in real time on embedded devices, but are also constructed at the cost of sacrificing prediction accuracy, which hinders the application of MDE. Therefore, there are hard requirements to accelerate the implementation time of attention operation, meanwhile maintaining high estimation accuracy.

6. Conclusions

Monocular depth estimation plays important roles in downstream applications such as virtual reality, 3D reconstruction, and robotic navigation. Traditional CNN architecture is not suitable for MDE due to the local receptive field. To mitigate the negative effect of CNN, various methods using attention mechanism are proposed to boost the performance of MDE. In this paper, we focus on the attention mechanism used in the existing MDE methods. We summarize the existing papers published from 2020 to 2022, and try to figure out why the attention mechanism can improve the performance of MDE. We firstly introduce the principles of the commonly used attention mechanism. Then, the details of the literature are analyzed according to a different architecture and training strategy. Furthermore, we compare the performance of the typical algorithms. Finally, we discuss the challenges and point out the potential directions in the future. We hope our contributions could facilitate the work of related researchers.

Author Contributions

Conceptualization, Y.L. and X.W.; methodology, X.W.; software, H.F.; validation, X.W., Y.L. and H.F.; formal analysis, X.W.; investigation, Y.L.; resources, Y.L.; data curation, X.W.; writing—original draft preparation, X.W.; writing—review and editing, H.F.; visualization, H.F.; supervision, X.W.; project administration, X.W.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant #62071006.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kerdvibulvech, C.; Dong, Z.Y. Roles of artificial intelligence and extended reality development in the post-COVID-19 Era. In Proceedings of the HCI International 2021-Late Breaking Papers: Multimodality, eXtended Reality, and Artificial Intelligence: 23rd HCI International Conference, HCII 2021, Virtual Event, 24–29 July 2021; Springer International Publishing: Cham, Switzerland, 2021; pp. 445–454. [Google Scholar]
Kerdvibulvech, C. A Digital Human Emotion Modeling Application Using Metaverse Technology in the Post-COVID-19 Era. In Proceedings of the International Conference on Human-Computer Interaction, Copenhagen, Denmark, 23–28 July 2023; Springer Nature: Cham, Switzerland, 2023; pp. 480–489. [Google Scholar]
Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, 8–13 December 2014; pp. 2366–2374. [Google Scholar]
Wang, G.; Li, Y. Monocular depth estimation using synthetic data with domain-separated feature alignment. In Proceedings of the 2022 6th International Conference on Computer Science and Artificial Intelligence, Beijing China, 9–11 December 2022; pp. 100–105. [Google Scholar]
Godard, C.; Mac Aodha, O.; Firman, M. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3828–3838. [Google Scholar]
Wofk, D.; Ma, F.; Yang, T.J. Fastdepth: Fast monocular depth estimation on embedded systems. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 6101–6108. [Google Scholar]
Zhou, T.; Brown, M.; Snavely, N. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1851–1858. [Google Scholar]
Kavuran, G.; Gökhan, Ş.; Yeroğlu, C. COVID-19 and human development: An approach for classification of HDI with deep CNN. Biomed. Signal Process. Control. 2023, 81, 104499. [Google Scholar] [CrossRef] [PubMed]
Hamad, Q.S.; Samma, H.; Suandi, S.A. Feature selection of pre-trained shallow CNN using the QLESCA optimizer: COVID-19 detection as a case study. Appl. Intell. 2023, 53, 18630–18652. [Google Scholar]
Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 12179–12188. [Google Scholar]
Agarwal, A.; Arora, C. Attention attention everywhere: Monocular depth prediction with skip attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 5861–5870. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Polasek, T.; Čadík, M.; Keller, Y. Vision UFormer: Long-range monocular absolute depth estimation. Comput. Graph. 2023, 111, 180–189. [Google Scholar] [CrossRef]
Li, Z.; Wang, X.; Liu, X. Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv 2022, arXiv:2204.00987. [Google Scholar]
Agarwal, A.; Arora, C. Depthformer: Multiscale vision transformer for monocular depth estimation with global local information fusion. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 3873–3877. [Google Scholar]
Zhang, N.; Nex, F.; Vosselman, G. Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 18537–18546. [Google Scholar]
Tomar, S.S.; Suin, M.; Rajagopalan, A.N. Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 308–326. [Google Scholar]
Bae, J.; Moon, S.; Im, S. MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 187–196. [Google Scholar]
Zhao, C.; Zhang, Y.; Poggi, M. Monovit: Self-supervised monocular depth estimation with a vision transformer. In Proceedings of the 2022 International Conference on 3D Vision (3DV), Prague, Czech Republic, 12–15 September 2022; pp. 668–678. [Google Scholar]
Ming, Y.; Meng, X.; Fan, C. Deep learning for monocular depth estimation: A review. Neurocomputing 2021, 438, 14–33. [Google Scholar] [CrossRef]
Dong, X.; Garratt, M.A.; Anavatti, S.G. Towards real-time monocular depth estimation for robotics: A survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 16940–16961. [Google Scholar] [CrossRef]
Bae, J.; Hwang, K.; Im, S. A Study on the Generality of Neural Network Structures for Monocular Depth Estimation. arXiv 2023, arXiv:2301.03169. [Google Scholar]
Chaudhari, S.; Mithal, V.; Polatkan, G. An attentive survey of attention models. ACM Trans. Intell. Syst. Technol. (TIST) 2021, 12, 1–32. [Google Scholar] [CrossRef]
Li, Y.; Lin, C.; Li, H. Unsupervised domain adaptation with self-attention for post-disaster building damage detection. Neurocomputing 2020, 415, 27–39. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Ates, G.C.; Mohan, P.; Celik, E. Dual Cross-Attention for Medical Image Segmentation. arXiv 2023, arXiv:2303.17696. [Google Scholar]
He, K.; Zhang, X.; Ren, S. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Ruhkamp, P.; Gao, D.; Chen, H. Attention meets geometry: Geometry guided spatial-temporal attention for consistent self-supervised monocular depth estimation. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 837–847. [Google Scholar]
Chen, Y.; Zhao, H.; Hu, Z. Attention-based context aggregation network for monocular depth estimation. Int. J. Mach. Learn. Cybern. 2021, 12, 1583–1596. [Google Scholar] [CrossRef]
Aich, S.; Vianney JM, U.; Islam, M.A. Bidirectional attention network for monocular depth estimation. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11746–11752. [Google Scholar]
Zhang, X.; Abdelfattah, R.; Song, Y. Depth Monocular Estimation with Attention-based Encoder-Decoder Network from Single Image. In Proceedings of the 2022 IEEE 24th International Conference on High Performance Computing & Communications(HPCC), Chengdu, China, 18–20 December 2022; pp. 1795–1800. [Google Scholar]
Lee, M.; Hwang, S.; Park, C. Edgeconv with attention module for monocular depth estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2858–2867. [Google Scholar]
Wang, Y.; Sun, Y.; Liu, Z. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. (Tog) 2019, 38, 1–12. [Google Scholar] [CrossRef]
Jung, G.; Yoon, S.M. Monocular depth estimation with multi-view attention autoencoder. Multimed. Tools Appl. 2022, 81, 33759–33770. [Google Scholar] [CrossRef]
Naderi, T.; Sadovnik, A.; Hayward, J. Monocular depth estimation with adaptive geometric attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 944–954. [Google Scholar]
Lu, Z.; Chen, Y. Pyramid frequency network with spatial attention residual refinement module for monocular depth estimation. J. Electron. Imaging 2022, 31, 023005. [Google Scholar] [CrossRef]
Ren, H.; El-Khamy, M.; Lee, J. Deep Monocular Video Depth Estimation Using Temporal Attention. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1988–1992. [Google Scholar]
Zhang, M.; Ye, X.; Fan, X. Unsupervised depth estimation from monocular videos with hybrid geometric-refined loss and contextual attention. Neurocomputing 2020, 379, 250–261. [Google Scholar] [CrossRef]
Zhang, C.; Liu, J.; Han, C. Unsupervised learning of depth estimation based on attention model from monocular images. In Proceedings of the 2020 International Conference on Virtual Reality and Visualization (ICVRV), Recife, Brazil, 13–14 November 2020; pp. 191–195. [Google Scholar]
Jiang, C.; Liu, H.; Li, L. Attention-based self-supervised learning monocular depth estimation with edge refinement. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, Alaska, 19–22 September 2021; pp. 3218–3222. [Google Scholar]
Zhang, Q.; Lin, D.; Ren, Z. Attention Mechanism-based Monocular Depth Estimation and Visual Odometry. In Proceedings of the 2021 IEEE International Conference on Real-Time Computing and Robotics (RCAR), Xining, China, 15–19 July 2021; pp. 951–956. [Google Scholar]
Lei, Z.; Wang, Y.; Li, Z. Attention based multilayer feature fusion convolutional neural network for unsupervised monocular depth estimation. Neurocomputing 2021, 423, 343–352. [Google Scholar] [CrossRef]
Johnston, A.; Carneiro, G. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4756–4765. [Google Scholar]
Yan, J.; Zhao, H.; Bu, P. Channel-wise attention-based network for self-supervised monocular depth estimation. In Proceedings of the 2021 International Conference on 3D vision (3DV), London, UK, 1–3 December 2021; pp. 464–473. [Google Scholar]
Bhattacharyya, S.; Shen, J.; Welch, S. Efficient unsupervised monocular depth estimation using attention guided generative adversarial network. J. Real-Time Image Process. 2021, 18, 1357–1368. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Song, X.; Li, W.; Zhou, D. MLDA-Net: Multi-level dual attention-based network for self-supervised monocular depth estimation. IEEE Trans. Image Process. 2021, 30, 4691–4705. [Google Scholar] [CrossRef] [PubMed]
Xu, X.; Chen, Z.; Yin, F. Multi-scale spatial attention-guided monocular depth estimation with semantic enhancement. IEEE Trans. Image Process. 2021, 30, 8811–8822. [Google Scholar] [CrossRef] [PubMed]
Fan, C.; Yin, Z.; Xu, F. Joint soft–hard attention for self-supervised monocular depth estimation. Sensors 2021, 21, 6956. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Luo, F.; Xiao, C. Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module. Comput. Vis. Media 2022, 8, 631–647. [Google Scholar] [CrossRef]
Hong, Z.; Wu, Q. Self-supervised monocular depth estimation via two mechanisms of attention-aware cost volume. Vis. Comput. 2022, 1–15. [Google Scholar] [CrossRef]
Liu, S.; Tu, X.; Xu, C. Deep neural networks with attention mechanism for monocular depth estimation on embedded devices. Future Gener. Comput. Syst. 2022, 131, 137–150. [Google Scholar] [CrossRef]
Wei, J.; Pan, S.; Gao, W. Triaxial squeeze attention module and mutual-exclusion loss based unsupervised monocular depth estimation. Neural Process. Lett. 2022, 54, 4375–4390. [Google Scholar] [CrossRef]
Ling, C.; Zhang, X.; Chen, H. Unsupervised monocular depth estimation using attention and multi-warp reconstruction. IEEE Trans. Multimed. 2021, 24, 2938–2949. [Google Scholar] [CrossRef]
Xiang, J.; Wang, Y.; An, L. Visual attention-based self-supervised absolute depth estimation using geometric priors in autonomous driving. IEEE Robot. Autom. Lett. 2022, 7, 11998–12005. [Google Scholar] [CrossRef]
Gupta, A.; Prince, A.A.; Fredo, A.R.J. Transformer-based Models for Supervised Monocular Depth Estimation. In Proceedings of the 2022 International Conference on Intelligent Controller and Computing for Smart Power (ICICCSP), Hyderabad, India, 21–23 July 2022; pp. 1–5. [Google Scholar]
Françani, A.O.; Maximo, M.R.O.A. Dense Prediction Transformer for Scale Estimation in Monocular Visual Odometry. In Proceedings of the 2022 Latin American Robotics Symposium (LARS), 2022 Brazilian Symposium on Robotics (SBR), and 2022 Workshop on Robotics in Education (WRE), São Paulo, Brazil, 18–21 October 2022; pp. 1–6. [Google Scholar]
Wu, B.; Wang, Y. Rich global feature guided network for monocular depth estimation. Image Vis. Comput. 2022, 125, 104520. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Cheng, Z.; Zhang, Y.; Tang, C. Swin-depth: Using transformers and multi-scale fusion for monocular-based depth estimation. IEEE Sens. J. 2021, 21, 26912–26920. [Google Scholar] [CrossRef]
Chen, M.; Liu, J.; Zhang, Y. RA-Swin: A RefineNet Based Adaptive Model Using Swin Transformer for Monocular Depth Estimation. In Proceedings of the 2022 8th International Conference on Virtual Reality (ICVR), Nanjing, China, 26–28 May 2022; pp. 270–279. [Google Scholar]
Ibrahem, H.; Salem, A.; Kang, H.S. Rt-vit: Real-time monocular depth estimation using lightweight vision transformers. Sensors 2022, 22, 3849. [Google Scholar] [CrossRef] [PubMed]
Shu, C.; Chen, Z.; Chen, L. SideRT: A real-time pure transformer architecture for single image depth estimation. arXiv 2022, arXiv:2204.13892. [Google Scholar]
Yun, I.; Lee, H.J.; Rhee, C.E. Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2022; Volume 36, pp. 3224–3233. [Google Scholar]
Yang, J.; An, L.; Dixit, A. Depth estimation with simplified transformer. arXiv 2022, arXiv:2204.13791. [Google Scholar]
Zhang, Q.; Wei, C.; Li, Q. Pooling Pyramid Vision Transformer for Unsupervised Monocular Depth Estimation. In Proceedings of the 2022 IEEE International Conference on Smart Internet of Things (SmartIoT), Xining, China, 9–21 August 2022; pp. 100–107. [Google Scholar]
Han, D.; Shin, J.; Kim, N. Transdssl: Transformer based depth estimation via self-supervised learning. IEEE Robot. Autom. Lett. 2022, 7, 10969–10976. [Google Scholar] [CrossRef]
Varma, A.; Chawla, H.; Zonooz, B. Transformers in self-supervised monocular depth estimation with unknown camera intrinsics. arXiv 2022, arXiv:2202.03131. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M. Training data-efficient image transformers distillation through attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Bhat, S.F.; Alhashim, I.; Wonka, P. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4009–4018. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Hong, Y.; Liu, X.; Dai, H. PCTNet: 3D Point Cloud and Transformer Network for Monocular Depth Estimation. In Proceedings of the 2022 10th International Conference on Information and Education Technology (ICIET), Matsue, Japan, 9–11 April 2022; pp. 415–419. [Google Scholar]
Li, Z.; Chen, Z.; Liu, X. Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation. arXiv 2022, arXiv:2203.14211. [Google Scholar]
Manimaran, G.; Swaminathan, J. Focal-WNet: An Architecture Unifying Convolution and Attention for Depth Estimation. In Proceedings of the 2022 IEEE 7th International conference for Convergence in Technology (I2CT), Pune, India, 7–9 April 2022; pp. 1–7. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Huo, Z.; Chen, Y.; Wei, J. Transformer-Based Monocular Depth Estimation Using Token Attention. SSRN 2022. [Google Scholar] [CrossRef]
Hwang, S.J.; Park, S.J.; Baek, J.H. Self-supervised monocular depth estimation using hybrid transformer encoder. IEEE Sens. J. 2022, 22, 18762–18770. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Stiller, C. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]

Figure 1. Squeeze-and-excitation block (image: courtesy to Hu et al. [27]).

Figure 2. Scheme of channel self-attention of CBAM (image: courtesy to Woo et al. [28]).

Figure 3. (a) Scheme of channel cross-attention; and (b) scheme of spatial cross-attention (image: courtesy to Ates et al. [29]).

Figure 4. Scheme of spatial self-attention of CBAM (image: courtesy to Woo et al. [28]).

Figure 5. Scaled dot product attention mechanism of Transformer (image: courtesy to Vaswani et al. [13]).

Figure 6. Possible combinations of CNN-based supervised MDEs.

Figure 7. Possible combinations of Transformer-based supervised MDEs.

Figure 8. Serial scheme of CNN–Transformer-based supervised MDE.

Figure 9. Parallel scheme of CNN–Transformer-based supervised MDE.

Figure 10. Channel attention and spatial attention used in [31].

Figure 11. Channel and spatial attentions across three directions (image: courtesy to Jung et al. [37]).

Figure 12. AGA module used in [38] (image: courtesy to Naderi et al. [38]).

Figure 13. Attention module used in [48] (image: courtesy to Bhattacharyya et al. [48]).

Figure 14. Channel attention module of MLDA-Net (image: courtesy to Song et al. [50]).

Figure 15. Structure of TSAM (image: courtesy to Wei et al. [56]).

Figure 16. Overview of DPT architecture (image: courtesy to Ranftl et al. [10]).

Figure 17. Overview of the pixel-wise attention module. The two inputs are the intermediate feature maps of the encoder and the decoder (image: courtesy to Han et al. [63]).

Figure 18. Overview of attention module used in AdaBins, (image: courtesy to Bhat et al. [73]).

Figure 19. Overview of token attention module used in [79] (image: courtesy to Huo et al. [79]).

Figure 20. Joint CNN and Transformer layer used in [21] (image: courtesy to Zhao et al. [21]).

Figure 21. The proposed consecutive dilated convolutions module (CDC) and local–global features interaction module (LGFI) (image: courtesy to Zhang et al. [18]).

Table 1. Summary of attention-based MDE methods from 2020 to 2022.

Models	Strategy	Methods	Attention Mechanism	Contributions
CNN-based	Supervised	ANUW [31], 2021	CSA, SSA	U-net with channel and spatial attentions
		ACAN [32], 2021	SSA	Context aggregation module
		BANet [33], 2021	SCA	Bidirectional attention
		Zhang et al. [34], 2022	CSA, SSA	Inserting attentions into skip connections
		Lee et al. [35], 2022	SSA	EdgeConv attention module
		MVAA [37], 2022	CSA, SSA	Multi-view attention model
		Naderi et al. [38], 2022	CSA, SCA	Adaptive geometric attention
		PFN [39], 2022	SSA	Spatial attention residual refinement
		Ren et al. [40], 2022	CCA	Temporal attention
	Self- supervised	Zhang et al. [41], 2020	CSA, SSA	Channel and spatial attentions
		Zhang et al. [42], 2021	SSA	Spatial attention
		Jiang et al. [43], 2021	SSA, CSA	Channel and spatial attentions
		Zhang et al. [44], 2020	CSA, SSA	Channel and spatial attentions
		AgFU-Net [45], 2021	CSA	Attention of skip connection
		Johnston et al. [46], 2020	SSA	Prediction of discrete disparity with spatial attention
		TC-Depth [31], 2021	SSA, SCA	Temporal attention
		CADepth-Net [47], 2021	CSA	Channel attention
		Bhattacharyya [48], 2021	CSA	Channel attention at multi-scale
		MLDA-Net [50], 2021	CSA	Channel attention using dot-product
		Xu et al. [51], 2021	SSA, SCA	Combing ASPP with spatial attention
		SHdepth [52], 2021	SSA, CSA	Channel and spatial attentions
		Li et al. [53], 2022	CSA, SSA	Normalized CBAM
		Hong et al. [54], 2022	SSA	3D spatial attention
		EDNet [55], 2022	CSA	Channel attention
		Wei et al. [56], 2022	CSA, SSA	Triaxial squeeze attention
		Ling et al. [57], 2022	CSA, SSA	Channel and spatial attentions
		VADepth [58], 2022	CSA, SSA	Channel and spatial attentions
Transformer-based	Supervised	DPT [10], 2021	SSA	Transformers as backbone
		Gupta et al. [59], 2022	SSA	Transformers as backbone
		ViUT [15], 2022	SSA	Transformers as backbone
		DPT-VO [60], 2022	SSA	DPT for scale estimation
		RGFN [61], 2022	SSA	Large kernel convolution attention
		BinsFormer [16], 2022	SSA, SCA	Using Transformers to predict bins center
		DepthFormer [17], 2022	SSA	Using Transformers to predict bins center
		Swin-Depth [63], 2021	CSA, SSA	SW-Transformers as backbone
		PixelFormer [11], 2022	CCA	Skip attention module
		RA-Swin [64], 2022	SSA	SW-Transformers as backbone
		RT-ViT [65], 2022	SSA	Lightweight framework
		SideRT [66], 2022	SSA, SCA	Cross scale attention
	Self- supervised	Yun et al. [67], 2021	SSA	Non-local fusion block
		DEST [68], 2022	SSA, SCA	Simplified Transformers for both depth and pose net
		PPViT [69], 2022	SSA	Spatial reduction before attention
		TransDSSL [70], 2022	SSA	Pixel-wise skip attention; Self-distillation loss
		MT-SfMLearner [71], 2022	SSA	Estimate camera intrinsics
Hybrid	Supervised	AdaBins [73], 2020	SSA	Predict pixel probability using attention
		DPT-hybrid [10], 2021	SSA	Hybrid encoder as backbone
		PCTNet [75], 2022	SSA	Use simple AdaBins module as decoder
		DepthFormer [76], 2022	SSA, SCA	Fuse features of Transformer and CNN
		Focal-WNet [77], 2022	SSA	Fuse features of Transformer and CNN
		Huo et al. [79], 2022	SSA	Token attention
	Self- supervised	Tomar et al. [19], 2022	SSA	Fuse features of Transformer and CNN
		MonoFormer [20], 2022	SSA, CSA	Attention connection module
		MonoViT [21], 2022	SSA, CSA	Joint CNN and Transformer layer
		Hwang et al. [80], 2022	SSA, CSA	Fuse features of Transformer and CNN
		Lite-Mono [18], 2022	SSA, CSA	Lightweight framework with attention mechanism

Table 2. Comparison of supervised MDE algorithms on dataset KITTI.

Models	Methods	AbsRel↓	SqRel↓	RMSE↓	RMSE_log↓	δ₁↑	δ₂↑	δ₃↑
CNN-based	ACAN [32], 2021	0.075	-	3.509	0.118	0.930	0.985	0.996
	BANet [33], 2021	0.083	0.181	3.300	-	0.938	0.988	0.997
	Zhang et al. [34], 2022	0.061	0.297	2.548	-	0.947	0.989	0.996
	Lee et al. [35], 2022	0.065	0.261	2.925	0.107	0.947	0.992	0.998
	Naderi et al. [38], 2022	0.070	-	3.223	0.113	0.944	0.991	0.998
	PFN [39], 2022	0.069	0.302	2.652	0.112	0.953	0.989	0.995
	Ren et al. [40], 2022	0.038	0.209	4.771	-	0.952	0.993	0.998
Transformer-based	DPT [10], 2021	0.062	-	2.573	0.092	0.959	0.995	0.999
	Swin-Depth [63], 2021	0.064	0.232	2.643	0.097	0.957	0.994	0.999
	PixelFormer [11], 2022	0.051	0.149	2.081	0.077	0.976	0.997	0.999
	BinsFormer [16], 2022	0.052	0.151	2.098	0.079	0.974	0.997	0.999
	DepthFormer [17], 2022	0.058	-	2.285	-	0.967	0.996	0.999
	RGFN [61], 2022	0.057	0.182	2.250	0.086	0.968	0.996	0.999
	SideRT [66], 2022	0.054	0.173	2.249	0.082	0.972	0.997	0.999
Hybrid	AdaBins [73], 2020	0.058	0.190	2.360	0.088	0.964	0.995	0.999
	DPT-hybrid [10], 2021	0.062-	-	2.573	0.092	0.959	0.995	0.999
	DepthFormer [76], 2022	0.052	0.158	2.143	0.079	0.975	0.997	0.999
	Focal-WNet [77], 2022	0.082	0.355	3.076	0.120	0.926	0.986	0.997

Table 3. Comparison of self-supervised MDE algorithms on dataset KITTI.

Models	Methods	AbsRel↓	SqRel↓	RMSE↓	RMSE_log↓	δ₁↑	δ₂↑	δ₃↑
CNN-based	Zhang et al. [41], 2020	0.135	0.879	4.051	0.195	0.822	0.943	0.981
	Zhang et al. [42], 2020	0.173	1.153	4.979	0.249	0.752	0.916	0.968
	Jiang et al. [43], 2021	0.127	0.913	4.970	0.202	0.846	0.953	0.980
	AgFU-Net [45], 2021	0.119	0.922	5.033	0.211	0.851	0.947	0.977
	Johnston et al. [46], 2020	0.106	0.861	4.699	0.185	0.889	0.962	0.982
	TC-Depth [31], 2021	0.082	0.667	4.104	-	0.921	-	0.997
	CADepth-Net [47], 2021	0.096	0.694	4.264	0.173	0.908	0.968	0.984
	Bhattacharyya [48], 2021	0.120	0.889	4.329	0.192	0.865	0.943	0.989
	MLDA-Net [50], 2021	0.097	0.658	4.278	0.178	0.889	0.965	0.984
	Xu et al. [51], 2021	0.125	1.553	5.844	0.202	0.865	0.946	0.977
	SHdepth [52], 2021	0.108	0.812	4.634	0.185	0.887	0.962	0.982
	Li et al. [53], 2022	0.098	0.810	4.672	0.177	0.890	0.964	0.983
	Hong et al. [54], 2022	0.096	0.742	4.450	0.176	0.901	0.965	0.983
	EDNet [55], 2022	0.095	-	4.311	-	0.891	0.965	0.986
	Wei et al. [56], 2022	0.138	1.319	5.499	0.231	0.802	0.924	0.966
	Ling et al. [57], 2022	0.121	0.971	5.206	0.214	0.843	0.944	0.975
	VADepth [58], 2022	0.109	0.785	4.624	0.190	0.875	0.960	0.982
Transformer-based	DEST [68], 2022	0.095	0.752	4.207	0.165	-	-	-
	PPViT [69], 2022	0.120	0.997	5.033	0.217	0.859	0.948	0.977
	TransDSSL [70], 2022	0.095	0.711	4.321	0.172	0.906	0.967	0.984
	MT-SfMLearner [71], 2022	0.104	0.799	4.547	0.181	0.893	0.963	0.982
Hybrid	Tomar et al. [19], 2022	0.112	0.750	4.528	0.187	0.881	0.963	0.983
	MonoFormer [20], 2022	0.108	0.806	4.594	0.184	0.884	0.963	0.983
	MonoViT [21], 2022	0.067	0.389	3.108	0.115	0.950	0.992	0.998
	Hwang et al. [80], 2022	0.095	0.696	4.317	0.174	0.902	0.965	0.983
	Lite-Mono [18], 2022	0.097	0.710	4.309	0.174	0.905	0.967	0.984

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Wei, X.; Fan, H. Attention Mechanism Used in Monocular Depth Estimation: An Overview. Appl. Sci. 2023, 13, 9940. https://doi.org/10.3390/app13179940

AMA Style

Li Y, Wei X, Fan H. Attention Mechanism Used in Monocular Depth Estimation: An Overview. Applied Sciences. 2023; 13(17):9940. https://doi.org/10.3390/app13179940

Chicago/Turabian Style

Li, Yundong, Xiaokun Wei, and Hanlu Fan. 2023. "Attention Mechanism Used in Monocular Depth Estimation: An Overview" Applied Sciences 13, no. 17: 9940. https://doi.org/10.3390/app13179940

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Attention Mechanism Used in Monocular Depth Estimation: An Overview

Abstract

1. Introduction

2. Preliminary

2.1. Monocular Depth Estimation

2.2. Attention

2.3. Transformer

3. Attention Mechanism in MDE Methods

3.1. Overview of MDE Framework with Attention

3.2. CNN-Based Attention Methods

3.2.1. Supervised Training

3.2.2. Self-Supervised Training

3.3. Transformer-Based Attention Methods

3.3.1. Supervised Training

3.3.2. Self-Supervised Training

3.4. Hybrid Attention Methods

3.4.1. Supervised Training

3.4.2. Self-Supervised Training

3.5. Summary of MDE Methods

4. Performance Comparison

4.1. Metrics

4.2. Comparison and Analysis

5. Challenges and Trends

5.1. Determining the Range of Context

5.2. Reducing Complexity of Multi-Head Self-Attention

5.3. Exploring the Potential of Temporal Attention

5.4. Developing Fast Implementation of Attention Algorithms

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI