4.5.2. Comparison of Stereo Matching Model
Comparisons with depth estimation:
Table 3 shows the comparison results of the estimated depth value between our model and the current state-of-the-art monocular depth evaluation method on the test dataset of the KITTI benchmark. The first two rows in
Table 3 show the depth results of the supervision labels obtained by the self-supervised stereo matching model with the input of the original stereo pairs. Although the supervision labels were obtained by the self-supervised model, the depth results were even better than the supervised depth estimation method on most metrics. From the estimation metric of
Table 3, we can see that under the same training conditions, our method can outperform other methods in most estimation metrics, especially the estimated results after post-processing can achieve better performance. For the depth evaluation cap of 80 m, compared with other methods under the same training condition, our method before post-processing was only slightly inferior to Godard et al.’s [
32] in estimation of the metrics of RMSE and SRD, which were about 0.016 and 0.131 lower, respectively. Other estimation indexes were better than other methods under the same training condition. After post-processing, the estimation metrics of our method were more than any other method on all the metrics under the same training conditions. Especially for RMSE(log) and ARD, our method was better than the current state-of-the-art method, 0.019 and 0.008. When we set the cap of the predicted depth to 50 m, our method was only slightly inferior to Garg et al. [
25], about 0.002 m on the RMSE metric before post-processing with the same training mode. After the post-processing, the metric of RMSE reduced 0.017, which was on average 0.001 m more accurate than that of Garg et al. [
25].
Table 3 also demonstrates the depth estimation results of those methods [
28,
32] that added extra video temporal series in the training stage. Despite training data being added, our method still outperformed Zhan et al. [
28], and the four indices exceeded Godard et al. [
32] in the depth evaluation cap of 80 m. As can be seen from the comparison, there was a gap between our method and the state-of-the-art supervised depth estimation methods [
11]. In conclusion, our method was superior to the current state-of-the-art self-supervised monocular depth estimation method under the same training conditions.
Figure 5 shows our qualitative comparison between our semi-supervised method and the current mainstream monocular depth estimation methods by outputting the visual disparity map. Perhaps unsurprisingly, the ground truth disparities obtained by 3D scanner were able to provide better visual effects, but the recoding pixel points were sparse. As we can see from
Figure 5, the method of [
12,
25,
32] was able to obtain a good depth estimation map for a single view from the scene, but our method can present the details of the depth map more clearly and smoothly.
Different Training Modes of the Stereo Matching Network: From
Table 5, we can see that there were some errors between the synthetic right view and the original right view, so it may not be appropriate to use the weightings of the stereo matching model trained from the original stereo pair as the weightings of our stereo matching model directly. In order to verify which training mode is more suitable for our stereo matching model, we conducted three experiments, which are respectively represented as the SSD, the OOD, and the SOD. Where the SSD refers to the original left view and synthesized right view as self-supervised input data and supervision labels, the OOD refers to the original stereo pairs as self-supervised input data and supervision labels, the SOD refers to the synthesized right view and the original left view as self-supervised input data and the original stereo pairs as supervised labels. Either way, in the test phase, the input data were a mixture of the reconstructed right view and the original left view. As can be seen from
Table 6, no matter whether the evaluation cap was 50 m or 80 m, the SOD from the three training modes was better than the other two training modes in most estimation metrics, and the evaluation metric of the SSD training mode was the worst, even lower than monocular self-supervised depth estimation methods. At the depth cap of 80 m, the estimation metric of the OOD was only slightly less, 0.003 m, than that of the SOD in terms of the SRD metrics. Although OOD mode had a good depth estimation result for the original stereo pair, it did not refer to the reconstructed right view during the training stage, so that the generalization for the reconstructed right view was poor. For SSD mode, supervision labels also adopted the reconstructed right view with some errors that made depth estimation for an actual scene have a big deviation. When the depth cap was 50 m, the SOD was much better than the OOD in all estimation metrics.
Figure 6 shows the qualitative depth estimation results of the stereo matching models in three training modes. It is very difficult to distinguish which training method was better only from the pictures, but we can intuitively see that the SOD mode can achieve clearer segmentation on the edge of the scene object. We can also see from
Figure 6 that since the reconstructed right view was used as the self-supervision label for the SSD training mode, the scene object in the depth map had a blurred boundary at the edge part. The OOD mode did not refer to the feature of synthetic right view during training, and the generalization of the model was greatly reduced. In the SOD training mode, we adopted the concatenation of both the reconstructed right view and original left view as the input data and the original stereo pair as the supervision label to train the model. This model not only can extract from the reconstructed right view both the characteristics of the input data and the supervision of the original scene, it can produce the optimal depth estimation results.
Comparison of Different Supervision Parameter: As can be seen from
Table 3, compared with the ground truth depth data, the depth values predicted by the original stereo pair through the stereo matching network had a certain error. Therefore, if we used these disparity values directly as supervision labels with
to train our model, this would lead to greater error and reduce the generalization of the stereo matching network. In order to make the supervision loss play the most appropriate optimization role in the model, we used parameter
to adjust the proportion of this cost in the loss function so as to constrain its effect on the model.
Table 2 shows the experiment results, which are the depth estimation metrics from the stereo matching network with different parameters value
. We set the range of supervision loss parameter to
and selected eight
values between zero and one to train our model. We can see from
Table 2 that when only using the supervision loss to optimize our model, the generalization ability of the model was poor due to the errors of the supervision labels. With the range of
, the depth estimation results gradually improved with the decrease of
, while when
, the depth estimation results gradually reduced with the decrease of
. Like a parabola, when
, the model can perform the best for all metrics and setups. Therefore, in this paper, we set the supervision loss parameter
as 0.1.
We also made a qualitative comparison of different loss parameter value
by visualizing the output disparity maps. As we can see from
Figure 7, when only using the supervised loss to optimize the model, the depth estimation effect obtained by the warping function was very poor. When the value of
was greater than 0.1 or less than 0.1, the visualized disparity map obviously deviated from the original view. When
, the visualized effect of the disparity map was similar to that of
. However, we can see from the quantitative representation in
Table 2 and the qualitative presentation in
Figure 7 that the disparity estimation of
was superior to
in accuracy, details, and smoothness.