ED2IF2-Net: Learning Disentangled Deformed Implicit Fields and Enhanced Displacement Fields from Single Images Using Pyramid Vision Transformer
Abstract
:1. Introduction
- A Pyramid-Vision-Transformer-based -Net is proposed for end-to-end single-view implicit 3D reconstruction, which disentangles implicit field reconstruction into accurate topological structures and enhanced surface details with competitive inference time. To our knowledge, it is the first method to utilize transformers for single-view implicit 3D reconstruction. Experimental results show superior performance in both overall reconstruction and detail recovery.
- The finer topological structural details of the object are achieved through iterative refinement of the coarse implicit field using multiple IFDBs. IFDB deforms the implicit field from coarse to fine based on query point and pixel-aligned local feature variations at continuous scales. -Net also enhances surface detail representation at spatial and channel levels.
- A novel loss function consisting of four terms is proposed, where coarse shape loss and overall shape loss allow the reconstruction of the coarse shape and the overall shape after fusion, and novel deformation loss and Laplacian loss enable -Net to reconstruct structure details and recover surface details, respectively.
2. Related Works
2.1. Shape Representations
2.2. Implicit Methods for Single-View 3D Reconstruction
2.3. Laplacian Operators
2.4. Transformers in Computer Vision
3. Methodology
3.1. Overview
3.2. Disentanglement Method
3.3. Network Architecture
3.3.1. Pyramid Vision Transformer Encoder
3.3.2. Coarse Shape Decoder
3.3.3. Deformation Decoder
Algorithm 1 Deformation |
Input: coarse implicit field , multi-scale local features , query point P and its projection p on the image |
Output: deformed implicit field |
|
Implicit Field Deformation Block
3.3.4. Surface Detail Decoder
3.4. Loss Function and Sampling Strategy
4. Experiment Results and Discussion
4.1. Dataset and Metrics
4.2. Implementation Details
4.3. Comparison with SOTA Approaches
4.4. Ablation Studies
- Option 1: In this option, we keep the original encoder PVT in the network, plus the coarse shape decoder (CSD) and a random sampling strategy, and the loss function is applied. It can be seen from Figure 7 that the coarse shape decoder and the random sampling strategy can only reconstruct the coarse shape with few structure details and no surface details. It is consistent with the quantitative results in Table 2.
- Option 2: On the basis of the first option, the network is trained with weighted sampling (WS). It can be found from Figure 7 that WS enables the network to reconstruct more details, especially at small scales.
- Option 3: In this option, we still use PVT as the encoder. However, we try to directly initialize a random signed distance value for each query point and iteratively refine it in the deformation decoder (DD). Then, the network is trained only constrained by deformation loss with WS. It can be observed from Figure 7 that the network without the coarse implicit field reconstructs awful surfaces and topologies. Moreover, quite a few surface artifacts emerge due to the absence of the coarse implicit field near the shape.
- Option 4: With this option, CSD together with the DD serve as the decoders and only with WS is used for the loss estimation. It can be seen in Figure 7 that such a network creates fewer shape artifacts and distortions, but it still fails to reconstruct a full shape of the structure, which is attributed to the fact that the loss function takes no account of the intermediate implicit fields generated in the iterative deformation.
- Option 5: Based on the previous options, and are applied to train the CSD and the DD, respectively. WS is also used here. From Figure 7, it is illustrated that the network with this option is capable of reconstructing more accurate topological structures and producing a smoother shape.
- Option 6: In this option, the surface detail decoder in -Net with WS cancels the prediction of the backward displacement map and the deformed implicit field is only fused with the forward displacement map. The surface detail decoder in this case is represented as SDD_S and the normal case is denoted as SDD_N. It can be noticed from Figure 7 that, without the backward displacement map, the surface details of the results may be incorrectly reconstructed and distortions may occur at the structural level, possibly owing to the lack of the backward displacement map, which prevents fine-tuning.
- Option 7: Only the of the standard -Net loss functions is removed and the rest remains unchanged. It can be noted from Figure 7 that, in this case, the surface details of the reconstruction cannot be clearly recovered and may produce distortions.
- Option 8: The encoder in -Net-T is replaced with ResNet18, keeping the rest of the settings fixed. The reconstruction results are shown in Figure 7 and it can be noticed that there exist plenty of artifacts, which may be caused by ResNet18 being slightly inferior to PVT in terms of feature extraction, proving that PVT is optimal for -Net.
- Option 9: When this option is selected, all HAMs in the SDD_N of the standard -Net are removed and the rest of the network settings remain fixed. As shown in Figure 7, the surface details of the reconstructed objects become unclear without the HAM, which is consistent with the quantitative results in Table 2, demonstrating that the variant leads to an increase in ECD-3D and ECD-2D. These results confirm the effectiveness of HAMs in enhancing surface details.
- Option 10: We remove the DD from the standard -Net and exclude the term from the loss function to create a variant pipeline similar to IM-Net. As shown in Figure 7, the shapes reconstructed by this variant are not comparable to the ones reconstructed by the standard -Net. It is worth noting that the quantitative comparisons in Table 1 and Table 2 show that, although the variant (marked in orange) has slightly lower performance than the standard -Net to some extent, it still outperforms IM-Net, which confirms the superiority of our network pipeline.
- Option 11: To further validate the effectiveness of the deformation decoder (DD) and deformation loss in reconstructing finer topological structures, we add the DD to the network of option 10 while keeping the other settings unchanged. As shown in Figure 7, this variant generally reconstructs object shapes with more detailed topological structures compared to option 10. This further demonstrates the contribution of the DD in reconstructing finer topological structures of objects. However, it is worth noting that the variant still struggles to generate visually appealing object shapes compared to the standard -Net. This observation emphasizes the importance of the deformation loss in the reconstruction process.
- Option 12: To further demonstrate the superiority of the proposed method in feature extraction, we replace the PVT-Tiny and PVT-Large image encoders in the standard -Net-T and -Net-L with DeiT-Tiny and DeiT-Base [37], respectively, while keeping the other settings unchanged. The qualitative results are presented in Figure 7. It can be observed that when the image encoders of -Net-T and -Net-L are replaced by DeiT-Tiny and DeiT-Base, respectively, the network tends to reconstruct inferior results, which exhibit poor topological structure and surface details. This further confirms the effectiveness of -Net in feature extraction.
- Option 13: To further validate the effectiveness of HAM in the surface detail decoder for enhancing surface detail representation, all HAMs in the surface detail decoder of the standard model are replaced with CBAMs [45], while keeping the other settings unchanged. The qualitative reconstruction results are depicted in Figure 7. In comparison to the standard -Net, the variant encounters challenges in capturing and recovering clear surface details of the object, resulting in the presence of artifacts around the shape. This option further demonstrates that HAM is more effective than CBAM in enhancing the capability of -Net to handle surface details.
- Option 14: The standard -Net proposed in this paper, including all components and the loss function with WS.
4.5. Computational Complexity
4.6. Applications
4.6.1. Test on Online Product Images
4.6.2. Surface Detail Transfer
4.6.3. Pasting a Logo
4.7. Discussion about the Effects of Camera Sensor Type on -Net
5. Limitations and Future Works
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
CNN | Convolutional Neural Network |
MLP | Multi-Layer Perceptron |
PVT | Pyramid Vision Transformer |
IFDB | Implicit Field Deformation Block |
HAM | Hybrid Attention Module |
SDF | Signed Distance Function |
SRA | Spatial-Reduction Attention |
SOTA | State Of The Art |
IoU | Intersection of Union |
CD | Chamfer Distance |
EMD | Earth Mover Distance |
ECD-3D | Edge Chamfer Distance of the reconstructed shape |
ECD-2D | Edge Chamfer Distance in the image |
FOV | Field of View |
CSD | Coarse Shape Decoder |
WS | Weighted Sampling |
DD | Deformation Decoder |
SDD_N | Normal Surface Detail Decoder |
SDD_S | Surface Detail Decoder predicting only a single forward displacement map |
Appendix A. Implementation Details
Appendix A.1. Advantages of Utilizing a 224 × 224 RGB Image as Input
- Dataset compatibility: The majority of images in existing publicly available 3D reconstruction datasets are based on a resolution of . Therefore, selecting RGB images with a resolution of as input ensures better alignment with the dataset, leading to improved training efficacy of the network.
- Resource constraints: Higher resolution images as input increase computational and memory requirements, resulting in longer training times and higher hardware demands. By opting for RGB images with a resolution of as input, computational resource consumption is reduced while maintaining higher performance levels.
- Information preservation: 3D reconstruction involves processing and analyzing input images to extract relevant features. By choosing RGB images with a resolution of as input, more detailed information can be preserved, resulting in enhanced 3D reconstruction performance.
Appendix A.2. Analysis of Parameter Settings
Appendix A.2.1. Setting Batch_Size as 16
- Reduced memory consumption: A smaller batch_size leads to decreased memory usage since fewer data samples need to be stored per batch. This enables a larger number of batches to fit within the available memory, facilitating efficient training and inference processes.
- Improved model stability: A smaller batch_size enhances the stability of the model by introducing greater randomness in the samples within each batch. This randomization can help mitigate the risk of overfitting, resulting in a more robust and generalizable model.
- Improved tuning effectiveness: A smaller batch_size allows for faster observation of the model’s training progress. This expedited feedback loop enables quicker adjustments and fine-tuning of hyperparameters.
Appendix A.2.2. Setting Learning Rate as 5 × 10−5
Appendix A.2.3. Reasons for Other Settings
References
- Zai, S.; Zhao, M.; Yiran, X.; Yunpu, M.; Roger, W. 3D-RETR: End-to-End Single and Multi-View3D Reconstruction with Transformers. In Proceedings of the British Machine Vision Conference (BMVC), Virtual, 22–25 November 2021; British Machine Vision Association: Durham, UK, 2021; p. 405. [Google Scholar]
- Peng, K.; Islam, R.; Quarles, J.; Desai, K. Tmvnet: Using transformers for multi-view voxel-based 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, New Orleans, LA, USA, 18–24 June 2022; pp. 222–230. [Google Scholar]
- Yagubbayli, F.; Tonioni, A.; Tombari, F. LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction. arXiv 2021, arXiv:2106.12102. [Google Scholar]
- Tiong, L.C.O.; Sigmund, D.; Teoh, A.B.J. 3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction. In Proceedings of the Asian Conference on Computer Vision (ACCV), AFCV, Macau, China, 4–8 December 2022; pp. 1438–1454. [Google Scholar]
- Li, X.; Kuang, P. 3D-VRVT: 3D Voxel Reconstruction from A Single Image with Vision Transformer. In Proceedings of the 2021 International Conference on Culture-Oriented Science & Technology (ICCST), IEEE, Beijing, China, 18–21 November 2021; pp. 343–348. [Google Scholar]
- Xie, H.; Yao, H.; Sun, X.; Zhou, S.; Zhang, S. Pix2Vox: Context-aware 3D Reconstruction from Single and Multi-view Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2690–2698. [Google Scholar]
- Choy, C.B.; Xu, D.; Gwak, J.; Chen, K.; Savarese, S. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 628–644. [Google Scholar]
- Sun, Y.; Liu, Z.; Wang, Y.; Sarma, S.E. Im2Avatar: Colorful 3D Reconstruction from a Single Image. arXiv 2018, arXiv:1804.06375. [Google Scholar]
- Tatarchenko, M.; Dosovitskiy, A.; Brox, T. Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Venice, Italy, 22–29 October 2017; pp. 2107–2115. [Google Scholar]
- Wu, J.; Wang, Y.; Xue, T.; Sun, X.; Freeman, W.T.; Tenenbaum, J.B. MarrNet: 3D Shape Reconstruction via 2.5D Sketches. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- Fan, H.; Su, H.; Guibas, L.J. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Honolulu, HI, USA, 21–26 July 2017; pp. 605–613. [Google Scholar]
- Lun, Z.; Gadelha, M.; Kalogerakis, E.; Maji, S.; Wang, R. 3D Shape Reconstruction from Sketches via Multi-view Convolutional Networks. In Proceedings of the 2017 International Conference on 3D Vision (3DV), IEEE, Qingdao, China, 10–12 October 2017; pp. 67–77. [Google Scholar]
- Kurenkov, A.; Ji, J.; Garg, A.; Mehta, V.; Gwak, J.; Choy, C.; Savarese, S. DeformNet: Free-Form Deformation Network for 3D Shape Reconstruction from a Single Image. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, Lake Tahoe, NV, USA, 12–15 March 2018; pp. 858–866. [Google Scholar]
- Lin, C.H.; Kong, C.; Lucey, S. Learning efficient point cloud generation for dense 3d object reconstruction. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2018. [Google Scholar]
- Kar, A.; Tulsiani, S.; Carreira, J.; Malik, J. Category-specific object reconstruction from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, 7–12 June 2015; pp. 1966–1974. [Google Scholar]
- Li, X.; Liu, S.; Kim, K.; De Mello, S.; Jampani, V.; Yang, M.H.; Kautz, J. Self-supervised single-view 3d reconstruction via semantic consistency. In Proceedings of the European Conference on Computer Vision (ECCV), Virtual, 23–28 August 2020; pp. 677–693. [Google Scholar]
- Wang, N.; Zhang, Y.; Li, Z.; Fu, Y.; Liu, W.; Jiang, Y.G. Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 55–71. [Google Scholar]
- Park, J.J.; Florence, P.; Straub, J.; Newcombe, R.; Lovegrove, S. DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Long Beach, CA, USA, 15–20 June 2019; pp. 165–174. [Google Scholar]
- Chen, Z.; Zhang, H. Learning Implicit Fields for Generative Shape Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Long Beach, CA, USA, 15–20 June 2019; pp. 5932–5941. [Google Scholar]
- Mescheder, L.; Oechsle, M.; Niemeyer, M.; Nowozin, S.; Geiger, A. Occupancy Networks: Learning 3D Reconstruction in Function Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Long Beach, CA, USA, 15–20 June 2019; pp. 4460–4470. [Google Scholar]
- Littwin, G.; Wolf, L. Deep Meta Functionals for Shape Representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1824–1833. [Google Scholar]
- Michalkiewicz, M.; Pontes, J.K.; Jack, D.; Baktashmotlagh, M.; Eriksson, A. Deep level sets: Implicit surface representations for 3d shape inference. arXiv 2019, arXiv:1901.06802. [Google Scholar]
- Wu, R.; Zhuang, Y.; Xu, K.; Zhang, H.; Chen, B. PQ-NET: A Generative Part Seq2Seq Network for 3D Shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, 13–19 June 2020; pp. 829–838. [Google Scholar]
- Xu, Q.; Wang, W.; Ceylan, D.; Mech, R.; Neumann, U. DISN: Deep implicit surface network for high-quality single-view 3D reconstruction. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 492–502. [Google Scholar]
- Wang, Y.; Zhuang, Y.; Liu, Y.; Chen, B. MDISN: Learning multiscale deformed implicit fields from single images. Vis. Inform. 2022, 6, 41–49. [Google Scholar] [CrossRef]
- Xu, Y.; Fan, T.; Yuan, Y.; Singh, G. Ladybird: Quasi-monte carlo sampling for deep implicit field based 3d reconstruction with symmetry. In Proceedings of the European Conference on Computer Vision (ECCV), Virtual, 23–28 August 2020; pp. 248–263. [Google Scholar]
- Bian, W.; Wang, Z.; Li, K.; Prisacariu, V.A. Ray-ONet: Efficient 3D Reconstruction From A Single RGB Image. In Proceedings of the British Machine Vision Conference (BMVC), British Machine Vision Association, Virtual, 22–25 November 2021. [Google Scholar]
- Peng, S.; Niemeyer, M.; Mescheder, L.; Pollefeys, M.; Geiger, A. Convolutional Occupancy Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Virtual, 23–28 August 2020; pp. 523–540. [Google Scholar]
- Li, M.; Zhang, H. d2im-net: Learning detail disentangled implicit fields from single images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Nashville, TN, USA, 20–25 June 2021; pp. 10246–10255. [Google Scholar]
- Saito, S.; Huang, Z.; Natsume, R.; Morishima, S.; Kanazawa, A.; Li, H. PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2304–2314. [Google Scholar]
- Saito, S.; Simon, T.; Saragih, J.; Joo, H. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, 13–19 June 2020; pp. 84–93. [Google Scholar]
- He, T.; Collomosse, J.; Jin, H.; Soatto, S. Geo-PIFu: Geometry and Pixel Aligned Implicit Functions for Single-view Human Reconstruction. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 9276–9287. [Google Scholar]
- Takikawa, T.; Litalien, J.; Yin, K.; Kreis, K.; Loop, C.; Nowrouzezahrai, D.; Jacobson, A.; McGuire, M.; Fidler, S. Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Nashville, TN, USA, 20–25 June 2021; pp. 11358–11367. [Google Scholar]
- Deng, Y.; Yang, J.; Tong, X. Deformed Implicit Field: Modeling 3D Shapes with Learned Dense Correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Nashville, TN, USA, 20–25 June 2021; pp. 10286–10296. [Google Scholar]
- Yang, M.; Wen, Y.; Chen, W.; Chen, Y.; Jia, K. Deep optimized priors for 3d shape modeling and reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Nashville, TN, USA, 20–25 June 2021; pp. 3269–3278. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Virtual, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
- Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Montreal, BC, Canada, 11–17 October 2021; pp. 7242–7252. [Google Scholar]
- Li, Y.; Wu, C.Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. MViTv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, New Orleans, LA, USA, 18–24 June 2022; pp. 4804–4814. [Google Scholar]
- Wang, D.; Cui, X.; Chen, X.; Zou, Z.; Shi, T.; Salcudean, S.; Wang, Z.J.; Ward, R. Multi-view 3d reconstruction with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Montreal, BC, Canada, 11–17 October 2021; pp. 5722–5731. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Montreal, BC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
- Li, G.; Fang, Q.; Zha, L.; Gao, X.; Zheng, N. HAM: Hybrid attention module in deep convolutional neural networks for image classification. Pattern Recognit. 2022, 129, 108785. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Honolulu, HI, USA, 21–26 July 2017; pp. 624–632. [Google Scholar]
- Tang, Y.; Gong, W.; Chen, X.; Li, W. Deep inception-residual Laplacian pyramid networks for accurate single-image super-resolution. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 1514–1528. [Google Scholar] [CrossRef]
- Denton, E.L.; Chintala, S.; Szlam, A.; Fergus, R. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2015. [Google Scholar]
- Li, S.; Xu, X.; Nie, L.; Chua, T.S. Laplacian-Steered Neural Style Transfer. In Proceedings of the 25th ACM international conference on Multimedia. Association for Computing Machinery, Mountain View, CA, USA, 23–27 October 2017; pp. 1716–1724. [Google Scholar]
- Liu, S.; Li, T.; Chen, W.; Li, H. Soft Rasterizer: A Differentiable Renderer for Image-based 3D Reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7708–7717. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Lorensen, W.E.; Cline, H.E. Marching cubes: A high resolution 3D surface construction algorithm. ACM Siggraph Comput. Graph. 1987, 21, 163–169. [Google Scholar] [CrossRef]
- Esedog, S.; Ruuth, S.; Tsai, R. Diffusion generated motion using signed distance functions. J. Comput. Phys. 2010, 229, 1017–1042. [Google Scholar] [CrossRef] [Green Version]
- Yao, Y.; Schertler, N.; Rosales, E.; Rhodin, H.; Sigal, L.; Sheffer, A. Front2back: Single view 3d shape reconstruction via front to back prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, 13–19 June 2020; pp. 531–540. [Google Scholar]
- Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. Shapenet: An information-rich 3d model repository. arXiv 2015, arXiv:1512.03012. [Google Scholar]
- Remelli, E.; Lukoianov, A.; Richter, S.; Guillard, B.; Bagautdinov, T.; Baque, P.; Fua, P. MeshSDF: Differentiable Iso-Surface Extraction. In Advances in Neural Information Processing Systems (NeurIPS); Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 22468–22478. [Google Scholar]
- Chen, Z.; Tagliasacchi, A.; Zhang, H. Bsp-net: Generating compact meshes via binary space partitioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, 13–19 June 2020; pp. 45–54. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2019. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Ma, C.; Shi, L.; Huang, H.; Yan, M. 3d reconstruction from full-view fisheye camera. arXiv 2015, arXiv:1506.06273. [Google Scholar]
- Strecha, C.; Zoller, R.; Rutishauser, S.; Brot, B.; Schneider-Zapp, K.; Chovancova, V.; Krull, M.; Glassey, L. Quality assessment of 3D reconstruction using fisheye and perspective sensors. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2015, 2, 215. [Google Scholar] [CrossRef] [Green Version]
- Kakani, V.; Kim, H.; Kumbham, M.; Park, D.; Jin, C.B.; Nguyen, V.H. Feasible Self-Calibration of Larger Field-of-View (FOV) Camera Sensors for the Advanced Driver-Assistance System (ADAS). Sensors 2019, 19, 3369. [Google Scholar] [CrossRef] [Green Version]
- Fan, J.; Zhang, J.; Maybank, S.J.; Tao, D. Wide-angle image rectification: A survey. Int. J. Comput. Vis. 2022, 130, 747–776. [Google Scholar] [CrossRef]
- Hart, J.C. Sphere tracing: A geometric method for the antialiased ray tracing of implicit surfaces. Vis. Comput. 1996, 12, 527–545. [Google Scholar] [CrossRef]
- Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, New Orleans, LA, USA, 18–24June 2022; pp. 12009–12019. [Google Scholar]
- Alkhulaifi, A.; Alsahli, F.; Ahmad, I. Knowledge distillation in deep learning and its applications. PeerJ Comput. Sci. 2021, 7, e474. [Google Scholar] [CrossRef]
- Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Plane | Bench | Box | Car | Chair | Display | Lamp | Speaker | Rifle | Sofa | Table | Phone | Boat | Mean | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IoU↑ | IM-Net | 55.4 | 49.5 | 51.5 | 74.5 | 52.2 | 56.2 | 29.6 | 52.6 | 52.3 | 64.1 | 45.0 | 70.9 | 56.6 | 54.6 |
DISN | 57.5 | 52.9 | 52.3 | 74.3 | 54.3 | 56.4 | 34.7 | 54.9 | 59.2 | 65.9 | 47.9 | 72.9 | 55.9 | 57.0 | |
MDISN | 60.4 | 54.6 | 52.2 | 74.5 | 55.6 | 59.4 | 38.2 | 55.8 | 62.2 | 68.5 | 48.6 | 73.5 | 60.4 | 58.8 | |
IM-Net | 60.6 | 55.7 | 52.1 | 74.6 | 56.2 | 61.9 | 40.8 | 54.5 | 63.4 | 69.3 | 48.2 | 73.8 | 62.5 | 59.5 | |
IM-Net | 59.2 | 53.8 | 52.6 | 73.5 | 54.7 | 62.4 | 41.1 | 54.3 | 62.9 | 68.5 | 48.0 | 74.3 | 61.6 | 59.0 | |
-Net-T | 62.9 | 57.8 | 55.2 | 75.8 | 56.5 | 63.7 | 38.9 | 54.6 | 64.5 | 71.1 | 49.3 | 72.6 | 61.8 | 60.4 | |
-Net-L | 63.5 | 59.6 | 56.5 | 76.4 | 57.3 | 64.2 | 39.6 | 55.7 | 65.1 | 70.6 | 50.8 | 72.1 | 62.3 | 61.1 | |
CD↓ | IM-Net | 12.65 | 15.10 | 11.39 | 8.86 | 11.27 | 13.77 | 63.84 | 21.83 | 8.73 | 10.30 | 17.82 | 7.06 | 13.25 | 16.61 |
DISN | 9.96 | 8.98 | 10.19 | 5.39 | 7.71 | 10.23 | 25.76 | 17.90 | 5.58 | 9.16 | 13.59 | 6.40 | 11.91 | 10.98 | |
MDISN | 5.77 | 6.29 | 8.78 | 5.21 | 6.68 | 8.13 | 15.59 | 14.54 | 6.98 | 6.96 | 10.36 | 5.36 | 6.20 | 8.22 | |
IM-Net | 7.32 | 6.03 | 9.16 | 4.98 | 6.41 | 8.25 | 14.57 | 14.69 | 5.14 | 6.45 | 9.83 | 5.42 | 7.56 | 8.14 | |
IM-Net | 7.14 | 6.15 | 8.92 | 5.06 | 6.34 | 8.03 | 14.59 | 14.41 | 5.27 | 6.58 | 9.67 | 5.49 | 7.12 | 8.06 | |
-Net-T | 6.31 | 5.62 | 8.13 | 4.66 | 6.15 | 7.59 | 14.17 | 13.06 | 4.38 | 6.06 | 8.64 | 5.47 | 6.45 | 7.44 | |
-Net-L | 5.89 | 5.34 | 7.86 | 4.52 | 6.03 | 7.42 | 13.91 | 12.75 | 4.41 | 6.12 | 8.54 | 5.39 | 6.23 | 7.26 | |
EMD↓ | IM-Net | 2.90 | 2.80 | 3.14 | 2.73 | 3.01 | 2.81 | 5.85 | 3.80 | 2.65 | 2.71 | 3.39 | 2.14 | 2.75 | 3.13 |
DISN | 2.67 | 2.48 | 3.04 | 2.67 | 2.67 | 2.73 | 4.38 | 3.47 | 2.30 | 2.62 | 3.11 | 2.06 | 2.77 | 2.84 | |
MDISN | 2.33 | 2.17 | 2.91 | 2.70 | 2.52 | 2.50 | 3.67 | 3.30 | 2.17 | 2.43 | 2.81 | 2.11 | 2.42 | 2.62 | |
IM-Net | 2.24 | 2.18 | 2.93 | 2.61 | 2.65 | 2.62 | 3.72 | 3.28 | 2.14 | 2.36 | 2.78 | 1.91 | 2.53 | 2.61 | |
IM-Net | 2.32 | 2.13 | 3.01 | 2.58 | 2.62 | 2.66 | 3.67 | 3.41 | 2.25 | 2.44 | 2.86 | 2.00 | 2.49 | 2.65 | |
-Net-T | 2.12 | 2.15 | 2.96 | 2.57 | 2.59 | 2.48 | 3.55 | 3.21 | 2.18 | 2.42 | 2.72 | 2.08 | 2.37 | 2.57 | |
-Net-L | 2.07 | 2.12 | 2.93 | 2.45 | 2.54 | 2.51 | 3.47 | 3.16 | 2.11 | 2.38 | 2.66 | 1.95 | 2.28 | 2.51 | |
ECD-3D↓ | IM-Net | 7.89 | 6.85 | 8.72 | 8.72 | 6.61 | 8.20 | 9.95 | 10.80 | 6.74 | 7.90 | 7.10 | 7.24 | 8.23 | 8.07 |
DISN | 6.84 | 5.73 | 6.97 | 6.80 | 5.64 | 7.65 | 11.27 | 10.77 | 3.50 | 6.06 | 6.01 | 7.08 | 5.83 | 6.94 | |
MDISN | 6.32 | 5.13 | 6.84 | 6.87 | 5.57 | 7.39 | 10.06 | 10.26 | 3.53 | 6.29 | 5.95 | 6.72 | 5.94 | 6.68 | |
IM-Net | 5.67 | 4.77 | 6.61 | 7.28 | 5.23 | 6.74 | 9.18 | 9.09 | 3.43 | 6.42 | 6.30 | 6.09 | 5.68 | 6.34 | |
IM-Net | 5.98 | 5.16 | 6.91 | 6.46 | 5.04 | 7.13 | 8.97 | 9.73 | 3.57 | 6.02 | 5.67 | 6.60 | 5.34 | 6.35 | |
-Net-T | 5.31 | 4.51 | 6.54 | 6.76 | 5.19 | 6.51 | 9.07 | 8.94 | 3.16 | 6.03 | 5.93 | 6.02 | 5.45 | 6.11 | |
-Net-L | 5.33 | 4.45 | 6.60 | 6.72 | 5.15 | 6.49 | 9.11 | 8.87 | 3.12 | 5.98 | 5.86 | 5.96 | 5.38 | 6.08 | |
ECD-2D↓ | IM-Net | 2.53 | 2.85 | 4.47 | 3.34 | 2.70 | 3.23 | 3.36 | 4.20 | 3.14 | 2.98 | 2.85 | 2.42 | 3.05 | 3.16 |
DISN | 2.67 | 2.21 | 2.25 | 2.04 | 1.98 | 3.16 | 4.86 | 3.34 | 1.35 | 2.06 | 2.07 | 2.26 | 2.00 | 2.48 | |
MDISN | 2.36 | 2.13 | 2.01 | 2.12 | 1.64 | 2.65 | 4.47 | 2.98 | 1.39 | 2.08 | 1.97 | 1.93 | 1.91 | 2.28 | |
IM-Net | 1.99 | 1.67 | 1.79 | 2.07 | 1.71 | 1.95 | 3.16 | 2.64 | 1.28 | 2.01 | 1.88 | 1.62 | 1.73 | 1.96 | |
IM-Net | 1.98 | 1.77 | 1.74 | 1.77 | 1.58 | 2.68 | 3.01 | 2.72 | 1.77 | 1.78 | 1.74 | 2.14 | 2.27 | 2.07 | |
-Net-T | 1.92 | 1.51 | 1.66 | 1.94 | 1.63 | 1.87 | 2.99 | 2.48 | 1.36 | 2.04 | 1.79 | 1.58 | 1.65 | 1.88 | |
-Net-L | 1.96 | 1.44 | 1.68 | 1.85 | 1.59 | 1.79 | 3.04 | 2.41 | 1.26 | 1.88 | 1.84 | 1.52 | 1.63 | 1.84 |
PVT-Tiny | ResNet18 | DeiT-Tiny | PVT-Large | DeiT-Base | CSD | DD | SDD_S | SDD_N | HAM | CBAM | WS | IoU↑ | CD↓ | EMD↓ | ECD-3D↓ | ECD-2D↓ | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Option 1 | ✓ | ✓ | ✓ | 52.4 | 9.96 | 3.11 | 7.42 | 3.58 | |||||||||||||
✓ | ✓ | ✓ | 52.9 | 9.87 | 3.09 | 7.36 | 3.52 | ||||||||||||||
Option 2 | ✓ | ✓ | ✓ | ✓ | 53.5 | 9.23 | 3.04 | 7.16 | 3.17 | ||||||||||||
✓ | ✓ | ✓ | ✓ | 53.7 | 9.31 | 2.95 | 6.98 | 3.06 | |||||||||||||
Option 3 | ✓ | ✓ | ✓ | ✓ | 54.1 | 8.67 | 2.96 | 6.85 | 2.88 | ||||||||||||
✓ | ✓ | ✓ | ✓ | 54.2 | 8.63 | 2.87 | 6.64 | 2.75 | |||||||||||||
Option 4 | ✓ | ✓ | ✓ | ✓ | ✓ | 54.7 | 8.32 | 2.89 | 6.53 | 2.54 | |||||||||||
✓ | ✓ | ✓ | ✓ | ✓ | 54.9 | 8.17 | 2.81 | 6.35 | 2.46 | ||||||||||||
Option 5 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 55.1 | 7.74 | 2.83 | 6.29 | 2.27 | ||||||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 55.4 | 7.63 | 2.75 | 6.02 | 2.11 | |||||||||||
Option 6 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 55.6 | 7.21 | 2.78 | 5.87 | 1.95 | ||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 56.1 | 6.97 | 2.67 | 5.64 | 1.82 | |||||||
Option 7 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 55.9 | 6.58 | 2.71 | 5.49 | 1.76 | |||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 56.5 | 6.39 | 2.62 | 5.38 | 1.67 | ||||||||
Option 8 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 51.2 | 10.83 | 3.16 | 7.59 | 3.65 | ||||||
Option 9 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 56.4 | 6.26 | 2.61 | 5.21 | 1.65 | |||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 57.0 | 6.12 | 2.56 | 5.18 | 1.62 | ||||||||
Option 10 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 56.2 | 6.35 | 2.64 | 5.22 | 1.69 | ||||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 56.4 | 6.27 | 2.61 | 5.20 | 1.64 | |||||||||
Option 11 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 56.3 | 6.30 | 2.62 | 5.21 | 1.67 | |||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 56.8 | 6.14 | 2.57 | 5.17 | 1.61 | ||||||||
Option 12 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 54.8 | 6.54 | 2.76 | 5.43 | 1.78 | ||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 55.6 | 6.48 | 2.69 | 5.32 | 1.68 | |||||||
Option 13 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 56.0 | 6.26 | 2.72 | 5.25 | 1.71 | ||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 56.9 | 6.18 | 2.64 | 5.23 | 1.67 | |||||||
Option 14 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 56.5 | 6.15 | 2.59 | 5.19 | 1.63 | ||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 57.3 | 6.03 | 2.54 | 5.15 | 1.59 |
Model | IM-Net | DISN | MDISN | IM-Net | -Net-T | -Net-L |
---|---|---|---|---|---|---|
Training Time (h) | 138 | 105 | 84 | 68 | 47 | 66 |
Inference Time (ms) | 204.15 | 188.19 | 162.73 | 146.57 | 97.64 | 144.09 |
IoU↑ | CD↓ | EMD↓ | ECD-3D↓ | ECD-2D↓ | ||
---|---|---|---|---|---|---|
Surface Detail Transfer | IM-Net | 51.4 | 6.87 | 3.01 | 5.86 | 2.25 |
-Net-T | 52.3 | 6.69 | 2.88 | 5.72 | 2.01 | |
-Net-L | 53.1 | 6.58 | 2.75 | 5.65 | 1.92 | |
Pasting a Logo | IM-Net | 53.4 | 6.67 | 2.93 | 5.66 | 2.08 |
-Net-T | 54.6 | 6.52 | 2.84 | 5.57 | 1.88 | |
-Net-L | 55.8 | 6.36 | 2.69 | 5.42 | 1.75 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhu, X.; Yao, X.; Zhang, J.; Zhu, M.; You, L.; Yang, X.; Zhang, J.; Zhao, H.; Zeng, D. ED2IF2-Net: Learning Disentangled Deformed Implicit Fields and Enhanced Displacement Fields from Single Images Using Pyramid Vision Transformer. Appl. Sci. 2023, 13, 7577. https://doi.org/10.3390/app13137577
Zhu X, Yao X, Zhang J, Zhu M, You L, Yang X, Zhang J, Zhao H, Zeng D. ED2IF2-Net: Learning Disentangled Deformed Implicit Fields and Enhanced Displacement Fields from Single Images Using Pyramid Vision Transformer. Applied Sciences. 2023; 13(13):7577. https://doi.org/10.3390/app13137577
Chicago/Turabian StyleZhu, Xiaoqiang, Xinsheng Yao, Junjie Zhang, Mengyao Zhu, Lihua You, Xiaosong Yang, Jianjun Zhang, He Zhao, and Dan Zeng. 2023. "ED2IF2-Net: Learning Disentangled Deformed Implicit Fields and Enhanced Displacement Fields from Single Images Using Pyramid Vision Transformer" Applied Sciences 13, no. 13: 7577. https://doi.org/10.3390/app13137577