Next Article in Journal
Traveling Wave Solutions for Complex Space-Time Fractional Kundu-Eckhaus Equation
Previous Article in Journal
A Game-Theory-Based Approach to Modeling Lane-Changing Interactions on Highway On-Ramps: Considering the Bounded Rationality of Drivers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Object-Aware 3D Scene Reconstruction from Single 2D Images of Indoor Scenes

Department of Multimedia Engineering, Dongguk University-Seoul, 30, Pildong-ro 1-gil, Jung-gu, Seoul 04620, Republic of Korea
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(2), 403; https://doi.org/10.3390/math11020403
Submission received: 29 November 2022 / Revised: 9 January 2023 / Accepted: 10 January 2023 / Published: 12 January 2023
(This article belongs to the Section Mathematics and Computer Science)

Abstract

:
Recent studies have shown that deep learning achieves excellent performance in reconstructing 3D scenes from multiview images or videos. However, these reconstructions do not provide the identities of objects, and object identification is necessary for a scene to be functional in virtual reality or interactive applications. The objects in a scene reconstructed as one mesh are treated as a single object, rather than individual entities that can be interacted with or manipulated. Reconstructing an object-aware 3D scene from a single 2D image is challenging because the image conversion process from a 3D scene to a 2D image is irreversible, and the projection from 3D to 2D reduces a dimension. To alleviate the effects of dimension reduction, we proposed a module to generate depth features that can aid the 3D pose estimation of objects. Additionally, we developed a novel approach to mesh reconstruction that combines two decoders that estimate 3D shapes with different shape representations. By leveraging the principles of multitask learning, our approach demonstrated superior performance in generating complete meshes compared to methods relying solely on implicit representation-based mesh reconstruction networks (e.g., local deep implicit functions), as well as producing more accurate shapes compared to previous approaches for mesh reconstruction from single images (e.g., topology modification networks). The proposed method was evaluated on real-world datasets. The results showed that it could effectively improve the object-aware 3D scene reconstruction performance over existing methods.

1. Introduction

Three-dimensional reconstruction from images is a long-standing issue in computer vision and can benefit various applications. For example, the reconstructed mesh can be used for virtual reality (VR) applications [1], games [2], etc. Three-dimensional reconstruction from 3D data, such as point clouds or depth images, is generally more accurate than reconstruction from 2D images because the 3D data provides more information about the shape and geometry of the object. However, obtaining 3D data often requires special devices such as lidar or depth cameras, which can be more expensive or difficult to use than standard 2D cameras. On the other hand, 2D images such as RGB images are easier to obtain, but the reconstruction can be more challenging due to the lack of 3D information. The advancement of deep learning has spurred a range of research addressing issues related to 3D understanding from 2D images, including 3D object detection [3], object tracking [4], object reconstruction [5], and scene reconstruction [6]. Among them, methods that perform 3D scene reconstruction from multiview images or videos have achieved good results. However, when only a single 2D image is available, the reconstructed quality is unsatisfactory because of the scale ambiguity caused by the dimension reduction in the 3D to 2D projection. Additionally, the majority of existing studies [7,8,9,10] have reconstructed objects in images as a single mesh, without separating individual objects within the scene. As a result, these reconstructed scenes are unsuitable for VR applications or games, as they do not allow the identification and interaction of individual objects. Therefore, object-aware 3D reconstruction has been considered for reconstructing an interactable 3D scene.
Object-aware 3D scene reconstruction from a 2D image involves estimating the 3D shape and pose of individual objects depicted in the image, the 3D layout bounding box of the scene, and the camera pose used to capture the input image [6]. The goal is to reconstruct a 3D scene that accurately reflects the real-world scene depicted in the input image. However, 2D image formation is irreversible because a single 2D image may be generated from different 3D scenes [11]. Thus, it is challenging to infer poses accurately from a single 2D image without depth information. Specific methods estimate the 3D poses of objects by using the constraints between objects in a scene. For example, objects should not overlap in the 3D world [12,13]. By adding these constraints to the loss functions during training [12,13], reasonable relative poses of objects can be recovered, but the results still have an enormous scope for improvement. Another challenge is that a 2D image typically shows only a limited portion of a 3D object, with occlusion and hidden surfaces that are not visible in the image. As a result, it can be difficult to accurately reconstruct the entire 3D geometry of an object from a single 2D image since important information is missing. Several experiments [6,14,15,16,17] have been conducted to solve this problem. In the early years of machine learning, various studies reconstructed 3D shapes using voxel representations or point clouds, which can capture the global shape well but may not accurately reproduce surface details. In recent years, implicit function-based methods [5,9,16] have solved the resolution issue. However, they sometimes produce broken meshes.
This study proposes a method to estimate the camera parameters, layout bounding box, 3D object bounding boxes, and 3D object shapes from a 2D image. Unlike previous methods that directly estimated this information from a 2D image, the proposed method first estimates the depth to recover the scale information, enhancing the 3D detection accuracy. For mesh reconstruction, we designed a mesh reconstruction network that outputs a mesh in two different formats followed by a multitask learning scheme. We summarize the contributions of this study as follows:
  • A depth-feature generation module is introduced, which can solve the scale ambiguity issue. The generated depth features can be beneficial to recovering the 3D bounding boxes of the layout and objects.
  • We propose enhancing the mesh reconstruction results using multitask learning. The mesh reconstruction network comprises the following components: a 2D image encoder that encodes a 2D input to a feature vector; and two decoders, wherein one decodes the feature vector into a point cloud and the other decodes the feature vector into an implicit mesh. The two decoders allow the image encoder to extract more information that can be utilized by the decoders, thereby reducing broken meshes.
The object-aware 3D scene reconstruction pipeline is shown in Figure 1. It consists of two stages: initial estimation and refinement. The input image is first preprocessed to generate 2D object bounding boxes. The initial estimation stage produces coarse estimations of the 3D object bounding boxes, camera pose, layout bounding box, and 3D object meshes. The refinement stage predicts residual errors and improves the coarse estimations to obtain the final 3D scene.
To evaluate our proposed methods, we followed the evaluation metrics and dataset used by [6,16]. We evaluated our methods using the SUN RGBD [18] and Pix3D [19] datasets. The SUN RGBD dataset provided ground-truth data for the evaluation of 3D pose estimation of objects, 3D layout estimation, and camera pose estimation. The Pix3D dataset was used for the evaluation of 3D mesh reconstruction.
The rest of this study is organized as follows. Related work on object-aware 3D scene reconstruction is reviewed in Section 2. Section 3 explains the proposed method and its components. Section 4 describes the experiments and their results. The final section summarizes and concludes the study’s results.

2. Related Work

Object-aware 3D scene reconstruction requires estimating the layout bounding box, 3D object bounding boxes, and object shapes [20]. Early studies focused on layout estimation and 3D object bounding box estimation without estimating 3D object shapes [12,13,21,22]. They estimated the layout bounding box only using the edge [23], area [24], and contour features [25]. Recently, there has been a growing focus on estimating 3D object bounding boxes [12,13,22]. In these studies, various constraints were used to guide the 3D object bounding box estimations. For example, in an indoor scene, objects are physically stable. Therefore, they must be aligned with walls and floors [12], and when 3D bounding boxes are projected onto a 2D image, the projected 2D bounding boxes should be consistent with the ground-truth 2D bounding boxes [22]. When humans exist in the scene, the human–object interaction provides enhanced knowledge of the object placement [13]. In the methods mentioned above, reconstructed objects are represented only by 3D bounding boxes, and the object shapes are not recovered.
Several shape retrieval methods have created 3D scenes with object shapes [26,27,28,29,30,31,32]. A computer-aided design (CAD) model that is closest in similarity to the object in the image is selected from a CAD database based on the distance in the embedding space between the input image and the CAD model. Although these methods allow for creating clean 3D scenes, the reconstruction quality and efficiency depend considerably on the variety and size of the CAD database. If no similar CAD models exist in the database, poorly matched CAD models are selected, resulting in inaccurate reconstructions. Moreover, if the CAD database is extensive, searching for similar 3D models takes a long time.
Single-object reconstruction methods [6,16] are usually combined for 3D scene reconstruction to reconstruct object meshes instead of finding similar CAD models. Early studies reconstructed meshes into point clouds [33,34]. However, point clouds cannot represent mesh topologies. Voxel grid representations have also been widely adopted [35,36,37,38], but the reconstructed meshes lack details because the resolution of the voxel grid and the computational capacity of the device restricts them. Several methods have employed octrees [39,40,41] to obtain a high-resolution voxel grid, but the resolution issue has not yet been fully solved. Several studies [42,43,44] have recovered shapes by deforming a template using single-object mesh reconstruction methods. Good reconstruction results were obtained when the target object had the same topology as the template. However, reconstruction accuracy is sometimes very low if the topologies differ significantly. Therefore, topology modification methods must be applied [6,14], but the reconstructed meshes suffer from self-intersection problems. To solve the resolution and non-watertight issues of the methods mentioned above, signed distance fields [45] and implicit fields [46] have been adopted to represent meshes. These methods are not restricted by resolution and can be adapted to objects with various topologies. Implicit3D [16] adopted implicit fields for mesh reconstruction, which is similar to the proposed method. However, Implicit3D sometimes generates broken meshes, and we propose a method that combines the point-cloud and implicit-field representations in the mesh decoder’s output to improve the reconstructed meshes’ global shapes.

3. Methods

An overview of the proposed object-aware 3D scene reconstruction method is shown in Figure 2. It jointly estimates the camera pose, 3D layout bounding box, 3D object bounding boxes, and 3D object shapes. First, the input images are prepossessed to generate object proposals using a pretrained 2D detector, faster region-based convolutional neural network (Faster R-CNN) [47]. Similar to [16], 3D scene reconstruction is then divided into two stages. The first is the initial estimation stage, which includes layout estimation, 3D object detection, and mesh reconstruction networks. We adopt the layout estimation and 3D object detection networks from [16] and propose a new mesh reconstruction network based on multitask learning. The second is the refined estimation stage, which includes a depth-feature generation network and a scene convolutional network (SGCN). The depth-feature generation network aims to generate depth features for the layout and objects. The SGCN follows the same design from [16], which produces the final camera pose, 3D layout bounding box, and 3D object bounding boxes using a combination of the initial estimations from the first stage and depth features from the depth-feature generation network.
This study follows the setup of the world coordinate system proposed in [16], which places the origin at the center of the camera and aligns the x-axis with the orientation of the camera forward and the y-axis with the direction at a 90-degree angle to the floor. Let β , and γ denote the pitch and roll of the camera, respectively, and the camera pose is represented as R ( β ,   γ ) . The 3D bounding boxes are defined as a combination of the center C 3 , size s   3 , and orientation θ   3 in 3D space. Specifically, the 3D layout bounding box is represented as ( C l ,   s l ,   θ l ) , while the 3D object bounding box is represented as ( C o ,   s o ,   θ o ) . Directly predicting the 3D object bounding box center C o is difficult [22]. We define δ   2 as the 2D offset between the 2D projection of the 3D center and the center of the detected 2D object bounding box c 2 using the 2D object detector, and d as the distance from the center of the 3D object bounding box to the camera center in 3D space. Let K 3 denote the camera intrinsic matrix, the C o is then parametrized as Equation (1). Thus, the 3D object bounding box can be decomposed as ( δ ,   d ,   s o ,   θ o ).
C o = R 1 ( β ,   γ ) d K 1 [ c + δ ,   1 ] T K 1 [ c + δ ,   1 ] T 2
The goal of this study is to estimate the camera pose R ( β ,   γ ) , 3D layout bounding box ( C l ,   s l ,   θ l ) , and 3D object bounding box ( δ ,   d ,   s o ,   θ o ). As directly regressing the orientation, sizes, and distance can be error-prone, classification and regression are combined to obtain the estimations. Specifically, the range of size and orientation of the 3D layout/object bounding box can be evenly divided into a set of small range spaces. The network classifies which small range space the target estimation belongs to and the residual error within that small range space. Same as [16], the β ,   γ ,   s l ,   θ l ,   d ,   s o , and θ o adopts the combination of classification and regression estimation methods, while C l and δ are directly estimated. This is because the C l is computed using a pre-computed 3D layout center and its offset, and δ is the offset of the detected 2D object bounding box center c .
Initial Estimation Stage. The layout estimation network uses Resnet-18 [48] as the backbone, followed by three multilayer perceptron (MLP) networks. The three MLP networks output the classification and regression estimations of camera pose R ( β ,   γ ) , orientation θ l , and size s l of the layout bounding box, and the regression estimation of the offset of the 3D center of the 3D layout bounding box. The entire 2D image is the input of the layout estimation network. The 3D object detection network takes the 2D proposals detected using the 2D detector and estimates the 3D object bounding boxes. The output is the regression estimation of δ classification and regression estimations of s o ,   θ o . The network architecture design is based on those proposed in previous studies [6,16]. The 3D layout and object bounding boxes are estimated in the camera space. In contrast to the mesh reconstruction network in [16], we designed the mesh reconstruction network based on multitask learning, which includes two decoders that generate shapes in different representations. This will be explained in Section 3.1.
Refined Estimation Stage. The refined estimation stage includes two networks: a depth-feature generation network that extracts depth features from the input image, and an SGCN, which is used for optimizing the coarse estimation of camera pose, 3D layout bounding box, and 3D object bounding boxes. The SGCN represents the objects, the layout, and the relationships between them as graph nodes. Each node contains features extracted at the initial estimation stage, flattened, and concatenated as a node feature vector. The node feature vector is then encoded into the same dimension feature vector using MLP networks and updated iteratively by aggregating the information from neighboring nodes using MLP networks. The architecture is based on the design proposed in [16]. However, in contrast to that study, the global depth feature is included in the layout node and the object depth features are included in the object nodes.

3.1. Mesh Reconstruction Network Based on Multitask Learning

It has been demonstrated in several studies [49,50] that simultaneously learning multiple tasks can improve performance on the original desired task, as certain information shared across tasks may be more effectively extracted through another task. In this study, motivated by multitask learning, we designed a mesh reconstruction network to learn different mesh representations. The architecture of the mesh reconstruction network is shown in Figure 3.
The cropped object is encoded into an image feature vector using Resnet-18, which is then concatenated with a one-hot encoded object class and further encoded into a shape embedding that contains the shape information of the target object. Two different shape decoders were used. One was a Local Deep Implicit Functions (LDIFs) decoder [51] that had the same architecture as that used in [16]. Given the shape embedding and 3D point in a canonical space, it estimates whether the 3D point is inside the target 3D mesh. During the inference time, when 3D points are densely sampled in the canonical space and evaluated using the decoder, the canonical space becomes an occupancy field, which can be processed to generate a triangular mesh using the marching cube algorithm [52]. However, the local deep implicit representation method sometimes generates broken meshes due to insufficient global shape supervision. To solve this problem, we employed an additional point cloud shape decoder to estimate the global shape of the target object. The point cloud decoder adopted in the architecture is described in [33].

3.2. Depth-Feature Generation Network

In the refined estimation stage, we added the global and object depth features to the layout and object nodes of SGCN, respectively. The added depth features helped alleviate the problem of inaccurate 3D perception caused by scale ambiguity. The networks that extract the global and object depth features share the same network architecture but comprise different weights. The network architecture is shown in Figure 4.
The depth-feature extraction network uses the Resnet-18 as the backbone, followed by a two-layer MLP network. We employed the pretrained LapDepthNet [53] to predict depth images from 2D images, as it has been trained on the indoor scene dataset NYU-Depth V2 [54]. The global depth feature was extracted from the global depth image, and the object depth features were extracted from the cropped depth image of the object area. In the SGCN, we concatenate the global depth feature with the original layout feature to create the layout node feature, and we concatenate the object depth features with the original object features to create the object node features. This is in addition to the features used in Implicit3D [16].

3.3. Loss Function

Losses for Mesh Reconstruction Network. The losses for the mesh reconstruction network are designed for two decoders, the Local Deep Implicit Functions (LDIFs) decoder and the point cloud decoder. Let l d i f and p denote the losses for the LDIFs decoder and point cloud decoder, respectively, m denote the mesh reconstruction loss, and λ p denote the weight of p . The mesh reconstruction loss is defined as Equation (2):
m = l d i f + λ p p ,
where p is the chamfer distance [51] calculated between ground-truth 3D points and estimated 3D points using the point cloud decoder. The l d i f is further defined as follows:
l d i f = λ s s + λ u u + c ,
In Equation (3), s and u are the point sampling losses calculated using the L2 loss between the ground-truth labels and predicted labels of near-surface 3D points and uniformly sampled 3D points, respectively [16]. Further, λ s and λ u are the weights of s and u , separately, and c is the shape element loss [51].
Losses for Layout Estimation Network and 3D Object Detection Network. We adopt the same losses from [16] for the layout estimation and 3D object detection networks. Specifically, a hybrid strategy that incorporates both classification loss and regression loss is used to optimize ( s l ,   s o , θ l ,   θ o ,   d ,   β ,   γ ) [22]. The classification loss and regression loss employ Softmax [55] and smooth-L1 (Huber) loss [47], respectively. As the C l is calculated using its offset and a precomputed layout center, and δ also represents offset, they are optimized by L2 loss.
Joint Losses. In the refined estimation stage, a joint loss is employed by combining the layout estimation network loss l , 3D object detection network loss d , cooperative loss c o , and physical violation loss p h y [16]. c o describes the consistency constraints of the 3D bounding boxes projected onto the 2D bounding boxes. A detailed description is available in [22]. The observation inspires the physical violation loss that objects in the scene should not intersect with each other, and a detailed definition is provided in [16]. The joint loss j o i n t is expressed as Equation (4):
j o i n t = l + d + λ c o c o + λ p h y p h y ,
where λ c o and λ p h y are the weights of the cooperative and physical violation losses, respectively.

4. Experiments

4.1. Experimental Setup

Dataset. Following previous studies [6,16], the SUN RGB-D dataset [18] and the Pix3D dataset [19] were used in the experiments. The SUN RGB-D dataset contains 10,335 indoor scene images, 3D camera pose annotations, 3D layout bounding boxes, semantic segmentation, and labels. From this dataset, 5050 images were used for testing and 5280 for training. The Pix3D dataset comprises 10,069 real-world images and 395 indoor object models under 9 categories. These images and shapes are annotated using pixel-level 2D–3D alignment. From this dataset, 7556 images were used for training and 2513 for testing. To ensure a fair comparison, we used the same train/test splits as [16].
Implementation. In our implementation, we used the open-source repository of Implicit3D [16] as the software backbone and modified the code to fit our specific needs. Specifically, we added a depth-feature generation network and a multi-task learning mechanism. The depth map estimation in our experiments was based on the implementation and pretrained model of LapDepthNet for the NYU-Depth V2 dataset. We used a Resnet-18 backbone from Implicit3D with a two-layer MLP network added on top for the depth-feature generation network. For the mesh reconstruction network, we based our pointset decoder on the design of the vanilla version of the decoder in [33], which we implemented as a 3-layer MLP network. We also used the evaluation code provided by Implicit3D. The PyTorch version we used was 1.7.0, and the python version was 3.8.13. We used the 2D detection results from [16] as inputs for our networks. The output dimensions of the depth feature from the depth-feature generation network were 1024 for the whole-scene depth maps and 512 for the cropped object depth maps. We first used the method employed in Implicit3D to train the networks in the initial estimation stage. After fixing the network’s weights in the initial estimation stage, the depth-feature extraction and SGCN networks were trained jointly for 30 epochs. Finally, except for the mesh reconstruction network, all networks were jointly trained for 30 epochs to fine-tune their weight. During training, we set the hyperparameters λ p = 100.0 , λ s = 0.1 , λ u = 1.0 , λ c o = 150 , and λ p h y = 20.0. The networks were trained on an Ubuntu server with six NVIDIA GeForce RTX 3090 graphics cards. During inference, only one graphics card was used.

4.2. Quantitative Results

This section presents quantitative evaluation results of the layout estimation, camera estimation, and 3D object detection for the SUN RGB-D dataset and the mesh reconstruction for the Pix3D datasets. A comparison between the proposed methods and existing methods is also presented.
Layout Estimation. The layout estimation for the SUN RGB-D dataset was evaluated using the 3D intersection over union (IoU) metric. We compared the results with previous studies [6,16,22,27], which are listed in Table 1. The proposed method achieved the best score among all the methods.
Camera Pose Estimation. Similar to previous studies [6,16,22], the mean absolute error (MSE) was used to evaluate the accuracy of the estimated camera pose. The distance between the estimated camera pose and ground truth defined by roll and pitch are calculated. The results are presented in Table 1. The best results are highlighted in bold. The proposed method obtained the lowest MSE between the ground truths for pitch estimation among all methods but a slightly higher MSE for roll estimation.
3D Object Detection. To evaluate the 3D object detection performance, we used the mean average precision(mAP) as the evaluation metric, similar to Total3D [6] and Implicit3D [16]. The mAP is calculated based on the 3D IoU of the estimated 3D bounding boxes and ground truth. The detection is considered a true positive when the IoU is greater than 0.15. A quantitative comparison with existing methods is illustrated in Table 2, wherein all methods used the same evaluation metric and dataset. The results indicate that the proposed method achieved the best score in a majority of categories and the best performance on average.
Mesh Reconstruction. We evaluated the mesh reconstruction performance on the Pix3D dataset using Chamfer distance, a metric used for computing the distance between two point clouds. We evaluated the mesh reconstruction network on 256 × 256 × 256 voxel grids and then executed the marching cubes algorithm to extract the triangular meshes. The reconstructed meshes were aligned with the ground-truth mesh using the iterative closest point algorithm. A total of 10 K points were sampled from the estimated and ground-truth meshes, and the distance between the two point clouds was calculated. We conducted experiments using point cloud decoders with different numbers of output points (1024, 2048, and 4096) and compared the use of the chamfer distance (CD) loss to the earth mover’s distance (EMD) loss in Equation (2). The performance of our method was compared to that of several state-of-the-art approaches [6,14,16,56] in Table 3. We found that the model that used the CD loss and produced 2048 output points outperformed the others on average. In Table 3, the categorical evaluation scores of [6,14,56] were obtained from Total3D [6] and the mean CD score was recalculated across the whole testing samples. The evaluation score of Implicit3D was reported using Implicit3D, which was originally calculated across the whole test set. Table 4 shows the number of samples of each category in the test set of the Pix3D dataset.
Table 1. Quantitative evaluation results for layout and camera pose estimations. The bolded results represent the highest scores.
Table 1. Quantitative evaluation results for layout and camera pose estimations. The bolded results represent the highest scores.
MethodCam Pitch (MSE)Cam Roll (MSE)3D Layout (IoU)
CooP [27]3.282.1956.9
HoPR [22]7.603.1254.9
Total3D [6]3.152.0959.2
Implicit3D [16]2.982.1164.4
Proposed2.972.1664.5
Table 2. Quantitative evaluation results for 3D object detection on the SUN RGB-D dataset. The bolded results represent the highest scores.
Table 2. Quantitative evaluation results for 3D object detection on the SUN RGB-D dataset. The bolded results represent the highest scores.
MethodDeskSofaBedSinkDresserLampChairCabinetTableNightstandmAP
CooP [27]19.9036.6757.7115.9515.983.2815.2110.4731.1611.3621.77
HoPR [22]4.7928.3758.292.1813.712.4113.560.4812.128.8014.47
Total3D [6]27.9344.9060.6518.5021.195.0417.5514.5136.4817.0126.38
Implicit3D [16]49.0369.1089.3233.8129.2711.9035.1433.9357.3541.3445.21
Proposed49.3670.2689.4631.5233.6012.2135.3832.2955.0546.4645.56
Execution Time and Network Parameters. We tested it on 100 images from the test split of the SUN RGBD dataset. The input images were first preprocessed using a pretrained 2D object detection network, Faster R-CNN [47], and the resulting preprocessed data were fed into our 3D reconstruction networks. The networks contained in the initial estimation stage and the refined estimation stage contain a total of 192.59 million parameters. As shown in Table 5, the average time taken to reconstruct a 3D scene from the preprocessed data was 3.68 s, demonstrating the efficiency of our method.

4.3. Qualitative Results

We used the pretrained model in Implicit3D to produce the reconstruction results and compare them with those obtained using the proposed method. Figure 5 shows the visualization results after testing on the SUN RGB-D dataset. The first row shows the input images, and the second row shows the visualization results with the ground-truth layout and 3D object bounding boxes. The third and fourth rows show the results produced using the pretrained model of Implicit3D and the proposed method for 3D detection and layout estimation, respectively. The last two rows show the object-aware 3D scene reconstruction results produced using Implicit3D and the proposed method. We visualized the reconstruction results of the first 100 testing images, and then randomly selected the images containing at least 2 objects that were being detected and reconstructed.
Observing the accuracies of the 3D bounding box estimations from the second to fourth rows is not intuitive. However, combined with the reconstructed scene, the proposed method shows better results than Implicit3D. For example, the wardrobe in the first column has a wrong orientation when produced using Implicit3D. Still, the result of the proposed method estimated the correct orientation. In the second column, evidently, Implicit3D even predicts the wrong orientation for oversized furniture, such as the bed. In the fourth column, the two farthest chairs are also incorrectly orientated with the input image when the reconstructed scene with meshes is compared to the input image. Therefore, the proposed method produces more complete meshes than the reconstructed scene with object shapes.
We also compared the reconstruction results for a single object from the Pix3D dataset. Our mesh reconstruction network consists of an LDIF decoder and a point cloud decoder. The point cloud decoder is only used during training to encourage the ResNet-18 backbone network to extract useful features for generating complete global shapes. The LDIF decoder generates the mesh results during inference, as in [16]. In general, the Implicit3D outputs complete meshes with visually indistinguishable mesh quality. However, in cases where the Implicit3D fails to generate complete meshes, our mesh reconstruction results can still be complete. We visualized the reconstruction results of all test images and put several results corresponding to the cases where Implicit3D generated broken meshes in Figure 6. The first row of the right part shows that our mesh reconstruction quality is visually competitive as in [16], but it failed to handle details such as think structure, as shown in the left part of the first row in Figure 6.

4.4. Discussion

In our experiments, we observed that incorporating depth features into the refined estimation stage resulted in slightly improved estimates for the camera pitch and 3D layout bounding box (Table 1). However, this improvement came at the cost of slightly worse performance in camera roll estimation. This suggests that depth features may not be as beneficial for 3D layout and camera pose estimation. As presented in Table 2, our method achieved better evaluation scores for 7 out of 10 categories and overall better average evaluation results when using depth features in the refined estimation stage. This indicates that depth features can help improve 3D object bounding box estimation.
As shown in Table 3, the reconstructed meshes differ when we set different numbers of output points. When using the chamfer distance (CD) loss with output point numbers 1024, 2048, and 4096, the average performance of the reconstruction increases and then decreases. When the number of output points is 1024, it is likely because the global shape supervision is insufficient, resulting in poor mesh reconstruction. When the output points are 4096, the network extracts too much information for the global shape and ignores the local detail features that are necessary for accurate local surface reconstruction. The experiments also show that both the CD loss and the earth mover’s distance (EMD) loss can improve the reconstruction quality of the mesh reconstruction network, with the CD loss achieving a slightly better average performance than the EMD loss. However, all of our models except the one that outputs 4096 points achieve better average CD scores than the other methods. The best model of our experiments performs well for specific object categories, such as chairs, wardrobes, and bookcases, but performs poorly for others. This is likely because our mesh reconstruction network has difficulty handling thin structures, as demonstrated by the bed frame in the first row of Figure 6. However, our method performs better for objects with large surfaces than Implicit3D, as seen in the results on the Pix3D dataset. Additionally, when the images are taken in certain constrained conditions, our method can infer more accurate global shapes than Implicit3D (Figure 7). We tested the proposed method and Implicit3D on photos captured under constrained conditions, such as a camera held at the bird-view angle, under sheltered environment illumination, and with texture-less background and object and occluded objects. The results are shown in Figure 7, which indicate that Implicit3D failed to recover the shapes accurately. However, the proposed method recovered the shapes with satisfactory quality.
To improve the proposed method’s performance on a wider range of objects, future work could focus on extending the method to handle unseen object categories more effectively. Currently, the method relies on the object categories that have been learned using the mesh reconstruction network. To improve the performance in novel object categories, it may be necessary to develop new techniques for transferring knowledge from known categories or for learning more flexible models that can adapt to a wider range of objects.

5. Conclusions

In this paper, we proposed an object-aware 3D scene reconstruction network that jointly estimates the camera pose, 3D layout, 3D object pose, and object shapes. We propose introducing a depth-feature generation network in the refined estimation stage to address the depth ambiguity issue in 2D to 3D understanding. We also propose using multitasking learning for mesh reconstruction networks to obtain more complete meshes. We evaluated the proposed method on real-world datasets―SUN RGB-D and Pix3D―and compared the results to state-of-the-art methods. Our experimental results on the SUN RGB-D dataset showed that this method improved 3D object bounding box estimation for a majority of object categories. The mesh reconstruction quality on the Pix3D dataset demonstrated that the proposed multitask learning-based mesh reconstruction network can be beneficial for complete shape estimation. One limitation of this study is that current mesh reconstruction networks can only estimate object shapes included in the learned object categories. The proposed method will be extended in future work to more general object reconstruction tasks.

Author Contributions

Conceptualization, M.W., K.C.; methodology, software, validation, writing—original draft preparation, M.W.; writing—review and editing, M.W., K.C.; supervision, project administration, funding acquisition, K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIP) (2022R1A2C2006864).

Data Availability Statement

SUN RGB-D [18] and Pix3D [19] datasets are used in this study. The datasets can be found here: https://rgbd.cs.princeton.edu (accessed on 14 November 2022) and http://pix3d.csail.mit.edu (accessed on 14 November 2022), respectively.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Manni, A.; Oriti, D.; Sanna, A.; De Pace, F.; Manuri, F. Snap2cad: 3D indoor environment reconstruction for AR/VR applications using a smartphone device. Comput. Graph. 2021, 100, 116–124. [Google Scholar] [CrossRef]
  2. Ferdani, D.; Fanini, B.; Piccioli, M.C.; Carboni, F.; Vigliarolo, P. 3D reconstruction and validation of historical background for immersive VR applications and games: The case study of the Forum of Augustus in Rome. J. Cult. Herit. 2020, 43, 129–143. [Google Scholar] [CrossRef]
  3. Wang, Y.; Guizilini, V.C.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Proceedings of the Conference on Robot Learning, PMLR, London, UK, 8–11 November 2022; pp. 180–191. [Google Scholar]
  4. Hu, H.N.; Yang, Y.H.; Fischer, T.; Darrell, T.; Yu, F.; Sun, M. Monocular quasi-dense 3d object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1. [Google Scholar] [CrossRef] [PubMed]
  5. Saito, S.; Simon, T.; Saragih, J.; Joo, H. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 84–93. [Google Scholar]
  6. Nie, Y.; Han, X.; Guo, S.; Zheng, Y.; Chang, J.; Zhang, J.J. Total3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 55–64. [Google Scholar]
  7. Bozic, A.; Palafox, P.; Thies, J.; Dai, A.; Nießner, M. Transformerfusion: Monocular rgb scene reconstruction using transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 1403–1414. [Google Scholar]
  8. Sun, J.; Xie, Y.; Chen, L.; Zhou, X.; Bao, H. NeuralRecon: Real-time coherent 3D reconstruction from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15598–15607. [Google Scholar]
  9. Peng, S.; Niemeyer, M.; Mescheder, L.; Pollefeys, M.; Geiger, A. Convolutional occupancy networks. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 523–540. [Google Scholar]
  10. Denninger, M.; Triebel, R. 3d scene reconstruction from a single viewport. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 51–67. [Google Scholar]
  11. Michalkiewicz, M.; Parisot, S.; Tsogkas, S.; Baktashmotlagh, M.; Eriksson, A.; Belilovsky, E. Few-shot single-view 3-d object reconstruction with compositional priors. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 614–630. [Google Scholar]
  12. Du, Y.; Liu, Z.; Basevi, H.; Leonardis, A.; Freeman, B.; Tenenbaum, J.; Wu, J. Learning to exploit stability for 3d scene parsing. Adv. Neural Inf. Process. Syst. 2018, 31, 1733–1743. [Google Scholar]
  13. Chen, Y.; Huang, S.; Yuan, T.; Qi, S.; Zhu, Y.; Zhu, S.C. Holistic++ scene understanding: Single-view 3d holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 8648–8657. [Google Scholar]
  14. Pan, J.; Han, X.; Chen, W.; Tang, J.; Jia, K. Deep mesh reconstruction from single rgb images via topology modification networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9964–9973. [Google Scholar]
  15. Xu, Q.; Wang, W.; Ceylan, D.; Mech, R.; Neumann, U. Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. Adv. Neural Inf. Process. Syst. 2019, 32, 490–500. [Google Scholar]
  16. Zhang, C.; Cui, Z.; Zhang, Y.; Zeng, B.; Pollefeys, M.; Liu, S. Holistic 3d scene understanding from a single image with implicit representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8833–8842. [Google Scholar]
  17. Weng, Z.; Yeung, S. Holistic 3d human and scene mesh estimation from single view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 334–343. [Google Scholar]
  18. Song, S.; Lichtenberg, S.P.; Xiao, J. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 567–576. [Google Scholar]
  19. Sun, X.; Wu, J.; Zhang, X.; Zhang, Z.; Zhang, C.; Xue, T.; Tenenbaum, J.B.; Freeman, W.T. Pix3d: Dataset and methods for single image 3d shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2974–2983. [Google Scholar]
  20. Pintore, G.; Mura, C.; Ganovelli, F.; Fuentes-Perez, L.; Pajarola, R.; Gobbetti, E. State-of-the-art in Automatic 3D Reconstruction of Structured Indoor Environments. In Proceedings of the Computer Graphics Forum; Wiley Online Library: New York, NY, USA, 2020; Volume 39, pp. 667–699. [Google Scholar]
  21. Choi, W.; Chao, Y.W.; Pantofaru, C.; Savarese, S. Understanding indoor scenes using 3d geometric phrases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 33–40. [Google Scholar]
  22. Huang, S.; Qi, S.; Xiao, Y.; Zhu, Y.; Wu, Y.N.; Zhu, S.C. Cooperative holistic scene understanding: Unifying 3d object, layout, and camera pose estimation. Adv. Neural Inf. Process. Syst. 2018, 31, 206–217. [Google Scholar]
  23. Mallya, A.; Lazebnik, S. Learning informative edge maps for indoor scene layout prediction. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, CL, USA, 7–13 December 2015; pp. 936–944. [Google Scholar]
  24. Dasgupta, S.; Fang, K.; Chen, K.; Savarese, S. Delay: Robust spatial layout estimation for cluttered indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 616–624. [Google Scholar]
  25. Ren, Y.; Li, S.; Chen, C.; Kuo, C.C.J. A coarse-to-fine indoor layout estimation (cfile) method. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 36–51. [Google Scholar]
  26. Izadinia, H.; Shan, Q.; Seitz, S.M. Im2cad. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5134–5143. [Google Scholar]
  27. Huang, S.; Qi, S.; Zhu, Y.; Xiao, Y.; Xu, Y.; Zhu, S.C. Holistic 3d scene parsing and reconstruction from a single rgb image. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 187–203. [Google Scholar]
  28. Avetisyan, A.; Dahnert, M.; Dai, A.; Savva, M.; Chang, A.X.; Nießner, M. Scan2cad: Learning cad model alignment in rgb-d scans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2614–2623. [Google Scholar]
  29. Kuo, W.; Angelova, A.; Lin, T.Y.; Dai, A. Mask2cad: 3d shape prediction by learning to segment and retrieve. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 260–277. [Google Scholar]
  30. Engelmann, F.; Rematas, K.; Leibe, B.; Ferrari, V. From points to multi-object 3D reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4588–4597. [Google Scholar]
  31. Kuo, W.; Angelova, A.; Lin, T.Y.; Dai, A. Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 12589–12599. [Google Scholar]
  32. Gümeli, C.; Dai, A.; Nießner, M. ROCA: Robust CAD Model Retrieval and Alignment from a Single Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4022–4031. [Google Scholar]
  33. Fan, H.; Su, H.; Guibas, L.J. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 605–613. [Google Scholar]
  34. Achlioptas, P.; Diamanti, O.; Mitliagkas, I.; Guibas, L. Learning representations and generative models for 3d point clouds. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 40–49. Available online: http://proceedings.mlr.press/v80/achlioptas18a.html (accessed on 12 November 2022).
  35. Li, L.; Khan, S.; Barnes, N. Silhouette-assisted 3d object instance reconstruction from a cluttered scene. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Korea, 27–28 October 2019. [Google Scholar]
  36. Kundu, A.; Li, Y.; Rehg, J.M. 3d-rcnn: Instance-level 3d object reconstruction via render-and-compare. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3559–3568. [Google Scholar]
  37. Tulsiani, S.; Gupta, S.; Fouhey, D.F.; Efros, A.A.; Malik, J. Factoring shape, pose, and layout from the 2d image of a 3d scene. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 302–310. [Google Scholar]
  38. Gkioxari, G.; Malik, J.; Johnson, J. Mesh r-cnn. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 9785–9795. [Google Scholar]
  39. Riegler, G.; Ulusoy, A.O.; Geiger, A. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3577–3586. [Google Scholar]
  40. Tatarchenko, M.; Dosovitskiy, A.; Brox, T. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2088–2096. [Google Scholar]
  41. Wang, P.S.; Sun, C.Y.; Liu, Y.; Tong, X. Adaptive O-CNN: A patch-based deep representation of 3D shapes. ACM Trans. Graph. (TOG) 2018, 37, 1–11. [Google Scholar] [CrossRef] [Green Version]
  42. Wang, N.; Zhang, Y.; Li, Z.; Fu, Y.; Liu, W.; Jiang, Y.G. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 52–67. [Google Scholar]
  43. Chen, Z.; Zhang, H. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5939–5948. [Google Scholar]
  44. Pavllo, D.; Spinks, G.; Hofmann, T.; Moens, M.F.; Lucchi, A. Convolutional generation of textured 3d meshes. Adv. Neural Inf. Process. Syst. 2020, 33, 870–882. [Google Scholar]
  45. Park, J.J.; Florence, P.; Straub, J.; Newcombe, R.; Lovegrove, S. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 165–174. [Google Scholar]
  46. Mescheder, L.; Oechsle, M.; Niemeyer, M.; Nowozin, S.; Geiger, A. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4460–4470. [Google Scholar]
  47. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, CL, USA, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  48. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  49. Chen, P.Y.; Liu, A.H.; Liu, Y.C.; Wang, Y.C.F. Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2624–2632. [Google Scholar]
  50. He, L.; Lu, J.; Wang, G.; Song, S.; Zhou, J. SOSD-Net: Joint semantic object segmentation and depth estimation from monocular images. Neurocomputing 2021, 440, 251–263. [Google Scholar] [CrossRef]
  51. Genova, K.; Cole, F.; Sud, A.; Sarna, A.; Funkhouser, T. Local deep implicit functions for 3d shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4857–4866. [Google Scholar]
  52. Lorensen, W.E.; Cline, H.E. Marching cubes: A high resolution 3D surface construction algorithm. ACM Siggraph Comput. Graph. 1987, 21, 163–169. [Google Scholar] [CrossRef]
  53. Song, M.; Lim, S.; Kim, W. Monocular depth estimation using laplacian pyramid-based depth residuals. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4381–4393. [Google Scholar] [CrossRef]
  54. Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 746–760. [Google Scholar]
  55. Groueix, T.; Fisher, M.; Kim, V.G.; Russell, B.C.; Aubry, M. A papier-mâché approach to learning 3d surface generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 216–224. [Google Scholar]
  56. Bridle, J. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. Adv. Neural Inf. Process. Syst. 1989, 2, 211–217. [Google Scholar]
Figure 1. Pipeline of object-aware 3D scene reconstruction utilized in this study. The object-aware 3D scene reconstruction follows a coarse-to-fine manner. First, the input image is preprocessed to generate 2D bounding boxes. The initial estimation stage outputs coarse estimations, and then the refined estimation stage refines these estimates by predicting residual errors.
Figure 1. Pipeline of object-aware 3D scene reconstruction utilized in this study. The object-aware 3D scene reconstruction follows a coarse-to-fine manner. First, the input image is preprocessed to generate 2D bounding boxes. The initial estimation stage outputs coarse estimations, and then the refined estimation stage refines these estimates by predicting residual errors.
Mathematics 11 00403 g001
Figure 2. Overview of the proposed method. The input image is first preprocessed using a 2D object detector which outputs the 2D bounding boxes and categories of objects depicted in the input image. Then the 3D bounding boxes of layout and objects and 3D meshes are estimated in the initial estimation stage and further refined in the refined estimation stage.
Figure 2. Overview of the proposed method. The input image is first preprocessed using a 2D object detector which outputs the 2D bounding boxes and categories of objects depicted in the input image. Then the 3D bounding boxes of layout and objects and 3D meshes are estimated in the initial estimation stage and further refined in the refined estimation stage.
Mathematics 11 00403 g002
Figure 3. Architecture of the mesh reconstruction network. The Mesh reconstruction network outputs shapes in two different representations to encourage ResNet and MLP to extract features that contain the necessary information for both representations.
Figure 3. Architecture of the mesh reconstruction network. The Mesh reconstruction network outputs shapes in two different representations to encourage ResNet and MLP to extract features that contain the necessary information for both representations.
Mathematics 11 00403 g003
Figure 4. Architecture of the depth-feature generation network. The depth-feature generation network consists of two parts: LapDepthNet, which generates depth images from the input image, and a depth feature extraction network, which extracts depth features from the estimated depth image. The depth feature extraction network utilizes a ResNet as its backbone, followed by multiple MLP layers. While the design of the global and object depth image feature extraction networks is the same, they use different weights.
Figure 4. Architecture of the depth-feature generation network. The depth-feature generation network consists of two parts: LapDepthNet, which generates depth images from the input image, and a depth feature extraction network, which extracts depth features from the estimated depth image. The depth feature extraction network utilizes a ResNet as its backbone, followed by multiple MLP layers. While the design of the global and object depth image feature extraction networks is the same, they use different weights.
Mathematics 11 00403 g004
Figure 5. Detection and reconstruction results of various methods on the SUN RGB-D dataset [18]. The top row shows the input image. The second to fourth rows display the 3D bounding boxes obtained from ground truth, Implicit3D [16], and the proposed method. The fifth and sixth rows show the reconstructed 3D scenes using Implicit3D [16] and the proposed method, respectively.
Figure 5. Detection and reconstruction results of various methods on the SUN RGB-D dataset [18]. The top row shows the input image. The second to fourth rows display the 3D bounding boxes obtained from ground truth, Implicit3D [16], and the proposed method. The fifth and sixth rows show the reconstructed 3D scenes using Implicit3D [16] and the proposed method, respectively.
Mathematics 11 00403 g005
Figure 6. Mesh reconstruction results for the Pix3D dataset. (a,d) input images; (b,e) rendered images of reconstructed meshes produced using Implicit3D; (c,f) rendered images of reconstructed meshes produced using the proposed method.
Figure 6. Mesh reconstruction results for the Pix3D dataset. (a,d) input images; (b,e) rendered images of reconstructed meshes produced using Implicit3D; (c,f) rendered images of reconstructed meshes produced using the proposed method.
Mathematics 11 00403 g006
Figure 7. Mesh reconstruction results of single images captured under constrained conditions. (a,d) input images; (b,e) rendered images of reconstructed meshes produced using Implicit3D; (c,f) rendered images of reconstructed meshes produced using the proposed.
Figure 7. Mesh reconstruction results of single images captured under constrained conditions. (a,d) input images; (b,e) rendered images of reconstructed meshes produced using Implicit3D; (c,f) rendered images of reconstructed meshes produced using the proposed.
Mathematics 11 00403 g007
Table 3. Evaluation results of mesh reconstruction on the Pix3D dataset. The bolded results represent the highest scores.
Table 3. Evaluation results of mesh reconstruction on the Pix3D dataset. The bolded results represent the highest scores.
CategoriesDeskTableBedChairWardrobeBookcaseSofaToolMisc.Mean
TMN [14]7.0817.427.786.864.095.934.254.1323.688.43
AtlasNet [56]8.5919.469.038.374.786.916.246.9540.0510.17
Total3D [6]5.9314.195.995.323.836.563.363.1226.936.84
Implicit3D [16]7.8511.734.115.454.313.965.612.3924.656.72
Proposed-2048 pts (EMD)10.5912.135.325.033.175.003.546.8821.616.55
Proposed-1024 pts (CD)10.2713.794.845.072.963.903.274.5622.716.71
Proposed-4096 pts (CD)10.3713.344.965.783.133.713.313.4131.336.97
Proposed-2048 pts (CD)9.8012.084.455.243.263.573.584.5126.716.46
Table 4. The number of samples on the test set of pix3D for evaluation.
Table 4. The number of samples on the test set of pix3D for evaluation.
CategoriesDeskTableBedChairWardrobeBookcaseSofaToolMisc.Total
Number of samples175467248959609048611172513
Table 5. Execution time and network parameters of the proposed method.
Table 5. Execution time and network parameters of the proposed method.
MethodNetwork ParametersAverage Execution Time
Proposed 192.59 million3.68 s
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wen, M.; Cho, K. Object-Aware 3D Scene Reconstruction from Single 2D Images of Indoor Scenes. Mathematics 2023, 11, 403. https://doi.org/10.3390/math11020403

AMA Style

Wen M, Cho K. Object-Aware 3D Scene Reconstruction from Single 2D Images of Indoor Scenes. Mathematics. 2023; 11(2):403. https://doi.org/10.3390/math11020403

Chicago/Turabian Style

Wen, Mingyun, and Kyungeun Cho. 2023. "Object-Aware 3D Scene Reconstruction from Single 2D Images of Indoor Scenes" Mathematics 11, no. 2: 403. https://doi.org/10.3390/math11020403

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop