3D City Reconstruction: A Novel Method for Semantic Segmentation and Building Monomer Construction Using Oblique Photography

Xu, Wenqiang; Zeng, Yongnian; Yin, Changlin

doi:10.3390/app13158795

Open AccessArticle

3D City Reconstruction: A Novel Method for Semantic Segmentation and Building Monomer Construction Using Oblique Photography

by

Wenqiang Xu

^1,2,

Yongnian Zeng

¹ and

Changlin Yin

^2,*

¹

School of Geosciences and Info-Physics, Central South University, Changsha 410083, China

²

Changsha Urban Planning Information Service Center, Changsha 410083, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(15), 8795; https://doi.org/10.3390/app13158795

Submission received: 20 June 2023 / Revised: 19 July 2023 / Accepted: 27 July 2023 / Published: 30 July 2023

Download

Browse Figures

Versions Notes

Abstract

:

Existing 3D city reconstruction via oblique photography can only produce surface models, lacking semantic information about the urban environment and the ability to incorporate all individual buildings. Here, we propose a method for the semantic segmentation of 3D model data from oblique photography and for building monomer construction and implementation. Mesh data were converted into and mapped as point sets clustered to form superpoint sets via rough geometric segmentation, facilitating subsequent feature extractions. In the local neighborhood computation of semantic segmentation, a neighborhood search method based on geodesic distances, improved the rationality of the neighborhood. In addition, feature information was retained via the superpoint sets. Considering the practical requirements of large-scale 3D datasets, this study offers a robust and efficient segmentation method that combines traditional random forest and Markov random field models to segment 3D scene semantics. To address the need for modeling individual and unique buildings, our methodology utilized 3D mesh data of buildings as a data source for specific contour extraction. Model monomer construction and building contour extractions were based on mesh model slices and assessments of geometric similarity, which allowed the simultaneous and automatic achievement of these two processes.

Keywords:

oblique photography; 3D semantic segmentation; machine learning; building monomer

1. Introduction

Three-dimensional (3D) city reconstruction is a popular research topic in virtual geographical environments, photogrammetry, and computer vision. As an effective means for the multidimensional presentation of urban elements, 3D city reconstruction has broad application prospects in fields such as natural resource supervision, urban planning, engineering simulations for construction purposes, and the scheduling of emergency protocols. With the recent development of unmanned aerial vehicle platforms, remote sensors, graphic image processing, 3D visual [1], and other technologies, oblique photography has gradually become the mainstream method for 3D modeling and is widely used in large-scale 3D city reconstructions [2,3].

Oblique photography 3D reconstruction has the advantages of fast data acquisition, low production costs, a high degree of automation, and a rich expression of results; however, a common disadvantage is the lack of semantic information in such models. Computers cannot automatically identify and distinguish between the ground, trees, buildings, roads, and other objects in a scene, i.e., it is difficult to perform monomer management of ground object elements. Therefore, a 3D model of a city as reconstructed via oblique photography can traditionally be used only for simple displays, and a connection between this virtual scene and the real-world parameters cannot be established through semantic information [4]. Increasing attention is being paid to the semantic classification and segmentation of 3D scenes and models to remedy this situation. Proposed methods include semantic classification based on manual annotation [5,6], a 3D point cloud classification based on deep learning [7,8,9,10,11], semantic segmentation of imagery based on dimension reduction [12,13], and a semantic classification based on 3D voxels [14,15]. However, mature and effective theoretical methods, technical systems, and software platforms have not yet been developed for semantic segmentation of urban 3D scenes acquired via oblique photography.

Due to its modeling principles and technical limitations, a model constructed using oblique photography presents a continuous whole, such as a triangular mesh. This is commonly known as a “surface” model, which cannot distinguish between individual buildings in geometric space and does not allow the objectified management of such models. To solve this challenge, it is often necessary to conduct monomer processing in subsequent stages [16,17] via physical cutting [18], dynamic labeling based on 2D vector graphics, manual ID labeling, and semiautomatic model replacement. However, in practice, the effect of these methods is not ideal, and they require considerable manual intervention or rely on third-party data sources (such as planning, or the construction management data related to building outlines). That is, it is difficult to satisfy the requirements of objectification without building outline data and a high degree of automation. Developing effective automated monomer processing in 3D models established via oblique photography is critical for achieving these goals.

This study explored methods for semantic classification, automatic individualization, and contour extraction using 3D model data from oblique photography. It proposes corresponding implementation paths and specific algorithm procedures and conducts experimental analysis. This research aimed to explore a beneficial method for processing and applications in urban 3D reconstruction using oblique photography. This study reflects innovation in the point set processing of oblique photography 3D data, superpoint construction, and feature fusion. A unique method for elevation plane cutting and polygon similarity determination is proposed for building model monomers, which can simultaneously achieve model monomer and contour extraction without relying on any third-party data.

2. Materials and Methods

2.1. Oblique Photography 3D Semantic Segmentation

Implementation Pathway for the Semantic Classification of 3D Mesh Data

The basic principles and objectives of semantic segmentation of 3D data are consistent with those valid for 2D image data [19,20,21,22,23,24,25]. However, unlike raster 2D images, urban 3D models are characterized by spatial dispersion, rich information, and large data scales. It is necessary to consider the characteristics of urban data and research the corresponding solutions in data organization, feature selection, and spatial retrieval for the semantic segmentation of 3D model mesh data from oblique photography. This study considered the following three key problems:

The organization and formatting of 3D data. Each minimum semantic unit needs to be marked and classified. For the 3D modeling results of oblique photography, the data units that can be used for semantic segmentation comprise triangular vertices and facets. Data can also be reorganized to meet the needs of the semantic segmentation of large-scale 3D urban scenes;
The classification and calculation of features. Feature selection is critical and normally evaluated manually; however, fusion can also compute features. Overall, the features of 3D data can be categorized into textural, color, spatial geometric, and fusion features. Another key issue in feature calculation is identifying and selecting the feature neighborhood space;
Classifier training and classification optimization. The last step of semantic segmentation is classification calculations. Existing machine learning and deep learning algorithms train the classifier using sample data and then classify the test data [26]. An effective classifier must be selected to optimize classification by considering the global continuity of geospatial data.

Based on the three challenges above, this study proposes the following path for 3D semantic segmentation of oblique photographs (Figure 1).

The main processes are as follows:

Step 1: Mapping an oblique photography 3D mesh model to point cloud data. To facilitate machine learning calculations, we performed point aggregation mapping of a 3D mesh model of oblique photography to transform the 3D mesh model into a point cloud model. In point cluster mapping, more abundant spatial location data are obtained while retaining topological relationships between points in the mesh. As a result, this type of point set provides richer semantic information than conventional point cloud data;
Step 2: Rough geometric segmentation of point cloud data based on spatial clustering. The 3D point dataset obtained in Step 1 was mapped and divided into different geometric regions. Each region is called a superpoint, and all the points that constitute the superpoint possess the same semantic attributes. A spatial clustering method based on regional growth was adopted to conduct the subsequent semantic segmentation based on superpoints, reducing the overall computational complexity;
Step 3: Point cloud spatial neighborhood search and geometric feature calculation. Feature extractions were performed for the adjacent 3D point regions for which the neighborhood search methodology is key. This study proposes using a geodesic distance calculation based on a surface model that considers the topological constraints of the grid (rendering it a more reasonable method than one that employs Euclidean distances);
Step 4: Clustering center feature solving. Feature extraction and computation are prerequisites for semantic segmentation. In this study, the geometric features of individual points were first obtained, and then feature fusion was performed on superpoints based on the features of individual points. Compared to individual points, the feature dimension of superpoints was increased.
Step 5: Semantic segmentation calculation. Semantic segmentation depends on the characteristics of the classifier. To conduct feature extraction, the 3D data need to define the characteristics of each object with a segmentation probability. This study proposed dividing the 3D scene into four categories: “ground”, “building”, “trees”, and “other”. We employed Markov random field and energy minimization to optimize the classification globally to sharpen category boundaries and render the local neighborhood classification as smoothly as possible.

2.2. Point Set Mapping and Point Cloud Geometric Segmentation

2.2.1. Mapping the Mesh Model to a Point Set

The spatial positions of the vertices of a triangular mesh representing a continuous 3D space will determine the geometric information of the triangular surface, including the location, area, normal vector, and texture information. Solving the problem of each triangular surface relies on the information from its three vertices when extracting spatial features. In geometric space, the processing of points is easier and more direct than that of surfaces in terms of computational complexity. To facilitate calculations, this study selected the vertices of triangular slices as computational units (Figure 2). The vertices of numerous triangular slices constitute a point aggregation model, thus transforming a triangular mesh’s semantic segmentation into a point cloud’s semantic segmentation. In contrast to a traditional LiDAR point cloud, the point cloud obtained through the transformation of mesh data possesses spatial geometry, texture, and color information while retaining a topological connection between points, which is of great value in describing local features.

Considering the textural information of the vertices from the oblique photography 3D model, the data model following transformation and reconstruction is described as follows: In the continuous 3D space, the point cloud of the triangle model is expressed as S = {F, P, T}, where F is the collection of facets that constitute the triangulation network, P is the collection of vertices that constitute the triangulation network, and T is the topological relationship between the triangles. This can be expressed as F = {F_i|i = 1, 2, 3,…, n} and F_i = {P_i₁, P_i₂, P_i₃}, where P_i₁, P_i₂, and P_i₃ are the three vertices that constitute the triangular sheet F_i. The vertex feature f(P_i) includes coordinate and texture information: f(P_i) = {

x_{P_{i}}, y_{P_{i}}, z_{P_{i}}, r_{P_{i}}, g_{P_{i}}, b_{P_{i}}, N_{P_{i}}

}, where

x_{P_{i}}, y_{P_{i}}, z_{P_{i}}

are the coordinate values of the vertex,

r_{P_{i}}, g_{P_{i}}, b_{P_{i}}

are the texture color values of the vertex, and

N_{P_{i}}

is the set of points connected to the vertex P. The subsequent rough data segmentation utilizes the topological relationship (T) between triangular vertices.

2.2.2. Geometric Segmentation of the Point Cloud via Spatial Clustering

The 3D mesh data from oblique photography were converted to a point cloud model (Section 2.2.1). To reduce the computational time during subsequent machine learning and improve the model’s scalability, the point cloud was divided into different subsets via clustering, with points in the same regional subset possessing homogeneous semantics. Following the terminology of Landrieu [27,28], each subset was labeled a “superpoint”. Compared to the unprocessed point cloud data, the data after clustering contained more information, such as cluster size and the relationship between classes. Considering the large number of planar regions in a 3D urban landscape, this study adopted a plane detection method based on region growth to conduct a coarse segmentation clustering of point clouds, in which each point set is regarded as belonging to a specific plane as a cluster. This method assumes that points in the same plane region of clustering have similar normal vectors [29]. The following iterative approach was adopted:

Step 1: An available seed point is selected from the dataset, and its neighborhood space is determined;
Step 2: The neighborhood that meets the regional plane detection rules is included in the region;
Step 3: Steps 1 and 2 are repeated for all neighborhoods;
Step 4: A new region is created until no neighborhoods meet the 3D rules of regional plane detection.

Our plane detection based on 3D points adopted a plane-fitting method based on principal component analysis. When the normal vector of a point is less than the threshold value, and the distance from the plane is less than a certain value, that point is included in the regional plane. Each planar region is represented by a superpoint element.

2.3. Neighborhood Search and Feature Extraction

2.3.1. K-Neighborhood Search Based on Geodesic Distances

An effective neighborhood search is the key to feature extraction from a continuous space. For a given 3D point set, defining neighborhoods typically includes the following three processes [30]: the spherical [31], cylindrical [18], and k-neighborhood methods [32]. The selection of suitable k values for spherical and cylindrical neighborhoods depends on the empirical judgment of different datasets. Neighborhood k exhibits better flexibility when the point cloud density changes and is more suitable for the 3D urban environment, in which point cloud density varies significantly. As a result, we utilized the k-neighborhood method to determine the vertex neighborhood of each triangular facet.

The basic principle of this method is to calculate the Euclidean distances between a given point and other points, then sort them in ascending order. The so-called “k points” located closest to the examined point are the k-neighborhood points. However, such a direct use of Euclidean distances would abandon the plane topology constraint, and it is more reasonable to identify the similarity of samples in a manifold space. Therefore, we used the geodesic distances based on the surface model to quantify distances in the k-neighborhood.

The key to calculating the geodesic distance between two points on the surface of a triangular grid model is to identify the geodesic. In this study, the geodesic refers to the shortest path between two points on the surface of a 3D model and is, therefore, limited to its surface. Geodesic distance calculations can be divided into accurate and approximate. The precise method poses high requirements regarding model topology, which often requires many calculations and can be extremely time-consuming. In this study, we selected the Dijkstra shortest path geodesic distance algorithm to calculate the surface distance between two vertices, considering the time cost, computational accuracy, and model topology involved. The vertices of the triangles represent the points of the Dijkstra diagram, and the lengths of the edges connecting the points are the weights of the edges of the Dijkstra diagram, as illustrated by the numbers in Figure 3. The shortest route between the source point S and target point E was the line segment set {SP₂, P₂P₆, P₆E}, as shown by the blue lines in Figure 3, and the shortest path was 15.

2.3.2. Geometric Feature Extraction of 3D Points

Each 3D point in this study contained both geometric and textural features. The four geometric features selected for extraction were the local planarity (f_p), local horizontality (f_v), local sphericity (f_s), and relative elevation (f_z). Using these values, a 4-dimensional vector space can be constructed for each point using V(

p_{i}

) = {

f_{p} (p_{i})

,

f_{v} (p_{i})

,

f_{s} (p_{i})

,

f_{z} (p_{i})

}, constituting the feature space of the point. The calculation of each variable is described below:

Local planarity (f_p). Local planarity can be expressed as the flatness of the neighborhood space and is an important feature of many 3D urban objects, such as building façades, pitched roofs, and flat ground, all of which exhibit good local planarity. In the k-neighborhood space, local flatness can be calculated as the sum of the squares of the distances from all points in the neighborhood space for the optimal fitting plane of the local neighborhood. Therefore, the key is determining the neighborhood’s optimal fitting plane. This problem can be solved by establishing a neighborhood covariance matrix and estimating the local surface normal vectors based on calculating eigenvectors and eigenvalues [33,34]. Given a 3D space point p_i, with its k-neighborhood space being N (p_i) and the centroid of its neighborhood space being $\bar{p}$ , the 3 × 3 covariance matrix (C) is calculated as follows:

$C = \frac{1}{k} \sum_{i \in N} {(p_{i} - \bar{p})}^{T} (p_{i} - \bar{p}) .$

(1)

Following Equation (1), C is a symmetric semi-positive definite matrix. It has three orthogonal eigenvectors, 𝑣₀, 𝑣₁, and 𝑣₂, which correspond with three non-negative eigenvalues, λ₀, λ₁, and λ₂, respectively. If λ₂ ≥ λ₁ ≥ λ₀ ≥ 0 is assumed, the minimum eigenvalue λ₀ corresponds with the eigenvector 𝑣₀, which approximately represents the normal vector of the optimal fitting plane in neighborhood space [35]. As such, λ₀ represents the deviation from the local plane along the surface normal vector 𝑣₀. Therefore, we can define the flatness metric at p_i according to Equation (2) [36]:

$f_{p} (p_{i}) = 1 - \frac{3 λ_{0}}{λ_{0} + λ_{1} + λ_{2}};$

(2)
Local horizontality (f_v). Horizontal features are more evident on flat roofs and urban roads. Horizontality can be measured (Equation (3)) using the normal vector Pi’s absolute value and the vertical z-axis’s cosine. A cosine value close to zero indicates high horizontality, whereas a value close to 1 indicates a vertical plane.

$f_{v} (p_{i}) = |v_{0} . n_{z}|, n_{z} = < 0, 0.1 >;$

(3)
Local sphericity (f_s). Sphericity determines whether geometric shapes within a local area resemble a ball shape, typical for tree crowns. As previously defined, spherical properties can be calculated based on the difference between three characteristic values: λ₂ ≥ λ₁ ≥ λ₀ ≥ 0. The larger the value obtained with the following equation, the better the spherical properties:

$f_{s} (p_{i}) = \frac{λ_{0}}{λ_{2}};$

(4)
Relative elevation (f_s). Different ground object types are linked to certain height classifications in urban environments. Owing to topographic fluctuations, absolute elevation information lacks a unified differential expression. Therefore, a relative elevation was defined in this study to describe the elevation information of each object, calculated as follows:

$f_{z} (p_{i}) = \frac{z_{i} - z_{m i n}}{z_{m a x} - z_{m i n}},$

(5)

where z_i is point p_i’s elevation; Zmax and Zmin are the maximum and minimum elevation values of all points in the neighborhood space.

2.3.3. Feature Fusion and Extraction of Superpoints

Feature fusion and extraction were conducted on each superpoint for which the fusion features were mainly the average of all 3D point features contained in the superpoint cluster, i.e., the average local planarity

f_{\bar{p}} (s_{j})

, average local horizontality

f_{\bar{v}} (s_{j})

, average local sphericity

f_{\bar{s}} (s_{j})

, and average relative elevation

f_{\bar{z}} (s_{j})

. The extracted features were obtained by calculating the statistical features of the 3D points within the superpoint. All points within a superpoint are considered local neighborhood space, and the corresponding eigenvector values of the local neighborhood covariance matrix were calculated as λ_s2 ≥ λ_s1 ≥ λ_s0 ≥ 0. Three geometric features were defined for each superpoint: the length

f_{l} (s_{j})

, area

f_{a} (s_{j})

, and volume

f_{v} (s_{j})

. These values were calculated using:

f_{l} (s_{j})

= λ_s2,

f_{a} (s_{j})

= λ_s1λ_s2 and

f_{v} (s_{j})

= λ_s1λ_s2λ_s0. Thus, each superpoint was ultimately characterized with a 7-dimensional vector: V(

S_{j}

) = {

f_{\bar{p}} (s_{j})

,

f_{\bar{v}} (s_{j})

,

f_{\bar{s}} (s_{j})

,

f_{\bar{z}} (s_{j})

,

f_{l} (s_{j}), f_{a} (s_{j})

,

f_{v} (s_{j})

}.

2.4. Random Forest Semantic Segmentation

The basic idea of 3D semantic segmentation is to calculate the probability of an object belonging to each classification label and then attribute the label attaining the highest probability value. This can be conducted via a supervised classification based on machine learning concepts, and specific training can be understood as a classifier. For a given feature vector x, the prediction belongs to the category label y with conditional probability p (y|x). Therefore, our primary objective was to select an appropriate classifier. We opted to use a random forest as a classifier for 3D urban landscapes with large amounts of data, which has the advantages of parallel computation [37], support for high-dimensional features, and strong generalizability.

A prior probability distribution of the observational data was created using a random forest. However, random forest classification only provides the category probability of a single discrete object without considering the correlation between spatial objects in its neighborhood. Consequently, some point clouds can be isolated into incorrect categories, and their classification labels are often not optimal solutions. Simultaneously, there are significant contextual prior knowledge constraints in 3D classifications (such as the elevation of the ground being lower than that of the building and roof, the building being adjacent to the ground, and the absence of trees on a roof). Classification results should discern between the boundaries of different categories as clearly as possible and obtain the smoothest possible classification in the local neighborhood. To this end, we employed the Markov random field model [38] to determine optimal classification labels.

Let P = {P1, P2,…, Pn} be a set of points, L = {L1, L2,…, Lm} comprise a set of classification labels, and m is the number of classification labels. X = {X1, X2,…, Xn} is a set of random variables in the set P, and each 3D point is associated with a random variable. According to Bayes’ rule:

P (s| d) = \frac{P (d| s) P (s)}{P (d)},

(6)

where

P (d| s)

is the conditional probability distribution of the observed value d (also known as the likelihood function);

P (d)

is a fixed value; and

P (s)

is the prior probability of the initial value generated via random forest classification. Assuming that the optimal state is

s *

, the solution to the problem can be expressed as

s * = m a x P (s) P (d| s), s \in S .

(7)

According to the Hammersley–Clifford theorem, the probability distribution of the Markov random field undirected graph must be expressed as the product of non-negative functions on the largest clique that satisfies the Gibbs distribution [39,40]. According to Gibbs measures of the Potts model [41], the solution of Equation (7) can be transformed into the minimum value of the solution of Equation (8):

E (s) = \sum_{i = 1 . . N} D_{i} (s_{i}) + γ \sum_{i, j \in N (p_{i})} V_{i, j} (s_{i,} s_{j}),

(8)

where the semantic meaning of

D_{i} (s_{i})

can be understood as the loss of the current object, designated under a different classification tag. When this object is labeled with tag X_i, the consistent difference between its observed and expected features is greater; the greater this value, the smaller the matching degree. Furthermore,

V_{i, j} (s_{i,} s_{j})

describes the degree of proximity between adjacent objects, where i and j represent two adjacent points in the neighborhood space; the larger this value, the larger the range of smoothness and the smaller the overall smoothness.

\sum_{i = 1 . . N} D_{i} (s_{i})

of the equation sign in Equation (8) expresses the global optimality of the classification results, and

γ \sum_{i, j \in N (p_{i})} V_{i, j} (s_{i,} s_{j})

expresses the local smoothness of the classification results. In the classification problem based on the Markov model, Equation (8) was solved for the minimum value. The α-expansion algorithm [42] was used to solve the smallest energy.

2.5. Monomer and Contour Extraction

Thus far, architectural model objects have been extracted from a photograph using a semantic segmentation method. Based on an analysis of existing methods related to buildings, we believe that the following aspects should be considered in establishing 3D building monomers:

The monomer modeling process should avoid using third-party data;
Irreversible damage to the original data should be avoided;
The degree of monomer automation must be maximized;
Contour extraction should be optimally integrated with monomer processing to avoid multiple traversal data calculations.

Considering the characteristics inherent to the 3D data of oblique photography and the need for calculation efficiency, this study utilizes a 3D mesh as a data source. The proposed monomer and contour extraction method is based on a hierarchical “slicing” of oblique 3D models to determine the geometric similarity of polygons.

2.5.1. Contour Set Generation Based on Horizontal Slicing of Grid Model Data

Our basic principle behind building contour extraction was to consider a building contour collection using sections from different elevations; geometric similarity of the contour line is determined to extract the optimal contour line and perform logical monomer processing (Figure 4).

In the oblique photographic 3D model, the horizontal plane where the highest point of a building is located is taken as the starting level. The model is “sliced” horizontally at certain intervals between the highest and lowest points of the building (Figure 5). Note that no physical or geometric cuts of the model are involved; purely the projection of all grid mesh data above the cutting plane elevation is obtained for the horizontal cutting plane at each elevation interval to facilitate subsequent contour extraction.

The triangular plane of the model was projected onto the elevation plane and rasterized to allow the extraction of a binarized raster orthographic projection image on each plane. The size of grid cells selected for rasterization was 0.1 m × 0.1 m, balancing extraction efficiency and accuracy (for large cell sizes, the extraction accuracy is low, but the calculation efficiency is high). The process of projecting the triangle onto the elevation plane and performing binarization is summarized in Figure 6.

A binarized raster image of building contours was obtained for each elevation plane. Examples of binarized raster images are shown in Figure 7.

The raster image data of the projected building surfaces were obtained through binarization. To obtain building contours, building edges were extracted from the raster data. This involves the vectorization of raster data using relatively mature algorithms. However, in contrast to the classification of alternative image data, ground objects are complex and include many different linear and areal entities. The binary raster data processed in this study contained simple information and a large image range; the vector objects generated were all polygons. Considering these characteristics, we used a method based on image edge detection [43] and boundary tracking [44] to extract polygon contours from the binary imagery.

2.5.2. Monomer and Contour Selection Based on Geometric Similarity

Using the methods described in Section 3.1, multiple contour lines corresponding to different high-level horizontal sections were obtained, forming a contour line set. To extract a unique optimal contour for each building, two tasks must be accomplished: (1) collating the set of contour lines belonging to the same monomer and (2) selecting one contour line (among those cut from the same monomer) as the building contour line.

In this study, the monomer of the building adopted the logical monomer method, which does not cut the model geometrically or physically but generates a bounding box of the model and attaches the model ID and attributes. Multiple and unique contour line sets were obtained for separate high-rise buildings without logical connections. Determining whether the contour sets belong to the same building is the key to determining the logical monomer. Our monomer determination was based on the following principles:

The contour of a single building “sliced” at different elevations is completely closed;
For a contour set of the same building, two geometrically similar contour lines must exist at different elevations;
The ideal situation for a regularly shaped building is one in which the contour polygons of different elevations overlap completely when projected onto the same XOY plane. The vector polygons obtained by “slicing” the inclined model blocks at different elevation planes were compared; corresponding inclined model blocks of the vector polygons, for which the position and shape deviation was less than the set threshold, were determined to represent the same inclined building monomers.

2.5.3. Geometric Similarity Indicators of Contours

The following three indicators were proposed to measure the spatial geometric similarity between two different contours, acting as quantified indicators for the subsequent monomer determination and contour selection:

Positional deviations of contour planes. This refers to the spatial distance between contour polygons in the XOY plane, as measured by the distance between the geometric centers of two contour polygons. Contour lines of the same building should exhibit relatively small plane space deviations at separate elevations;
Area deviations of contour polygons. This refers to the difference in surface areas between two contour polygons. The more the areas of two different contour lines resemble each another, the higher their geometric similarity;
Geometric deviations of contours. This reflects the global similarity between two contours, mainly characterized by contour shape and size.

To increase the computational efficiency and facilitate the description of the above indicators, an outer bounding box of a building contour was first established when determining the geometric shape of the plane space position and contour. The plane space position of the contour is replaced by the space position of the outer bounding box, and the geometric shape of the contour is determined by the length and width of the outer bounding box. As shown in Figure 8, a series of contour sets were obtained for each building via horizontal “slicing” at different elevation planes. Subsequent monomer and contour selections were calculated and solved based on the geometric information of this contour series.

Table 1 lists the formulas used to calculate each of the three indicators. First, the outer bounding boxes were solved for each contour. P₁ and P₂ are the geometric centers of the outer bounding boxes of source contour C₁ and target contour C₂, respectively. S₁ and S₂ represent the areas of these two contours. L₁ and L₂ are the heights, and W₁ and W₂ are the widths of the outer bounding boxes of these contours, respectively.

Using the above three indicators, conditional strategies can be applied to determine whether two contours are similar. The indicators had to meet predetermined thresholds simultaneously before two contours could be labeled “similar”. These conditions were as follows:

The deviation of the plane position is not more than 0.1 m, the threshold value is empirical, and the position similarity (SimP) is calculated as follows:

S i m P = \{\begin{matrix} T r u e, Δ D \leq 0.1 m \\ F a l s e, Δ D > 0.1 m \end{matrix};

(9)

2.: The difference between the length and width of the outer bounding box adopts a relative threshold that does not exceed 1/20 of the sum of the compared quantities when geometric similarity (including length and width similarity) is calculated as

S i m L = \{\begin{matrix} T r u e, Δ L \leq (L_{1} + L_{2}) / 20 \\ F a l s e, Δ L > (L_{1} + L_{2}) / 20 \end{matrix},

(10)

S i m W = \{\begin{matrix} T r u e, Δ W \leq (W_{1} + W_{2}) / 20 \\ F a l s e, Δ W > (W_{1} + W_{2}) / 20 \end{matrix};

(11)

3.: The contour area deviation adopts the relative threshold of no more than 1/20 of the sum of the comparison contests. The similarity in the contour areas (SimS) is determined as follows:

S i m S = \{\begin{matrix} T r u e, Δ S \leq (S_{1} + S_{2}) / 20 \\ F a l s e, Δ S > (S_{1} + S_{2}) / 20 \end{matrix};

(12)

4.: When the values obtained for SimP, SimL, SimW, and SimS are all “True”, the two contour profiles were considered similar. This can be expressed as follows:

Sim(C₁, C₂) = SimP & SimL & SimW & SimS.

(13)

2.5.4. Monomer Judgment and Contour Selection

Based on the assumptions listed in Section 2.5.3, building monomers were determined as follows: Contour polygons of differently inclined models were compared; when all geometric similarity indicators met the predetermined requirements, the two contour polygons were considered to correspond to the same building monomer. The maximum value of the corresponding elevation plane slice and its generated vector polygon (meeting the similarity conditions) represented the contour line of the building. During calculations, small or large contours were removed based on the given area range, decreasing the interference of noisy data. Such area ranges can be customized by users according to the types of buildings present. The stepwise monomer determination and contour selection process is as follows:

Step 1: Beginning with the highest elevation slice, assign one of the vertical surfaces of the contour polygon as F_S, the source of the polygon outline for the current elevation (elevation = H). At subsequent (lower) vertical “slices” (elevation = H − ΔH), find the outline polygon for a target contour (Fo) that renders Sim (F_S, Fo) = True; in this case, F_S and Fo belong to the same building monomer. F_S is accepted as the outline of the current building, incorporated into the contour set F, and the polygon Fo is determined to be extracted;
Step 2. If no Lo can be found in the elevation plane H − ΔH and Sim (F_S, Fo) = True, F_S does not represent the contour line of a single building. The next polygon is selected from the elevation plane H as the source polygon, and Step 1 is repeated;
Step 3. After all contour polygons in elevation plane H have been used as source contours, the H − ΔH elevation plane is taken as the object and the contour polygons in this elevation are selected as source contour polygons. The source polygons are then searched for the next elevation at plane H − 2ΔH, and contours are added to set F. The processes outlined in Step 1 and Step 2 are repeated; (In this step, extracted contour polygons are no longer included in the calculation.)
Step 4. Traverse all “sliced” surfaces of different elevations in descending order from high to low, repeating Step 1 to Step 3 until all elevation cut surfaces are traversed, and the selected contour set is F.

In addition to buildings, there may be objects such as trees and lampposts. The outline area of these features is small, and the distribution is scattered. To minimize the influence of these ground objects on the accuracy of building monomer extraction, we established a filtering area threshold value, S_t, and eliminated contour values whose area is smaller than from the final contour set F.

3. Results

3.1. Semantic Segmentation of an Oblique Photography 3D Model

The data used for the 3D semantic segmentation in this study comprised three sets: a training set, a test set, and experimental segmentation data. Training set data were used to train the classification model, and test set data were used to test the classification results and evaluate the model’s classification accuracy. Our experimental segmentation data were used to perform a 3D segmentation experiment according to a semantic segmentation model. Due to a lack of publicly available oblique photography datasets, we independently annotated the training and test data. Their data were divided into packets for small scenes of 140 m × 140 m to facilitate data annotation and training tests. The training set included 3D data for 10 scenes, including 3,325,855 triangular mesh surfaces. The test set included the 3D data for two scenes, including 944,858 triangular mesh plates. The experimental segmentation data included the 3D data for two scenes obtained via oblique photography, covering approximately 3 km² and 0.1 km², respectively. This experimental data were selected from an oblique photography 3D model of Changsha City, China, and the data objects covered buildings, ground, trees, and other elements with strong representation, as shown in Figure 9.

3.1.1. Data Format Conversion and Point Cloud Mapping

According to this study’s point set mapping method, all grid data were converted and mapped to 3D point cloud data. Figure 10 illustrates the process of point set mapping; Figure 10A is the original oblique photography model, Figure 10B is its corresponding grid model, Figure 10C is an enlarged local grid model, and Figure 10D shows the data mapped as a point cloud.

3.1.2. Data Classification Annotation

Currently, there are no accessible training or test datasets for the 3D semantic segmentation of oblique photography. Semantic marking tools were used to annotate point cloud data after 3D tilted mesh data mapping (as shown in Figure 11) to provide training data for our classification model and conduct tests.

Figure 12A depicts the training set data for 10 selected scenes, including buildings, the ground, trees, and other objects. This study classified data into four categories: ground, buildings, trees, and others. Figure 12B shows the product data after converting the original grid data from the training set (OBJ files) to point cloud data containing texture information. Following semantic annotation, the model file contained all the classification annotation information.

3.1.3. Model Training

The process of 3D semantic segmentation involves several key parameters, including the k-neighborhood number and energy function coefficient γ. During our model training, we adjusted training accuracy by dividing the relative influences of these two parameters as follows:

Influence of the k value of k-neighborhoods on classification accuracy. With γ set to a fixed value of 0.2, the empirical results of training with different pairs of k values are shown in Table 2. The effects on the classification accuracy and mean F1 score were also investigated. According to a comprehensive analysis, when the k value was between 6 and 12, the training classification accuracy and F1 score obtained good results (Figure 13). As a result, the empirical value for k in this study was set to 10.
The influence of the energy functions coefficient γ on classification accuracy. With k set to a fixed value of 10, Table 3 presents the training results using different pairs of γ values. Their effects on classification accuracy and mean F1 scores were investigated. According to a comprehensive analysis, a γ value between 0.4 and 0.6 delivered both training classification accuracy and an F1 score that obtained good results (Figure 14). As a result, the empirical value for γ in this study was set to 0.5.

3.1.4. Semantic Segmentation

Three scenes were selected for the segmentation test, as shown in Figure 15 and Figure 16.

The precision, recall and interchange ratio (IoU) were used to evaluate the classification results of multiple categories: a binary confusion matrix was adopted, the positive case was assumed to belong to the current class, and the negative case was assumed to belong to the non-current class. This study first used the training set data to train the model and then applied the test set data to the trained model. The results are summarized in Table 4. In the test dataset, the classification accuracy was high, particularly when the building accuracy reached 0.87. For 323,081 points in the test scenario data, the running time of semantic segmentation was 98 s.

3.2. Building Monomer and Contour Extraction

A building model with an area of approximately 1 km² was selected and sectioned at different elevation planes. The projection and binarization results of grid model slices are shown in Figure 16, where h represents the elevation of the different planes.

After obtaining the binarized raster image of elevation planes as shown in Figure 17, filtering and edge detection technologies were used to extract the initial edge contours of the building, as shown in Figure 18.

The practical application of this method primarily focuses on two indexes: one is the monomer extraction accuracy, and the second is the recall rate of monomeric buildings. To evaluate the accuracy and recall ratio, this study calibrated the true value data according to the vector data of urban architecture status. According to the proposed algorithm, the building monomer extraction’s accuracy and recall ratio is primarily affected by the elevation difference ΔH used in ”cutting” and the area threshold S_t used in the filtering of building contour. Table 5 and Table 6 compare the accuracy and recall under varying elevation differences and area thresholds, respectively. When evaluating the influence of the above two factors on the monomer effect, the other factor must be assigned a constant value. In Table 5, the area threshold is 200 m; in Table 6, the elevation difference is 2 m.

Figure 18 and Figure 19 show the corresponding change curves in Table 5 and Table 6.

As shown in Figure 19, as the elevation difference ΔH increases, the accuracy of the monomer increases, whereas the recall rate decreases. This is consistent with the intuitive judgment based on experience. When the elevation difference increases, some low trees and non-building features are filtered out to improve the accuracy. However, extremely large elevation differences may result in the omission of some low buildings during extraction, thereby reducing the recall rate.

As shown in Figure 20, the accuracy of the building monomer extraction and the recall rate increase with the increase in the area threshold. This is also consistent with the empirical judgment. When the size threshold increases, smaller trees and non-building features are eliminated; however, smaller house structures may also be eliminated.

Figure 21 depicts the monomer rendering results according to the building contours and height information extracted via monomer segmentation. The monomer segmentation and extraction of an oblique photography 3D model with a range of 5 km² took approximately 456 s. Through manual interpretation and verification, the accuracy of this monomer extraction was approximately 90%.

4. Discussion

This study addressed the lack of semantic information in 3D models of oblique photography. A semantic segmentation method was proposed based on 3D mesh data obtained from oblique photography. A point set mapping method was adopted to convert the mesh data into point cloud data, and rough geometric segmentation was performed simultaneously. We adopted a neighborhood search method based on geodesic distances in the local neighborhood computation of semantic segmentation. During feature selection and calculation, superpoint theory was used to fuse local features, realize the feature extraction of the segmented object, and reduce the computational load of machine learning. Finally, the oblique photographic model was partitioned into ground, trees, buildings, and others to segment the scene visually. Our proposed technique can advance the 3D semantic segmentation of oblique photography based on the successful implementation results. The semantic segmentation method proposed in this study for oblique photography 3D data was validated in Section 3.1. From the implementation results, this method can segment the continuously surfaced 3D mesh model into different land object types, which is beneficial for the subsequent processing and application of oblique photography 3D data.

To acquire 3D building monomers and conduct contour extraction using semantically segmented oblique photography, our technique included hierarchical “slicing” of a 3D model followed by similarity assessments of polygon geometry. This method used a 3D mesh model as the object, projected the triangular meshes of buildings onto different elevation planes, and then processed mesh binarization and mesh vector conversion to extract initial contour sets for buildings. Finally, a polygon space similarity discrimination method calculated a single body and extracted the adopted building contour from the contour set data. The proposed method for individualizing building models based on oblique photography 3D data was validated in Section 3.2. The implementation results show that this method can accurately extract individual components from oblique photography 3D models. This is beneficial for the subsequent object-oriented management of oblique photography 3D models.

This study also analyzed and evaluated the time efficiency of the proposed method. In terms of 3D semantic segmentation, the computation time of the model presented in this paper is approximately 98 s/km². For building individualization, the computation time of the method proposed in this paper is approximately 90 s/km². Based on an area of approximately 200 km² for a medium-sized city, it is estimated that the model in this paper will take around 10.4 h to complete 3D semantic segmentation and individualization. This processing time is acceptable in practical applications.

Compared to existing methods for semantic segmentation and individual extraction of urban 3D models, this study exhibits several characteristics and innovations, including the following:

(1) Regarding 3D semantic segmentation, this study utilized oblique photography-based 3D models as the data source and conducted point cloud processing, forming superpoints through clustering to reduce computational complexity. A search method based on surface model geodesic distance was proposed for feature neighborhood searches. It differs from traditional Euclidean distance clustering methods for discrete point cloud data by leveraging spatial topological information of mesh data and providing a more rational approach to spatial distance measurement;

(2) In terms of monomer and contour extraction of oblique photography-based 3D architectural models, this study innovatively proposes an approach based on high-level plane segmentation and geometric similarity analysis. Compared to physical or interactive individualization methods, our method has the advantages of not relying on third-party data, possessing a high degree of automation, and integrating monomer and contour extraction;

However, this study has some limitations. Three aspects of developing 3D reconstructions using oblique photography require further research: First, a more refined semantic segmentation is needed. Microscopic and fine semantic recognition should include the classification of components such as windows, balconies, and roof structures. Second, in terms of 3D semantic segmentation, more advanced deep learning methods can be employed to further improve segmentation accuracy. Third, the extracted single building contours should be regularized to reduce jagged edges to meet application needs.

5. Conclusions

Oblique photography 3D reconstruction is an important simulation modeling method with broad applications. To meet the needs of semantic extraction and building monomers from such photography, this study used 3D mesh data as the research object and proposed a holistic solution. Our technique enriched the post-processing system for the 3D reconstruction of oblique photography, exploring several problems and offering a feasible solution.

First, a machine learning-based semantic segmentation method is proposed for oblique photography 3D model data. According to the characteristics of the data to be semantically segmented, in the specific solving process, various methods, such as mesh data point mapping, superpoint construction, and spatial measurement based on geodesic distance, are comprehensively used for data processing and neighborhood retrieval. This allows for feature fusion and extraction of segmentation objects, followed by the use of traditional machine learning methods to accomplish semantic segmentation.

Second, relying solely on the oblique photography model as the data source, a monomer and contour extraction method is proposed based on model elevation plane “cutting” and polygon geometric similarity judgment. This method can simultaneously realize model individualization and outline extraction for buildings.

The experimental results support the effectiveness and applicability of the proposed method. However, there are some limitations in this study. In the future, further research is warranted regarding semantic segmentation granularity, the introduction of deep learning methods, and the regularization of building outlines.

Author Contributions

Conceptualization, Y.Z. and W.X.; methodology, W.X. and C.Y.; software, W.X.; validation, W.X.; formal analysis, Y.Z.; investigation, W.X. and C.Y.; resources, W.X. and C.Y.; data curation, W.X. and C.Y.; writing—original draft preparation, W.X.; writing—review and editing, W.X.; visualization, W.X. and C.Y.; supervision, Y.Z.; project administration, W.X.; funding acquisition, W.X. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the National Natural Science Foundation of China (42171364, 42171452) and the R&D Project Plan of the Ministry of Housing and Urban–Rural Development of China (2019-K-158).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are not publicly available as they involve ownership issues.

Acknowledgments

The authors thank Yunsheng Zhang from Central South University, China for his helpful suggestions and comments.

Conflicts of Interest

The authors declare no conflict of interest. The sponsors had no role in the design, execution, interpretation, or writing of the study.

References

Güler, O.; Savaş, S. Stereoscopic 3D teaching material usability analysis for interactive boards. Comput. Anim. Virtual Worlds 2022, 33, e2041. [Google Scholar] [CrossRef]
Li, D.R.; Xiao, X.G.; Guo, B.X.; Jiang, W.; Shi, Y. Oblique Image Based Automatic Aerotriangulation and its Application in 3D City Model Reconstruction. J. Wuhan Univ. (Inf. Sci. Ed.) 2016, 41, 711–721. [Google Scholar]
Liang, Y.B.; Cui, T.J. Research progress of GRC photogrammetry. J. Tianjin Normal Univ. (Nat. Sci. Ed.) 2017, 37, 1–6. [Google Scholar]
Wang, Q.D. A Semantic Modeling Framework-Based Method for 3D Building Reconstruction from Airborne LiDAR Point. Ph.D. Thesis, Wuhan University, Hubei, China, 2017. [Google Scholar]
Horn, B.K.P. Extended Gaussian images. Proc. IEEE 1984, 72, 1671–1686. [Google Scholar] [CrossRef] [Green Version]
Bu, S.; Liu, Z.; Han, J.; Wu, J.; Ji, R. Learning High-Level Feature by Deep Belief Networks for 3-D Model Retrieval and Recognition. IEEE Trans. Multimed. 2014, 16, 2154–2167. [Google Scholar] [CrossRef]
Zhang, R. Research on Semantic Segmentation of Polymorphic Objects in Complex 3D Scene Based on Laser Point Cloud. Ph.D. Thesis, Strategic Support Force Information Engineering University, Henan, China, 2018. [Google Scholar]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 40, 834–848. [Google Scholar] [CrossRef] [Green Version]
Aijazi, A.K.; Checchin, P.; Trassoudaine, L. Segmentation based classification of 3D urban point clouds: A super-voxel approach for evaluation. Remote Sens. 2013, 5, 1624–1650. [Google Scholar] [CrossRef] [Green Version]
Hang, S.; Maji, S.; Kalogerakis, E.; Learned-Miller, E. Multi-view Convolutional Neural Networks for 3D Shape Recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 945–953. [Google Scholar] [CrossRef]
Kalogerakis, E.; Averkiou, M.; Maji, S.; Chaudhuri, S. 3D Shape Segmentation with Projective Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3779–3788. [Google Scholar] [CrossRef]
Xu, X.; Corrigan, D.; Dehghani, A.; Caulfield, S.; Moloney, D. 3D Object Recognition Based on Volumetric Representation Using Convolutional Neural Networks. In Articulated Motion and Deformable Objects, Proceedings of the 9th International Conference, AMDO 2016, Palma de Mallorca, Spain, 13–15 July 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 147–156. [Google Scholar]
Li, Y.; Pirk, S.; Su, H.; Qi, C.R.; Guibas, L.J. FPNN: Field Probing Neural Networks for 3D Data. In Advances in Neural Information Processing Systems 29, Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain, 5–10 December 2016; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar] [CrossRef]
Chen, Y. Research on Building Monomer Extraction Technology Based on Oblique Image Dense Matching Point Cloud. Master’s Thesis, PLA Information Engineering University, Henan, China, 2017. [Google Scholar]
Cai, X.Y. Building Monomer Method Based on UAV Oblique Photography Scene Modeling. Master’s Thesis, Nanjing Normal University, Nanjing, China, 2018. [Google Scholar]
Lei, J.T.; Liu, Q.; Pan, C.L.; Luo, Y.T.; Chen, R.B. Research on the monomer technology of oblique photography three-dimensional model based on vector cutting. Sci. Surv. Mapp. 2021, 46, 84–91. [Google Scholar] [CrossRef]
Filin, S.; Pfeifer, N. Neighborhood Systems for Airborne Laser Data. Photogramm. Eng. Remote Sens. 2005, 71, 743–755. [Google Scholar] [CrossRef]
Asma, A.A.; Fouad, S.T. Distinguishing license plate numbers using discrete wavelet transform technology based deep learning. Indones. J. Electr. Eng. Comput. Sci. 2023, 30, 1771–1776. [Google Scholar] [CrossRef]
Maha, A.M.; Lemya, A.A.H.; Asma, A.A. New algorithm based on deep learning for number recognition. Int. J. Math. Comput. Sci. 2023, 18, 429–438. [Google Scholar]
Abduldaim, M.; Abdulrahman, A.A.; Tahir, F.S. The effectiveness of discrete hermite wavelet filters technique in digital image watermarking. Indones. J. Electr. Eng. Comput. Sci. 2022, 25, 1392–1399. [Google Scholar] [CrossRef]
Tahir, F.S.; Abdulrahman, A.A. The effectiveness of the Hermite wavelet discrete filter technique in modify a convolutional neural network for person identification. Indones. J. Electr. Eng. Comput. Sci. 2023, 31, 290–298. [Google Scholar] [CrossRef]
Elaksher, A.F.; Bethel, J.S. Reconstructing 3D Buildings from LIDAR Data. In Proceedings of the ISPRS Commission III Symposium, Photogrammetric and Computer Vision, Graz, Austria, 9–13 September 2002; pp. 102–107. [Google Scholar]
Elberink, S.O.; Maas, H.G. The use of anisotropic height texture measures for the segmentation of airborne laser scanner data. Int. Arch. Photogramm. Remote Sens. 2000, 33, 678–684. [Google Scholar]
Güler, O.; Yücedağ, I. Developing an CNC lathe augmented reality application for industrial maintanance training. In Proceedings of the 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, 19–21 October 2018; pp. 1–6. [Google Scholar] [CrossRef]
Wu, W.M. Research on the Automatic ID Method of Oblique Photography Model. Geomat. Spat. Inf. Technol. 2018, 41, 223–225. [Google Scholar]
Landrieu, L.; Simonovsky, M. Large-scale Point Cloud Semantic Segmentation with Superpoint Graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4558–4567. [Google Scholar]
Lafarge, F.; Mallet, C. Creating Large-scale City Models from 3D-point Clouds: A Robust Approach with Hybrid Representation. Int. J. Comput. Vis. 2012, 99, 69–85. [Google Scholar] [CrossRef] [Green Version]
Weinmann, M.; Jutzi, B.; Hinz, S.; Mallet, C. Semantic Point Cloud Interpretation based on Optimal Neighborhoods, Relevant Features and Efficient Classifiers. ISPRS J. Photogramm. Remote Sens. 2015, 105, 286–304. [Google Scholar] [CrossRef]
Lee, I.; Schenk, A. Perceptual organization of 3D surface points. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2002, 34, 193–198. [Google Scholar]
Linsen, L.; Prautzsch, H. Local Versus Global Triangulations. In Proceedings of the Eurographics, Manchester, UK, 2–3 September 2001; pp. 257–263. [Google Scholar] [CrossRef]
Pauly, M.; Gross, M.; Kobbelt, L.P. Efficient simplification of point-sampled surfaces. In Proceedings of the IEEE Visualization, Boston, MA, USA, 27 October 2002; pp. 163–170. [Google Scholar] [CrossRef] [Green Version]
Hoppe, H.; Derose, T.; Duchamp, T.; McDonald, J.; Stuetzle, W. Surface Reconstruction from Unorganized Points. In Proceedings of the 19th Annual Conference on Computer Graphics and Interactive Techniques, Chicago, IL, USA, 26 July 1992; pp. 71–78. [Google Scholar]
Zhang, L.W. Research on 3D Surface Reconstruction Technology of Scattered Point Cloud. Ph.D. Thesis, National University of Defense Technology, Changsha, China, 2009. [Google Scholar]
Gao, Y.L. 3D Model Reconstruction of Tilted Photographic Buildings Based on Vehicle-Mounted Point Cloud Enhancement. Ph.D. Dissertation, Wuhan University, Wuhan, China, 2017. [Google Scholar]
Hackel, T.; Wegner, J.D.; Schindler, K. Fast Semantic Segmentation of 3D Point Clouds with Strongly Varying Density. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. 2016, 3, 177–184. [Google Scholar]
Mount, D.M. ANN Programming Manual; Technical Report; Department of Computer Science, University of Maryland: College Park, MD, USA, 1998. [Google Scholar]
Yan, D.M.; Wintz, J.; Mourrain, B.; Wang, W.; Boudon, F.; Godin, C. Efficient and robust reconstruction of botanical branching structure from laser scanned points. In Proceedings of the 2009 11th IEEE International Conference on Computer-Aided Design and Computer Graphics, Huangshan, China, 19–21 August 2009; pp. 572–575. [Google Scholar]
Shapovalov, R.; Velizhev, A. Cutting-plane training of non-associative Markov for 3D point cloud segmentation. In Proceedings of the 2011 International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission, Hangzhou, China, 16–19 May 2011; pp. 1–8. [Google Scholar]
Li, S.Z. Markov Random Field Modeling in Image Analysis; Springer: London, UK, 2009. [Google Scholar]
Boykov, Y.; Veksler, O.; Zabih, R. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 1222–1239. [Google Scholar] [CrossRef] [Green Version]
Canny, J.A. Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 679–698. [Google Scholar] [CrossRef]
Freeman, H. Boundary Encoding and Processing. In Picture Processing and Psychopictorics; Lipkin, B.S., Rosenfeld, A., Eds.; Academic Press: New York, NY, USA, 1970; pp. 241–266. [Google Scholar]

Figure 1. Realization path of semantic classification based on 3D mesh data.

Figure 2. Conversion of oblique photography into 3D data. (A) An oblique photography 3D model. (B) Point cloud data of triangular mesh vertices.

Figure 3. The solution diagram of the shortest distance between three-dimensional lattice points is based on the Dijkstra algorithm.

Figure 4. Monomer and contour extraction process of a 3D model building.

Figure 5. Real model diagram of 3D building elevation segmentation.

Figure 6. 3D model projection and binarization process.

Figure 7. Binary images formed by triangular rasterization (elevation = 31 m).

Figure 8. Contour lines were obtained via horizontal “slices” at different elevations of high-rise buildings.

Figure 9. Data range of the training set and test data set.

Figure 10. 3D scene of oblique photography and its corresponding grid model: (A) the original 3D model; (B) the 3D grid model; (C) the local 3D grid model; (D) point cluster mapping data for the local area.

Figure 11. A semantic annotation tool is used to annotate point clusters after mapping.

Figure 12. Training data set and point set model obtained from an oblique photography 3D model: (A) training set data—original skew model data; (B) training set data—point set mapping and semantically labeled data.

Figure 13. Variation of classification accuracy and mean F1 scores with changes in the value of k: (A) variation in training classification accuracy; (B) variation in mean F1 scores.

Figure 14. Variation in classification accuracy and mean F1 scores with changing γ values: (A) training classification accuracy; (B) mean F1 scores.

Figure 15. Test results of 3D semantic segmentation for two small local scenes: (A) original oblique photographic data; (B) mesh point cloud data; (C) semantic classification results.

Figure 16. Test results of 3D semantic segmentation for a larger scene: (A) original oblique photographic data; (B) mesh point cloud data; (C) semantic classification results.

Figure 17. Binarization results of grid projection after slicing at different elevations: (A) h = 47 m; (B) h = 49 m; (C) h = 51 m; (D) h = 53 m.

Figure 18. Initial contours extracted from different elevation cut surfaces at (A) h = 47 m; (B) h = 49 m; (C) h = 51 m; (D) h = 53 m.

Figure 19. Curve of accuracy and recall under different elevation differences, ΔH.

Figure 20. Curve of accuracy and recall under different elevation differences, S_t.

Figure 21. Monomer rendering results based on the extracted building contours.

Table 1. Calculation of contour geometric similarity indicators.

Geometric Similarity Index	Deviation Formula	Source Contour of the Computed Object	Target Contour of the Computed Object
Positional deviations	ΔP = \|P₁ − P₂\|
Geometric deviations	ΔL = \|L₁ − L₂\|
Geometric deviations	ΔW = \|W₁ − W₂\|
Contour area deviations	ΔS =\|S₁ − S₂\|

Table 2. Results of classification accuracy and F1 scores under different neighborhood k values.

K values	4	6	8	12	16	20	30
Classification accuracy	0.9553	0.9561	0.9583	0.9565	0.9519	0.9460	0.9460
Mean F1 score	0.8796	0.8774	0.8875	0.8912	0.8354	0.8737	0.8737

Table 3. Results of classification accuracy and F1 scores under different energy function coefficients (γ values).

Gamma (γ) values	0.01	0.1	0.4	0.6	1.0	2	6
Classification accuracy	0.9535	0.9553	0.9566	0.9544	0.9471	0.9351	0.8844
Mean F1 score	0.8700	0.8556	0.8907	0.8873	0.8278	0	0

Table 4. Test results based on the training model.

Category	Precision	Recall	IoU
Ground	0.795524	0.796856	0.661390
Building	0.872807	0.822344	0.797025
Trees	0.766479	0.895214	0.703374
Other	0.748356	0.823761	0.723279

Table 5. Accuracy and recall under different elevation differences, ΔH.

ΔH (m)	Accuracy	Recall
0.5	0.812	0.922
0.8	0.821	0.905
1.0	0.833	0.893
1.5	0.834	0.894
2.0	0.843	0.885
2.5	0.854	0.871
3.0	0.877	0.797
5.0	0.885	0.786
6.0	0.901	0.732

Table 6. Accuracy and recall under different area thresholds, S_t.

S_t (m²)	Accuracy	Recall
3.0	0.732	0.903
5.0	0.781	0.894
8.0	0.792	0.886
10.0	0.866	0.879
20.0	0.893	0.885
50.0	0.905	0.864
80.0	0.914	0.822
100.0	0.923	0.782
150.0	0.945	0.742

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, W.; Zeng, Y.; Yin, C. 3D City Reconstruction: A Novel Method for Semantic Segmentation and Building Monomer Construction Using Oblique Photography. Appl. Sci. 2023, 13, 8795. https://doi.org/10.3390/app13158795

AMA Style

Xu W, Zeng Y, Yin C. 3D City Reconstruction: A Novel Method for Semantic Segmentation and Building Monomer Construction Using Oblique Photography. Applied Sciences. 2023; 13(15):8795. https://doi.org/10.3390/app13158795

Chicago/Turabian Style

Xu, Wenqiang, Yongnian Zeng, and Changlin Yin. 2023. "3D City Reconstruction: A Novel Method for Semantic Segmentation and Building Monomer Construction Using Oblique Photography" Applied Sciences 13, no. 15: 8795. https://doi.org/10.3390/app13158795

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

3D City Reconstruction: A Novel Method for Semantic Segmentation and Building Monomer Construction Using Oblique Photography

Abstract

1. Introduction

2. Materials and Methods

2.1. Oblique Photography 3D Semantic Segmentation

Implementation Pathway for the Semantic Classification of 3D Mesh Data

2.2. Point Set Mapping and Point Cloud Geometric Segmentation

2.2.1. Mapping the Mesh Model to a Point Set

2.2.2. Geometric Segmentation of the Point Cloud via Spatial Clustering

2.3. Neighborhood Search and Feature Extraction

2.3.1. K-Neighborhood Search Based on Geodesic Distances

2.3.2. Geometric Feature Extraction of 3D Points

2.3.3. Feature Fusion and Extraction of Superpoints

2.4. Random Forest Semantic Segmentation

2.5. Monomer and Contour Extraction

2.5.1. Contour Set Generation Based on Horizontal Slicing of Grid Model Data

2.5.2. Monomer and Contour Selection Based on Geometric Similarity

2.5.3. Geometric Similarity Indicators of Contours

2.5.4. Monomer Judgment and Contour Selection

3. Results

3.1. Semantic Segmentation of an Oblique Photography 3D Model

3.1.1. Data Format Conversion and Point Cloud Mapping

3.1.2. Data Classification Annotation

3.1.3. Model Training

3.1.4. Semantic Segmentation

3.2. Building Monomer and Contour Extraction

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI