3.2. Data Collection and Pre-Processing
3.2.1. Remote Sensing Data Preparation and Pre-Processing
The preparation and preprocessing of remote sensing data consist of 4 parts: preparation of SLC (Single Looked Complex) data, image alignment and cropping combinations, selection of DEM data and alignment and selection of master images.
Single Looked Complex (SLC) data are the basis of PS-InSAR. The data set in this study comes from the ESA Sentinel 1 Earth Observation Satellite. The satellite parameters are shown in
Table 2.
A total of 15 single-view complex images were selected in this study, and the period is between 22 September 2019 and 8 March 2020. The selected single-view images can cover the entire Shenzhen area.
3.2.2. Deep Learning and Image Processing Data Collection and Pre-Processing Data Collection
The study′s primary purpose is to realize the identification and measurement of dangerous points and abnormal point cracks in the subway and the engineering facilities along the subway line subsidence. Engineering facilities include roads, buildings, bridges, etc., and their materials are not the same. Therefore, the dataset we used for model training should contain as many materials as possible, such as asphalt, concrete, bricks, etc. In order to obtain a large and diverse set of high-quality crack images, the data sources for the study consisted of the following 4 sources:
Using search engines such as
www.baidu.com (accessed on 5 March 2021) and
www.bing.com (accessed on 12 March 2021) to retrieve and collect crack images posted on the Internet;
Handheld cameras in Shenzhen perform manual shooting;
Use of DJI′s 4RTK drone to shoot high-rise and hidden places;
Public data sets in existing research.
After collecting the data, the images need to be preprocessed by being cropping into a fixed size. The images are then labeled to distinguish whether they contain cracks or not. Finally, the crack information is highlighted by grayscaling and image enhancement.
Image Cropping and Labeling
When deep learning performs image classification, the input image needs to be cropped to a uniform size. In the data collected in this experiment, the pixels of the images taken by the camera are 3024 × 4032, the pixels taken by the drone are 4864 × 3648, the public data set is 227 × 227, and the sizes of the images downloaded from the Internet vary. However, the training of deep learning models requires uniform-size image data. At the same time, in order to avoid the impact of image compression and distortion on the model training effect, this study cuts the oversized image into 227 × 227 image samples, and filters out the images that do not contain concrete and crack information. Finally, a dataset with 8026 images was obtained. In this dataset, there are 1294 images taken manually, 912 images taken by drones, 2000 images downloaded from the Internet, and the public data set contains 4000, with positive and negative factors each accounting for 50% (see
Table 3 for details). Finally, according to the set-aside cross-validation method, the data were divided into training, validation and test sets in the ratio of 8:1:1. As shown in
Table 4, the test set contains 6556 images, the validation set contains 820 images, and the test set contains 820 images.
3.2.3. Image Processing Data Pre-Processing
Grayscale
The color of a color image is jointly determined by the three components of R, G, and B, but RGB does not reveal the morphological characteristics of the image; instead, the color is adjusted according to the optical principle. Therefore, converting a color image into a grayscale image that contains only luminance information but not color information helps reduce computational effort. Grayscale can transform a color image into a grayscale image by deleting all the color information of the image and only retaining the brightness of each pixel. Grayscale processing methods primarily include the component method, maximum method, average method, weighted average method and so on. The weighted average method effectively enhances data features, thereby enabling the model to better learn crack information. Therefore, in this paper, we employ the weighted average method for grayscale preprocessing of images. The three-channel components of the color image are weighted and averaged according to the importance of each channel, and then the resulting weighted average is used as the gray value of the grayscale image. The gray value is shown in Equation (1).
Image Enhancement
In the infrastructures, the difference between the crack and the surrounding background color is generally small, which can lead to a lack of crack contrast. In order to make the edges of the cracks more prominent, the image′s contrast needs to be enhanced. The grayscale range measures contrast. The larger the range, the higher the contrast and the clearer the image. Contrast enhancement is based on the grayscale histogram. The grayscale histogram is a function of the grayscale level of the image, which is used to describe the number or occupancy rate of each grayscale pixel in the image matrix, that is, the number of pixels corresponding to each gray level between 0 and 255 is counted.
Linear transformation is used to enhance contrast, which is a less computationally intensive method of enhancing dynamic range by linearly stretching the original gray level to stretch the range of the original gray level through linear transformations. The Equations (2)–(4) of a linear transformation are shown below.
3.3. Settlement Point Identification Model
The deformation of the ground elevation is the basis for the selection of settlement points. InSAR is a space-to-earth observation technology, which can calculate the topography and surface changes in the target area by using the complex image pairs observed by the radar at different times or different angles in the same ground area [
23]. Coherence and atmospheric inhomogeneity limit the use of InSAR [
24,
25]. The PS-InSAR algorithm, by placing the core of the calculation on a part of the target with high coherence, can ensure that the target maintains stable scattering characteristics over a long time sequence [
30], which is highly spatially and temporally independent. In addition, it can acquire data over long-time spans with sub-millimeter accuracy and is capable of round-the-clock, all-day monitoring. These PS points are generally distributed on some hard ground objects without vegetation coverage, so this technology is suitable for urban areas.
After inputting the prepared 15 scene images data into the SARscape software (v.5.6), a connection map can be generated. The central scene image is selected as the basis of the other 14 scene images for registration to eliminate the relative offset between the primary and auxiliary images caused by orbital errors.
Subsequent differential interference processing is performed on the enslaver and enslaved person images to obtain the interference image set based on the familiar master image. The differential interference atlas of the region is obtained by subtracting the DEM information and the interference atlas. Then, the amplitude dispersion index (ADI) threshold method is used to identify the point target with stable radar scattering. After the above processing, the remaining phase includes the deformation phase, the noise phase, the elevation error and the atmospheric delay phase. The SARscape software can eliminate factors affecting monitoring accuracies, such as terrain residuals, deformation, elevation, and climate errors. In order to obtain more accurate deformation information, points with an error greater than 1 mm were deleted in ArcGIS, and 9,998,423 PS points were obtained by processing satellite data. The settling velocity ranges from −41.06 mm/y to 7.93 mm/y.
Figure 2 shows the PS-InSAR technology processing flow.
3.4. Crack Recognition Model
Convolutional neural networks can be used to build a deeper network with a smaller number of parameters, thus reducing the time and computational cost of the training process. VGG16, as a classical convolutional neural network architecture, consists of 13 conventional layers and 3 fully connected layers. It employs small 3 × 3 convolutional kernels and incorporates maximum pooling layers, presenting a relatively streamlined yet deep architectural design, as shown in
Figure 3. Consequently, in our crack recognition model, VGG16 was chosen as the classifier to convert the crack detection problem into a binary classification problem and classify the image samples with cracks and no cracks in the image data set.
Viewing the VGG16 model as a classifier allows us to formalize the crack detection problem as a binary classification task, distinguishing between crack and non-crack image samples. Constructing a crack recognition model using VGG16 primarily involves three steps: data collection, data preprocessing, and model training. The first two steps have already been completed in the previous sections.
There are two types of data: the data collected by dispatching drones based on the settlement value, and the second type is specially prepared for model training to improve the universality of the crack recognition model. The size of the existing data set is 227 × 227, so the input structure of the VGG16 model is adjusted to 227 × 227. In order not to miss the boundary information, 0 is added to the periphery of the original image matrix. After the convolution operation, the pixel value will be reduced accordingly, making the resulting matrix smaller than the original matrix. Padding can solve this problem well and obtain the matrix by convolution consistent with the original matrix. Since it is a grayscale image, it is a 3 × 3 single-channel convolution kernel. The 64 convolution kernels are convolved to generate a 64-channel feature map. The activation function ReLU makes the output content non-linear. In addition, 64-channel 3 × 3 convolution kernels generate 64-channel feature maps, and the number of channels in the output feature map always equals the number of convolution kernels. During the training, the initial learning rate was set to 0.001, the SGD optimizer was used, and as for other parameters, momentum was set to 0.5.
In order to further reduce the parameters, after the maximum pooling, average pooling can be performed. The uniform output size is 7 × 7 × 512 feature images. After flattening, there are 25,088 parameters (7 × 7 × 512), and the last three layers are fully connected. After the 13-layer convolution operation, the output of the last convolution layer is flattened and then input into the fully connected layer. The fully connected layer can map 25,088 parameters to 4096 neurons. Finally, the SoftMax layer is used to determine whether the input image contains cracks. Model codes are available in the Crack Recognition Model for
Supplementary Materials.
The training results of the model are shown in
Figure 4. The convolutional network inputs 8206 images, 6566 images in the training set, and 820 images in the validation set. A batch-size has 64 images, and an echo has a total of 103 batch-sizes. All the data of training set are used for training, and epochs are set to 20, 30, 40, 50, 70, 80, 90 and 100 to find the best trained model. In this study, training loss, validation loss, training accuracy, and validation accuracy are utilized to calculate the accuracy of crack recognition model. When training for 20 epochs, as shown in
Figure 4a, there is no convergence trend for training accuracy and verification accuracy. At the 80th epoch, as shown in
Figure 4b, the training accuracy has not yet converged but the validation accuracy is close to convergence. As shown in
Figure 4c–e, after 100 epochs of training, the training loss and validation loss are 0.046 and 0.042, respectively, the training accuracy of the model converges to 99.48%, and the validation accuracy converges to 98.66%, which is a sufficiently accurate model for subsequent use.
Figure 5a is a photo of a crack, and the feature image is extracted by convolution processing.
Figure 5 shows the process of convolution processing and the feature changes from clear to fuzzy, from edge feature extraction to regional feature extraction.
Figure 5c is the 0th layer of convolution, which extracts edge information and shows 30 feature images.
Figure 5d shows that the second convolution layer further extracts the edge information, but it is gradually blurred.
Figure 5e shows the regional features extracted by the fifth layer of convolution.
3.5. Crack Measurement Model
After the convolutional neural network classification, a series of image labels containing cracks can be obtained. However, the cracks in the image cannot be further analyzed, so we need to use digital image processing technology to extract and analyze the crack information in the image.
First, the Canny edge detection, a classic edge detection algorithm, is performed on the image that has been grayed and contrast-enhanced, which can extract useful structural information and significantly reduce the amount of data. After edge detection, five morphological operations are required to highlight the crack information and remove noise. In the first morphology, the core size is determined to be 3 × 3, and then the expansion and closing operations are performed. During the expansion operation, the closing operation can effectively remove the noise. However, the cracks and noise will be highlighted simultaneously, so obtaining a closing operation with acceptable results is difficult. In the second morphological operation, a closing operation was performed, the noise information was obviously removed, and the crack information was further highlighted. In the third morphological operation, the cracks are screened and the following takes place: (1) determination of whether the area of the crack is larger than the area; (2) if the width is less than the height, then the width and height are swapped. If the width is greater than the WHRation times of the height, and if the width is greater than 100 and meets 1 and 2, it is considered a crack. Then, the image is opened. The opening operation is to expand and then corrode, which is to connect the broken cracks. Finally, two corrosion operations were carried out. After five morphological operations, all cracks have been screened.
Figure 6a,b compares the images before and after the five morphological manipulations.
After further extracting the crack′s shape, the crack′s length, width and area need to be calculated. In our study, the length and width of cracks refer to the length and width of the minimum bounding rectangle around the cracks, while the area represents the en-closed pixel area. But most of the current UAVs are equipped with monocular cameras and monocular cameras cannot capture the distance between the lens and the object, so this study installed a laser rangefinder on the UAV platform, and made the shutter and the laser rangefinder work at the same time, which can capture the distance between the lens and the object at the same time. Then, based on the measured distance, the parameters of the camera and the focal length formula, the actual length and width formula of cracks can be obtained as shown in the Formula (5). The area is the area surrounded by pixels.
where
D is the actual length of the crack,
L is the distance obtained by the laser rangefinder,
V is the focal length of the camera,
S is the number of pixels on the long side of the image sensor,
d′ is the number of pixels in the image, and
s is the physical size of the long side of the image sensor. DJI 4RTK UAV is used in this experiment. The pixel size of the UAV is 2.41 μm, and the focal length of the camera is 8.8 mm.
Finally, the fracture information extracted from the experiment is covered in the original image, and the actual information output is saved in TXT format for experts to evaluate, as shown in
Figure 6c. Model codes are available in the Crack Measurement Model for
Supplementary Materials.