1. Introduction
Subway trains are integral to traffic systems, modernization, and urban culture [
1,
2,
3]. However, because these axle boxes of a subway train support the whole weight of the subway vehicle and ensure the reliability of a subway train [
4,
5], and the rolling bearings are the vitally important component to transfer loads and torque through which are filtered by an air spring to shaft. Hence, failures unavoidably occur in rolling bearings and result in economic loss or even human casualties. As a result, fast and accurate fault diagnosis of axle box bearings can be used to maintain the smooth operation of urban rail transit and extend service time as well as ensure travel safety.
The fault diagnosis methods used mainly for rolling bearings can be classified into two categories, vibration-based signal analysis and machine-learning-powered methods [
6,
7,
8,
9,
10]. In general, vibration-based signal analysis methods detect faults by extracting fault-related vibration components and characteristic frequency. However, vehicle-mounted sensors used to extract other irrelevant vibration signals including the shaft and gearbox, etc., when subway trains operate at high speeds. Hence, in the early stage of faults, bearing-related fault signals used to be overwhelmed by overstated other components and harmonics or environment noise. Therefore, it is hard to extract pure fault-related vibration signal by traditional vibration-based signal analysis.
Machine-leaning-powered fault diagnosis methods detect faults by extracting a series of statistical parameters (e.g., such as kurtosis, root mean square, energy and entropy.) to represent bearings’ health states. Meanwhile, these parameters can be used to train classifiers (e.g., such as a support vector machine (SVM), a deep neural network (DNN), or a Bayes network) to classify different fault characteristics. Among them, SVM is a class of generalized linear classifiers that perform binary classification of data in a supervised learning manner, neural networks are based on extensions of perceptrons, while DNN can be understood as neural networks with many hidden layers. Nevertheless, the extracted statistical parameters cannot ensure the accuracy of distinguishing different faults. Therefore, finding suitable training parameters to train traditional classifiers is a long-term challenge for machine-learning-powered fault diagnosis methods [
11].
In recent years, deep learning (DL) methods, which take vibration-related signals as input data, has been applied in various fields [
12,
13,
14,
15,
16]. For example, S. Roy et al. [
12] applied the successful application of DL in medical imaging to COVID-19 as well as paved the way to future research on DL for the assisted diagnosis of COVID-19 from medicine imaging datasets. K.B. Lee et al. and H. S. DIKBAYIR et al. [
13,
14,
15] detected vehicles in different complex driving environments based on DL. For fault-related signal features extraction, traditional machine-learning-powered methods, which rely on fault-related preprocessing, are the lack of multiple levels of nonlinear transformations [
17]. DL cannot only adaptively extract deep features of fault characteristics from anc input layer but can also ease the difficulty of parameters optimizations.
Among the DL theoretical methods, diagnostic models with classical convolutional neural network (CNN) structure are the most widely used, such as Alex Net, VGG, etc. However, because of the big data with large volume, vary modalities, fast generation and large value but low density, these network models have no choice but to improve the depth to parse the massive data, so it will cause huge training parameters and overfitting. By contrast, deep residual networks (ResNets) are an effective variant of CNNs, which can use identity shortcuts to ease the difficulty of parameters optimization [
18,
19,
20,
21]. ResNets and their variants have applied for fault diagnosis in a few papers [
22,
23,
24,
25,
26,
27,
28]. For instance, C. Zhou et al. [
24] analyzed the COVID-19 chest X-raay images based on image regrouping and ResNet-SVM. As a consequence, this method can reach 93% accuracy on a relatively small dataset. M. Zhao et al. [
27] proposed deep residual shrinkage network (DRSN), which is an evolution of ResNets. Compared with structure of ResNets, DRSN has a shrinkage block (soft threshold function) and show that the method is effective for high noise fault diagnosis.
The developed residual neural networks can adapt to large-scale data and have good nonlinear expression capability, but a large number of research is based on 1D fault vibration signals, which makes full use of the self-extraction capability of DL but also limits the diagnostic accuracy. This article develops a fault reconstruction characteristics classification method using ResNet-152 inserted with multi-layers convolutional kernels. The data is structured by convolutional units with multi-layer stacked convolutional kernels to enhance the nonlinear representation. One-dimensional vibration signals processed by Gramian angular summation field (GASF) are easier to manipulate by convolutional layers. The main contributions of this paper are as follows:
- (1)
Three-layers stacked convolutional kernels are inserted into ultra-deep ResNets to replace large-size or less-layers convolutional kernels to improve the nonlinear representation of feature images;
- (2)
The fault datasets are reconstructed to increase the data scale and retain the temporal features in the fault data, while reducing the difficulty of the convolution process;
- (3)
Research on axle box bearings for subway trains to improve the efficiency and accuracy of diagnosis of this component.
Additionally, this study verifies the role of superimposed convolution kernels in specific objects for the first time on the basis of the theory; another novelty is the training of deep learning networks using reconstructed fault feature signals, as researchers have overlooked the importance of modest feature engineering while focusing too much on the powerful learning ability of deep learning.
This paper is organized as follows: in
Section 2, ResNets-related methods are introduced. In
Section 3, details of the design of fundamental architectures for ResNet-152-MSRF, data reconstruction methods, experimental protocols and complete experimental results are presented. In
Section 4, the experimental results are summarized, and the advantages and disadvantages of each model are sorted out.
2. Basic Components
2.1. Basic Structure of Residual Neural Networks
ResNets share many of the same components as traditional CNNs, such as convolution layers, rectifier linear unit (ReLU), activation function, batch normalization (BN), loss function and pooling layers et al. In the fact, the pooling layer, which downsamples fault-related features to submit to the next block, can be used or not in many deep neural networks. The theories of these basic components are described as follows.
The convolution operation in a CNN is the key component of the entire network and is the essential difference from a fully connected (FC) neural network. The convolutional layer in a CNN can effectively reduce the amount of the trainable parameters, so that the training speed of the model is greatly improved. In a neural network, the fewer the trainable parameters, the less likely the network will be over-fitted. The formula for the convolution operation can be expressed as follows:
where
is the
th channel of the input feature map,
is the
th corresponding convolution kernel,
is the activation function of the corresponding layer,
is the corresponding bias term,
are the relative positions during feature mapping,
is the
th channel of output feature map. The convolution operation can be repeated several times to obtain a large number of feature maps.
BN is an important technique for normalizing feature data as a trainable process to be inserted into the deep learning architecture [
29,
30]. Deep networks training is a complex process, whenever a small change occurs in the first few layers of the network, then the later layers will be cumulatively amplified down. Hence, the purpose of BN is to reduce the internal covariate shift, in which updates of the front layer training parameters will lead to changes in the distribution of the back layer input data. As a matter of fact, BN force the distribution of the input value of any neuron in each layer of the neural network back to a standard normal distribution with a mean of zero and a variance of one, so that the activation input value falls in the region where the nonlinear function is more sensitive to the input. BN operation is expressed as follows:
where
and
represent the input and output feature of the
th observation in a mini-batch.
and
are two trainable parameters to adjust the distribution.
is a constant that tends to zero.
Loss function is used to measure the quality of a set of parameters by comparing the difference between the expected output and the true output. In multi-category tasks, the cross-entropy error used to be the objected function to be minimized. Compared with other traditional error functions, cross-entropy can promise a higher training efficiency. Apart from that, in order to strengthen the feature, cross-entropy is usually used with the softmax function to map the output from zero to one. Then softmax function can be expressed as follows:
the first step is to take the
th row of
and multiply that row with
as well as compute for all
for
, and then apply the softmax function to get a normalized probability. Cross-entropy is expressed as follows:
where
is the
th actual probability of observation. After calculating the cross-entropy error, the gradient descent algorithm is used to optimize the parameters, and then the network is fully trained after several iterations.
2.2. Insertion of Multi-Scale Superimposed Receptive Field
In this section, the motivation for fault characteristics reconstruction and multi-scale superimposed receptive field that insert into the architecture of deep residual network are introduced.
Receptive field is the convolutional kernel which realize the local perception of the corresponding input, the implementation is a weighted summation over a local region of the input. The size of the convolution kernel must be larger than 1 to have the effect of enhancing the perceptual field, so that the most commonly used convolutional kernel for feature extraction cannot be 1. Convolution kernels of even size cannot guarantee that the input feature map size and output feature map size remain unchanged even if padding is added symmetrically (e.g., if the input is 4 × 4 and the convolution kernel size is 2 × 2 and padding is 1 on each side, there will be a total of 5 outputs after sliding, which will not correspond to the input). Compared with bigger convolution kernels, multi-layer stacked small-sized convolution kernels have more activation functions, richer features and greater discernment. Convolution operation is accompanied by an activation function, and the use of more convolution kernels can make the decision function more discriminative.
Multi-layer stacked convolutional kernel replacement for large size convolution kernel involves parameter calculation.
Table 1 shows the comparison of whether stacked convolution is used or not for different kinds of networks or for the same kind of networks with different depths, including VGG-16, VGG-19, ResNet-50, ResNet-152. As shown in
Table 1, multi-layer stacked convolutional kernels have a larger number of parameters compared to large size convolutional kernels and few-layer convolutional kernels, but parameters growth rates are all stable at less than 1%, the replacement of 7 × 7 convolutional kernels with 3 × 3 + 3 × 3 + 3 × 3 stacked convolutional kernels in the VGG-19 network has the smallest parameter growth rate of 0.06%, and the replacement of 5 × 5 convolutional kernels with 3 × 3 + 3 × 3 stacked convolutional kernels in the ResNet-152 network has the largest parameter growth rate of 0.9%.
The superiority of the residual network can also be seen in
Table 1, where the number of trainable parameters of the ResNet with a depth of 152 layers is less than 20% of that of the VGG network with only 16 or 19 layers, so that ResNets are well suited for embedding stacked convolutional kernels, which stabilize the number of parameters while keeping the network lightweight.
3. Design of Fundamental Architectures for ResNet-152-MSRF
In this section, the architecture of ResNet-152-MSRF are elaborated.
The core part of the developed ResNet-152-MSRF is shown in
Figure 1, the convolution process is achieved by stacked convolutional kernels. The input image is output to the next stage with tensor data with deep nonlinear features by the action of a three-layer stacked convolution kernel. Neural networks gradually lose local features at each layer through pooling as the depth increases, which is fatal for fault diagnosis. ResNets, on the other hand, lead to ultra-deep network structures. ResNet-152-MSRF has 152 convolutional layers, and each convolutional layer performs 3 nonlinear transformations because of the embedded 3-layer stacked convolutional kernel, and then average pooling is used between each layer as well as the number of feature channels increases and the feature size decreases. In
Figure 2 is the overall architecture of ResNet-152-MSRF. The input features of the previous convolutional layer are added to the output features by identity shortcutting, prerequisite is to ensure the same shape (e.g., the input shape of the previous layer is 64 × 64 × 16, then the output feature shape is the same). Dropout(0.5) function is used between each convolutional layer to randomly reduce the number of neurons to prevent overfitting, 0.5 represents the random neuron discard rate. Finally, the fully connected layer is connected, and the number of output nodes is the same as the number of categories.
The advantage of using this architecture is that it can train large datasets well, and the network is deep enough to make sufficient nonlinear transformations to allow the computer to discriminate features.