1. Introduction
The rolling bearing is one of the most crucial parts of rotating machinery, which is widespread in industrial applications [
1,
2]. Due to the harsh working environment and variable heavy loads, many types of faults are likely to occur in rolling bearings, which may cause inestimable work accidents and financial losses. Therefore, accurate fault diagnosis of the rolling bearings is of great significance for ensuring mechanical system security and operational stability [
3,
4].
With the continuous development of artificial intelligence technology in the industrial field, diagnosis methods based on machine learning are universally used in the intelligent fault diagnosis of rotating machinery [
5]. However, traditional machine learning methods need to manually set internal parameters, which have high requirements for background knowledge and professional experience. Meanwhile, the traditional machine learning methods are unable to adaptively learn the extracted signal features; thus, their recognition ability is limited. In order to address these issues, deep learning methods have been pioneered in fault diagnosis. Due to the powerful modeling and image feature extraction capabilities of deep learning methods, many previous studies have converted one-dimensional vibration signals into two-dimensional images as input for deep learning models. He et al. [
6] processed the sensor data by the method of short-time Fourier transform (STFT) to obtain a spectrum image. Tao et al. [
7] applied the short-time Fourier transform (STFT) to convert raw vibration signals into images. Shao et al. [
8] generated a visual image of the raw signal using continuous wavelet transform (CWT). Wang et al. [
9] obtained the 2D signal representation maps by synchro-extracting transform (SET). However, most of these signal-to-image methods heavily rely on expert experience to set appropriate internal parameters. According to the problem, some researchers have introduced the Gramian Angular Field (GAF) method to convert signals into images without selecting parameters [
10]. Tang et al. [
11] decomposed the vibration signals to gain the appropriate signal components and converted them into images by GAF. Han et al. [
12] compared GAF with Markov Transition Field (MTF) and verified the superiority of GAF in information preservation. As a type of GAF, the Gramian Angular Difference Field (GADF) obtains a matrix by calculating the trigonometric difference between each point. It maintains the temporal dependency and preserves abundant features with polar coordinates. Therefore, GADF is employed to transform the vibration signals into images in this paper.
Due to the complex environment and the influence of vibration information from other mechanical components, the bearing fault vibration signal collected by the sensor contains background noise, which affects the accuracy of the fault diagnosis [
13,
14]. Bearing fault features can be extracted by performing optimal filtering on the signal to obtain obvious periodic impact components. Moreover, the collected bearing fault signal can be seen as the convolution of the impact signal with the transmission path, and the fault impact signal can be extracted by a deconvolution process [
15]. Endo et al. [
16] introduced the minimum entropy deconvolution (MED) to improve the ability to diagnose gear tooth faults, and it achieved great performance. The MED algorithm can only extract individual impulse features and may have spurious impulse components. Moreover, the iterative method of MED is complex, and the efficiency of finding the optimal filter is low. Considering the drawbacks of MED, McDonald et al. [
17] proposed the maximum correlated kurtosis deconvolution (MCKD) by designing the correlated kurtosis norm as the target function of the filtering. Wang et al. [
18] denoised the vibration signal by MCKD and effectively emphasized periodic impulses. Jia et al. [
19] incorporated MCKD and an improved spectrum kurtosis to diagnose the early fault of bearings. Although MCKD can extract more impulse components, it can still only extract a limited number of impulses. In addition, the setting of MCKD internal parameters depends on prior knowledge, which means that noise reduction is only effective when the parameters are selected appropriately. In order to address the issues of the above two methods, Multipoint Optimal Minimum Entropy Deconvolution Adjusted (MOMEDA) method was developed [
20]. Due to the unpredictability of the bearing fault period in practical engineering, MOMEDA deconvolves the signals of different preset target periods by presetting a period range, and the multipoint kurtosis (MKurt) is obtained by calculating the kurtosis of filtered signals. When the bearing component fails, the multipoint kurtosis spectrum will have significant peaks at the bearing fault period, as well as its harmonics, to reflect the fault information of the component. McDonald et al. [
20] successfully applied MOMEDA to the fault detection of the gearbox. However, due to the kurtosis being sensitive to accidental pulses and less robust against noise, multipoint kurtosis might lead to the wrong indication when processing signals containing accidental pulses and heavy noise [
21]. Considering that L-kurtosis is more robust to the spurious noise spikes compared with kurtosis, this paper develops a method for establishing temporal features of multipoint envelope L-kurtosis (MELkurt), and it is combined with GADF to propose an enhanced image representation method of vibration signals.
Due to the powerful performance of feature learning and extracting, intelligent diagnosis methods based on deep learning have been applied to various engineering areas [
22]. In particular, models based on convolutional neural network (CNN) have been widely researched to solve the problems of bearing fault diagnosis [
23,
24]. Wang et al. [
25] combined the squeeze-and-excitation (SE) network and CNN to propose SE-CNN, while using symmetrized dot pattern (SDP) images of vibration signals as input. Wen et al. [
26] designed a new Transfer CNN (TCNN) and incorporated the architecture of TCNN with Visual Geometry Group 19 (VGG-19). Yao et al. [
27] introduced the butterfly-transform (BFT) module to MobileNet V3 and proposed BFT-MobileNet V3, which achieved better diagnosis accuracy with less computation. Chen et al. [
28] proposed a fault diagnosis method by incorporating Cyclic Spectral Coherence (CSCoh) with CNN, which effectively improved the recognition accuracy of bearing faults. CNN-based models have been successfully implemented for variable fault diagnosis issues and have achieved great success in previous studies. However, CNN-based models are weak at learning relationships between different pixel regions and rely on more convolutional layers when capturing global information. If the background noise is enhanced or the application scenario changes, the diagnostic accuracy and stability of CNNs will be reduced due to the lack of transfer capability. Motivated by the remarkable achievements of the transformer architecture models in natural language processing, many researchers have introduced transformer-based models to image processing. Currently, the transformer-based models have shown excellent transfer and modeling capabilities. To extend transformer-based models to the field of bearing fault diagnosis, this paper introduced the Vision Transformer (ViT) and enhanced its performance [
29]. First, to overcome the shortcomings of ViT in modeling links between different local areas, we introduced the Super Token Transformer block and Super Token Mixer (STM) module [
30]. Second, Conditional Positional Encoding (CPE) is incorporated into the designed model to improve the generalization ability [
31]. Therefore, we proposed a novel deep learning method named Conditional Super Token Transformer (CSTT).
In this work, a novel intelligent diagnosis approach is established based on an enhanced vibration signal image representation method and CSTT. The MOMEDA is combined with the designed Multipoint Envelope L-Kurtosis to enhance the fault features of vibration signals. Then, GADF is applied to translate the enhanced signals into images in order to obtain distinguishing feature representations of different bearing faults. The proposed Conditional Super Token Transformer is utilized to recognize rolling bearing diagnosis fault types by taking advantage of its feature extraction capability.
The organization of this paper is as follows. The principles of MOMEDA and GADF are described in
Section 2. In addition, the details of the designed Multipoint Envelope L-Kurtosis are introduced in this section.
Section 3 introduces the proposed CSTT and its theoretical background. The proposed bearing fault diagnosis framework, based on the enhanced vibration signal image representation method and CSTT, is presented in
Section 4. In
Section 5, the proposed method is validated, and comparisons are carried out by using two different datasets. Finally, main conclusions are summarized in
Section 6.
6. Conclusions
This work presents a novel deep learning fault diagnosis method of rolling bearing based on MELkurt, GADF, and CSTT. Combined MELkurt with GADF, an enhanced image representation method of vibration signals, is developed in this paper. The designed MELkurt is superior to MKurt for fault signal feature enhancement since the MELkurt is more robust to suppress background noise. The GADF is employed to convert the obtained temporal signals of MELkurt into images without setting internal parameters in advance, which avoids the drawbacks of relying heavily on prior knowledge. Besides, the GADF images can preserve the variable features and the temporal dependency. To effectively and automatically extract the features of the GADF images, the original Vision Transformer (ViT) is improved by incorporating the Super Token Transformer block, Super Token Mixer (STM) module, and Conditional Positional Encoding (CPE) mechanism appropriately, thus proposing the Conditional Super Token Transformer (CSTT). During two experimental datasets, the results showed that GADF image datasets of MELkurt can achieve higher diagnostic accuracy and better stability than the datasets of MKurt. It can be validated that the MELkurt spectra are more suitable for feature visualization. Through comparison with the ViT and several CNN-based models, the proposed CSTT greatly outperforms them, with an average recognition accuracy of 100% and a standard deviation of 0. The proposed method has exhibited an outstanding performance in bearing fault diagnosis, with excellent feature extraction and generalization ability.
In future work, the proposed model will be implemented to diagnose more bearing states with different severity levels. Meanwhile, multimodal information fusion will be considered to further improve the diagnosis accuracy of the proposed method.