3.1. Application Example in CMAPSS Dataset
The database chosen for the study is a variant of the Commercial Modular Aero-Propulsion System Simulation (CMAPSS), publicly available and recognized as one of the datasets frequently used for benchmark prediction algorithms. It was recently updated after joint work between NASA and ETH Zurick’s intelligent maintenance systems center, so that the amount of sensor samples has been increased to 1 Hz, making it suitable for the study of the models oriented for large volumes of data.
The CMAPSS-2 [
31] is composed of a set of synthetic RTF trajectories, that is, with the artificial degradation of nine turbofan engines that were produced by the simulator from the input of real flight conditions, which are characterized by the scenario descriptor variables: altitude, Mach number, throttle-resolver angle (TRA), and total inlet blade temperature. The base is divided into six units designated for training and another three for testing, with operating conditions slightly different from the others. In this study, only the training data from CMAPSS-2 were used, which does not compromise the feasibility study since the tested model is unsupervised and, therefore, uses only a part of the samples from each unit for training.
The inserted degradation pattern is of a continuous type and is divided into four states: the degradation condition at the beginning of the operation; the normal state; a transition zone between the normal to abnormal conditions; and an abnormal state. The simulation considers the alternating presence of failure modes in the main sub-components of the motor: fan, LPC, HPC, HPT, and LPT. Their deteriorations are modeled by adjustments in flow capacity and efficiency. More information about the modeling can be found in Chao et al.’s work [
32].
Figure 5 outlines the allocation of the main subsystems of a turbofan engine.
In this application example, the units have been subjected to high- and low-pressure turbine failure modes with an initial condition of random deterioration of about 10% of the health index implicit in the simulator.
Table 3 details the failure modes for each unit and provides additional information on the number of samples, the transition time to abnormality, and end of life (
) in cycles.
Figure 6 details the trajectory imposed on the flow and efficiency modifiers for the tested units.
This application example follows the framework with sets of hyperparameters and fixed neural network architecture, whose feature space is composed of 18 variables, which are the same condition monitoring signals used by Chao et al. [
32]. In addition to that, a detailed description of the CMAPSS simulator variables can be found in [
31].
The autoencoder models are subject to a validation procedure that consists of two steps: the first one is to evaluate whether its performance (through an analysis of the metrics presented in
Section 2.4) surpasses that of a simplified baseline model, which does not use deep learning, and the second one is to compare it with alternatives presented in the literature that employ similar techniques and databases.
The baseline model is built from a simple regression extrapolation procedure of the pre-processed original inputs of the database, following the sequence of steps: down sampling at a rate of 1 sample every 200 (without crossing the limits of operational cycles) and later smoothing by simply moving an average size of 500, so that the samples of this model and the one submitted for validation are similar, and then the application (see
Figure 2) of the methodology and performance evaluation, obviously with the metrics of
Section 2.4.
3.2. Results from Application Example
The MSE loss convergence during the networks’ training progression is shown in
Figure 7 and the progression of useful life estimations over the course of the operation of the units is presented in
Figure 8 and
Figure 9. The time instant
,
x-axis, is normalized in relation to the total life (
) of the motors and is interpreted as a percentage (0–100%) of
or as normalized cycles. The y-axis indicates the predicted RUL at instant
t (also expressed as a percentage of
) and the orange dashed line, the real value of the
(that is,
at that instant. It is noted that the beginning of the forecast differs from the units since it is directly related to the abnormality detection capacity, which is made by a criterion similar to that used by Rosa et al. [
23], wherein there is a difference in a consecutive set of points of the maximum reconstruction error between the samples in NOC.
For all the analyzed models, the time of the first prediction () occurred after half of the degradation time of the engines. From 50% to 65% constitutes a region of instability in the forecasts in which there are remaining life estimates that exceed the value of teol near 100% or underestimate it close to 1%. This is because the deterioration trends are incipient and have a low rate of change, which makes it difficult for the algorithm to decide which of the curves is the most appropriate, as some have a very similar fit condition. After 70% of the teol, a stable convergence zone is formed, and the adherence of the projections to the real RUL curve gradually improves up to 100%, which is the desired behavior. Compared to the baseline model, the proposed models advance to a stable condition much earlier (~65%) than the Baseline (~80%).
From the three models tested, Conv-1d showed the best result in terms of advancing convergence to the actual prognostic result for all units. It can be seen from
Figure 8 that it is the model with the most anticipated first average prediction time of all the units and adheres to the reference line of progression of the RUL in about 75% of the
teol. The MLP model visually manifested a behavior similar to the convolutional one and also presented a zone of forecast instability with high fractional RMSE but with time stamping metrics (
fpt, H
T(5), and H
T(20)) later compared to the second. The LSTM model did not show a concentrated region of large prediction errors like the previous two, but it did show sparse peaks of high errors for two or three cycles in units 2, 16, 15, and 5. Although it may seem that the LSTM provides more stable predictions, in fact, gaps in the forecasts may occur, especially in the region of 60–75%
teol, in which large magnitude discrepancies are suppressed by the restriction of the algorithm to disregard RUL estimates, if
r(
t)
+ t exceeds
teol, above 300 cycles.
For a moving average subsequence of
n = 500, it can be seen that the three autoencoder models outperform the Baseline, which starts to provide consistent forecasts after 80 normalized cycles have elapsed. An increase in the time window of the moving average could proportionate a positive impact, especially on the base model, as it benefits the most from signal attenuation in regions of instability. However, increasing
reduces the number of samples of each unit available for curve fitting in the prediction algorithm so that the RUL of some units arranged in
Table 3 could not be calculated.
The difference between the forecast and the actual value of the
, also expressed as a percentage of the
, is shown in
Figure 10 and
Figure 11. The blue dashed lines indicate 20% error limits in
Figure 10 and 5% error limits in
Figure 11, which is taken as a reference for calculating the prognostic horizon. The proposed method manages to keep the estimates within the error margin of +/− 20%
, but it has complications in meeting the goal of +/− 5%
, with only a few units achieving this result even after 80% of the machine’s life. There are two possible reasons for this answer: the first one, mentioned above, is the absence of a global tuning of the model, including the neural architecture, which is not at its optimal performance in terms of training with NOC samples; the second one is the uncertainty regarding the choice of the error threshold for the prognosis, which can increase the estimates above what was expected.
The summary of the results obtained for the values of the performance metrics is presented in
Table 4, while
Table A1 (in
Appendix A), presents all the results organized by unit. Both tables show the RMSE, fractional RMSE’s
L1,
L2, and
L3, time of the first prediction (
), the
(Equation (6)) divided by the total number of estimates, cumulative relative accuracy, and prognostic horizons for 5% and 20% of the errors. It should be noted that
L1,
L2, and
L3 stand for the RMSE fraction only for samples inside the first, second, and third thirds of the second half of the normalized
teol, respectively.
There is no expressive gain in the RMSE of the proposed model when compared to the Baseline (−15.64%,
Table 4) due to the rough projections made at the beginning of the degradation process. When these prognostic samples are disregarded, it is possible to notice a performance gain for this metric, which is expressive from the third third (
L3) and improves the proximity of
, therefore quantitatively corroborating the notion that the proposed model advances to a state of convergence in the zone before the baseline.
A
Table 4 inspection reveals that the global RMSE is lower than the Baseline model. The reasons are that the Baseline model produced less and later estimates in comparison to the autoencoders, as can be viewed in
Figure 8 and
Figure 9, and when closer to the end of life, the prediction errors tend to be smaller due to the presence of more information about the pronounced degradation. The models start to equate in performance as there is an approximation to the stable convergence zone, and there is a slight divergence between the
values. Although the Baseline also has a lower
, it should be noticed that it has performed fewer predictions even in that region—see
Figure 9 and
Figure 11. The prognostic horizon is certainly greater for the autoencoder-based solutions, highlighting the Conv-1d, which has the earlier
, so a correlation with the
was already expected. The difference
–
could be interpreted as a latency of the model in the archive or an acceptable error margin.
Moreover, , another error evaluation metric, differed from the RMSE’s outcomes by showing a similar quantity overall. This fact is justifiable because, even though the models have differed significantly in global accuracy, all of them displayed a greater tendency to overestimate predictions, which is penalized by this metric. CRA, in turn, follows the RMSE behavior, as they are almost analogous measurements when a linear weighting (Equation (10)) is taken.
Generally, the proposed autoencoder models are more stable than the Baseline model, detect abnormalities earlier, and enter a region of stable convergence earlier. They manage to meet the margin of error requirement below 20% of for at least a fourth of the unit’s life but struggle to meet the requirement of a 5% forecast horizon.
Finally, the comparison with the literature is based on the publication by Chao et al. [
32], who also built deep learning models to estimate the RUL on the CMAPSS-2 basis. This comparison aims to verify if the framework exhibits coherent behavior for the predictions over time. It is made by the qualitative inspection of the prediction errors’ progression, see
Figure 10 and
Figure 11, which is also plotted by Chao et al. [
32] for the same three kinds of layers used in this study. Moreover, some performance measurements taken in this work are compared with the results obtained by the cited author. They are the RMSE and prognostic horizon.
The presented models could not overcome the data-oriented arrangements programmed by Chao et al. [
32], nor is this the intention, as they use supervised learning, thus mapping the channel signature throughout the degradation evolution and not only in the NOC. Even if it is not possible to exceed this author in performance, it is important to note that there is a great proximity between the mean squared error values for the stable convergence zone (
). There is a great similarity between the behavior of the operating time forecasts plotted by this author with the one shown in this work. Greater uncertainty is also demonstrated at the beginning of the forecasting process and gradually reduces until
teol. It is observed that the use of a supervised technique allows a
tfpt very close to the beginning of the unit’s life and that the supervised method purely derived from ANN can make inferences almost in real time after being trained and generated by a new sample (without the computational costs of the curve fitting). On the other hand, supervised learning techniques tend to be more specific to the application—failure mode—and have a shorter lifespan, requiring retraining to adapt to changes in the operating equipment.
Therefore, the advantage of our approach is its capacity to be easily implemented in an industrial context, which has the particularity of having an abundance of engineering system data in NOC with few recorded faults. Furthermore, real scenarios had lower-quality data labels or unlabeled data. Our framework is designed considering this observation since there is no need to attribute labels or even discriminate sets of abnormal samples. Another point is that we elaborate a complete framework that embraces detection and prognostic models, while Chao et al. [
32] focus only on train models for RUL estimation without worrying about scalability. In the end, the proposed framework is more suitable for use in different industrial domains and has an extensive application range because it does not require physical information or intensive knowledge about the fault’s nature and its signature in the sensors’ readings.