This section provides experimental results for (1) machine learning models obtained through a random search, (2) hyperparameter tuning for different validation metrics, and (3) state-of-the-art models and different validation metrics.
5.1. Hyperparameter Search Ranges
In this work, we only consider convolutional neural networks (CNNs) as they contain many hyperparameters to tune. Therefore, it becomes more challenging to find good hyperparameter combinations than, e.g., multilayer perceptrons. Consequently, the tuning of CNNs benefits more from an efficient evaluation strategy than tuning some simpler neural network architectures. Moreover, CNNs demonstrated good performance even in the presence of various hiding and/or masking countermeasures, and numerous SCA works consider only them, see, e.g., [
8,
9,
10,
30].
Convolutional neural networks commonly consist of convolutional layers, pooling layers, and fully connected layers. The convolution layer computes the output of neurons connected to local regions in the input, each computing a dot product between their weights and a small region connected to the input volume. Pooling decreases the number of extracted features by performing a down-sampling operation along the spatial dimensions. Finally, the fully connected layer computes the hidden activations or the class scores.
Table 1 provides the selected ranges for the hyperparameter tuning processes. These selected ranges result in a search space containing
$2.7\times {10}^{9}$ possible combinations. As we can see, we allow CNNs to contain up to eight hidden layers, combining convolution and dense layers. A pooling layer always follows each convolution layer. As the ASCADf and ASCADr datasets contain 50,000 and 200,000 profiling traces, respectively, larger models would tend to overfit.
5.2. Random Hyperparameter Search with Different Validation Metrics
We compare different early stopping metrics in a random hyperparameter search process for the two ASCAD datasets and different leakage models. The results for the CHES CTF dataset are only provided for the Hamming weight leakage model. Each randomly selected CNN is trained for 200 epochs, and we save the trained weights at the end of each epoch. At the end of the training, each early stopping metric indicates the best training epoch, and we restore the trained weights from that epoch. Then, as the training is finished, we compute GE for the attack set containing a larger number of traces. Note that 200 epochs is a relatively small number for training epochs, and, as shown in this section, stopping the training after 200 epochs may also deliver good results for some cases.
Table 2 gives the number of validation traces
V considered for each early stopping metric, while the partition
Q is the number of the traces used to calculate each specific metric. For instance, GE is the average of multiple key rank executions over
Q traces, randomly selected from a larger set
V for each key rank execution. This way, we set
V greater than
Q so that sampling each data in
Q preserves a certain randomness. By doing so, the obtained results would indicate a better generalization capacity of models. For mutual information, we apply
V validation traces. FGE estimation considers only 50 traces for
Q and 500 for
V. We tested other values for
Q and
V, from 20 to 200 (with a step of 10 traces), and 50 was the minimum value for
Q and
V, which still preserves the best results for FGE. This range was selected to align with the usually required number of attack traces reported in related works. More precisely, state-of-the-art techniques commonly require between 100 and 200 attack traces to break the considered datasets. At the same time, by considering a less than 100 attack traces setting, we allow for further improvements in the results.
We execute 500 searches for each dataset, considering the Hamming weight and identity leakage models.
Table 3 provides the average time overhead in percentage for each considered metric. As we can see, the FGE estimation provides a maximum of 3.35% overhead among the four considered scenarios. For the ASCADr dataset, the overhead is only 1.19% and 1.49%, which can be considered negligible for the training time compared with its counterparts. As expected, the empirical GE and GEEA methods provide the largest overheads, although GEEA is faster than empirical GE. The mutual information method provides the second-best results, which is related to the more straightforward calculation than guessing entropy.
Table 4 provides the % that each metric can select a generalizing model with early stopping (model that reaches GE = 1 in the attack phase, which is indicated by line 13 in Algorithm 1) from the random search. Together with GEEA, the fast GE is a highly efficient metric (top two performance in all considered scenarios). Most importantly, we verified that FGE is always superior to the situation where no early stopping is used (200 epochs in the table) and with negligible overhead. For the case of the identity leakage models, FGE shows the best results.
Figure 2 shows the results for the ASCADf dataset. When side-channel traces are labeled according to the Hamming weight leakage model, the correct key is recovered with 514 traces for the GEEA metric and 534 traces (the second best) with FGE early stopping metric. In the case of the identity leakage model, the best results are achieved for the FGE metric, where 101 attack traces are needed to achieve a GE equal to 1, which is aligned with state-of-the-art results [
8,
9,
21]. The good performing results from the mutual information metric and the GE obtained with 200 epochs indicate the effectiveness of early stopping metrics in preventing the best model from overfitting. Again, we confirm that FGE is highly competitive in both leakage models and requires 10× fewer validation traces.
For the ASCADr dataset, the results for FGE are also very promising, as shown in
Figure 3. For the Hamming weight leakage model, FGE provides the best results, followed by the mutual information metric. In the case of the identity leakage model, the best result is obtained with all 200 epochs, showing that this number of epochs is appropriate for this best model found through a random search. The best results are obtained with the FGE metric when early stopping is considered.
Figure 4 provides results for the CHES CTF dataset. The FGE metric provides the second-best results after GEEA. The results for the CHES CTF dataset are only shown for the Hamming weight leakage model, as this dataset provides bad results with the identity leakage model, as discussed in [
9].
Furthermore, the performance of the best models selected from empirical GE as an early stopping metric provided less efficient results. As already mentioned in [
15], empirical GE requires a very large validation set, and a more stable GE estimation can be obtained with the selection of larger validation sets. Of course, using larger validation sets provides an estimation of model generalization, and this is especially important for models that provide suboptimal performance and require more traces to show GE reduction for the correct key. However, computing GE for this large number of traces is undesirable as an early stopping metric due to significant time overhead.
5.3. Hyperparameter Tuning with Different Validation Metrics
This section analyzes how the evaluated early stopping metrics perform with Bayesian optimization (BO) for the hyperparameter search [
22]. For that, we consider the open-source
Bayesian Optimization method provided in the
keras-tuner [
31] Python package. We run Bayesian Optimization for 100 searches with ASCAD datasets and the Hamming weight and identity leakage models. We repeat each search process five times for each different early stopping metric. The guessing entropy results without early stopping (“all epochs” labels in the figures from the previous section) are omitted because
keras-tuner inherently implements early stopping and, for this reason, it is not possible to select the best model by ignoring early stopping. The results reported in this section are extracted from the best-found model out of the five search attempts.
The results from BO for the ASCADf dataset are shown in
Figure 5. The best results are obtained with FGE for both Hamming weight and Identity leakage models. In particular, for the Identity leakage model, as shown in
Figure 5b, the best-found model achieves GE equal to 1 with less than half of the attack traces needed for GEEA. In these experiments, mutual information provides less efficient results.
Figure 6 provides BO results for the ASCADr dataset. For the Hamming leakage model, GEEA and FGE provide the best results. For the Identity leakage model, the results for FGE are superior, and only 60 attack traces are required for key byte recovery, while empirical GE requires 10× more attack traces to succeed. Again, the mutual information metric delivers the worst results.
Running hyperparameter tuning with Bayesian optimization for the CHES CTF dataset and the Hamming weight leakage model, the results obtained with FGE are significantly better than other validation metrics, as shown in
Figure 7. We can see that FGE returns the best model that reaches GE equal to 1 in the attack phase with only 232 traces, while other metrics always require significantly more attack traces.