A New Phase Classifier with an Optimized Feature Set in ML-Based Phase Prediction of High-Entropy Alloys

Zhang, Yifan; Ren, Wei; Wang, Weili; Ding, Shujian; Li, Nan

doi:10.3390/app132011327

Open AccessArticle

A New Phase Classifier with an Optimized Feature Set in ML-Based Phase Prediction of High-Entropy Alloys

by

Yifan Zhang

¹,

Wei Ren

^1,2,*,

Weili Wang

^2,*,

Shujian Ding

² and

Nan Li

²

¹

School of Science, Xi’an University of Posts & Telecommunications, Xi’an 710121, China

²

School of Physical Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2023, 13(20), 11327; https://doi.org/10.3390/app132011327

Submission received: 15 August 2023 / Revised: 9 October 2023 / Accepted: 13 October 2023 / Published: 15 October 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The phases of high-entropy alloys (HEAs) are closely related to their properties. However, phase prediction bears a significant challenge due to the extensive search space and complex formation mechanisms of HEAs. This study demonstrates a precise and timely methodology for predicting alloy phases. It first developed a machine learning classifier using 145 features and a dataset with 1009 samples to differentiate the four types of alloy phases. Feature selection was performed on the feature set using an Embedded algorithm and a genetic algorithm, resulting in the selection of nine features. The Light GBM algorithm was chosen to train the machine learning model. Finally, the implementation of oversampling and cost-sensitive methods enables LightGBM to tackle the problem of insufficient accuracy in BCC+FCC phase classification. The resulting accuracy of the alloy phase prediction model, evaluated through ten-fold cross-validation, stands at 0.9544.

Keywords:

high-entropy alloy; machine learning; data imbalance; phase selection

1. Introduction

High-entropy alloys (HEAs) [1,2] are usually composed of four or more metallic elements. The high configurational entropy (

Δ S m i x

) of HEA makes it easier to form stable single-phase solid solutions (SS) than conventional alloys [3]. Theoretically, the special structure of HEA solid solution makes it possible for material researchers to design excellent mechanical, thermoelectric, and electrochemical properties. Also, different HEA phases, including the Face-Centered Cubic (FCC), Body-Centered Cubic (BCC), or Hexagonal Close Packed (HCP) phases, exhibit different properties [4]. In addition, the core effects of HEAs, such as the high entropy effect, sluggish diffusion effect, lattice distortion effect, and cocktail effect [5,6], facilitate the transformation of HEAs into high-performance materials [7,8,9]. However, the multi-principal structure and complex formation mechanism of HEAs make it tremendously difficult to design high-performance HEA materials. There is an urgent need for a method to assist in the rapid exploration of the composition space of HEAs so that the properties of the HEA SS phases can be determined with the aid of different phases of HEA and the desired materials can be found.

Traditional computational methods such as density functional theory [10,11,12,13], ab initio [12], and calculation of phase diagrams [14] have accelerated the search for high-performance HEAs to some extent. However, these methods are computationally inefficient and require huge computational resources [15]. Some methods are not even accurate. Additionally, several parametric methods for differentiating the numerous alloy phases are mainly based on Hume–Rothery rules [16] and only judged by several thermodynamic parameters, which have been summarized by Guo et al. [4]. For example, the possible phases of the alloys can be judged from the values of parameters such as mixing entropy (ΔSmix), mixing enthalpy (ΔHmix), valence electron concentration (VEC), atomic size difference (δr), and Allen electronegativity (χ). However, these parametric methods only use a single parameter, and it is difficult to obtain good alloy phase classification results.

In recent years, a large number of studies have emerged that use data-driven machine learning (ML) [17,18,19] methods to predict alloy phases. These studies accomplish the classification of alloy phases using previously constructed empirical parameters as input features of ML and using the powerful nonlinear or linear mapping capability of ML to project the alloy phase classification problem into a high-dimensional space to make it linearly separable. Huang et al. [20] used ΔSmix, ΔHmix, VEC, δr, and χ as input features for the K-Nearest Neighbors (KNN), support vector machine (SVM), and artificial neural network (ANN) algorithms to distinguish between SS phases and intermetallic compounds, respectively. Krishna et al. [21] used Huang’s input features for six ML algorithms and analyzed the importance of the features using a self-organizing map, a scatter plot, a radar plot, etc. Finally, the alloy phases were predicted using ANN, and the accuracy was more than 80%. Islam et al. [22] used the same features to construct an ANN model for differentiating SS, amorphous, and intermetallic compounds. The final accuracy under four-fold cross-validation reached 83%. All the above studies show that ANN has good predictions of alloy phases. Agarwal et al. [23] used the elemental compositions and empirical parameters as input features to the adaptive neurofuzzy interface system, respectively, resulting in an accuracy of 84.21 and 80% in the classification of alloy phases, respectively. This suggests that it is possible to achieve good phase classification results by simply using combinations of elements to build ML models. Mandal et al. [24] utilized the SVM algorithm to fit five thermodynamic, configurational, and electronic parameters to these parameters. Classification accuracy of 93.84% was achieved for SS, amorphous intermetallic compounds, and 84.32% for BCC, FCC, and their mixed phases. Machaka et al. [25] utilized various feature selection methods to screen the feature set containing elemental alloy compositions, empirical HEA design parameters, metallurgy-informed alloy processing and postprocessing parameters, and other features. The accuracy of the obtained feature set exceeds 90% for alloy phase classification using various machine learning algorithms. Zhang et al. [26] used a genetic algorithm to screen among 70 features related to alloy phase classification. Eventually, the obtained features were fitted using SVM, which resulted in a classification accuracy of 88.7% for SS and non-SS and 91.3% for BCC, FCC, and their hybrid phases. Although these prediction methods are effective, the input features are mainly chosen from empirical parameters constructed by previous researchers. In fact, due to an unclear understanding of the alloy phase formation mechanism, there are still a large number of parameters that may affect the phase formation in HEAs, which makes it difficult to further optimize prediction models. Constructing a huge feature set plays an important role in discovering new features that can be used for alloy phase classification. In addition, a large number of single-phase SS were included in the past HEA data. This data property lowers the prediction accuracy in a few classes of phases due to an imbalance in phase classes when building data-driven models using ML. To solve this problem, many studies have used the random oversampling method. For example, Chang et al. [27] used random oversampling on the dataset and trained the support vector machine, gradient boosting decision tree, multi-layer perceptron, and other algorithms to differentiate the SS phases. Risal et al. [28] used a feature selection method based on feature importance and correlation to filter modeling features and finally found that K-Nearest Neighbors and Random Forest classifiers performed significantly better on the dataset after oversampling and principal component analysis (PCA) dimensionality reduction. However, these methods applied random oversampling to the model prior to training, which resulted in early data leakage and thus overestimation of model accuracy. Since the random oversampling method simply repeats for a few classes of data, it can cause the model to appear overfitted. Also, since most methods use greedy feature selection algorithms such as Embedded and forward/backward sequence selection algorithms in feature selection, this may result in the less optimal feature set being chosen as the final feature for modeling construction. In addition, feature extraction methods such as PCA should be avoided because they can change the distribution of features and make the model lose its physical interpretability.

In this work, a huge feature set containing 145 features for exploring previously unused empirical parameters was constructed. The LightGBM algorithm for building classifiers that distinguish different phases was selected. Combining the two feature selection algorithms facilitates fast and accurate feature selection over huge feature sets. The optimal feature set, including nine features determined by global searching of the feature set, was determined. Finally, to address data imbalance and model overfitting issues, ensemble learning, SMOTE oversampling methods, and cost-sensitive methods were combined.

2. Methodology

2.1. Dataset Obtaining and Parameter Calculation

The alloy phase data were derived from the alloy phase data disclosed in references [29,30,31,32,33]. Since the as-cast samples were stable, all alloy samples in the dataset were as-cast. The alloy phase dataset was collated, and the obtained dataset contained a total of 1009 samples of solid solution phase binary alloys and medium and HEAs, of which the dataset contains 391 BCC, 392 FCC, 74 BCC+FCC, and 152 HCP data.

Feature engineering is one of the most important steps, which determines whether the model can accurately map the formation pattern of alloy phases. Due to the complex formation mechanism of alloy phases, it is challenging to select an appropriate feature set for model construction. It was believed that a large feature set may help to find out some undiscovered empirical parameters that can be used to help predict alloy phases. The materials-agnostic platform for informatics and exploration (Magpie) [34] was therefore used to generate 143 features directly according to alloy composition and molar ratio. Among these features, the features describing atomic radius, electronegativity, and electron concentration in Magpie have been reported to have a close relationship with the formation of alloy phases [20,35]. The features and their calculations included in Magpie are as follows:

The $L_{p}$ norm features of HEA

The

L_{p}

norm features of the HEA are calculated as shown in Equation (1):

L_{p} = {(\sum_{i = 1}^{n} {|c_{i}|}^{p})}^{\frac{1}{p}}

(1)

where

c_{i}

represents the molar ratio of metal elements of the HEA; p is taken as 3, 5, 7, and 10 to construct four

L_{p}

parametric features, namely

L_{3}

,

L_{5}

,

L_{7}

, and

L_{10}

, respectively;

2.: Elemental molar-ratio weighted features of HEAs

For each element in the HEA, the atomic number (Num), Mendeleev Number (MN), Atomic Weight (AW), Melting Temperature (Tm), Column (Col), Row, Covalent Radius (R), Electronegativity (

χ

), the number of filled s valence electrons, the number of filled p valence electrons, the number of filled d valence electrons, the number of filled f valence electrons, the number of filled valence electrons, the number of unfilled s valence orbitals, the number of unfilled p valence orbitals, the number of unfilled d valence orbitals, the number of unfilled f valence orbitals, the number of unfilled valence orbitals, Specific Volume of 0 K Ground State, Band Gap Energy of 0 K Ground State, Magnetic Moment (per atom) of 0 K Ground State, and Space Group Number of 0 K Ground State for the following six calculation operations:

F = \sum_{i = 1}^{n} c_{i} F_{i},

(2)

m a x_F = \max_{i = 1, \dots, n} (c_{i} F_{i}),

(3)

m i n_F = \min_{i = 1, \dots, n} (c_{i} F_{i}),

(4)

Δ F = m a x_F - m i n_F,

(5)

δ F = \sqrt{\sum_{i = 1}^{n} c_{i} \times {(F - F_{i})}^{2}},

(6)

m o s t_F = \frac{1}{n} \sum_{i = 1}^{n} F_{i},

(7)

where F represents the above features and

c_{i}

represents the molar ratio of each element of the HEA.

3.: Valance orbital occupation features

Fractions of filled s valence electrons, filled p valence electrons, filled d valence electrons, and filled f valence electrons are calculated as follows:

f r a c_O V a l e n c e = \frac{\sum_{i = 1}^{n} c_{i} N_{i}^{O}}{\sum_{i = 1}^{n} c_{i} N_{i}},

(8)

where O denotes the s, p, d, and f orbits,

N_{i}^{O}

is the number of filled valence orbitals for a specific orbital, and

N_{i}

is the number of filled valence orbitals for all orbitals.

4.: Ionic compound features

Three features are included. The first feature (CanFormIonic) is used to describe whether it is possible to form a neutral ionic compound, assuming each element takes exactly one of its common charge states (see Equation (9)). The second feature is the maximum ionic character (

I_{m a x}

) between any two elements in the HEA (see Equation (10)). The third feature is the mean ionic character (

\bar{I}

) (see Equation (11)).

I (χ_{i}, χ_{j}) = 1 - e^{- 0.25 {(χ_{i} - χ_{j})}^{2}},

(9)

I_{m a x} = m a x (I (χ_{i}, χ_{j})),

(10)

\bar{I} = \sum x_{i} x_{j} \times I (χ_{i}, χ_{j}) .

(11)

In addition, a considerable number of studies have demonstrated that thermodynamic parameters such as

Δ S m i x

and

Δ H m i x

have a strong correlation with the formation of HEA phases and the origin of high entropy effects [36].

Δ S m i x

and

Δ H m i x

are also added to the feature set (see Equations (12) and (13) for the calculation).

Δ S m i x = - 8.314 \sum_{i = 1}^{n} c_{i} \times l n (c_{i}),

(12)

Δ H m i x = 4 \sum_{i = 1, j > i}^{n} c_{i} c_{j} H_{i - j}^{m i x} .

(13)

From the above steps, the dataset contains 1009 alloy phase samples, and the feature set includes 145 features.

2.2. Model Construction Using ML

By revealing the implicit relationship between target features and alloy phases, ML can help to understand the formation mechanism of the HEAs and accelerate the discovery of new alloys. Generally, it is very difficult to choose the best algorithm for alloy phase prediction because the features used, the model hyperparameters, and the dataset distribution are various. Building the alloy phase prediction model requires testing a large number of algorithms so as to select the best algorithm for the model. In this work, we will test the fitting effects of 9 algorithms—Logistic regression (LR), SVM, K-Nearest Neighbors (KNN), Back Propagation Neural Network, Adaptive Boosting (AdaBoost), eXtreme Gradient Boosting (XGBoost), Gradient Boosting Decision Tree (GBDT), Light Gradient Boosting Machine (LightGBM), and Random Forest (RF)—and take the algorithm with the highest accuracy as the benchmark algorithm for model construction.

Feature engineering is to remove redundant features and noises, reduce the risk of model overfitting, and improve model accuracy and interpretability. In addition, the use of data pre-processing, such as data dimensionality reduction, can change the data distribution and cause the data to lose interpretability, and therefore should be avoided in the feature engineering process. Further, because the feature set includes a large quantity of features, it is crucial to select the most suitable features according to whether to effectively perform feature engineering. Through both Embedded and Wrapper modes, feature selection can be performed, and the selected features can be made relevant to ML modeling. The Embedded mode incorporates feature selection into the model training and evaluates the weight coefficient of individual features based on the specific model so as to find out the features with the greatest contribution to the model. Wrapper mode directly takes the accuracy of the model as the evaluation criterion of the feature subset and searches the feature set through a certain search strategy to obtain the optimized model accuracy. The common algorithms of Wrapper are Sequential Forward Selection (SFS), recursive feature elimination (RFE), etc. However, these algorithms are often greedy, and the prediction results of the selected features are not as accurate as those of the features selected through the global optimization algorithm. Genetic algorithm (GA), a search algorithm based on the principle of biological evolution, can find the optimal solution in a large-scale search space by simulating the biological genetic and natural selection processes in nature. Compared with SFS, RFE, and other search strategies, GA has better global search capability, which makes it easier to converge to the global optimal solution. However, being a global optimization algorithm, GA requires a large amount of computational resources when feature selection is performed in a dataset with a large feature set. Therefore, this work will combine Embedded and Wrapper modes to quickly select the best combination of features for modeling while reducing the use of computational resources. To avoid overfitting or underfitting, we use the 10-fold cross-validation method (10-fold) to ensure the generalization of the model in the process of algorithm selection, feature selection, and hyperparameter optimization for ML modeling. The 10-fold divides the original dataset into 10 subsets of equal size, 9 of which are used as the training set and 1 as the test set, and then rotates each of these 10 subsets as the test set until each subset is used once as the test set. The performance on each test set is recorded and finally averaged as the performance metric of the model. A 10-fold method can better utilize the dataset for model performance evaluation and avoid overfitting, resulting in more accurate evaluation results.

2.3. Oversampling and Cost-Sensitive

While ML as a data-driven approach is efficient to run, the sample size of the dataset and the quality of the data are critical to a highly accurate model. Due to the multi-principal structure and four core effects of HEAs, the data from previous studies are mostly single-phase SS, which results in a sparse distribution of alloy phase types, thus causing the issue of class imbalance. As a result, when constructing a ML model, the number of single-phase SS samples is much larger than the number of mixed-phase samples, making it impossible for traditional classification algorithms to achieve good results. In our dataset, there are only 74 mixed-phase samples of BCC+FCC, which may also lead to a class imbalance. How to optimize the model and improve the prediction accuracy for the dataset with the class imbalance has become urgent. Oversampling may change the distribution of the original dataset by adding a small number of classes of samples and transforming the dataset from imbalance to balance. In fact, many studies [27,28] have used random oversampling to address the imbalance in the alloy phase data classes. But random oversampling is done by replicating a few classes of samples to increase their proportion of data, which often increases the risk of overfitting [37] and certainly has a negative impact on decision-making. Some studies performed random oversampling on the dataset before it was divided into training and test sets, which undoubtedly caused data leakage and severely overestimated the prediction accuracy. Unlike random oversampling, which balances data by copying existing samples, the Synthetic Minority Oversampling Technique (SMOTE) uses adjacent samples to generate new minority-class samples to resolve the data imbalance issue, effectively avoiding sample duplication. Specifically, the SMOTE algorithm treats a small number of classes of samples as feature vectors and generates new synthetic samples by randomly selecting two adjacent sample points in the feature space for interpolation operations. Because SMOTE may produce the same amount of synthetic data when generating samples from multiple minority classes, it is possible for samples to overlap between different classes. To solve the sample overlap problem, Borderline-SMOTE, Adaptive Synthetic Sampling (ADASYN) and other adaptive synthetic sampling were proposed. Unlike SMOTE, which generates synthetic samples in each minority-class sample, Borderline-SMOTE generates synthetic samples only for the minority-class samples adjacent to the boundary, which can reduce the generation of noisy samples and improve the classification accuracy. The SVM-SMOTE algorithm is a variant of Borderline-SMOTE that involves first training an SVM classifier and then restricting the generation of new synthetic samples based on the boundary information of the classifier. Therefore, the new samples generated by SVM-SMOTE are closer to the real data distribution. ADASYN decides how many synthetic data to generate based on the contribution of the samples, which not only adaptively generates a few classes of more samples but also optimizes them for different data distributions, thus improving the quality and quantity of the generated samples and further enhancing the model performance.

Cost-sensitive learning (CS) is also used to solve data imbalances by assigning different costs to different classes of misclassifications in order to deal with an imbalanced dataset more efficiently. Under the CS strategy, the minority-class samples are paid more attention to when they are classified, and the minority-class samples are correctly classified as much as possible. For misclassification of minority-class samples as majority-class samples, a higher cost is assigned to compensate for the consequences of misclassification. In this way, the risk of overfitting the model to the majority-class samples can be effectively reduced, and the identification ability and generalization ability of the model to the minority-class samples can be improved.

3. Results and Discussion

3.1. Algorithm Comparison of the ML Model

As the alloy phase formation mechanism is complex and nonlinear, certain nonlinear mapping was required to effectively improve the model’s accuracy. Therefore, the RBF kernel was used for SVM, the Rectified Linear Unit as the activation function for BPNN, and the Adaptive Moment Estimation Optimizer as the optimizer to improve the nonlinear mapping capability of the model. Subsequently, the dataset containing 145 features was used as input parameters for LR, SVM, KNN, BPNN, AdaBoost, RF, XGBoost, GBDT, and LightGBM, respectively, and the alloy phase classes as output to train ML models to evaluate candidate algorithms. The evaluation indexes of the algorithm are accuracy, precision, recall, and F1-score under 10-fold to detect the classification performance of the algorithm, which are calculated as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F N + F P},

(14)

P r e c i s i o n = \frac{T P}{T P + F P},

(15)

R e c a l l = \frac{T P}{T P + F N},

(16)

F 1 - s c o r e = \frac{2 P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l},

(17)

where TP and TN represent the correct classification; TP/TN is the number of correct classifications of positive/negative class; FP and FN represent misclassification; and FP/FN are the data misclassified as positive/negative class.

As Figure 1 shows the confusion matrix of the classifiers fitted by the nine ML algorithms under 10-fold, the imbalance of the data makes each model to some extent less effective in classifying the HCP or BCC+FCC. Among them, the classification is particularly poor for BCC+FCC. For HCP prediction, EL algorithms such as RF, XGBoost, GBDT, LightGBM, etc., show overwhelming advantages over LR, KNN, SVM, and BPNN algorithms. Although the EL algorithms are still not ideal for the prediction of BCC+FCC, they have been improved to some extent compared to the traditional algorithms. This result may suggest that the EL algorithm suppresses the negative effects of data imbalances to some extent. Also, this result demonstrates the advantages of EL algorithms in dealing with sparse and redundant tabular data. Further, to select the optimal one from the EL algorithms, the accuracy, precision, recall, and F1-score of each algorithm were tested. In Table 1, LightGBM has the best prediction performance, with all the accuracy, precision, recall, and F1-score values exceeding 0.95. Therefore, LightGBM is used as the benchmark algorithm to construct the alloy phase classifier. However, due to data imbalance, the prediction accuracy of LightGBM for BCC+FCC is only 75.68%, which is much lower than the prediction accuracy of BCC, FCC, and HCP and has a negative impact on the accurate prediction of alloy phases. From the classification results of BCC+FCC in Figure 1g, five or thirteen BCC+FCC were misclassified as BCC or FCC, respectively, but none were misclassified as HCP. This may be due to the fact that most of the empirical parameters were used to distinguish between BCC, BCC+FCC, and FCC, while the phase transitions with empirical parameters followed the order BCC, BCC+FCC, and FCC. Also, this proves that the classification boundary between HCP and the other three phases is relatively clear, so that its classification accuracy reaches 91.45%, although there is a data imbalance in HCP.

3.2. Feature Set

Feature selection is essential since the dataset contains 145 features. Using a global search algorithm such as GA for feature selection will consume excessive time and generate space complexity. Due to the large number of features, the feature set filtered by GA may still contain a lot of redundant information. Therefore, a feature selection algorithm is proposed to quickly filter out a low-redundant feature set from 145 features and accurately classify alloy phases. As a summary for the feature selection method, firstly, the Embedded method, which runs faster, is constructed in combination with a specific machine learning method (the LightGBM algorithm is selected in Section 3.1) to perform a rough filtering of the feature set to obtain a simplified feature set. The simplified feature set is then further filtered using a global optimization algorithm such as GA. After that, the speed and accuracy of feature selection are balanced.

In Section 3.1, the LightGBM was selected as the benchmark algorithm to build the classifier. At first, Embedded was used to evaluate the importance of features and correspondingly select features that are important to LightGBM to reduce feature redundancy. The alloy phase dataset was fitted using the LightGBM and ranked according to the importance of the features. Subsequently, the features are added to the feature set based on their importance level, and the feature group with the highest accuracy is selected.

In Figure 2a, the LightGBM classifier achieves the best fit when the top 52 features in the importance ranking are included. Since the feature set used by the LightGBM classifier does not contain a large number of features, GA was used for feature selection. In Figure 2b, finally, after the global search of fifty-two features through GA, nine features, δT_m, δχ, χ, Col, AW, ΔMN, max_R, L₇, and min_R, are finally retained. In Figure 2c,d, to further evaluate the features, the importance of nine features was ranked using Embedded and the correlation between the features was analyzed using the Pearson correlation coefficient (PCC) method. From Figure 2c, the importance of the parameters δTm and δχ associated with the mismatch ranks in the top two. It is worth mentioning that the parameters δTm, Col, ΔMN, AW, and L₇ were not or were rarely used in previous studies for the construction of alloy phase classifiers. These new parameters may have a new role in exploring the formation mechanism of the HEAs phase. To further observe whether the selected nine features contain a lot of redundant information, the PCC method was used to analyze the selected features. The PCC is formulated as:

P C C = \frac{\sum_{i = 1}^{n} (u_{i} - \bar{u}) (v_{i} - \bar{v})}{\sqrt{\sum_{i = 1}^{n} {(u_{i} - \bar{u})}^{2}} \sqrt{\sum_{i = 1}^{n} {(v_{i} - \bar{v})}^{2}}},

(18)

where the numerator of the equation is the covariance between feature

u_{i}

and feature

v_{i}

, and the denominator is the product of the standard deviation of feature

u_{i}

and feature

v_{i}

.

In Figure 2d, the

|P C C|

between each feature is less than 0.8, which proves that our selected feature set has removed a large number of redundant features and can effectively improve the physical interpretability of the model. The nine selected features have no significant linear correlation by PCC analysis; no further feature selection is required.

To further evaluate the classification effects of each parameter in linear space, a 9 × 9 scatter matrix plot (as shown in Figure 3) is drawn to visualize the alloy phase distribution data between any two features, with the diagonal lines showing the distribution of different alloy phases depending on each input features. From the diagonal subplots, none of the alloy phase distribution maps described by the individual features allow complete phase separation, which means that none of the features can be used to completely classify the alloy phases. Also, this indicates that the complex nonlinear relationship of alloy phases is difficult to fit using the linear ML algorithm. However, the distribution of the feature Col, which ranks third in feature importance as shown in Figure 2c, with other features in the non-diagonal plot indicates that the classification boundaries of Col are relatively clear in the two-dimensional linear space for the three classes BCC, FCC, and HCP. However, the BCC+FCC phase produces a large amount of confusion with the BCC and FCC phases in the two-dimensional linear space, which may be one of the reasons for the low classification accuracy of the classifier for the BCC+FCC phase. To further understand the decision-making mechanism of the model, we used the Shapley Additive Explanations (SHAP) [38] interpretable machine learning method to analyze how the model makes decisions. SHAP, as a post hoc interpretation method, is used to analyze a “black box” algorithm locally or globally by calculating the marginal contribution of each feature to the prediction results. Unlike the feature importance results in Figure 2c, SHAP takes the trained model as a whole to analyze the contribution of individual features to each alloy phase, which allows for a better assessment of the laws followed by the model in actual decision-making. As shown in Figure 4, just like the Col distribution presented in Figure 3, Col has the largest contribution in distinguishing the alloy phases in the ML model. Also, by observing the histogram of Col, it was found that Col has a good effect on the separation of BCC from other phases. It also has a good separation effect for the classification of HCP phases. In addition, χ has good results for the classification of BCC+FCC, a mixed phase that is the most difficult to separate. L₇, on the other hand, has a good classification for FCC. It is the nonlinear combination of these features for features that focus on classifying different alloy phases using LightGBM’s excellent nonlinear mapping ability that allows the model to show good classification results for the four alloy phases.

After feature selection, the number of features decreases from one hundred and forty-five to nine, and the accuracy of the LightGBM classifier does not deteriorate because of the decrease in the number of features. At the same time, feature selection reduces the redundancy of data and improves the physical interpretability of the model.

3.3. Data Imbalance Issue

The LightGBM algorithm will use the nine features screened in Section 3.2 as input parameters to predict the phase of the alloy. In order to solve the problem of lack of accuracy in BCC+FCC phase prediction caused by alloy phase class imbalance, we used SMOTE, SVM-SMOTE, Borderline-SMOTE, ADASYN, and other oversampling methods to process the dataset. It is worth mentioning that when using 10-fold to evaluate the algorithm’s accuracy, to prevent problems such as overestimation of prediction accuracy, overfitting, and so on, we only oversampled the training set of the ten subsets generated by 10-fold, and the test set was used for testing only. This operation can effectively prevent data leakage, model overfitting, etc.

In the hyperparameter optimization of LightGBM using Bayesian optimization methods, the accuracy of the algorithm corresponding to different hyperparameter combinations was also evaluated using the above method. Further, we used cost sensitivity in dataset training to enhance the model’s penalty for misclassification of minority classes and thus improve the classification accuracy of minority-class samples.

Four oversampling methods were used for dataset training, and the confusion matrix of the training results was evaluated using a 10-fold. As shown in Figure 5, the prediction accuracy of the LightGBM algorithm processed by each of the four oversampling methods is very close, which may be due to the fact that these methods are based on the SMOTE method. However, by observing the classification effects of a few classes, it can be found that after ADASYN oversampling, the HCP phase has the highest classification accuracy of 92.11%, and the overall classifier has 95.44% classification accuracy. Although several oversampling methods combined with cost-sensitive methods have achieved good prediction results, the LightGBM algorithm trained by the ADASYN method is undoubtedly better for the classification of a few classes. Also, the classification accuracy of the overall alloy phase of the model trained with the training set processed with ADASYN is good among the four types of SMOTE oversampling. Further, the LightGBM model is constructed by combining the ADASYN method with the cost-sensitive method, which is further evaluated using the evaluation metrics precision, F1-score, and recall. Finally, with 10-fold verification, precision reached 0.9575, F1-score reached 0.9543, and recall reached 0.9544. This shows that by using the LightGBM algorithm based on EL, the ADASYN oversampling method, and cost-sensitive learning, the classification accuracy of BCC+FCC for minority-class samples is successfully improved from 75.68% to 86.49%, alleviating the problem of poor classification accuracy of minority classes caused by data imbalance. In order to validate the predictive effectiveness of the model, we tested the model using data from outside the LightGBM model training dataset. The test results are shown in Table 2, which demonstrates the good prediction of the model.

4. Conclusions

In this work, a LightGBM classifier that distinguishes BCC, FCC, BCC+FCC, and HCP phases was developed. In order to explore the empirical parameters of alloy phase classification that have not been proposed before, a feature set consisting of 145 features was constructed. Through Embedded and GA, a set of modeling features with low redundancy and high accuracy were examined, which included parameters such as δT_m, Col, ΔMN, AW, L₇, and others. These parameters were seldom or never used in the construction of alloy-phase classifiers. In addition, through SHAP analysis, Col has made a significant contribution to differentiating alloy phases. To address the issue of data imbalance and improve the accuracy of BCC+FCC, a combination of EL, ADASYN oversampling techniques, and cost-sensitive learning were implemented. Our approach results in a noteworthy increase in the classification accuracy of BCC+FCC from 75.68% to 86.49% and the overall accuracy of the classifier reaching 95.44% with a 10-fold, which may accelerate the development of new HEAs.

Author Contributions

Methodology, Y.Z.; Software, Y.Z.; Validation, Y.Z., S.D. and N.L.; Resources, W.R. and W.W.; Data curation, S.D. and N.L.; Writing—original draft, Y.Z.; Writing—review & editing, W.R.; Visualization, Y.Z.; Supervision, W.R. and W.W.; Project administration, W.R. and W.W.; Funding acquisition, W.R. and W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the National Natural Science Foundation of China (Grant Nos. 51931005, 52171048, and 51571163) and the Key Industry Innovation Chain Project of Shaanxi Province (2020ZDLGY12-02).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yeh, J.W.; Chen, S.K.; Lin, S.J.; Gan, J.Y.; Chin, T.S.; Shun, T.T.; Tsau, C.H.; Chang, S.Y. Nanostructured High-Entropy Alloys with Multiple Principal Elements: Novel Alloy Design Concepts and Outcomes. Adv. Eng. Mater. 2004, 6, 299–303. [Google Scholar] [CrossRef]
Wu, P.; Gan, K.; Yan, D.; Fu, Z.; Li, Z. A non-equiatomic FeNiCoCr high-entropy alloy with excellent anti-corrosion performance and strength-ductility synergy. Corros. Sci. 2021, 183, 109341. [Google Scholar] [CrossRef]
Miracle, D.B.; Senkov, O.N. A critical review of high entropy alloys and related concepts. Acta Mater. 2017, 122, 448–511. [Google Scholar] [CrossRef]
Guo, S. Phase selection rules for cast high entropy alloys: An overview. Mater. Sci. Technol. 2015, 31, 1223–1230. [Google Scholar] [CrossRef]
Li-Sheng, Z.; Guo-Liang, M.; Li-Chao, F.; Jing-Yi, T. Recent Progress in High-entropy Alloys. In Proceedings of the 2012 2nd lnternational Conference on Materials Engineering for Advanced Technologies (ICMEAT 2012), Xiamen, China, 27–28 December 2012. [Google Scholar]
Ranganathan, S. Alloyed pleasures: Multimetallic cocktails. Curr. Sci. 2003, 85, 1404–1406. [Google Scholar]
Wu, Y.D.; Cai, Y.H.; Wang, T.; Si, J.J.; Zhu, J.; Wang, Y.D.; Hui, X.D. A Refractory Hf₂₅Nb₂₅Ti₂₅Zr₂₅ High-Entropy Alloy with Excellent Structural Stability and Tensile Properties. Mater. Lett. 2014, 130, 277–280. [Google Scholar] [CrossRef]
Yu, Y.; Wang, J.; Li, J.; Kou, H.; Liu, W.J.T.I. Tribological behavior of AlCoCrCuFeNi and AlCoCrFeNiTi_0.5 High entropy alloys under Hydrogen peroxide solution against different counterparts. Tribol. Int. 2015, 92, 203–210. [Google Scholar] [CrossRef]
Cheng, P.; Zhao, Y.; Xu, X.; Wang, S.; Sun, Y.; Hou, H. Microstructural evolution and mechanical properties of Al_0.3CoCrFeNiSi_x high-entropy alloys containing coherent nanometer-scaled precipitates. Mater. Sci. Eng. A 2020, 772, 138681. [Google Scholar] [CrossRef]
Grabowski, B.; Ma, D.; Neugebauer, J.; Raabe, D.; Kormann, F. Ab initio thermodynamics of the CoCrFeMnNi high entropy alloy: Importance of entropy contributions beyond the configurational one. Acta Mater. 2015, 100, 90–97. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, F.; Chen, S. Computational Thermodynamics Aided High-Entropy Alloy Design. JOM 2012, 64, 839–845. [Google Scholar] [CrossRef]
Jiang, C.; Uberuaga, B.P. Efficient Ab initio Modeling of Random Multicomponent Alloys. Phys. Rev. Lett. 2016, 116, 105501. [Google Scholar] [CrossRef] [PubMed]
Saal, J.E.; Berglund, I.S.; Sebastian, J.T.; Liaw, P.K.; Olson, G.B. Equilibrium high entropy alloy phase stability from experiments and thermodynamic modeling. Scr. Mater. 2017, 146, 5–8. [Google Scholar] [CrossRef]
Senkov, O.N.; Miller, J.D.; Miracle, D.B.; Woodward, C. Accelerated exploration of multi-principal element alloys for structural applications. Calphad 2015, 50, 32–48. [Google Scholar] [CrossRef]
Li, R.; Xie, L.; Wang, W.Y.; Liaw, P.K.; Zhang, Y. High-Throughput Calculations for High-Entropy Alloys: A Brief Review. Front. Mater. 2020, 7, 290. [Google Scholar] [CrossRef]
Hume-Rothery, W. Comments on papers resulting from Hume-Rothery’s Note—1965. Acta Metall. 1967, 15, 567–569. [Google Scholar] [CrossRef]
Silver, D.; Huang, A.; Maddison, C.; Guez, A.; Sifre, L.; Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
Huang, W.; Bai, X.-M. Machine learning based on-the-fly kinetic Monte Carlo simulations of sluggish diffusion in Ni-Fe concentrated alloys. J. Alloys Compd. 2023, 937, 168457. [Google Scholar] [CrossRef]
Zhang, Y.-F.; Ren, W.; Wang, W.-L.; Ding, S.-J.; Li, N.; Chang, L.; Zhou, Q. Machine learning combined with solid solution strengthening model for predicting hardness of high entropy alloys. Acta Phys. Sin. 2023, 72, 110177. [Google Scholar] [CrossRef]
Huang, W.; Martin, P.; Zhuang, H.L. Machine-learning phase prediction of high-entropy alloys. Acta Mater. 2019, 169, 225–236. [Google Scholar] [CrossRef]
Krishna, Y.V.; Jaiswal, U.K.; Rahul, M.R. Machine learning approach to predict new multiphase high entropy alloys. Scr. Mater. 2021, 197, 113804. [Google Scholar] [CrossRef]
Islam, N.; Huang, W.; Zhuang, H.L. Machine learning for phase selection in multi-principal element alloys. Comput. Mater. Sci. 2018, 150, 230–235. [Google Scholar] [CrossRef]
Agarwal, A.; Prasada, R.A.K. Artificial Intelligence Predicts Body-Centered-Cubic and Face-Centered-Cubic Phases in High-Entropy Alloys. JOM 2019, 71, 3424–3432. [Google Scholar] [CrossRef]
Mandal, P.; Choudhury, A.; Mallick, A.B.; Ghosh, M. Phase Prediction in High Entropy Alloys by Various Machine Learning Modules Using Thermodynamic and Configurational Parameters. Met. Mater. Int. 2023, 29, 38–52. [Google Scholar] [CrossRef]
Machaka, R. Machine learning-based prediction of phases in high-entropy alloys. Comput. Mater. Sci. 2021, 188, 110244. [Google Scholar] [CrossRef]
Zhang, Y.; Wen, C.; Wang, C.; Antonov, S.; Xue, D.; Bai, Y.; Su, Y. Phase prediction in high entropy alloys with a rational selection of materials descriptors and machine learning models. Acta Mater. 2020, 185, 528–539. [Google Scholar] [CrossRef]
Chang, H.; Tao, Y.; Liaw, P.K.; Ren, J. Phase prediction and effect of intrinsic residual strain on phase stability in high-entropy alloys with machine learning. J. Alloys Compd. 2022, 921, 166149. [Google Scholar] [CrossRef]
Risal, S.; Zhu, W.; Guillen, P.; Sun, L. Improving phase prediction accuracy for high entropy alloys with Machine learning. Comput. Mater. Sci. 2021, 192, 110389. [Google Scholar] [CrossRef]
Borg, C.K.H.; Frey, C.; Moh, J.; Pollock, T.M.; Gorsse, S.; Miracle, D.B.; Senkov, O.N.; Meredig, B.; Saal, J.E. Expanded dataset of mechanical properties and observed phases of multi-principal element alloys. Sci. Data 2020, 7, 430. [Google Scholar] [CrossRef]
Yang, X.; Zhang, Y. Prediction of high-entropy stabilized solid-solution in multi-component alloys. Mater. Chem. Phys. 2012, 132, 233–238. [Google Scholar] [CrossRef]
Qi, J.; Cheung, A.M.; Poon, S.J. High Entropy Alloys Mined From Binary Phase Diagrams. Sci. Rep. 2019, 9, 15501. [Google Scholar] [CrossRef]
Pei, Z.; Yin, J.; Hawk, J.A.; Alman, D.E.; Gao, M.C. Machine-learning informed prediction of high-entropy solid solution formation: Beyond the Hume-Rothery rules. NPJ Comput. Mater. 2020, 6, 50. [Google Scholar] [CrossRef]
Lee, K.; Ayyasamy, M.V.; Delsa, P.; Hartnett, T.Q.; Balachandran, P.V. Phase classification of multi-principal element alloys via interpretable machine learning. NPJ Comput. Mater. 2022, 8, 25. [Google Scholar] [CrossRef]
Ward, L.; Agrawal, A.; Choudhary, A.; Wolverton, C. A general-purpose machine learning framework for predicting properties of inorganic materials. NPJ Comput. Mater. 2016, 2, 16028. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, Y.J.; Lin, J.P.; Chen, G.L.; Liaw, P.K. Solid-Solution Phase Formation Rules for Multi-component Alloys. Adv. Eng. Mater. 2008, 10, 534–538. [Google Scholar] [CrossRef]
Zhang, Y.-F.; Ren, W.; Wang, W.-L.; Li, N.; Zhang, Y.-X.; Li, X.-M.; Li, W.-H. Interpretable hardness prediction of high-entropy alloys through Ensemble learning. J. Alloys Compd. 2023, 945, 169329. [Google Scholar] [CrossRef]
Megahed, F.M.; Chen, Y.-J.; Megahed, A.; Ong, Y.; Altman, N.; Krzywinski, M. The class imbalance problem. Nat. Methods 2021, 18, 1270–1272. [Google Scholar] [CrossRef]
Lundberg, S.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]

Figure 1. Confusion matrix of (a) LR, (b) KNN, (c) SVM, (d) BPNN, (e) AdaBoost, (f) RF, (g) LightGBM, (h) GBDT, and (i) XGBoost algorithms.

Figure 2. Feature selection. (a) The importance of 145 features was evaluated using the LightGBM algorithm, and the LightGBM algorithm accuracy was evaluated by adding the feature sets in order of importance and using 10-fold. (b) Global searching of the feature set selected in (a) using GA. (c) Different feature importance levels of the LightGBM algorithms. (d) Heatmap of feature intercorrelation by PCC.

Figure 3. Pair-plot matrices show the correlation between nine features that are linked with the phase selection in 1009 alloys. The diagonal plots show the distribution of BCC, FCC, HCP, and BCC+FCC with different values of the nine features. The off-diagonal plots show the distribution between any two of the nine features with different phases. The subgraph involved in the red box is the distribution relationship diagram between feature Col and other features.

Figure 4. Relationship between features and alloy phases analyzed by the SHAP method.

Figure 5. Confusion matrix of combination with cost-sensitive LightGBM for (a) SMOTE, (b) ADASYN, (c) Borderline-SMOTE, and (d) SVM-SMOTE oversampling methods.

Table 1. Prediction accuracy, precision, recall, and F1-score of ML algorithms under 10-fold cross-validation.

ML Algorithm	Accuracy	Precision	Recall	F1-Score
LR	0.7929	0.7846	0.7929	0.7848
KNN	0.7650	0.7622	0.7651	0.7624
SVM-rbf	0.7889	0.8219	0.7889	0.7834
BPNN	0.7967	0.7862	0.7899	0.7875
AdaBoost	0.6026	0.6910	0.6016	0.6182
Random Forest	0.9485	0.9478	0.9485	0.9477
LightGBM	0.9515	0.9508	0.9514	0.9507
GBDT	0.9465	0.9457	0.9465	0.9458
XGBoost	0.9445	0.9436	0.9445	0.9435

Table 2. Test results of the alloy phase prediction model.

Alloy	Experimental Alloy Phase	Predicted Alloy Phase
Hf_0.26Nb_1.0Ta_1.0Ti_0.58Zr_0.42	BCC	BCC
La_0.04Pb_0.96Se_1.0Sn_1.0Te_1.0	FCC	FCC
Al_0.5Co_0.5Cr_0.5Cu_0.25Fe_0.5Ni_1.0	BCC+FCC	BCC+FCC
Mo_0.1Nb_1.0Ti_1.0V_0.3Zr_1.0	BCC	BCC
Al_0.29Cr_0.34Fe_1.0Mn_0.66Ni_0.57	FCC	FCC

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Ren, W.; Wang, W.; Ding, S.; Li, N. A New Phase Classifier with an Optimized Feature Set in ML-Based Phase Prediction of High-Entropy Alloys. Appl. Sci. 2023, 13, 11327. https://doi.org/10.3390/app132011327

AMA Style

Zhang Y, Ren W, Wang W, Ding S, Li N. A New Phase Classifier with an Optimized Feature Set in ML-Based Phase Prediction of High-Entropy Alloys. Applied Sciences. 2023; 13(20):11327. https://doi.org/10.3390/app132011327

Chicago/Turabian Style

Zhang, Yifan, Wei Ren, Weili Wang, Shujian Ding, and Nan Li. 2023. "A New Phase Classifier with an Optimized Feature Set in ML-Based Phase Prediction of High-Entropy Alloys" Applied Sciences 13, no. 20: 11327. https://doi.org/10.3390/app132011327

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A New Phase Classifier with an Optimized Feature Set in ML-Based Phase Prediction of High-Entropy Alloys

Abstract

1. Introduction

2. Methodology

2.1. Dataset Obtaining and Parameter Calculation

2.2. Model Construction Using ML

2.3. Oversampling and Cost-Sensitive

3. Results and Discussion

3.1. Algorithm Comparison of the ML Model

3.2. Feature Set

3.3. Data Imbalance Issue

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI