Predicting Models for Plant Metabolites Based on PLSR, AdaBoost, XGBoost, and LightGBM Algorithms Using Hyperspectral Imaging of Brassica juncea

Yoon, Hyo In; Lee, Hyein; Yang, Jung-Seok; Choi, Jae-Hyeong; Jung, Dae-Hyun; Park, Yun Ji; Park, Jai-Eok; Kim, Sang Min; Park, Soo Hyun

doi:10.3390/agriculture13081477

Open AccessCommunication

Predicting Models for Plant Metabolites Based on PLSR, AdaBoost, XGBoost, and LightGBM Algorithms Using Hyperspectral Imaging of Brassica juncea

by

Hyo In Yoon

¹

,

Hyein Lee

¹,

Jung-Seok Yang

¹

,

Jae-Hyeong Choi

^1,2,

Dae-Hyun Jung

^1,3,

Yun Ji Park

¹

,

Jai-Eok Park

¹

,

Sang Min Kim

^1,2

and

Soo Hyun Park

^1,*

¹

Smart Farm Research Center, Korea Institute of Science and Technology (KIST), Saimdang-ro 679, Gangneung 25451, Republic of Korea

²

Department of Bio-Medical Science & Technology, University of Science and Technology, Seoul 02792, Republic of Korea

³

Department of Smart Farm Science, Kyung Hee University, Yongin 17104, Republic of Korea

^*

Author to whom correspondence should be addressed.

Agriculture 2023, 13(8), 1477; https://doi.org/10.3390/agriculture13081477

Submission received: 24 May 2023 / Revised: 29 June 2023 / Accepted: 29 June 2023 / Published: 26 July 2023

(This article belongs to the Special Issue Advances in Agricultural Engineering Technologies and Application)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The integration of hyperspectral imaging with machine learning algorithms has presented a promising strategy for the non-invasive and rapid detection of plant metabolites. For this study, we developed prediction models using partial least squares regression (PLSR) and boosting algo-rithms (such as AdaBoost, XGBoost, and LightGBM) for five metabolites in Brassica juncea leaves: total chlorophyll, phenolics, flavonoids, glucosinolates, and anthocyanins. To enhance the model performance, we employed several spectral data preprocessing methods and feature-selection al-gorithms. Our results showed that the boosting algorithms generally outperformed the PLSR models in terms of prediction accuracy. In particular, the LightGBM model for chlorophyll and the AdaBoost model for flavonoids improved the prediction performance, with R2p = 0.71–0.74, com-pared to the PLSR models (R2p = 0.53–0.58). The final models for the glucosinolates and anthocya-nins performed sufficiently for practical uses such as screening, with R2p = 0.82–0.85 and RPD = 2.4–2.6. Our findings indicate that the application of a single preprocessing method is more effective than utilizing multiple techniques. Additionally, the boosting algorithms with feature selection ex-hibited superior performance compared to the PLSR models in the majority of cases. These results highlight the potential of hyperspectral imaging and machine learning algorithms for the non-destructive and rapid detection of plant metabolites, which could have significant implications for the field of smart agriculture.

Keywords:

prediction models; hyperspectral image; PLSR model; AdaBoost; XGBoost; LightGBM

1. Introduction

Mustard (Brassica juncea), also commonly known as Chinese mustard, brown mustard, leaf mustard, vegetable mustard, and oriental mustard, is an annual plant that belongs to the Brassicaceae family [1]. Mustard contains bioactive components such as glucosinolates and their degradation products; polyphenols (flavonoids and anthocyanins); and large amounts of dietary fiber, chlorophyll, β-carotene, ascorbic acid, minerals, and volatile components [2]. Mustard is used as a spice because of its pungent taste. It also has important uses in medicine; its leaves are used as a diuretic, stimulant, and expectorant in folk medicine. Previous studies have found that B. juncea has bactericidal properties, can reduce the risk of atherosclerosis, and has antioxidant- and peroxynitrite-scavenging effects [1,3]. Additionally, B. juncea has exhibited antibacterial and antitumor properties and has been shown to improve various metabolic disorders [4]. Despite its excellent bioactivity, its industrial use as a raw material for medicine is limited by traditional analytical techniques, which are time-consuming and destructive. Therefore, a non-destructive method of determining bioactive compound contents should be developed for quality control in the production stage.

Recently, hyperspectral imaging has been used for the assessment of the biophysical traits of plants. Spectral information from hyperspectral images can be combined with various data processing and mining tools to ensure fast, non-destructive, and highly accurate detection of functional component contents [5]. Preprocessing of spectral data is an important step for suppressing the undesired effects of measurement conditions and enhancing relevant features, which commonly contain normalization, derivatives, and smoothing [6]. Partial least squares regression (PLSR) is a widely used method to analyze large amounts of hyperspectral data and predict functional components in plants, such as chlorophyll and carotenoids in spinach [7] or total polyphenols in cocoa beans [8]. In our previous study, we aimed to develop a predictive model for the functional components of mustard plants using a PLSR prediction model based on hyperspectral images and preprocessing techniques [9]. In that study, we found that a preprocessing combination of SNV transformation and 1st-Der with spectral data resulted in high-performance prediction models for the total chlorophyll, carotenoid, and glucosinolate contents, while a preprocessing combination of the S.G. filter and SNV transformation gave the highest prediction rate for the total phenolics. However, the accuracy of this model was limited because the amount of data was relatively small and it was only applied in an indoor environment.

Machine learning techniques, combined with hyperspectral imaging, have been extensively used for the determination of food quality [10], such as identifying contaminants in food [11]. Among them, boosting methods in ensemble learning are attracting attention for their outstanding performance and have paved the way for data analysis. Boosting algorithms, such as those for adaptive boosting (AdaBoost), extreme gradient boosting (XGBoost), and the light gradient-boosting model (LightGBM), have performed well in hyperspectral imaging-based data classification tasks [12,13]. Effective training of machine learning models usually requires abundant data for a more accurate predictive model [14]. To train the model and improve the accuracy of the PLSR prediction model for functional components such as chlorophyll, phenolics, flavonoids, glucosinolates, and anthocyanins in mustard plants, we first acquired more hyperspectral imaging data of plant leaves. For this study, we aimed to develop a model with excellent predictive performance by adding enough training data to apply boosting algorithms and applying a combination of data processing methods. This analysis has expanded upon the previous study by including the prediction of the total phenolic components, which was not previously considered. However, the prediction of the total carotenoids was excluded from the current study due to its poor performance in the previous study. To apply the developed model and predict functional components in the growing environment, hyperspectral images were measured from various angles.

2. Materials and Methods

2.1. Training Data Acquisition

The plant growth conditions and analysis methods used for this study were the same as those described in detail in the previous study [9]. Briefly, mustard plants (B. juncea L. Czern.) were cultivated in three different environments. Plants in an indoor farm were hydroponically grown under mixed LEDs and with Hoagland nutrient solution. Plants in a greenhouse and an open field were grown in pots filled with commercial soil and fertilizer. Fifteen plants from each cultivation environment were harvested for 4 weeks after the transplant to ensure variation in growth stage and leaf color. A total of 122 fully expanded leaves were collected for analysis.

As with the experimental setup in the previous study [9], the hyperspectral imaging system consisted of a hyperspectral imaging camera (MicroHSI 410 SHARK; Corning Inc., Corning, NY, USA) and eight 15 W halogen lamps. A total of 112 hyperspectral images were acquired, with 1408 spatial pixels and 150 spectral bands in the range of 400–1000 nm. After the hyperspectral imaging data were obtained, the leaves were freeze-dried for 4 days and powdered for component analysis [9,15]. The powder obtained using pulverization after freeze-drying was subdivided into 3 repetitions of 20 mg each and used for the analysis of 5 functional components. Briefly, the previous methods were used for the determination of the total chlorophyll content [16], total phenolic content [17], total flavonoid content [18], total glucosinolate content [19], and total anthocyanin content [20]. As a result of the component analysis, analysis values with high degrees of variation in content were excluded, and the average value of the rest was used as the component value for model development.

2.2. Data Processing and Prediction Models

The average of the spectral data was extracted from hyperspectral images within predefined regions of interest, as in the previous method [9]. The average spectral data of 150 bands for each of the 112 hyperspectral images were obtained [9]. The preprocessing methods, used alone or in combination (Table S1), included normalization, logarithmic transformation, a Savitzky–Golay filter, the 1st and 2nd derivative after SG filtering, multiplicative scatter correction (MSC), and standard normal variate (SNV) transformation. The SG filter was applied with a three-order polynomial fit with five data points, using the SciPy package in Python 3.9. A total of 36 preprocessing combinations were used to prepare the spectral data for the development of the predictive models.

Partial least squares regression (PLSR), adaptive boosting (AdaBoost), extreme gradient boosting (XGboost), and light gradient boosting model (LightGBM) algorithms were applied to predict the content of each metabolite in the plants. PLSR is a method that is commonly used to predict metabolite content from hyperspectral data. It works by extracting latent variables (LVs), which are linear combinations of original predictor variables that capture the maximum variation in data. The number of LVs is chosen based on the optimal performance of the relevant model, which is typically determined through cross-validation.

AdaBoost, XGboost, and LightGBM are boosting algorithms that are also commonly used for regression tasks [21,22,23]. Boosting algorithms combine multiple weak learners (e.g., decision trees) into a strong learner, which improves the accuracy of predictions. In this study, boosting algorithms were used for both feature selection and regression.

To reduce redundant information in the hyperspectral data, feature selection based on the importance of boosting was used. Only bands with a feature importance value greater than 1.25 times the average value were selected. The implementation of model development was programmed using the Scikit-learn, XGboost, and LightGBM packages in Python 3.9.

The preprocessing and feature-selection methods were determined, and the parameters for all of the algorithms were optimized after tenfold cross-validation based on the training dataset, corresponding to 80% of the data. After hyperparameter tuning, the performance of the final model was tested with an independent validation dataset that corresponded to 20% of the data. The model performance was evaluated based on the coefficient of determination (R²) and the root mean square error (RMSE), as follows:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} ({{\hat{y}}_{i} - y_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(1)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} ({{\hat{y}}_{i} - y_{i})}^{2}}

(2)

where

y_{i}

is the measured value of the component analysis;

{\hat{y}}_{i}

is the value predicted by the model;

\bar{y}

is the mean value of the component analysis; and

n

is the number of samples.

3. Results and Discussion

3.1. Development of a Prediction Model Based on Hyperspectral Imaging with the PLSR, AdaBoost, XGboost, and LightGBM Algorithms

The total chlorophyll, phenolic, flavonoid, glucosinolate, and anthocyanin contents in the B. juncea plants are summarized in Table 1. The reflectance spectra of 112 leaves were obtained by averaging the hyperspectral data, followed by preprocessing, as shown in Figure 1. In the spectra of the B. juncea leaves, the green and red regions were relatively low and high, respectively, compared to those of a typical green leaf. These different spectra could be caused by the absolute contents and ratio of chlorophyll and anthocyanin in the leaf [24]. The plant used in this study was a red mustard cultivar with purple–green leaves and a high anthocyanin content (Table 1).

PLSR models for five metabolites were developed using 36 preprocessing methods (Table S1). The optimal combination of preprocessing methods for each of the five PLSR models, as well as the optimal number of latent variables (LVs) for each component, was determined based on the low root mean square error of cross-validation (RMSECV) values, as shown in Table 2. Spectral preprocessing is an essential step in order to avoid undesirable scattering effects and reveal signals that correspond to chemical components [25]. The appropriate preprocessing method will depend on various factors, including the wavelength range and interval, the prediction model, the target compound, and the plant organs used, such as leaves and fruits. In previous studies, the performance of the PLSR model in detecting total phenolic content using a VIS-NIR hyperspectral imaging system was improved with the normalization method in apple fruits [26] and with the SG filter and derivative transformation in Arabidopsis leaves [27]. Derivative transforms emphasize spectral features but also emphasize the noise of data. The first and second derivatives removed an additive and a linear baseline, respectively. Logarithmic transformation can be employed to address a non-linear problem. In a previous study using the SWIR hyperspectral imaging system, the logarithmic transformation Log (1/R) improved the performance of the PLSR model for the ABA content in zucchini leaves [28]. MSC and SNV transformation are useful in reducing spectral variability due to scattering and baseline shifts. To further improve model performance, spectral preprocessing methods can be used in combination [9]. In this study, the prediction performance of the PLSR model was higher with the single preprocessing methods than with combinations of multiple methods (Table 2).

The AdaBoost, XGboost, and LightGBM prediction models were also developed using 36 preprocessing methods. The best preprocessing method for each algorithm and metabolite was determined based on low RMSECV values (Table S2). After that, the prediction models were compared according to the selection of three features (bands) based on the feature importance in the boosting algorithms (Table 3). The spectral bands that were reduced by the algorithms made the performances of several models better compared with the full bands. Hyperspectral data require band selection due to the large amount of highly correlated and redundant information. Reducing the number of features, even to less than 20% of the total band, can enhance the performance of a regression algorithm [28,29]. Combinations of different feature-selection and regression algorithms can improve model accuracy [30].

The importance values for selecting the features of each best performance model are given in Figure 2. For chlorophyll prediction, the highest importance was at 480.57 nm, followed by 916.83 nm, among 17 bands selected based on the XGBoost algorithm with 1st Der processing data. For phenolic prediction, the feature importance was the highest at 904.82 nm, followed by 760.73 nm, among 28 bands selected based on the LightGBM algorithm with Norm processing data. The selected features were distributed in the ranges of 488.57–544.61 nm and 672.68–992.87 nm, respectively. For flavonoid prediction, the feature importance was the highest at 692.69 nm, followed by 608.64 nm, among 33 bands selected based on the AdaBoost algorithm with 2nd Der processing data. The feature importance for glucosinolate prediction was concentrated in the range of 870–900 nm. The highest importance values were, in order, at 872.80, 896.81, 880.8, and 628.66 nm among 28 bands selected based on the AdaBoost algorithm with SNV processing data. For anthocyanin prediction, the feature importance was the highest at 924.83 nm among 37 bands selected with the LightGBM algorithm with Log (1/R), 1st Der, and MSC processing data. The features were selected depending on the spectral data preprocessing method as well as the selection algorithm.

Overall, the boosting algorithms showed better prediction performances compared to the PLSR models (Table 2 and Table 3). Specifically, the LightGBM model was found to be the best for predicting chlorophyll, while the AdaBoost model was the best for predicting phenolics, flavonoids, glucosinolates, and anthocyanins. The boosting models performed better than the best PLSR models, except when it came to anthocyanins, where the PLSR models showed better performances. The performances of the best prediction models for five metabolites are given in Figure 3. The best model for chlorophyll was the 1st Der processing–XGBoost selection–LightGBM prediction model with 17 bands selected (R²_P = 0.737, RMSEP = 1.052). The best model for phenolics was the Norm processing–LightGBM selection–AdaBoost prediction model with 28 bands selected (R²_P = 0.594, RMSEP = 1.426). For flavonoids, the 2nd Der processing–AdaBoost selection–AdaBoost prediction model with 33 bands selected performed the best (R²_P = 0.709, RMSEP = 1.417). For glucosinolates, the SNV processing–AdaBoost selection–AdaBoost prediction model with 28 bands selected was best (R²_P = 0.816, RMSEP = 4.744). The best boosting model for anthocyanins was the Log (1/R)–1st Der-MSC processing–LightGBM selection–AdaBoost prediction model with 37 bands selected (R²_P = 0.824, RMSEP = 1.876). The best PLSR model for anthocyanins was that with 9 LVs and using Log (1/R)–SG filter–MSC processing data, which had the highest performance of all of the models (R²_P = 0.850, RMSEP = 1.728). The ratio of prediction to deviation (RPD) value of the final AdaBoost models for glucosinolates and anthocyanins was 2.4, which indicates that these models are sufficient for practical screening applications [31,32]. However, the best PLSR model for anthocyanins performed even better, with an RPD value of 2.6. It is worth noting that the selection of the best model depended on various factors, including the spectral data preprocessing method and the selection algorithm used.

3.2. Application of the Functional Component Prediction Model with Visualization

Prediction models based on hyperspectral imaging are used to predict content at a single-pixel level and generate compound distribution maps. The prediction model developed here was applied to actual plants that were grown and utilized the spectrum of every pixel. The spatial distribution of five metabolites was found to be uneven across the leaf area (Figure 4). Yuan et al. (2021) visualized the distribution of SPAD values, which indicate chlorophyll content in pepper leaves [29]. The distribution of the total phenolics has been visualized using hyperspectral imaging and modeling in Arabidopsis plants [27] and shelled cocoa beans [8]. Hence, by employing a hyperspectral imaging system and the necessary software to run the algorithm, we could non-destructively and continuously monitor the compound distribution. This phytochemical monitoring will aid in making cultivation decisions to effectively control the quality of functional plants.

4. Conclusions

A prediction model using hyperspectral imaging was developed based on PLSR and boosting algorithms such as AdaBoost, XGboost, and LightGBM to predict five metabolites in B. juncea: total chlorophyll, phenolics, flavonoids, glucosinolates, and anthocyanins. To improve the model performance, various spectral data preprocessing methods and feature-selection algorithms were adopted. The prediction performance was higher with the single preprocessing methods than with combinations of multiple PLSR- and boosting-model methods. Feature selection based on boosting algorithms could improve prediction performance. The cross-validation and prediction performances were better in the boosting algorithms than in the PLSR models, except regarding anthocyanin prediction. The final models for glucosinolates and anthocyanins especially performed sufficiently for practical use such as screening, as R²_p = 0.82–0.85 and RPD = 2.4–2.6. This research presents a promising approach for the rapid and accurate prediction of metabolites in plants using hyperspectral imaging, which can contribute to the development of precision agriculture and plant breeding.

Overall, our results showed that boosting algorithms can be applied to predict the functional components of medicinal plants. Many studies have compared spectral data preprocessing methods and tried to improve prediction performance. We have confirmed that prediction performance can be improved by reducing spectral bands with a feature-selection algorithm. To develop faster and more accurate prediction techniques, it is necessary to continuously introduce the latest algorithms and data processing methods. Based on hyperspectral images, non-destructive monitoring techniques of functional components can be used as tools for quality control in the field of smart agriculture, including in the medicinal plant industry.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/agriculture13081477/s1, Table S1: List of preprocessing methods for hyperspectral data of B. juncea plants; Table S2: Determination of preprocessing methods for AdaBoost, XGBoost, and LightGBM prediction algorithms for five metabolites in B. juncea plants.

Author Contributions

Conceptualization, S.H.P. and S.M.K.; funding acquisition, S.M.K.; investigation, S.H.P. and H.I.Y.; methodology, H.I.Y., J.-H.C., D.-H.J., J.-E.P. and Y.J.P.; project administration, S.H.P.; software, H.I.Y. and H.L.; validation, H.I.Y. and S.H.P.; writing—original draft, H.I.Y. and S.H.P.; review and editing, H.I.Y., J.-S.Y. and S.H.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korean Institute of Planning and Evaluation for Technology in Food, Agriculture and Forestry (IPET) and by the Korean Smart Farm R&D Foundation (KosFarm) through the Smart Farm Innovation Technology Development Program, funded by the Ministry of Agriculture, Food and Rural Affairs (MAFRA), the Ministry of Science and ICT (MSIT), and the Rural Development Administration (RDA) (421034-04).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article and Supplementary Materials; further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare that this research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Szőllősi, R. Indian Mustard (Brassica juncea L.) Seeds in Health. In Nuts and Seeds in Health and Disease Prevention; Academic Press: Cambridge, MA, USA, 2020; pp. 357–364. [Google Scholar]
Tian, Y.; Deng, F. Phytochemistry and Biological Activity of Mustard (Brassica juncea): A Review. CyTA—J. Food 2020, 18, 704–718. [Google Scholar] [CrossRef]
Kumar, V.; Kumar Thakur, A.; Dev Barothia, N.; Chatterjee, S.S. Therapeutic Potentials of Brassica juncea: An Overview. CellMed 2011, 1, e2. [Google Scholar] [CrossRef] [Green Version]
Park, C.H.; Park, Y.E.; Yeo, H.J.; Kim, J.K.; Park, S.U. Effects of Light-Emitting Diodes on the Accumulation of Phenolic Compounds and Glucosinolates in Brassica juncea Sprouts. Horticulturae 2020, 6, 77. [Google Scholar] [CrossRef]
Sarić, R.; Nguyen, V.D.; Burge, T.; Berkowitz, O.; Trtílek, M.; Whelan, J.; Lewsey, M.G.; Čustović, E. Applications of Hyperspectral Imaging in Plant Phenotyping. Trends Plant Sci. 2022, 27, 301–315. [Google Scholar] [CrossRef] [PubMed]
Beć, K.B.; Grabska, J.; Bonn, G.K.; Popp, M.; Huck, C.W. Principles and Applications of Vibrational Spectroscopic Imaging in Plant Science: A Review. Front. Plant Sci. 2020, 11, 1226. [Google Scholar] [CrossRef]
Zhang, C.; Wang, Q.; Liu, F.; He, Y.; Xiao, Y. Rapid and Non-Destructive Measurement of Spinach Pigments Content during Storage Using Hyperspectral Imaging with Chemometrics. Measurement 2017, 97, 149–155. [Google Scholar] [CrossRef]
Caporaso, N.; Whitworth, M.B.; Fowler, M.S.; Fisk, I.D. Hyperspectral Imaging for Non-Destructive Prediction of Fermentation Index, Polyphenol Content and Antioxidant Activity in Single Cocoa Beans. Food Chem. 2018, 258, 343–351. [Google Scholar] [CrossRef]
Choi, J.-H.; Park, S.H.; Jung, D.-H.; Park, Y.J.; Yang, J.-S.; Park, J.-E.; Lee, H.; Kim, S.M. Hyperspectral Imaging-Based Multiple Predicting Models for Functional Component Contents in Brassica juncea. Agriculture 2022, 12, 1515. [Google Scholar] [CrossRef]
Saha, D.; Manickavasagan, A. Machine Learning Techniques for Analysis of Hyperspectral Images to Determine Quality of Food Products: A Review. Curr. Res. Food Sci. 2021, 4, 28–44. [Google Scholar] [CrossRef]
Bonifazi, G.; Capobianco, G.; Gasbarrone, R.; Serranti, S. Contaminant Detection in Pistachio Nuts by Different Classification Methods Applied to Short-Wave Infrared Hyperspectral Images. Food Control 2021, 130, 108202. [Google Scholar] [CrossRef]
Jafarzadeh, H.; Mahdianpari, M.; Gill, E.; Mohammadimanesh, F.; Homayouni, S. Bagging and Boosting Ensemble Classifiers for Classification of Multispectral, Hyperspectral and PolSAR Data: A Comparative Evaluation. Remote Sens. 2021, 13, 4405. [Google Scholar] [CrossRef]
Weksler, S.; Rozenstein, O.; Haish, N.; Moshelion, M.; Wallach, R.; Ben-Dor, E. Detection of Potassium Deficiency and Momentary Transpiration Rate Estimation at Early Growth Stages Using Proximal Hyperspectral Imaging and Extreme Gradient Boosting. Sensors 2021, 21, 958. [Google Scholar] [CrossRef]
Sha, W.; Guo, Y.; Yuan, Q.; Tang, S.; Zhang, X.; Lu, S.; Guo, X.; Cao, Y.-C.; Cheng, S. Artificial Intelligence to Power the Future of Materials Science and Engineering. Adv. Intell. Syst. 2020, 2, 1900143. [Google Scholar] [CrossRef] [Green Version]
Park, Y.J.; Park, J.-E.; Truong, T.Q.; Koo, S.Y.; Choi, J.-H.; Kim, S.M. Effect of Chlorella Vulgaris on the Growth and Phytochemical Contents of “Red Russian” Kale (Brassica napus Var. Pabularia). Agronomy 2022, 12, 2138. [Google Scholar] [CrossRef]
Lichtenthaler, H.K.; Buschmann, C. Chlorophylls and Carotenoids: Measurement and Characterization by UV-VIS Spectroscopy. Curr. Protoc. Food Anal. Chem. 2001, 1, F4.3.1–F4.3.8. [Google Scholar] [CrossRef]
Thomas, M.; Badr, A.; Desjardins, Y.; Gosselin, A.; Angers, P. Characterization of Industrial Broccoli Discards (Brassica oleracea Var. Italica) for Their Glucosinolate, Polyphenol and Flavonoid Contents Using UPLC MS/MS and Spectrophotometric Methods. Food Chem. 2018, 245, 1204–1211. [Google Scholar] [CrossRef]
Dewanto, V.; Xianzhong, W.; Adom, K.K.; Liu, R.H. Thermal Processing Enhances the Nutritional Value of Tomatoes by Increasing Total Antioxidant Activity. J. Agric. Food Chem. 2002, 50, 3010–3014. [Google Scholar] [CrossRef]
Mawlong, I.; Sujith Kumar, M.S.; Gurung, B.; Singh, K.H.; Singh, D. A Simple Spectrophotometric Method for Estimating Total Glucosinolates in Mustard De-Oiled Cake. Int. J. Food Prop. 2017, 20, 3274–3281. [Google Scholar] [CrossRef] [Green Version]
Yang, Y.-C.; Sun, D.-W.; Pu, H.; Wang, N.-N.; Zhu, Z. Rapid Detection of Anthocyanin Content in Lychee Pericarp during Storage Using Hyperspectral Imaging Coupled with Model Fusion. Postharvest Biol. Technol. 2015, 103, 55–65. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A Short Introduction to Boosting. J. Jpn. Soc. Artif. Intell. 1999, 14, 771–780. [Google Scholar]
Chen, T.; Guestrin, C. Xgboost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural Inf. Process Syst. 2017, 30, 3146–3154. [Google Scholar]
Gitelson, A.A.; Merzlyak, M.N.; Chivkunova, O.B. Optical Properties and Nondestructive Estimation of Anthocyanin Content in Plant Leaves. Photochem. Photobiol. 2001, 74, 38–45. [Google Scholar] [CrossRef] [PubMed]
Mishra, P.; Lohumi, S.; Khan, H.A.; Nordon, A. Close-Range Hyperspectral Imaging of Whole Plants for Digital Phenotyping: Recent Applications and Illumination Correction Approaches. Comput. Electron. Agric. 2020, 178, 105780. [Google Scholar] [CrossRef]
Hasanzadeh, B.; Abbaspour-Gilandeh, Y.; Soltani-Nazarloo, A.; Hernández-Hernández, M.; Gallardo-Bernal, I.; Hernández-Hernández, J.L. Non-Destructive Detection of Fruit Quality Parameters Using Hyperspectral Imaging, Multiple Regression Analysis and Artificial Intelligence. Horticulturae 2022, 8, 598. [Google Scholar] [CrossRef]
Jayapal, P.K.; Joshi, R.; Sathasivam, R.; Van Nguyen, B.; Faqeerzada, M.A.; Park, S.U.; Sandanam, D.; Cho, B.-K. Non-Destructive Measurement of Total Phenolic Compounds in Arabidopsis under Various Stress Conditions. Front. Plant Sci. 2022, 13, 982247. [Google Scholar] [CrossRef]
Burnett, A.C.; Serbin, S.P.; Davidson, K.J.; Ely, K.S.; Rogers, A. Detection of the Metabolic Response to Drought Stress Using Hyperspectral Reflectance. J. Exp. Bot. 2021, 72, 6474–6489. [Google Scholar] [CrossRef]
Yuan, Z.; Ye, Y.; Wei, L.; Yang, X.; Huang, C. Study on the Optimization of Hyperspectral Characteristic Bands Combined with Monitoring and Visualization of Pepper Leaf SPAD Value. Sensors 2021, 22, 183. [Google Scholar] [CrossRef]
Luo, M.; Wang, Y.; Xie, Y.; Zhou, L.; Qiao, J.; Qiu, S.; Sun, Y. Combination of Feature Selection and CatBoost for Prediction: The First Application to the Estimation of Aboveground Biomass. Forests 2021, 12, 216. [Google Scholar] [CrossRef]
Bellon-Maurel, V.; Fernandez-Ahumada, E.; Palagos, B.; Roger, J.-M.; McBratney, A. Critical Review of Chemometric Indicators Commonly Used for Assessing the Quality of the Prediction of Soil Attributes by NIR Spectroscopy. TrAC Trends Anal. Chem. 2010, 29, 1073–1081. [Google Scholar] [CrossRef]
Heil, K.; Schmidhalter, U. An Evaluation of Different NIR-Spectral Pre-Treatments to Derive the Soil Parameters C and N of a Humus-Clay-Rich Soil. Sensors 2021, 21, 1423. [Google Scholar] [CrossRef]

Figure 1. Single preprocessing method for hyperspectral data of B. juncea plants: raw reflectance (A), normalization (B), logarithmic transformation (C), Savitzky–Golay filter (D), first and second derivative after SG filtering (E,F), multiplicative scatter correction (G), and standard normal variate transformation (H). Colored lines represent different leaf samples. The combination of preprocessing methods refers to Table S1.

Figure 2. Feature importance values used to determine the best prediction models for total chlorophyll (A), phenolics (B), flavonoids (C), glucosinolates (D), and anthocyanins (E) in B. juncea plants. Orange bars represent selected features, and light blue bars represent unselected features, i.e., those not used in the prediction model. The best prediction models are documented in Table 3.

Figure 3. The optimal models for predicting the concentrations of the total chlorophyll (A), phenolics (B), flavonoids (C), glucosinolates (D), and anthocyanins (E,F) in B. juncea plants, as presented in Table 3. R²_P and RMSEP indicate coefficient of determination and root mean square error of prediction, respectively.

Figure 4. Distribution map of five metabolites, described by an application of hyperspectral image-based prediction in a growing environment: total chlorophyll, phenolics, flavonoids, glucosinolates, and anthocyanins in B. juncea plants.

Table 1. Statistical summary of the five components in B. juncea plants.

Metabolites	Min	Max	Mean	Standard Deviation
Chlorophyll (mg g⁻¹ DW)	2.54	12.17	7.24	2.28
Phenolics (mg g⁻¹ DW)	2.13	11.28	6.22	2.33
Flavonoids (mg g⁻¹ DW)	3.00	13.59	7.52	2.51
Glucosinolates (µmol g⁻¹ DW)	11.88	55.08	31.36	11.25
Anthocyanins (mg g⁻¹ DW)	0.00	33.80	3.23	5.35

Table 2. Performances of five PLSR models based on best preprocessing methods and optimal latent variables (LVs) for each component of B. juncea plants.

Metabolites	Preprocessing Method	Optimal LVs	Calibration		Cross-Validation		Prediction
Metabolites	Preprocessing Method	Optimal LVs	R²_C	RMSEC	R²_CV	RMSECV	R²_P	RMSEP
Chlorophyll	Log (1/R) + 1st Der + MSC	5	0.667	1.332	0.407	1.777	0.567	1.349
	Log (1/R) + 1st Der + SNV	5	0.667	1.332	0.405	1.779	0.553	1.371
	Raw reflectance	8	0.619	1.425	0.393	1.798	0.530	1.405
	SG filter	2	0.431	1.740	0.388	1.806	0.526	1.411
	1st Der	5	0.659	1.347	0.384	1.811	0.575	1.337
Phenolics	SG filter	11	0.731	1.206	0.433	1.751	0.558	1.487
	Norm + SNV	6	0.566	1.532	0.425	1.761	0.458	1.648
	SNV	6	0.566	1.532	0.425	1.761	0.458	1.648
	Norm + SG filter + SNV	6	0.554	1.552	0.424	1.763	0.440	1.675
	SG filter + SNV	6	0.554	1.552	0.424	1.763	0.440	1.675
Flavonoids	1st Der	1	0.452	1.814	0.406	1.888	0.507	1.845
	SG filter	1	0.442	1.830	0.404	1.892	0.477	1.902
	Raw reflectance	2	0.448	1.820	0.398	1.901	0.531	1.802
	2nd Der	1	0.450	1.818	0.382	1.926	0.412	2.016
	Log (1/R) + 1st Der + MSC	3	0.509	1.716	0.365	1.953	0.233	2.303
Glucosinolates	Raw reflectance	8	0.783	5.229	0.647	6.667	0.725	5.804
	Norm + SG filter	7	0.746	5.651	0.647	6.668	0.662	6.435
	SG filter	11	0.807	4.922	0.646	6.679	0.759	5.436
	Norm + SG filter + MSC	8	0.753	5.581	0.633	6.794	0.646	6.591
	SG filter + MSC	8	0.753	5.581	0.633	6.799	0.645	6.592
Anthocyanins	Log (1/R) + SNV	8	0.854	2.109	0.746	2.780	0.836	1.808
	Log (1/R) + SG filter + SNV	9	0.855	2.104	0.745	2.786	0.849	1.737
	Log (1/R)	11	0.868	2.002	0.745	2.787	0.837	1.801
	Log (1/R) + MSC	9	0.853	2.115	0.744	2.791	0.861	1.664
	Log (1/R) + SG filter + MSC	9	0.850	2.137	0.743	2.794	0.850	1.728

R²: coefficient of determination; RMSEC, RMSECV, and RMSEP: root mean square errors of calibration, cross-validation, and prediction, respectively. Bold indicates the best performance based on the RMSEP for each component.

Table 3. Performance of AdaBoost, XGboost, and LightGBM prediction models for five metabolites in B. juncea plants according to feature-selection algorithms after determination of preprocessing and hyperparameter tuning.

Prediction Model	Preprocessing Method	Feature Selection		Calibration		Cross-Validation		Prediction
Prediction Model	Preprocessing Method	Method	Feature No.	R²_C	RMSEC	R²_CV	RMSECV	R²_P	RMSEP
Total Chlorophyll
AdaBoost	Log (1/R) + 2nd Der + MSC	Full band	150	0.878	0.807	0.448	1.714	0.594	1.307
		AdaBoost	28	0.926	0.628	0.573	1.507	0.476	1.483
		XGboost	13	0.868	0.838	0.463	1.690	0.348	1.656
		LightGBM	35	0.929	0.616	0.541	1.563	0.502	1.447
XGboost	Log (1/R) + 2nd Der + MSC	Full band	150	0.996	0.137	0.519	1.600	0.476	1.484
		AdaBoost	28	0.997	0.120	0.594	1.471	0.545	1.382
		XGboost	13	0.891	0.763	0.488	1.651	0.455	1.514
		LightGBM	35	1.000	0.033	0.628	1.407	0.576	1.334
LightGBM	1st Der	Full band	150	0.945	0.543	0.414	1.766	0.695	1.133
		AdaBoost	31	0.829	0.954	0.463	1.691	0.648	1.217
		XGboost	17	0.743	1.170	0.388	1.805	0.737	1.052
		LightGBM	35	0.960	0.462	0.551	1.547	0.657	1.201
Total Phenolics
AdaBoost	Norm	Full band	150	0.924	0.642	0.581	1.505	0.521	1.549
		AdaBoost	37	0.931	0.611	0.641	1.393	0.517	1.554
		XGboost	16	0.921	0.652	0.646	1.382	0.512	1.562
		LightGBM	28	0.925	0.637	0.618	1.437	0.594	1.426
XGboost	Norm + SG filter	Full band	150	1.000	0.027	0.627	1.419	0.390	1.748
		AdaBoost	34	0.974	0.372	0.573	1.518	0.354	1.798
		XGboost	15	0.969	0.406	0.605	1.461	0.406	1.724
		LightGBM	30	0.933	0.601	0.557	1.546	0.378	1.765
LightGBM	1st Der	Full band	150	0.882	0.798	0.538	1.580	0.559	1.486
		AdaBoost	36	0.862	0.864	0.572	1.520	0.517	1.556
		XGboost	10	0.770	1.115	0.565	1.532	0.386	1.753
		LightGBM	38	0.942	0.558	0.602	1.467	0.499	1.583
Total Flavonoids
AdaBoost	2nd Der	Full band	150	0.872	0.878	0.512	1.712	0.704	1.429
		AdaBoost	33	0.827	1.018	0.551	1.642	0.709	1.417
		XGboost	12	0.913	0.724	0.572	1.602	0.575	1.714
		LightGBM	34	0.847	0.958	0.538	1.666	0.623	1.615
XGboost	1st Der	Full band	150	0.972	0.409	0.516	1.705	0.586	1.692
		AdaBoost	36	0.986	0.286	0.586	1.577	0.564	1.736
		XGboost	7	0.932	0.640	0.545	1.653	0.644	1.569
		LightGBM	46	0.997	0.138	0.580	1.588	0.568	1.728
LightGBM	1st Der	Full band	150	0.874	0.868	0.483	1.761	0.585	1.693
		AdaBoost	36	0.905	0.754	0.543	1.657	0.519	1.823
		XGboost	7	0.651	1.448	0.531	1.678	0.594	1.676
		LightGBM	46	0.955	0.518	0.548	1.648	0.503	1.854
Total Glucosinolates
AdaBoost	SNV	Full band	150	0.935	2.868	0.666	6.481	0.768	5.333
		AdaBoost	28	0.907	3.417	0.674	6.401	0.816	4.744
		XGboost	14	0.913	3.301	0.699	6.157	0.782	5.169
		LightGBM	34	0.935	2.852	0.677	6.372	0.730	5.748
XGboost	SG filter + SNV	Full band	150	0.997	0.644	0.670	6.445	0.751	5.521
		AdaBoost	29	0.996	0.670	0.676	6.382	0.763	5.389
		XGboost	12	0.993	0.928	0.715	5.987	0.778	5.211
		LightGBM	41	1.000	0.233	0.707	6.071	0.776	5.238
LightGBM	Log (1/R) + 1st Der + SNV	Full band	150	0.875	3.959	0.702	6.122	0.675	6.308
		AdaBoost	30	0.962	2.183	0.741	5.709	0.662	6.435
		XGboost	8	0.901	3.538	0.744	5.678	0.613	6.890
		LightGBM	51	0.985	1.386	0.739	5.729	0.665	6.411
Total Anthocyanins
AdaBoost	Log (1/R) + 1st Der	Full band	150	0.975	0.865	0.834	2.246	0.714	2.390
		AdaBoost	26	0.968	0.986	0.819	2.349	0.639	2.682
		XGboost	11	0.969	0.973	0.735	2.839	0.519	3.097
		LightGBM	37	0.976	0.851	0.822	2.329	0.824	1.876
XGboost	1st Der	Full band	150	1.000	0.003	0.724	2.899	0.265	3.830
		AdaBoost	20	1.000	0.041	0.725	2.892	0.742	2.271
		XGboost	11	0.987	0.625	0.738	2.826	0.251	3.865
		LightGBM	40	0.997	0.297	0.664	3.198	0.389	3.492
LightGBM	Log (1/R) + 1st Der	Full band	150	0.899	1.756	0.685	3.097	0.743	2.264
		AdaBoost	24	0.918	1.575	0.699	3.028	0.485	3.204
		XGboost	9	0.826	2.303	0.687	3.089	0.314	3.699
		LightGBM	39	0.933	1.430	0.742	2.804	0.717	2.375

R²: coefficient of determination; RMSEC, RMSECV, and RMSEP: root mean square errors of calibration, cross-validation, and prediction, respectively. Bold indicates the best performance based on the RMSEP for each component.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yoon, H.I.; Lee, H.; Yang, J.-S.; Choi, J.-H.; Jung, D.-H.; Park, Y.J.; Park, J.-E.; Kim, S.M.; Park, S.H. Predicting Models for Plant Metabolites Based on PLSR, AdaBoost, XGBoost, and LightGBM Algorithms Using Hyperspectral Imaging of Brassica juncea. Agriculture 2023, 13, 1477. https://doi.org/10.3390/agriculture13081477

AMA Style

Yoon HI, Lee H, Yang J-S, Choi J-H, Jung D-H, Park YJ, Park J-E, Kim SM, Park SH. Predicting Models for Plant Metabolites Based on PLSR, AdaBoost, XGBoost, and LightGBM Algorithms Using Hyperspectral Imaging of Brassica juncea. Agriculture. 2023; 13(8):1477. https://doi.org/10.3390/agriculture13081477

Chicago/Turabian Style

Yoon, Hyo In, Hyein Lee, Jung-Seok Yang, Jae-Hyeong Choi, Dae-Hyun Jung, Yun Ji Park, Jai-Eok Park, Sang Min Kim, and Soo Hyun Park. 2023. "Predicting Models for Plant Metabolites Based on PLSR, AdaBoost, XGBoost, and LightGBM Algorithms Using Hyperspectral Imaging of Brassica juncea" Agriculture 13, no. 8: 1477. https://doi.org/10.3390/agriculture13081477

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting Models for Plant Metabolites Based on PLSR, AdaBoost, XGBoost, and LightGBM Algorithms Using Hyperspectral Imaging of Brassica juncea

Abstract

1. Introduction

2. Materials and Methods

2.1. Training Data Acquisition

2.2. Data Processing and Prediction Models

3. Results and Discussion

3.1. Development of a Prediction Model Based on Hyperspectral Imaging with the PLSR, AdaBoost, XGboost, and LightGBM Algorithms

3.2. Application of the Functional Component Prediction Model with Visualization

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI