Next Article in Journal
Automatic Detection of Arrhythmias Using a YOLO-Based Network with Long-Duration ECG Signals
Previous Article in Journal
A Sensor Data-Based Approach for the Definition of Condition Taxonomies for a Hydraulic Pump
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Multivariate Spectra Analysis: PLSR vs. PCA + MLR †

1
Heinrich Blasius Institute for Physical Technologies, Hamburg University of Applied Sciences, Berliner Tor 21, 20099 Hamburg, Germany
2
School of Computing, Engineering & Physical Sciences, University of the West of Scotland, Scotland PA1 2BE Paisley, UK
*
Author to whom correspondence should be addressed.
Presented at the 7th International Electronic Conference on Sensors and Applications, 15–30 November 2020; Available online: https://ecsa-7.sciforum.net/.
Eng. Proc. 2020, 2(1), 83; https://doi.org/10.3390/ecsa-7-08226
Published: 14 November 2020
(This article belongs to the Proceedings of 7th International Electronic Conference on Sensors and Applications)

Abstract

:
For mixtures of compounds with very similar spectral features, common for larger organic molecules, multivariate analysis (MVA) methods can be applied to determine the concentration of the individual components. We analyzed photoacoustic spectra of mixtures of different volatile organic compounds with and without different feature selection and feature projection methods. These include: Multiple Linear Regression (MLR), Principal Component Analysis (PCA), Partial Least Squares Regression (PLSR) and Random Forest Algorithm (RFA). Even though PLSR provided the best prediction accuracy, the other techniques also exhibited some advantages.

1. Introduction

Spectroscopic probing of energetic transitions of molecules or atoms enables the analysis of mixtures and the selective determination of concentrations. Successfully applied laser spectroscopic methods include absorption spectroscopy, atomic emission spectroscopy, fluorescence spectroscopy and photoacoustic spectroscopy (PAS) [1].
If the spectral features of the single substances are broad and overlap strongly, the spectra evaluation requires a multivariate analysis. The general suitability of Partial Least Squares Regression (PLSR) to determine the absolute concentrations of different components of a mixture has been demonstrated [2,3]. However, the according study also revealed certain limitations of this evaluation method. Therefore, we further investigated methods of multivariate statistics and compared their prediction accuracy.

2. Materials Methods

2.1. Experiment

The investigation was performed on mixtures of five Volatile Organic Compounds (VOCs): 2-Butanone (C4H8O), 1-Propanol (C₃H₈O), Ethylbenzene (C8H10), Styrene (C8H8) and Hexanal (C6H12O).
A spectrum of each VOC was recorded with a photoacoustic analyzer based on an optical parametric oscillator (OPO). The system delivers highly resolved spectra in the mid-IR wavelength region between 3.2 and 3.5 μm [4,5].
The measured spectra of the single VOCs were weighed and additively combined in several variations in order to get a larger dataset over a wider range of concentrations. To consider the measurement uncertainty, noise is added to each of these synthetic mixtures.

2.2. Multivariate Analysis

Multivariate analysis (MVA) is used to identify the relationship between the photoacoustic spectra and the concentrations of different VOCs. The so-called response matrix Y ( m , n y ) contains the dependent variables ( y i l )—i.e. the concentration of VOC ( l ) in mixture/spectrum ( i ) . We investigated ( n y = 5 ) components and ( m = 100 ) mixtures. The predictor matrix ( X ( m , n x ) ) contains the independent variables ( x i j ), which correspond to the photoacoustic signal of mixture/spectrum ( i ) at wavelength ( j ). One measurement contains ( n x = 200 ) values, which are equally distributed over the wavelength range (3.3 μm to 3.5 μm) in 1 nm steps. For the analysis, the synthetic spectra are split into a training set of 70 spectra and a validation set of 30 spectra.
The investigated methods, including Multiple Linear Regression (MLR), Partial Least Squares Regression (PLSR) and Principal Component Analysis are linear methods. Since the absorption of the VOCs at low concentrations is relatively weak, a linear relationship between the photoacoustic signal and the concentration can be assumed. According to the simplest model, the MLR is defined as follows:
Y = X B + E pre ,
Y ^ = X B ,
with the model’s linearity coefficients ( B ), the prediction error ( E pre ) and the predicted values (here concentrations vector) ( Y ^ ).

2.3. Dimensionality Reduction by Feature Projection

A way to increase the accuracy of the regression can be a dimensional reduction. The PLSR performs this dimensionality reduction as feature projection prior to the actual regression. Feature projection is a technique to generate new, fewer variables, while preserving most of the information of the original dataset.
While the Principal Component Analysis (PCA) only decomposes the matrix of independent variables ( X ) (Equation (3)), the PLSR also decomposes the matrix of dependent variables ( Y ) into corresponding linear combinations ( T P ,   U Q ) (Equation (4)) [6]:
X = T P + E   ,
Y = U Q + F   .
The noise in the data set ( X ) and ( Y ) is indicated by the corresponding error vectors ( E ) and ( F ).
The regression model described by Equation (2) also applies to PLSR. The linearity coefficients ( B ) are determined by the model’s weights ( W ) and loadings ( P and Q ) [6]:
B = W ( P W ) Q   .
The Equations (3)–(5) can be solved by the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm [6].

2.4. Dimensionality Reduction by Feature Selection

In addition to the feature projection we investigated the feature (subset) selection, a method of dimensionality reduction in which only the most relevant features (independent variables) from the original data set are retained [7,8]. We used the Random Forest Algorithm (RFA) implementation of the Scikit-learn library based on Python, version 0.19.1 [9].

3. Results and Discussion

For the evaluation of the individual methods, two values are considered: The Mean Absolute Error ( MAE ) and the standard deviation ( s ) of ( E pre = Y ^ Y ), both averaged over all five VOCs. The bias for the different prediction methods is 6 ppb (parts per billion) and below.
Table 1 lists the results of the different multivariate analysis methods.
MLR is, in general, well suited for determining concentrations but gives less accurate results compared to the other methods. Even in combination with the RFA as a feature selection method, the accuracy remains the same. However, the method has a significant advantage. Applying a feature selection reduces the measuring time considerably, since not the entire spectrum has to be recorded, but only the ca. 70% with the most significant values. This enables sensors with approximately 30% shorter response time which is quite relevant considering that it can take several hours to record a complete spectrum.
Applying feature projection such as PCA and PLSR shows a significant increase in prediction accuracy. In this case the PLSR provides the highest prediction accuracy of all methods. An advantage of PCA+MLR is that the dimensional reduction is performed independent of the regression and even data sets of completely unknown composition can be used.
Based on the first results presented here, the MVA models will be investigated in the future by cross-validation and additional test data in the form of real gas mixtures. In addition, the feature selection will be investigated in greater depth. It can also be combined with the feature projection methods which have been introduced here.

Author Contributions

Conceptualization, M.W.; methodology, S.V.; experiment, S.V.; data curation, S.V.; writing—original draft preparation, M.W. and S.V.; writing—review and editing, M.W. and S.V.; supervision, M.W.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Demtröder, W. Laserspektroskopie: Grundlagen und Techniken, 4th ed.; Springer: Berlin/Heidelberg, Germany, 2000; ISBN 978-3-662-08266-9. [Google Scholar]
  2. Loh, A.; Wolff, M. Multivariate Analysis of Photoacoustic Spectra for the Detection of Short-Chained Hydrocarbon Isotopologues. Molecules 2020, 25, 2266. [Google Scholar] [CrossRef] [PubMed]
  3. Saalberg, Y.; Wolff, M. Multivariate Analysis as a Tool to Identify Concentrations from Strongly Overlapping Gas Spectra. Sensors 2018, 18, 1562. [Google Scholar] [CrossRef] [PubMed]
  4. Bruhns, H.; Marianovich, A.; Wolff, M. Photoacoustic Spectroscopy Using a MEMS Microphone with Inter-IC Sound Digital Output. Int. J. Thermophys. 2014, 35, 2292–2301. [Google Scholar] [CrossRef]
  5. Bruhns, H.; Saalberg, Y.; Wolff, M. Photoacoustic Hydrocarbon Spectroscopy Using a Mach-Zehnder Modulated cw OPO. Sens. Transducers 2015, 188, 40. [Google Scholar]
  6. Kessler, W. Multivariate Datenanalyse für die Pharma-, Bio- und Prozessanalytik: Ein Lehrbuch, 1st ed.; WILEY-VCH: Weinheim, Germany, 2008; ISBN 978-3-527-31262-7. [Google Scholar]
  7. Guyon, I.; Gunn, S.; Nikravesh, M.; Zadeh, L.A. Feature Extraction: Foundations and Applications; Springer: Berlin/Heidelberg, Germany, 2008; ISBN 978-3-540-35488-8. [Google Scholar]
  8. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  9. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Table 1. Results of different multivariate analysis methods.
Table 1. Results of different multivariate analysis methods.
M A E / ppm s / ppm
MLR 6.8 9.2
RFA + MLR 6.8 9.2
PCA + MLR 5.9 7.9
PLSR 5.8 7.8
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Vervoort, S.; Wolff, M. Multivariate Spectra Analysis: PLSR vs. PCA + MLR. Eng. Proc. 2020, 2, 83. https://doi.org/10.3390/ecsa-7-08226

AMA Style

Vervoort S, Wolff M. Multivariate Spectra Analysis: PLSR vs. PCA + MLR. Engineering Proceedings. 2020; 2(1):83. https://doi.org/10.3390/ecsa-7-08226

Chicago/Turabian Style

Vervoort, Sander, and Marcus Wolff. 2020. "Multivariate Spectra Analysis: PLSR vs. PCA + MLR" Engineering Proceedings 2, no. 1: 83. https://doi.org/10.3390/ecsa-7-08226

Article Metrics

Back to TopTop