Application of Dimensionality Reduction and Machine Learning Methods for the Interpretation of Gas Sensor Array Readouts from Mold-Threatened Buildings

Łagód, Grzegorz; Piłat-Rożek, Magdalena; Majerek, Dariusz; Łazuka, Ewa; Suchorab, Zbigniew; Guz, Łukasz; Kočí, Václav; Černý, Robert

doi:10.3390/app13158588

Open AccessArticle

Application of Dimensionality Reduction and Machine Learning Methods for the Interpretation of Gas Sensor Array Readouts from Mold-Threatened Buildings

by

Grzegorz Łagód

^1,*

,

Magdalena Piłat-Rożek

^2,*

,

Dariusz Majerek

²

,

Ewa Łazuka

²

,

Zbigniew Suchorab

¹

,

Łukasz Guz

¹,

Václav Kočí

^3,4

and

Robert Černý

³

¹

Faculty of Environmental Engineering, Lublin University of Technology, 20-618 Lublin, Poland

²

Faculty of Technology Fundamentals, Lublin University of Technology, 20-618 Lublin, Poland

³

Faculty of Civil Engineering, Czech Technical University in Prague, 166 29 Prague, Czech Republic

⁴

Institute of Technology and Business in Ceske Budejovice, 370 01 Ceske Budejovice, Czech Republic

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2023, 13(15), 8588; https://doi.org/10.3390/app13158588

Submission received: 12 May 2023 / Revised: 5 July 2023 / Accepted: 22 July 2023 / Published: 26 July 2023

(This article belongs to the Special Issue Selected Papers from 4th Central European Symposium on Thermophysics (CEST2022))

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

The solutions presented in the work, based on the use of a multi-sensor matrix and the analysis of multidimensional data, can be used in practice for assessing the mycological risk of buildings, detecting the presence of mold in rooms and evaluating the risk of the sick building syndrome.

Abstract

Paper is in the scope of moisture-related problems which are connected with mold threat in buildings, sick building syndrome (SBS) as well as application of electronic nose for evaluation of different building envelopes and building materials. The machine learning methods used to analyze multidimensional signals are important components of the e-nose system. These multidimensional signals are derived from a gas sensor array, which, together with instrumentation, constitute the hardware of this system. The accuracy of the classification and the correctness of the classification of mold threat in buildings largely depend on the appropriate selection of the data analysis methods used. This paper proposes a method of data analysis using Principal Component Analysis, metric multidimensional scaling and Kohonen self-organizing map, which are unsupervised machine learning methods, to visualize and reduce the dimensionality of the data. For the final classification of observations and the identification of datasets from gas sensor arrays analyzing air from buildings threatened by mold, as well as from other reference materials, supervised learning methods such as hierarchical cluster analysis, MLP neural network and the random forest method were used.

Keywords:

multidimensional signals analysis; dimensionality reduction; machine learning methods; gas sensors array; electronic nose; mold-threatened buildings

1. Introduction

The presence of fungi in the indoor environment is a widespread phenomenon. It is estimated that several hundred species of fungi are present indoors, including Cladosporium sphaerospermumn, Penicillium chrysogenum, Aspergillus niger, Aspergillus versicolor, Alternaria alternata, and Stachybotrys chartarum [1]. Exposure to these fungi can lead to many adverse and serious health effects, including allergies, fungal infections, or toxic reactions [2]. High relative humidity (optimally above 70%) and adequate ambient temperature, in the range of 16–35 °C, are conducive to fungal growth [3,4].

Fungal spores are the greatest group of biological particles present in the air. They infiltrate the respiratory system with the inhaled air, which contributes to allergies. They range in size, from a few micrometers to tens of micrometers; they are smaller than pollen grains and can enter the respiratory tract more deeply [5], causing allergic reactions from both the upper and lower respiratory tracts.

It is possible to distinguish more than a few dozen species of mold, which are among the dangerous factors in building infrastructures. This is because of their low requirements for development conditions, resulting in their occurrence in many different places. The most commonly mentioned mold fungi include the genera Penicillium, Aspergillus (Aspergillus pasiticus, Aspergillus flavus, Aspergillus fumigatus, and Aspergillus niger), Strachybotrys (Stachybotrys chartarum and Stachybotrys chartarum) Cladosporium, Alternalia, and Fusarium [6,7,8].

Mold fungi are heterotrophic organisms with a eukaryotic structure. They feed on dead or living organic matter, and their enzymes enable the decomposition of complex organic compounds. Contamination of the indoor environment by fungi, and thus the increased occurrence of SBS, is mainly due to the presence of microbial volatile organic compounds (mVOCs) [9]. These substances are metabolites formed during the life processes of fungi and are volatile compounds due to their physicochemical properties [10,11]. These compounds cause the characteristic odor of mold and the sensory indications of the presence of filamentous fungi in buildings. The toxins considered most harmful to human and animal health are metabolites of fungi from the genera Stachybotrys chartarum, Fusarium, and Aspergillus versicolor [12]. The most dangerous mycotoxins produced by fungi include ochratoxin a (OT), zearalenone (ZEN), aflatoxins (AF), trichothecenes, and fumonisins (F) [13]. Other toxins produced by fungi include ketones, alcohols, esters, terpenes, and sulfur compounds, including the following [14]: ketones: 2-heptanone, 2-pentanone, and 3-octanone; alcohols: 2-methyl-1-butanol, 3-methyl-1-butanol, 2-methylpropanol, 3-octanol, 1-octen-3-ol, 1-hexanol, 1-pentanol, 2-methylisoborneol, and geosmin; terpenes and sesquiterpenes: limonene and pinene; furans: 3-methylfuran; sulfur: 296 compounds; and hydrocarbons: alkanes, olefins, dienes, and trienes. Currently, about 400 such compounds and about 350 species of fungi are known to produce them [12,15]. The intensity of mVOC production is influenced by environmental factors such as fungal species composition, substrate and nutrient availability, and thermo-humidity conditions (air temperature and relative humidity) [16,17].

It should be noted that the mVOCs produced by fungi can be transformed into other compounds, making their detection much more difficult. For example, alcohols are easily oxidized to aldehydes and then to carboxylic acids, whereas ketones can be transformed into aldehydes through chemical transformations [18], which can contribute to hindering the detection of the presence of mVOCs in the air and thus interfere with the effectiveness of assessing the degree of infestation of buildings by using chemical methods.

When assessing the degree of infestation of buildings, it is important to determine the concentration of fungal spores in infested buildings. This concentration is expressed in (CFU (colony-forming units)). Its permissible value in residential buildings is 50 CFU/m³ for mold fungi and 150 CFU/m³ for mixed fungi (mold and non-mold)—excluding pathogens. In contrast, according to the PN-EN 13098:2007 standard currently in force in Poland, the permissible number of indoor CFU should be less than 500 CFU/m³. Fungal colonies of the Cladosporium and Alternaria genera are allowed at 300 CFU/m³, while the presence of Aspergillus fumigatus and Stachybotrys chartarum is not allowed to any extent [19]. According to Piotrowska and Żakowska [20], the dominant genera in buildings are Aspergillus versicolor, Penicillium chrysogenum, and Cladosporium cladosporoides.

It should be noted that infested buildings can exceed as much as 1000 CFU/m³, and there have been values as high as 67,000 CFU/m³ in particularly infested areas [21], reaching up to 260,000 CFU/m³ in the carpets and bedding of infested rooms [22].

The presence of fungi can be determined using numerous techniques. As a method for the fast detection of fungal presence in buildings, it appears that e-nose devices have a significant application potential. The e-nose was successfully used to find fungi in archives and libraries [23]. E-nose devices may additionally be applied to assess the threat of mold in equipment and buildings. Literature data indicate that an array of 16 MOS (metal oxide semiconductor) sensors used indoors was sufficient to identify and classify five fungi species at a quality level of approximately 96%. Additionally, five mold-emitted microbial volatile compounds (MVOCs) were identified [24]. Two e-noses were used in other studies to detect the presence of fungi: Moses II and Kamina. The former produced more precise results. It was not possible to distinguish between species, despite the application of a sophisticated valuation model. According to M. Kuske’s research [25], using 12 MOS sensors, one of four mold species could be identified with an estimation accuracy of 80–85%, bred under laboratory conditions on the following materials: particleboard, plasterboard, oriented strand board (OSB), and wallpaper.

The functional elements of e-noses are the arrays that consist of many various sensors [26,27,28]. This hinders signal interpretation that is multidimensional and requires advanced techniques of analysis [29,30,31,32]. The MOS sensor array is made up of thin-film metal oxide semiconductor materials. In the case of the applied sensor array, this is tin dioxide (SnO₂) [26,29]. When the sensor is exposed to the sample, specific mVOCs and VOCs present in the air become adsorbed onto the surface of the sensing material [33]. Different gases will have different affinities to the sensor material, leading to varying degrees of adsorption [34]. The presence of adsorbed gases on the surface of the metal oxide semiconductor material causes a change in the electrical resistance of the material [35]. This change in resistance is directly related to the concentration of the detected gas, as well as type of sensor [36]. Individual sensors are not precisely selective, which is why they react with varying intensity to a range of different gas compounds. In the commercial market, there is a lack of MOS sensors dedicated to detecting all VOCs, especially mVOCs. However, air pollutants do not occur as single chemical compounds, and deteriorated air quality also means increased concentrations of compounds to which one of the many sensors reacts.

In order to visualize multivariate data, the following methods can be applied: principal component analysis (PCA), metric multidimensional scaling, and Kohonen self-organizing map, which are unsupervised machine-learning methods. Hierarchical cluster analysis, random forest, and MLP neural network were used to classify objects.

The principal component analysis method was first formalized in 1901 by Pearson [37], and then in 1933 by Hotteling [38]. The main purpose of PCA is to reduce the dimensionality of the data, that is, to extract the information contained in n variables and store it in the form of a new system of orthogonal variables—principal components [39]. The PCA algorithm seeks a linear combination of the variables contained in dataset X with the greatest variance. If the mean variables of X are stored in the vector

μ

, while

Σ

denotes its covariance matrix, then the following matrix is sought:

Y = Γ^{T} (X - μ),

(1)

where

Γ

is an orthogonal matrix. In turn:

Λ = Γ^{T} Σ Γ,

(2)

is a diagonal matrix with eigenvalues on the main diagonal that have a relationship with each other:

λ_{1} \geq λ_{2} \geq \dots \geq λ_{n} \geq 0 .

(3)

Thus, the i-th principal component is the i-th column of the matrix Y given by the equation

Y_{i} = Γ_{i}^{T} (X - μ),

(4)

where

Γ_{i}

is the i-th column of the matrix

Γ

, which is also called the factor charge vector [40].

Dimensionality reduction, which is the main goal of PCA, is usually used to represent multidimensional data in a two- or three-dimensional space. It is a very common technique used in visualization, including environmental engineering works [41,42], but also for imaging electronic sense readings [28,31,43].

Multidimensional scaling was created by Torgerson and described in 1952 [44]; this method was designed to detect similarity between observations. In multidimensional scaling, a distinction is made between metric and non-metric methods. The non-metric method is used when the dataset under consideration contains variables measured in different units, which prevents the calculation of the Euclidean distance between them. Metric scaling, on the other hand, is employed when all measurements are made in the same unit. In the dataset considered in this work, all measurements were made in one unit, so it was acceptable to use metric scaling.

This method aims to preserve the distance between points in the reduced space. The algorithm is based on minimizing the difference between the points in the original space, and the reduced one. Thus, the stress function relying on these distances is minimized:

S T R E S S = \sqrt{\frac{\sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} {({‖ x_{i} - x_{j} ‖}_{2} - {‖ y_{i} - y_{j} ‖}_{2})}^{2}}{\sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} {‖ x_{i} - x_{j} ‖}_{2}^{2}}},

(5)

where

{‖ . ‖}_{2}

denotes the Euclidean norm,

x_{i}

is the i-th observation in the original dataset, while

y_{i}

is the i-th observation in the reduced space [45,46].

Although this algorithm is called a metric, its solution is not obtained in a linear fashion. The model’s answer is obtained iteratively—in a k-dimensional space, where

1 \leq k \leq n - 1

, and the points are moved to minimize the STRESS function [47].

Multidimensional metric scaling such as PCA can be used to visualize data in low-dimensional space. Among other things, it has been used in the past to study the distribution patterns of phyto- and bacterio-plankton in different seasons [48], as well as to identify the most important habitats of endangered and endemic fish inhabiting coral reefs [49].

Self-organizing maps were presented in 1982 by Kohonen in [50]. These are a type of neural network used to create a transformation that preserves the features of objects in a multidimensional space into a map. These maps are usually represented as a two-dimensional grid on which observations from the multidimensional space are marked. As this is intended to be a transformation that preserves the characteristics of objects, the data that are similar in the original space are represented close together in a self-organizing map [51]. The Kohonen map consists of

p \cdot q

neurons, where the numbers p and q are the dimensions of this map. These numbers are established in advance at the beginning of the algorithm. Each neuron is an n-element vector (where n is the number of explanatory variables in the original set). Usually, neurons in Kohonen maps have a rectangular, hexagonal, or circular topology. Training a self-organizing map involves reducing the direct neighborhood of each neuron while reducing the weights assigned to individual neurons, which are randomly assigned at the start of the algorithm [52]. Kohonen networks have been used to visualize, cluster, and analyze multivariate environmental engineering data, among others, in [53,54,55].

There are many algorithms for hierarchical clustering, the main differences between them are the choice of the metric in which the distances between the observed objects are calculated and the choice of the method for determining clusters. The most commonly used metric is Euclidean, which was also used in this work; the distance between two vectors x, y of length n is determined using the formula:

d (x, y) = \sqrt{\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2}} .

(6)

There are many methods for determining clusters, for example, Ward’s method presented in [56], as well as clustering by single [57] and full linkage methods [58]. In this paper, the full linkage method, which is an agglomerative method, was used. It relies on the fact that at the beginning of the algorithm, each observation is in a separate cluster. The distances between clusters A and B are calculated as follows:

D (A, B) = \max_{x \in A, y \in B} d (x, y) .

(7)

Thus, joining of two clusters depends on the farthest pair of observations. In this method of object clustering, as the cluster grows in size, it becomes increasingly difficult to attach more objects to the cluster, because the increase in cluster size automatically distances it from other observations and clusters considered in the analysis [59]. The hierarchical clustering method was used in [41] to classify the water quality in the Fuji River and in [60] to assess the sources of heavy metal contamination in water.

In an MLP (multilayer perceptron) neural network, the results are obtained not only through a set of n explanatory variables derived from the input data, but also through a set of auxiliary variables, which are latent units. Each of them is a linear combination of the original variables, usually transformed with a hidden g function, which is not linear in nature:

h_{k} (x) = g (β_{0 k} + \sum_{i = 1}^{n} x_{j} β_{j k}) .

(8)

The coefficient

β_{j k}

denotes the effect of the j-th variable on the formation of the k-th hidden unit. Assuming that there are H hidden units in the network, the result of the neural network will be a linear combination of the following form:

f (x) = γ_{0} + \sum_{i = 1}^{H} γ_{k} h_{k} .

(9)

Optimization of the coefficients of such a network involves minimizing the sum of squares of the errors with an additional correction attaching a predetermined regularization parameter

λ

to control the overfitting of the model to the training data. Thus, minimized, it is the sum of:

\sum_{i = 1}^{N} {(y_{i} - f (x_{i}))}^{2} + λ (\sum_{k = 1}^{H} \sum_{j = 1}^{n} β_{j k}^{2} + \sum_{k = 0}^{H} γ_{k}^{2}),

(10)

where N is the number of observations in the learning set. The larger the regularization parameter, the lower the chance of model overfitting [61].

MLP neural networks are widely used in various scientific fields, including predicting ozone and nitrogen dioxide levels in the air [62], assessing the toxicity of ionic liquids in cell lines of rats suffering from leukemia [63], or predicting the diameter of TiO₂ nanotubes [64].

Although random forests can be created using a variety of algorithms presented in different papers [65,66,67], among others, the most widely used one at present is one developed in 1999. This algorithm for learning a random forest begins by selecting the number of m trees from which a forest will be created. For each of the m trees, to begin with, a bootstrap sample of data is drawn on which the tree will be trained. Next,

1 \leq k \leq n - 1

is drawn, which is a fixed number of variables on which the splitting rules in the tree under consideration will be created [61]. According to Breiman [68], the number k for classification tasks should be equal to the rounded down root of the number of explanatory variables, while for regression it should be one-third of the number of such variables.

Random forests have been used in the past to estimate atmospheric aerosol concentrations [69], to spatially interpolate environmental variables using mud samples [70], to classify VOC vapors [71], or to detect mycophenolic acid in silage [72].

In the literature, one can find the works in which gas sensor arrays were used with the aforementioned methods, and there are also works in which gas sensor arrays were applied to the buildings affected by sick-building syndrome. However, there are a lack of works that present an effective step-by-step procedure for analyzing data from mold-infested buildings with the methods used within this paper.

2. Materials and Methods

2.1. Materials

To carry out the research, a mobile measurement device was developed. The main component of the measuring device was a sensor array consisting of eight MOS-type gas sensors, presented in Figure 1. Figaro 2600 series gas sensors were used together with TO-5 metal housing and a maximum power of less than 300 mW. The physical and chemical parameters of the gas sensors used in the study are summarized in Table 1. The sensors planted in the bases were arranged in a circular array. Additional temperature and humidity measurements were carried out using Maxim-Dallas DS18B20 and Honeywell HIH-4000 sensors.

2.2. Measurement Protocol

Before each measurement session, the sensors were flushed with clean air for 10 min. The specified time for the initial cleaning of the sensor array using dry air before each measurement session was based on calibration studies of the device. A period of 10 min was sufficient to stabilize the resistances of individual sensors. However, the shorter flushing time (2 min) before each measurement was due to the relatively low level of VOC air contamination. The initial flushing was necessary because during the periods when the meter was not in use between measurement sessions, ambient air could enter the sensor chamber. After each measurement session, the process of flushing the sensor array was also performed with clean air flow of 200 cm³/min. The first resistance measurement of individual sensors in the array was performed for a blank sample without added chemicals (sample bag background). Resistance measurements were then carried out for the reference gases, starting with the sample with the lowest concentration.

Each reference gas measurement included a zero air flush phase for

t^{0} = 2

min and a measurement phase for

t^{s a m p} = 5

min. During the initial

t^{s a m p}

phase of the sample measurement, there was a rapid decrease in the resistance of the sensor, proportional to the concentration of impurities in the test sample. At the end of the

t^{s a m p}

period of exposure to a given sample, the sensor resistance values stabilized and the average resistance value of the

R_{S}^{s a m p}

period could be determined. An example of a graph of resistance changes for the sensors is shown in Figure 2.

For each sensor, the relative resistance was determined according to the following formula:

R_{w} = R_{S}^{s a m p} / R_{S}^{0},

(11)

where

R_{S}^{s a m p}

is the stable value of sensor resistance for the sample under test and

R_{S}^{0}

is the baseline (sensor resistance for zero air). The baseline

R_{S}^{0}

was taken as the average value of the resistance of a given sensor from the final phase of flushing the sensors with zero air before the measurements.

The phenomenon of drift was investigated during the initial flushing of the sensor array. Measurements were performed over a short period of time, and during this period, no drift in the sensor responses caused by aging of the sensor’s detection elements was observed. The relatively low air pollution in the tested rooms also did not lead to excessive poisoning of the sensors.

2.3. Applied Software and Packages

The statistical analyses and graphs included in this paper were performed using the interpreted R programming language version 4.2.1 [73] with RStudio environment version 2022.7.0.548 [74]. The functions found in the aforementioned libraries of the software were used for the calculations presented below. The tidyverse package was created by Wickham and Team Rstudio in 2016 [75], whereas tidymodels was used to support machine learning models and employed the tidyverse philosophy [76]. The kohonen package allowed for learning and visualization of self-organizing Kohonen maps [77]. The ggfortify [78] and ggplot2 [79] libraries enabled creating consistent graphs. The factoextra package [80] allowed for applying the hierarchical clustering algorithm and creating a scatter plot for the PCA model.

2.4. Measured Objects

Research was conducted in nine selected rooms with different levels of mold threat. A short presentation of each object is shown in Table 2. The flux of the sampled air was equal to 200 cm³/min and sampling was conducted close to the building barriers with visible mold bloom. Additionally, the decayed wood and clean air were selected as the reference samples.

3. Results

The data from the last 30 s of readings from the facilities were involved in model development. Principal component analysis was performed on the full dataset. According to the scatter plot shown in Figure 3, the appropriate number of principal components was equal to 2 or 3, but the percentage of explained variance by the first two components was almost 85%. For this reason, only the first two components were chosen to visualize the data.

This decision turned out to be the right one, as the groups delineated by observations derived from the samples pertaining to other objects formed distinct, homogeneous clusters that did not overlap with the observations derived from other samples in the main component space in Figure 4. The visibly separable groups were the clean air and decayed wood samples, which are located in the upper left and upper right corners of the graph, respectively. This agrees with intuition, as the air was completely uncontaminated by mold, while the tested wood was entirely covered with it. The remaining observations from the other samples were grouped mainly in the right half-plane. Furthest from the decayed wood sample were the data from samples B5 and B9, which, according to Table 2, were characterized by the absence of odor nuisance, but in sample B9, the experts noted a minor mold bloom. The remaining data were grouped into two clusters B1, B2, B3, and B4 in the first quadrant and B6, B7, and B8 in the fourth quadrant of the system. The indicated clusters did not represent groups with the same degree of mold infestation, but the closest relative to the reference wood sample was sample B4, in which significant mold bloom and high odor nuisance were also observed.

The result of the multidimensional scaling metric can be found in Figure 5; again, the greatest difference was between the measurements from the decayed wood sample and the clean air sample. The closest to air was the B5 sample, which was characterized only by salt efflorescence. In contrast, the closest to decayed wood were the measurements from sample B7, which was characterized by significant mold bloom and a high odor nuisance, similar to the previously mentioned sample B4. The remaining results were arranged in a similar manner, as shown in Figure 4.

The Kohonen map was made in a hexagonal topology, on a 20 × 20 grid of neurons. The observations from the analyzed dataset were mapped onto the grid and labeled according to the sample to which they belonged, as can be observed in Figure 6. The map shows that all groups of observations separated from each other and formed disjoint groups. The observations from the clean air sample were not located in close proximity to those from the decayed wood sample. However, other groups of observations were not particularly well placed on the map, as, for example, the B9 measurements were next to those from the clean air reference sample.

The hierarchical clustering method grouped the observations from the original dataset into six homogeneous clusters. There were six clusters due to the fact that the analyzed objects were rated by experts according to the degree of mold bloom. This enabled verifying the correctness of the clusters created by the algorithm. The dendrogram in Figure 7a shows the differences between the groups of observations allowing for the creation of clusters. Meanwhile, Figure 7b shows where the observations belonging to each of the created clusters are located on the PCA plane. The observations from each sample are all assigned to one of the clusters, so they are homogeneous. According to Table 3, cluster 6 containing the observations from the clean air sample and cluster 3 with the B5 sample did not contain elements from the other observed groups. Cluster 5 contained the observations from the wood sample and object B7, which was significantly infested by mold. In cluster 4, there were observations from samples B6, B8, and B9, some of which were non-molded objects and some had light mold bloom. It can be seen that the worst classified group comprised the non-molded objects, which were classified into four different clusters. The overall correctness of the classification by hierarchical clustering could be assumed as 48.2%.

Thus, the dataset consisted of 1672 observations, which were randomly divided in a 2:1 ratio, where

\frac{2}{3}

of the data were used in teaching supervised learning models, and the remainder became the training set for the models. The hyperparameters of the neural network were tuned using a five-fold cross-validation. The grid was three-level for each hyperparameter, and the tuned values, when possible, were equidistant from each other. The parameters tuned were the number of hidden units at levels

1, 5, and 10

; the number of epochs at levels

50, 125, and 200

; and the value of the regularization parameter at levels

10^{- 10}, 10^{- 5}, and 1

. The result of model tuning can be seen in Figure 8. The model with the number of hidden units equal to 5, 125 training iterations, and a regularization parameter equal to 1 turned out to be the most optimal. On the training set, it achieved 100% correctness in classifying samples into the classes determined by the degree of mold bloom. The neural network achieved the same correctness for a model with 10 hidden units and two models with 200 iterations of training, but a model that was less computationally complex was chosen. The selected model on the test set also achieved perfect classification in all of the samples considered, as it can be observed in Figure 9, which shows the diagonal classification error matrix.

The random forest model was supposed to be tuned based on a grid of hyperparameters, but it turned out that regardless of their selection, all of the models had 100% classification accuracy. Therefore, a model with default parameter values was used, i.e., 500 trees built in the forest, 3 variables possible to divide at each node, and a minimum of 1 observation at each tree node. Such a model also achieved the maximum possible classification quality on both the training set and the test set, as shown in Figure 10, which presents the error matrix on the test set. Figure 11 demonstrates the ROC curves for each of the fungus infestation classes in the dataset. They had an ideal shape due to the 100% correctness of the classification.

4. Discussion

The principal component analysis method is used to analyze the sick-building syndrome in order to visualize the position of the test samples in low-dimensional space. An example of such an application of the PCA method is the work of [81]. The data analyzed therein came from samples of building materials and rooms tested for fungal contamination using an electronic nose. The data were visualized on a PCA plane; in the case of this work, the first two principal components explained 96.5% of the variance of the original variables. In the plot, only the reference samples were distinguished by their distribution; the other samples did not group into distinct clusters. The article [23] analyzed the samples of three types of paper, on which three different types of fungi were observed. The paper samples were examined using an electronic nose. The two-dimensional graph created by the PCA method for all data apparently discriminated the samples into three clusters, corresponding to the three paper types considered. The paper also produced additional graphs using this method, which highlighted the samples relating to only one of the paper types at 75% or 100% relative humidity. An additional method used in the paper was cluster analysis, which was also applied to the sets containing only one type of paper at 75% or 100% humidity. In the PCA charts, superior results were obtained at 100% RH, while the cluster analysis in each case distinguished between the paper samples from the control sample and those in which any type of fungus was observed.

In the present study, similar results to the articles considered above were observed, as the percentage of explained variance by the first two principal components allowed for visualization of the results in two-dimensional space. In addition, in this case, the samples constituting the control observations in the PCA plot were distinguished from the others by being distinct clusters.

The multidimensional scaling method is used to visualize the dissimilarity between the studied objects. In the paper [82], it was employed to show the differences between bacterial communities in three different moving bed bioreactors. The method showed that the samples from each reactor were clustered together and separated from the others on a plane. Therefore, it could be concluded that the three reactors contained different bacterial communities.

In this study, the result of metric multidimensional scaling was similar to that obtained using the principal component analysis method, i.e., the reference samples differed from the other types of samples.

The paper [83] used the Kohonen self-organizing map to represent the classification of facilities into groups characterized by six levels of mold infestation risk to analyze the sick building syndrome. In addition, in the paper [84], the authors used self-organizing maps to build an odor control map using the data taken with an electronic nose, from environmental odor monitoring systems. A classification into three groups created using the k-means method was overlaid on the map. Further in the article, a supervised Kohonen network was also used to predict the quantitative concentration of odor.

Only observations mapped to a self-organizing map are presented in this paper. The development of the aspect related to the location of clusters on the map and the application of the supervised Kohonen network will be a further subject of research.

Hierarchical cluster analysis was used in [81], where the data on mold contamination of building materials and buildings were classified into four groups. This work also investigated and classified the facilities affected by sick building syndrome. The observations from clean air samples were classified into a separate cluster. In addition, in the work in question, a separate group was formed by the observations from a decayed wood sample, while the remaining observations were assigned to the clusters that were not consistent with the degree of contamination of the samples belonging to them. The article [85] also used this method to group indoor air pollutants into three classes, corresponding to high, moderately high, and moderate frequency of reporting by workers in non-industrial work environments.

The results on the hierarchical cluster analysis presented in this paper resemble those presented in the article [81], in that apart from the cluster into which the observations from the clean air sample were classified, other groups contained observations from the samples with varying degrees of mold infestation.

The random forest model has been used in the past to classify the readings obtained from an electronic nose. In the paper [30], the random forest model was used in the classification of samples for different stages of wastewater treatment. In this work, the model achieved 97.5% correctness on the training set and 100% on the test set. In [86], the task was to detect different gases and the changes in their concentrations. The model in the aforementioned work obtained 99.75% correct classifications, but this was not the best possible quality for the classification obtained. For the data considered by the authors, the best classifier turned out to be the one obtained by the k-nearest neighbors method (which obtained a perfect classification).

The presented paper and the above-mentioned works allowed for concluding that random forests usually perform well in classifying objects from multidimensional gas sensor arrays. They are cited as one of the better or best applied classifiers.

On the other hand, the MLP neural network shows varying results in object classification. In the aforementioned work [86], it produced a result of only 9.6% correct classifications. Interestingly, this work also used an SLP (single-layer perceptron) network, which achieved 62.78% accurate classifications. However, these results were significantly worse than those obtained by the random forest.

In turn, in [87], an MLP neural network was used to predict the thermal performance in buildings. The dataset on which the authors worked was divided into training, testing, and validation. In this work, a neural network with 50 neurons was used. The mean squared error on the training and test set was 0, while on the validation set it was 0.0004.

The results obtained in the articles discussed were similar to those obtained in the present work, especially in [87], because it was an object classification task. However, the neural network trained in this work had only five hidden units, so it was much less complex than the one discussed in [87]. It also obtained 100% classification accuracy, so it was much better than both networks trained in the work [86].

5. Summary and Conclusions

In line with the analyses carried out using machine learning models, the following can be concluded:

The collected data allow for the classification of the considered samples into groups determined by the degree of mold infestation;
Although the data visualization shown by the SOM and PCA models and metric multidimensional scaling did not do a perfect job of showing the differences between the samples under consideration, the graphs determined by PCA and metric scaling showed these differences better, where first two PCA components explained almost 85% of the data variance;
Classification by an unsupervised algorithm, such as the hierarchical clustering algorithm, yielded a low accuracy of 48.2% when classifying objects;
Supervised learning methods performed much better on a given task, as both algorithms used classified objects into mold infestation groups 100% correctly, and this classification was correct for both the training set and the test set;
The random forest turned out to be the better of the two models in terms of the fact that no tuning of the model’s hyperparameters was needed, and even the model used with their default values classified all of the objects correctly.

This paper presents the results of classification on one dataset, while the results of the experiments on completely new datasets, to which the learned models will be applied, are already in development. Given that the measurements themselves are calibrated, it is expected that the classification results on the new datasets will confirm the generalization abilities of the obtained models.

Author Contributions

Conceptualization, G.Ł., M.P.-R. and D.M.; methodology, G.Ł., M.P.-R., D.M. and Z.S.; software, M.P.-R. and D.M.; validation, G.Ł., D.M. and Z.S.; formal analysis, G.Ł. and Z.S.; investigation, G.Ł., Ł.G. and Z.S.; resources, G.Ł., Ł.G. and Z.S.; data curation, Ł.G., M.P.-R. and D.M.; writing—original draft preparation, G.Ł., M.P.-R., Ł.G. and Z.S.; writing—review and editing, all of the authors; visualization, M.P.-R.; supervision, G.Ł., D.M., E.Ł., Z.S., V.K. and R.Č.; project administration, G.Ł., D.M. and Z.S.; funding acquisition, G.Ł. and Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

The research was partially supported by the Czech Science Foundation within the project No. 22-00420S.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All necessary materials and data are included within the paper.

Acknowledgments

This research was prepared within the activities of project Mobility FCE, Mobility CTU.

Conflicts of Interest

The authors declare no conflict of interest.

References

World Health Organization. WHO Guidelines for Indoor Air Quality: Dampness and Mould; World Health Organization, Regional Office for Europe: Copenhagen, Denmark, 2009; ISBN 9789289041683. [Google Scholar]
Peccia, J.; Kwan, S.E. Buildings, Beneficial Microbes, and Health. Trends Microbiol. 2016, 24, 595–597. [Google Scholar] [CrossRef] [PubMed]
Mohamed, D.J.; Martiny, J.B. Patterns of Fungal Diversity and Composition along a Salinity Gradient. ISME J. 2011, 5, 379–388. [Google Scholar] [CrossRef] [PubMed]
Spicer, R.; Gangloff, H. Establishing Site Specific Reference Levels for Fungi in Outdoor Air for Building Evaluation. J. Occup. Environ. Hyg. 2005, 2, 257–266. [Google Scholar] [CrossRef] [PubMed]
Richard, E.; Heutte, N.; Sage, L.; Pottier, D.; Bouchart, V.; Lebailly, P.; Garon, D. Toxigenic Fungi and Mycotoxins in Mature Corn Silage. Food Chem. Toxicol. 2007, 45, 2420–2425. [Google Scholar] [CrossRef]
Sessa, R.; Di Pietro, M.; Schiavoni, G.; Santino, I.; Altieri, A.; Pinelli, S.; Del Piano, M. Microbiological Indoor Air Quality in Healthy Buildings. New Microbiol. 2002, 25, 51–56. [Google Scholar]
Kuske, M.; Romain, A.-C.; Nicolas, J. Microbial Volatile Organic Compounds as Indicators of Fungi. Can an Electronic Nose Detect Fungi in Indoor Environments? Build. Environ. 2005, 40, 824–831. [Google Scholar] [CrossRef] [Green Version]
Isaksson, T.; Thelandersson, S.; Ekstrand-Tobin, A.; Johansson, P. Critical Conditions for Onset of Mould Growth under Varying Climate Conditions. Build. Environ. 2010, 45, 1712–1721. [Google Scholar] [CrossRef]
Chen, X.; Li, F.; Liu, C.; Yang, J.; Zhang, J.; Peng, C. Monitoring, Human Health Risk Assessment and Optimized Management for Typical Pollutants in Indoor Air from Random Families of University Staff, Wuhan City, China. Sustainability 2017, 9, 1115. [Google Scholar] [CrossRef] [Green Version]
Schenkel, D.; Lemfack, M.C.; Piechulla, B.; Splivallo, R. A Meta-Analysis Approach for Assessing the Diversity and Specificity of Belowground Root and Microbial Volatiles. Front. Plant Sci. 2015, 6, 707. [Google Scholar] [CrossRef] [Green Version]
Lemfack, M.C.; Gohlke, B.-O.; Toguem, S.M.T.; Preissner, S.; Piechulla, B.; Preissner, R. MVOC 2.0: A Database of Microbial Volatiles. Nucleic Acids Res. 2018, 46, D1261–D1265. [Google Scholar] [CrossRef] [Green Version]
Santana Oliveira, I.; da Silva Junior, A.G.; de Andrade, C.A.S.; Lima Oliveira, M.D. Biosensors for Early Detection of Fungi Spoilage and Toxigenic and Mycotoxins in Food. Curr. Opin. Food Sci. 2019, 29, 64–79. [Google Scholar] [CrossRef]
Żukiewicz-Sobczak, W.; Sobczak, P.; Imbor, K.; Krasowska, E.; Horoch, A.; Wojtyła, A.; Piątek, J. Fungal Hazards in Buildings and Flats—Impact on the Human Organism. Med. Og. Nauk. Zdr. 2012, 18, 141–146. [Google Scholar]
Eggleston, P.A.; Bush, R.K. Environmental Allergen Avoidance: An Overview. J. Allergy Clin. Immunol. 2001, 107, S403–S405. [Google Scholar] [CrossRef]
Jeleń, H.; Wasowicz, E. Volatile Fungal Metabolites and Their Relation to the Spoilage of Agricultural Commodities. Food Rev. Int. 1998, 14, 391–426. [Google Scholar] [CrossRef]
Bjurman, J. Ergosterol as an Indicator of Mould Growth on Wood in Relation to Culture Age, Humidity Stress and Nutrient Level. Int. Biodeterior. Biodegrad. 1994, 33, 355–368. [Google Scholar] [CrossRef]
Börjesson, T.S.; Stöllman, U.M.; Schnürer, J.L. Off-Odorous Compounds Produced by Molds on Oatmeal Agar: Identification and Relation to Other Growth Characteristics. J. Agric. Food Chem. 1993, 41, 2104–2111. [Google Scholar] [CrossRef]
Atkinson, R. Atmospheric Chemistry of VOCs and NOx. Atmos. Environ. 2000, 34, 2063–2101. [Google Scholar] [CrossRef]
Lacey, J. Indoor Aerobiology and Health. In Building Mycology Management of Decay and Health in Buildings; Jagjit, S., Ed.; Routledge: London, UK, 1995; pp. 75–120. [Google Scholar]
Piotrowska, M.; Żakowska, Z.; Gliścińska, A.; Bogusłąwska-Kozłowska, J. The Role of Outdoor Air on Fungal Aerosols Formation in Indoor Environment. In Proceedings of the II International Scientific Conference: Microbial Biodegradation and Biodeterioration of Technical Materials, Łódź, Poland, 30–31 May 2001; pp. 113–118. (In Polish). [Google Scholar]
Riggs, M.A.; Rao, C.Y.; Brown, C.M.; Van Sickle, D.; Cummings, K.J.; Dunn, K.H.; Deddens, J.A.; Ferdinands, J.; Callahan, D.; Moolenaar, R.L.; et al. Resident Cleanup Activities, Characteristics of Flood-Damaged Homes and Airborne Microbial Concentrations in New Orleans, Louisiana, October 2005. Environ. Res. 2008, 106, 401–409. [Google Scholar] [CrossRef] [PubMed]
Adhikari, A.; Jung, J.; Reponen, T.; Lewis, J.S.; DeGrasse, E.C.; Grimsley, L.F.; Chew, G.L.; Grinshpun, S.A. Aerosolization of Fungi, (1→3)-β-d Glucan, and Endotoxin from Flood-Affected Materials Collected in New Orleans Homes. Environ. Res. 2009, 109, 215–224. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Pinzari, F.; Fanelli, C.; Canhoto, O.; Magan, N. Electronic Nose for the Early Detection of Moulds in Libraries and Archives. Indoor Built Environ. 2004, 13, 387–395. [Google Scholar] [CrossRef]
Schiffman, S.S.; Wyrick, D.W.; Gutierrez-Osuna, R.; Nagle, H.T. Effectiveness of an Electronic Nose for Monitoring Bacterial and Fungal Growth. Proc. ISOEN 2000, 2000, 173–180. [Google Scholar]
Kuske, M.; Padilla, M.; Romain, A.C.; Nicolas, J.; Rubio, R.; Marco, S. Detection of Diverse Mould Species Growing on Building Materials by Gas Sensor Arrays and Pattern Recognition. Sens. Actuators B Chem. 2006, 119, 33–40. [Google Scholar] [CrossRef] [Green Version]
Suchorab, Z.; Frąc, M.; Guz, Ł.; Oszust, K.; Łagód, G.; Gryta, A.; Bilińska-Wielgus, N.; Czerwiński, J. A Method for Early Detection and Identification of Fungal Contamination of Building Materials Using E-Nose. PLoS ONE 2019, 14, e0215179. [Google Scholar] [CrossRef] [Green Version]
Wang, B.; Li, X.; Chen, D.; Weng, X.; Chang, Z. Development of an Electronic Nose to Characterize Water Quality Parameters and Odor Concentration of Wastewater Emitted from Different Phases in a Wastewater Treatment Plant. Water Res. 2023, 235, 119878. [Google Scholar] [CrossRef] [PubMed]
Apetrei, C.; Apetrei, I.M.; Villanueva, S.; de Saja, J.A.; Gutierrez-Rosales, F.; Rodriguez-Mendez, M.L. Combination of an E-Nose, an e-Tongue and an e-Eye for the Characterisation of Olive Oils with Different Degree of Bitterness. Anal. Chim. Acta 2010, 663, 91–97. [Google Scholar] [CrossRef]
Garbacz, M.; Malec, A.; Duda-Saternus, S.; Suchorab, Z.; Guz, Ł.; Łagód, G. Methods for Early Detection of Microbiological Infestation of Buildings Based on Gas Sensor Technologies. Chemosensors 2020, 8, 7. [Google Scholar] [CrossRef] [Green Version]
Piłat-Rożek, M.; Łazuka, E.; Majerek, D.; Szeląg, B.; Duda-Saternus, S.; Łagód, G. Application of Machine Learning Methods for an Analysis of E-Nose Multidimensional Signals in Wastewater Treatment. Sensors 2023, 23, 487. [Google Scholar] [CrossRef]
Moufid, M.; Bouchikhi, B.; Tiebe, C.; Bartholmai, M.; El Bari, N. Assessment of Outdoor Odor Emissions from Polluted Sites Using Simultaneous Thermal Desorption-Gas Chromatography-Mass Spectrometry (TD-GC-MS), Electronic Nose in Conjunction with Advanced Multivariate Statistical Approaches. Atmos. Environ. 2021, 256, 118449. [Google Scholar] [CrossRef]
Yaqoob, U.; Younis, M.I. Chemical Gas Sensors: Recent Developments, Challenges, and the Potential of Machine Learning—A Review. Sensors 2021, 21, 2877. [Google Scholar] [CrossRef]
Wang, J.; Zhou, Q.; Peng, S.; Xu, L.; Zeng, W. Volatile Organic Compounds Gas Sensors Based on Molybdenum Oxides: A Mini Review. Front. Chem. 2020, 8, 339. [Google Scholar] [CrossRef]
He, S.; Gui, Y.; Wang, Y.; Yang, J. A Self-Powered β-Ni(OH)2/MXene Based Ethanol Sensor Driven by an Enhanced Triboelectric Nanogenerator Based on β-Ni(OH)2@PVDF at Room Temperature. Nano Energy 2023, 107, 108132. [Google Scholar] [CrossRef]
Wang, Y.; Gui, Y.; He, S.; Yang, J. Hybrid Nanogenerator Driven Self-Powered SO2F2 Sensing System Based on TiO2/Ni/C Composites at Room Temperature. Sens. Actuators B Chem. 2023, 377, 133053. [Google Scholar] [CrossRef]
Huang, J.; Wu, J. Robust and Rapid Detection of Mixed Volatile Organic Compounds in Flow Through Air by a Low Cost Electronic Nose. Chemosensors 2020, 8, 73. [Google Scholar] [CrossRef]
Pearson, K. On Lines and Planes of Closest Fit to Systems of Points in Space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef] [Green Version]
Hotelling, H. Analysis of a Complex of Statistical Variables into Principal Components. J. Educ. Psychol. 1933, 24, 498–520. [Google Scholar] [CrossRef]
Abdi, H.; Williams, L.J. Principal Component Analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
Mardia, K.V.; Kent, T.; Bibby, J. Multivariate Analysis; Academic Press Limited: Cambridge, MA, USA, 1979. [Google Scholar]
Shrestha, S.; Kazama, F. Assessment of Surface Water Quality Using Multivariate Statistical Techniques: A Case Study of the Fuji River Basin, Japan. Environ. Model. Softw. 2007, 22, 464–475. [Google Scholar] [CrossRef]
Kazi, T.G.; Arain, M.B.; Jamali, M.K.; Jalbani, N.; Afridi, H.I.; Sarfraz, R.A.; Baig, J.A.; Shah, A.Q. Assessment of Water Quality of Polluted Lake Using Multivariate Statistical Techniques: A Case Study. Ecotoxicol. Environ. Saf. 2009, 72, 301–309. [Google Scholar] [CrossRef]
Łagód, G.; Guz, Ł.; Sabba, F.; Sobczuk, H. Detection of Wastewater Treatment Process Disturbances in Bioreactors Using the E-Nose Technology. Ecol. Chem. Eng. S 2018, 25, 405–418. [Google Scholar] [CrossRef] [Green Version]
Torgerson, W.S. Multidimensional Scaling: I. Theory and Method. Psychometrika 1952, 17, 401–419. [Google Scholar] [CrossRef]
Ghojogh, B.; Ghodsi, A.; Karray, F.; Crowley, M. Multidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey. arXiv 2020, arXiv:2009.08136. [Google Scholar]
Mardia, K.V. Some Properties of Clasical Multi-Dimesional Scaling. Commun. Stat. Theory Methods 1978, 7, 1233–1241. [Google Scholar] [CrossRef]
Borg, I.; Groenen, P.J.F.; Mair, P. Applied Multidimensional Scaling; Springer: Berlin/Heidelberg, Germany, 2013; ISBN 978-3-642-31847-4. [Google Scholar]
Su, X.; Steinman, A.D.; Xue, Q.; Zhao, Y.; Tang, X.; Xie, L. Temporal Patterns of Phyto- and Bacterioplankton and Their Relationships with Environmental Factors in Lake Taihu, China. Chemosphere 2017, 184, 299–308. [Google Scholar] [CrossRef]
Purcell, S.W.; Clarke, K.R.; Rushworth, K.; Dalton, S.J. Defining Critical Habitats of Threatened and Endemic Reef Fishes with a Multivariate Approach. Conserv. Biol. 2014, 28, 1688–1698. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kohonen, T. Self-Organized Formation of Topologically Correct Feature Maps. Biol. Cybern. 1982, 43, 59–69. [Google Scholar] [CrossRef]
Hollmen, J. Self-Organizing Map (SOM). Available online: http://users.ics.aalto.fi/jhollmen/dippa/node9.html (accessed on 18 March 2023).
Jain, A.K.; Mao, J.; Mohiuddin, K.M. Artificial Neural Networks: A Tutorial. Computer 1996, 29, 31–44. [Google Scholar] [CrossRef] [Green Version]
Lee, B.-H.; Scholz, M. Application of the Self-Organizing Map (SOM) to Assess the Heavy Metal Removal Performance in Experimental Constructed Wetlands. Water Res. 2006, 40, 3367–3374. [Google Scholar] [CrossRef]
Astel, A.; Tsakovski, S.; Barbieri, P.; Simeonov, V. Comparison of Self-Organizing Maps Classification Approach with Cluster and Principal Components Analysis for Large Environmental Data Sets. Water Res. 2007, 41, 4566–4578. [Google Scholar] [CrossRef] [PubMed]
Kalteh, A.M.; Hjorth, P.; Berndtsson, R. Review of the Self-Organizing Map (SOM) Approach in Water Resources: Analysis, Modelling and Application. Environ. Model. Softw. 2008, 23, 835–845. [Google Scholar] [CrossRef]
Ward, J.H. Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 1963, 58, 236. [Google Scholar] [CrossRef]
Sneath, P.H.A. The Application of Computers to Taxonomy. Microbiology 1957, 17, 201–226. [Google Scholar] [CrossRef] [Green Version]
Sørensen, T. A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and Its Application to Analysis of the Vegetation on Danish Commons. Biol. Skr. 1948, 5, 1–34. [Google Scholar]
Legendre, P.; Legendre, L. Numerical Ecology, 3rd Edition; Elsevier Science BV: Amsterdam, The Netherlands, 2012. [Google Scholar]
Fu, J.; Zhao, C.; Luo, Y.; Liu, C.; Kyzas, G.Z.; Luo, Y.; Zhao, D.; An, S.; Zhu, H. Heavy Metals in Surface Sediments of the Jialu River, China: Their Relations to Environmental Factors. J. Hazard. Mater. 2014, 270, 102–109. [Google Scholar] [CrossRef]
Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: New York, NY, USA, 2013. [Google Scholar]
Agirre-Basurko, E.; Ibarra-Berastegi, G.; Madariaga, I. Regression and Multilayer Perceptron-Based Models to Forecast Hourly O3 and NO2 Levels in the Bilbao Area. Environ. Model. Softw. 2006, 21, 430–446. [Google Scholar] [CrossRef]
Torrecilla, J.S.; García, J.; Rojo, E.; Rodríguez, F. Estimation of Toxicity of Ionic Liquids in Leukemia Rat Cell Line and Acetylcholinesterase Enzyme by Principal Component Analysis, Neural Networks and Multiple Lineal Regressions. J. Hazard. Mater. 2009, 164, 182–194. [Google Scholar] [CrossRef] [PubMed]
Isik, E.; Tasyurek, L.B.; Isik, I.; Kilinc, N. Synthesis and Analysis of TiO2 Nanotubes by Electrochemical Anodization and Machine Learning Method for Hydrogen Sensors. Microelectron. Eng. 2022, 262, 111834. [Google Scholar] [CrossRef]
Dietterich, T.G. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization. Mach. Learn. 2000, 40, 139–157. [Google Scholar] [CrossRef]
Breiman, L. Using Adaptive Bagging to Debias Regressions; Statistics Department UCB: Berkeley, CA, USA, 1999; Volume 547, pp. 3–7. [Google Scholar]
Ho, T.K. Random Decision Forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 1, pp. 278–282. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Hu, X.; Belle, J.H.; Meng, X.; Wildani, A.; Waller, L.A.; Strickland, M.J.; Liu, Y. Estimating PM 2.5 Concentrations in the Conterminous United States Using the Random Forest Approach. Environ. Sci. Technol. 2017, 51, 6936–6944. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Heap, A.D.; Potter, A.; Daniell, J.J. Application of Machine Learning Methods to Spatial Interpolation of Environmental Variables. Environ. Model. Softw. 2011, 26, 1647–1659. [Google Scholar] [CrossRef]
Wang, B.; Zhang, J.; Wang, T.; Li, W.; Lu, Q.; Sun, H.; Huang, L.; Liang, X.; Liu, F.; Liu, F.; et al. Machine Learning-Assisted Volatile Organic Compound Gas Classification Based on Polarized Mixed-Potential Gas Sensors. ACS Appl. Mater. Interfaces 2023, 15, 6047–6057. [Google Scholar] [CrossRef]
Ge, Y.; Liu, P.; Chen, Q.; Qu, M.; Xu, L.; Liang, H.; Zhang, X.; Huang, Z.; Wen, Y.; Wang, L. Machine Learning-Guided the Fabrication of Nanozyme Based on Highly-Stable Violet Phosphorene Decorated with Phosphorus-Doped Hierarchically Porous Carbon Microsphere for Portable Intelligent Sensing of Mycophenolic Acid in Silage. Biosens. Bioelectron. 2023, 237, 115454. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2022; Available online: https://www.R-project.org (accessed on 2 February 2023).
R Studio Team. RStudio: Integrated Development Environment for R; R Studio Team: Boston, MA, USA, 2022. [Google Scholar]
Wickham, H.; Averick, M.; Bryan, J.; Chang, W.; McGowan, L.D.; François, R.; Grolemund, G.; Hayes, A.; Henry, L.; Hester, J.; et al. Welcome to the Tidyverse. J. Open Source Softw. 2019, 4, 1686. [Google Scholar] [CrossRef] [Green Version]
Kuhn, M.; Wickham, H. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. Available online: https://www.tidymodels.org (accessed on 2 February 2023).
Wehrens, R.; Buydens, L.M.C. Self- and Super-Organizing Maps in R: The Kohonen Package. J. Stat. Softw. 2007, 21, i05. [Google Scholar] [CrossRef] [Green Version]
Tang, Y.; Horikoshi, M.; Li, W. Ggfortify: Unified Interface to Visualize Statistical Results of Popular R Packages. R. J. 2016, 8, 474. [Google Scholar] [CrossRef] [Green Version]
Wickham, H. Ggplot2: Elegant Graphics for Data Analysis; Springer: New York, NY, USA, 2016; ISBN 978-3-319-24277-4. [Google Scholar]
Kassambara, A.; Mundt, F. Factoextra: Extract and Visualize the Results of Multivariate Data Analyses; R Package Version 1.0.7. 2020. Available online: https://CRAN.R-project.org/package=factoextra (accessed on 4 February 2023).
Majerek, D.; Guz, Ł.; Suchorab, Z.; Łagód, G.; Sobczuk, H. The Application of the Statistical Classifying Models for Signal Evaluation of the Gas Sensors Analyzing Mold Contamination of the Building Materials; API Publishing: Chicago, IL, USA, 2017; p. 040024. [Google Scholar]
Gonzalez-Silva, B.M.; Jonassen, K.R.; Bakke, I.; Østgaard, K.; Vadstein, O. Nitrification at Different Salinities: Biofilm Community Composition and Physiological Plasticity. Water Res. 2016, 95, 48–58. [Google Scholar] [CrossRef]
Licen, S.; Cozzutto, S.; Astel, A.; Adami, G.; Barbieri, P. Extracting Knowledge from Hybrid Instrumental Environmental Odour Monitoring Systems: Self Organizing Maps, Data Fusion and Supervised Kohonen Networks for Prediction. In Proceedings of the 2019 IEEE International Symposium on Olfaction and Electronic Nose (ISOEN), Fukuoka, Japan, 26–29 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. [Google Scholar]
Łagód, G.; Majerek, D.; Guz, Ł.; Nabrdalik, M. Analysis of Gas Sensors Array Signals for Evaluation of Mold Contamination in Buildings; API Publishing: Chicago, IL, USA, 2018; p. 020022. [Google Scholar]
Syazwan, A.; Rafee, B.M.; Juahir, H.; Azman, A.Z.F.; Nizar, A.; Izwyn, Z.; Rozalini, M.; Syahidatussyakirah, K.; Muhaimin, A.; Syafiq, M.Y.A.; et al. Analysis of Indoor Air Pollutants Checklist Using Environmetric Technique for Health Risk Assessment of Sick Building Complaint in Nonindustrial Workplace. Drug. Healthc. Patient Saf. 2012, 4, 107–126. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wibowo, F.W. Wihayati Classification of Gases and Concentration Levels Obtained from Sensor Array Detection as Electronic Nose. In Proceedings of the 2021 3rd International Conference on Electronics Representation and Algorithm (ICERA), Virtual, 29–30 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 51–56. [Google Scholar]
Odesola, I.; Ige, E.; Adesokan, A.; Ige, I. An ANN Approach for Estimation of Thermal Comfort and Sick Building Syndrome. Rev. Intell. Artif. 2019, 33, 151–158. [Google Scholar] [CrossRef]

Figure 1. View of the sensor array arranged in a circular manner at an equal distance from the gas sample inlet. A DS18B20 temperature sensor and a HIH-4000 humidity sensor are located in the center (sensor numbering as shown in Table 1).

Figure 2. Changes in the resistance of S1–S8 sensors during a single measurement cycle (example for 8 × MOS array).

Figure 3. Scree plot for the principal component analysis.

Figure 4. Principal component analysis data visualization on a two-dimensional plane.

Figure 5. Metric multidimensional scaling mapping on a two-dimensional plane.

Figure 6. Kohonen self-organizing map.

Figure 7. Results of the hierarchical clustering algorithm: (a) dendrogram and (b) assigned clusters on the plane derived from PCA coordinates.

Figure 8. Relation between number of hidden units, epochs, amount of regularization, and ROC−AUC curve on the training sample.

Figure 9. Confusion matrix for the MLP neural network on the test sample.

Figure 10. Confusion matrix for a random forest model on the test sample.

Figure 11. ROC curve on each class on the test sample.

Table 1. Parameters of the gas sensors used in the 8 × MOS array.

No.	Sensor Type	Purpose	Measurement Range [ppm]
1	TGS2600-	Air pollution sensor	1 ÷ 30 (for H₂)
2	TGS2602-	General air pollution sensor	1 ÷ 30 (for ethanol)
3	TGS2610-	Propane sensor	500 ÷ 10,000
4	TGS2610-	Propane and butane sensor with carbon filter	500 ÷ 10,000
5	TGS2611-	Methane and natural gas sensor	500 ÷ 10,000
6	TGS2611-	Methane sensor with carbon filter	500 ÷ 10,000
7	TGS2612-	Methane, propane and isobutane sensor	1 ÷ 25% (explosive)
8	TGS2620-	Ethyl alcohol and solvent vapor sensor	50 ÷ 5000

Table 2. Features of the analyzed objects and reference samples.

Sample Name	Mold Bloom	Odor Nuisance	Object Features	Type of Object
B1	Visible	Plain	Roof construction failure	House, day room
B2	None	Perceptible	No thermal insulation	House, day room
B3	None	Plain	Incorrect thermal insulation	House, wainscot
B4	Significant	High	No thermal and water insulation	House, basement
B5	None (salt efflorescence)	None	Incorrect water isolation	House, basement
B6	None	None	Poor ventilation	Multifamily, bedroom
B7	Significant	High	Incorrect insulation	House, living room
B8	None	None	None	House, living room
B9	Fine Stains	None	Poor ventilation	Multifamily, wardrobe
Air	-	-	-	Clean air
DT	Totally stricken	High	Fully covered in mold	Decayed timber

Table 3. Classification of objects into individual clusters formed by hierarchical clustering.

	Cluster 1	Cluster 2	Cluster 3	Cluster 4	Cluster 5	Cluster 6
Observed	Cluster 1	Cluster 2	Cluster 3	Cluster 4	Cluster 5	Cluster 6
Air	0	0	0	0	0	174
None	162	176	178	352	0	0
Fine Stains	0	0	0	177	0	0
Visible	35	0	0	0	0	0
Significant	0	176	0	0	176	0
Totally stricken	0	0	0	0	66	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Łagód, G.; Piłat-Rożek, M.; Majerek, D.; Łazuka, E.; Suchorab, Z.; Guz, Ł.; Kočí, V.; Černý, R. Application of Dimensionality Reduction and Machine Learning Methods for the Interpretation of Gas Sensor Array Readouts from Mold-Threatened Buildings. Appl. Sci. 2023, 13, 8588. https://doi.org/10.3390/app13158588

AMA Style

Łagód G, Piłat-Rożek M, Majerek D, Łazuka E, Suchorab Z, Guz Ł, Kočí V, Černý R. Application of Dimensionality Reduction and Machine Learning Methods for the Interpretation of Gas Sensor Array Readouts from Mold-Threatened Buildings. Applied Sciences. 2023; 13(15):8588. https://doi.org/10.3390/app13158588

Chicago/Turabian Style

Łagód, Grzegorz, Magdalena Piłat-Rożek, Dariusz Majerek, Ewa Łazuka, Zbigniew Suchorab, Łukasz Guz, Václav Kočí, and Robert Černý. 2023. "Application of Dimensionality Reduction and Machine Learning Methods for the Interpretation of Gas Sensor Array Readouts from Mold-Threatened Buildings" Applied Sciences 13, no. 15: 8588. https://doi.org/10.3390/app13158588

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Application of Dimensionality Reduction and Machine Learning Methods for the Interpretation of Gas Sensor Array Readouts from Mold-Threatened Buildings

Abstract

Featured Application

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.2. Measurement Protocol

2.3. Applied Software and Packages

2.4. Measured Objects

3. Results

4. Discussion

5. Summary and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI