Next Article in Journal
Perception of National Park Soundscape and Its Effects on Visual Aesthetics
Next Article in Special Issue
Perfluorooctanoic Acid Affects Thyroid Follicles in Common Carp (Cyprinus carpio)
Previous Article in Journal
Parental Perceptions of Youths’ Desirable Characteristics in Relation to Type of Leisure: A Multinomial Logistic Regression Analysis of Martial-Art-Practicing Youths
Previous Article in Special Issue
Large-Scale Mercury Dispersion at Sea: Modelling a Multi-Hazard Case Study from Augusta Bay (Central Mediterranean Sea)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Updating Indoor Air Quality (IAQ) Assessment Screening Levels with Machine Learning Models

Department of Building Environment and Energy Engineering, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
*
Author to whom correspondence should be addressed.
Int. J. Environ. Res. Public Health 2022, 19(9), 5724; https://doi.org/10.3390/ijerph19095724
Submission received: 22 March 2022 / Revised: 5 May 2022 / Accepted: 6 May 2022 / Published: 8 May 2022

Abstract

:
Indoor air quality (IAQ) standards have been evolving to improve the overall IAQ situation. To enhance the performances of IAQ screening models using surrogate parameters in identifying unsatisfactory IAQ, and to update the screening models such that they can apply to a new standard, a novel framework for the updating of screening levels, using machine learning methods, is proposed in this study. The classification models employed are Support Vector Machine (SVM) algorithm with different kernel functions (linear, polynomial, radial basis function (RBF) and sigmoid), k-Nearest Neighbors (kNN), Logistic Regression, Decision Tree (DT), Random Forest (RF) and Multilayer Perceptron Artificial Neural Network (MLP-ANN). With carefully selected model hyperparameters, the IAQ assessment made by the models achieved a mean test accuracy of 0.536–0.805 and a maximum test accuracy of 0.807–0.820, indicating that machine learning models are suitable for screening the unsatisfactory IAQ. Further to that, using the updated IAQ standard in Hong Kong as an example, the update of an IAQ screening model against a new IAQ standard was conducted by determining the relative impact ratio of the updated standard to the old standard. Relative impact ratios of 1.1–1.5 were estimated and the corresponding likelihood ratios in the updated scheme were found to be higher than expected due to the tightening of exposure levels in the updated scheme. The presented framework shows the feasibility of updating a machine learning IAQ model when a new standard is being adopted, which shall provide an ultimate method for IAQ assessment prediction that is compatible with all IAQ standards and exposure criteria.

1. Introduction

Indoor air quality (IAQ) has gained enormous attention in the past decade due to the considerable amount of time we spend indoors nowadays [1,2]. To tackle the problem of poor IAQ, different countries have their own set of IAQ standards, with different measurement parameters and range of exposure limits. Representative parameters, such as carbon dioxide (CO2) and respirable suspended particulates (RSP), are always on the list, while total volatile organic compounds (TVOC), carbon monoxide (CO), ozone (O3), formaldehyde (HCHO), airborne bacteria count (ABC) may be included, depending on the application purpose of the standard [3,4,5,6,7]. The exposure limits are usually established based on health risk analysis, in which lifelong exposure to that level of pollutant shall not produce significant adverse effects on the public [8].
Alternatively, instead of complying strictly with the IAQ standard, the screening approach for assessing IAQ has become popular in recent years due to its simplicity and cheaper monitoring cost. With a large enough sample size, we can find out the “common” IAQ problems one type of premises often experiences, therefore, identifying the representative IAQ parameters that explain the majority of poor IAQ. The simplest way to reduce the cost of IAQ assessment is to just measure these representative parameters and see if they exceed the standard. One of the most notable examples is using CO2 level as an indicator of acceptable IAQ to adjust the fresh air quantity [9]. However, this approach may overlook the possibility of having IAQ problems caused by other IAQ parameters; therefore, a surrogate approach was proposed to identify surrogate IAQ parameters that are not just representative but also statistically correlated with other IAQ parameters. An express assessment protocol using three or five IAQ parameters, developed by Hui et al. [10], successfully screened out more than 90% of offices with poor IAQ, which provided an alternative for IAQ pre-assessment without the need to conduct a full assessment (all nine parameters). This study gave insight into the ability of a limited number of parameters in identifying problematic IAQ. Further to that, Wong et al. [11] proposed using CO2, RSP and TVOC as the surrogate indicators for evaluating IAQ in offices. The dependence and the correlations of the other nine parameters on the levels of the proposed surrogate indicators were found to be statistically significant. The result served as strong support that CO2, RSP and TVOC could be good surrogate indicators for other IAQ parameters, in terms of representativeness, ease of measurement and the possibility of real-time monitoring [12]. Individually, CO2, RSP and TVOC represent occupant load and ventilation rate, system filtration performance and indoor activities, and emissions from building materials and finishes, respectively, which serve as good indicators for the general IAQ of an environment with a ventilation system. To sum up, using surrogate indicators for IAQ evaluation can reduce the scale of measurement, as some high-risk premises are already being screened out preliminarily, therefore, reducing the resources required to identify problematic premises [10,11].
Based on the aforementioned efforts for simplifying IAQ assessment, an efficient and cost-effective IAQ screening protocol was proposed by Wong et al. [13] for identifying asymptomatic IAQ problems. IAQ index, the average fractional dose to exposure limits of the representative pollutants, was proposed and was used to diagnose unsatisfied IAQ in air-conditioned offices in the study by Mui et al. [14]. IAQ indices from 525 offices were evaluated using a five-level screening test with thresholds determined by likelihood ratios of unsatisfactory IAQ. A likelihood ratio larger than 1 indicates a high-risk sample having an excessive occurrence of unsatisfactory IAQ, whereas a smaller than 1 likelihood ratio identifies a low-risk sample. Given the pre-test probability of unsatisfactory IAQ and the regional failure percentage of the Hong Kong IAQ Certification Scheme, the post-test probability of offices with unsatisfactory IAQ can be estimated using the IAQ screening test. This screening test with representative IAQ parameters provides a much simpler and cost-effective alternative for IAQ assessment. If an environment “fails” in the screening test (i.e., any one of the three surrogate indicators exceeds the exposure limit), immediate remedies can be decided on to improve the IAQ. If not, based on the post-test probability given by the screening test, facility management can determine the threshold of the test and threshold of the remedy regarding the willingness to invest manpower and resources in improving the IAQ. Further test, a comprehensive one, will only be needed if the screening test result is in between the two thresholds [14].
It is noteworthy that this approach does not simply test some of the parameters against the standard, but rather uses these parameters to predict the probability of dissatisfying the standard based on correlation. Therefore, an assessment model developed based on the levels of surrogate parameters and probability of failing an IAQ standard is essential in IAQ screening practice. More improvements have been made to the IAQ index to further reduce the resources required for IAQ screening [15]; however, as powerful as it is in screening the IAQ of similar environments, prior knowledge of the IAQ of premises in the region is required [10], and the index may not be applicable to other kinds of space or against another set of IAQ standards.
In fact, throughout the development of IAQ policy, exposure limits have been updated from time to time, based on collective professional judgement and managerial decisions with a balance of social acceptance. The World Health Organization (WHO) has been making constant efforts to improve and refine the air quality standards, since the establishment of the air quality guidelines on selected pollutants in 2005 [16], which include the REVIHAAP project to review the health impacts of air pollution [17], and the HRAPIE project to identify dose–response relationship for RSP, O3 and nitrogen dioxide (NO2) [18]. Results from these two projects supported the comprehensive review of the European Union air quality policy in 2013 and many follow-up consultations and discussion forums on the preparation for an updated guideline [19]. In September 2021, the WHO issued the new Global Air Quality Guideline that reduced levels of key air pollutants to address the accumulated pieces of evidence of health effects and significant risks associated with poor air quality [20]. In 2019, the IAQ standard in Hong Kong was updated with stricter exposure limits to meet the updated IAQ guidelines published by the World Health Organization. The update consisted of the removal of three comfort parameters, the inclusion of visual inspection of mould condition and more stringent limits for CO, RSP and radon (Rn). Considering that the IAQ index itself, the screening levels and the likelihood ratios were all developed using the old standard, it is essential to identify the effect of the new IAQ standard on the suitability and performance of the established screening methods and to provide a framework for “updating” the screening levels.
With exposure standards being updated regularly in practical situations without the quantitatively assessed probable impact of the tightening of levels, fine tuning the IAQ screening baseline is deemed necessary. However, given that past data were assessed using the old standard, the iterative process for baseline determination using newly collected data takes a long time and is not ideal for responding to the rapid change in the need for environmental control. This presents a problem if the standard is being updated. Can the existing IAQ assessment model based on a statistical analysis of old data be useful against the new standard?
In this study, we proposed using machine learning methods for the development of a surrogate IAQ assessment model, which may be a solution to the problem of an updated IAQ standard and avoid the iterative process for baseline determination. Machine learning is a state-of-the-art method for environmental prediction. It is commonly used in outdoor pollution predictions [21] and indoor energy simulations [22]. The awareness and application of machine learning modeling in IAQ emerged in the past decade. A comprehensive review of existing machine learning and statistical models for IAQ prediction, conducted by Wei et al. [23], suggested that the majority of existing research focuses on using machine learning algorithms to predict pollutant concentrations. The most popular statistical models applied to IAQ consist of artificial neural network (ANN), multiple linear regression (MLR), partial least squares (PLS), and random forest (RF). They focus on predicting the concentrations of airborne particles, including RSP, e.g., [24,25,26], CO2, e.g., [27,28], NO2, e.g., [29] and Rn, e.g., [30,31], in indoor environments using outdoor data. Recently, the forecasting of IAQ has become popular for the sake of improving public health and well-being, since precautionary actions can be acted on ahead of time [32]. Machine learning methods, such as linear and non-linear autoregressive models [33], are used to develop IAQ forecasting models using the historical profile of IAQ parameters. As continuous monitoring of IAQ is required as the basis of time-series machine learning models, it is common to forecast temperature, e.g., [34,35], relative humidity, e.g., [35,36], CO2, e.g., [34,35,36] and CO, e.g., [36], as they can be easily monitored using low-cost sensors [23]. Forecasting the concentration of indoor aldehydes, volatile organic compounds (VOC), and semi-VOC using statistical models remains scarce [33], and an example of using the nonlinear threshold autoregressive (TAR) model and Chaos-dynamics-based model to forecast HCHO is presented in the study by Ouaret et al. [37]. All things considered, it is advisable to test and compare different statistical models for each specific case, as demonstrated by many studies that used machine learning methods for IAQ modelling [33].
Besides indoor air pollutant prediction and forecasting, there are other examples of applying machine learning methods in IAQ-related research that can be found in the literature. Zimmerman et al. [38] applied random forests (RFs) to improve low-cost sensor performance for more accurate IAQ monitoring. Leong et al. [39] used a support vector machine (SVM) for the prediction of the air pollution index (API) in Malaysia. Their study demonstrated that the radial basis function (RBF) kernel function could accurately and effectively predict API. Sarkhosh et al. [40] used a decision tree (DT) model to identify the most influential parameters that contributed to the prevalence of Sick Building Syndrome (SBS) in office buildings. The high prevalence of SBS was found to be related to job satisfaction, ergonomic parameters, microbiological pollutants and 1-methyl-4-(1-methylethyl) benzene concentration.
While IAQ prediction and forecasting give us a better understanding of the IAQ situation we are experiencing, it is of equal importance to identify whether the level of IAQ is considered acceptable or not before any follow-up mitigation or precautionary strategies are taken; therefore, an IAQ assessment model is essential.
To our best knowledge, we have identified the following research gaps in the field:
  • Using machine learning methods to assess whether the IAQ is acceptable or not with a given IAQ standard;
  • Addressing the issues of updating/changing IAQ standards, which would affect the screening levels and results; and
  • Predicting the updated screening baselines of IAQ with new standards.
Therefore, in this study, we discuss the possibility of using machine learning methods to “update” the screening levels, such that the IAQ screening method can still be applicable with a new standard. Using Hong Kong’s case of an updated IAQ standard as an example, in this paper, we present a universal framework of using machine learning models in predicting the updated IAQ screening levels, which includes:
  • Developing and evaluating the performance of machine learning IAQ assessment models with surrogate IAQ parameters;
  • Quantifying the impact of an updated scheme (i.e., an IAQ standard) on the machine learning IAQ assessment model; and
  • Evaluating the model flexibility in adapting an updated/another exposure standard.
Applicable to all IAQ standards and guidelines, this framework not only enables the implementation of a territory-wide IAQ screening program but also facilitates IAQ monitoring and improvements.

2. Materials and Methods

In the following section, the framework for updating the screening levels of IAQ assessment models is presented. To demonstrate the updating process, machine learning models for IAQ assessment based on the developed IAQ index algorithm and screening methodology were first developed using selected machine learning modelling methods. The performances of the models were evaluated, and with the average assessment results from the models, the relative impact ratios of the updated standard on the old standard were determined. The framework details the feasibility of developing machine learning IAQ assessment models, methods for model performance evaluation and the procedures for updating the screening levels with an updated standard.

2.1. Overview of the Data

IAQ assessment data collected from a cross-sectional IAQ survey of 525 air-conditioned offices in Hong Kong reported in a previous study was adopted to evaluate the performance of machine learning models [14]. The surveyed premises, which covered various grades, types and ages, included a wide range of open-plan offices from 10 m2 to 300 m2. The IAQ survey was conducted for the fulfilment of the Hong Kong IAQ Certification Scheme (the Scheme); therefore, the measurement protocol, sampling locations, period and equipment strictly followed the requirements stated in the Scheme. As such, 8 h continuous samplings were conducted during the office-occupied hours with a sampling density of 500 m2. All the sampling points were selected by the IAQ professionals during the walkthrough inspection before the actual measurement.
Two IAQ assessment schemes, Schemes 1 and 2, are exhibited in Table 1. Scheme 1 was the old IAQ objective in the Hong Kong IAQ Certification Scheme and Scheme 2 was the updated one to update the requirement against the latest IAQ guidelines by the World Health Organization [41]. In the updated scheme, exposure limits of CO, Rn and RSP are tightened to provide better public health protection. As mentioned above, the IAQ index using likelihood ratio cannot adapt to an updated standard since it was developed based on the previous standard, so using machine learning algorithms to model the IAQ index and IAQ dissatisfaction can, therefore, be a universal solution to the existing barrier.
A statistical summary of the dataset extracted for this study, which consists of three independent yet closely correlated IAQ surrogate indicators concerning the IAQ index [14], namely CO2, RSP and TVOC, is presented in Table 2. These three parameters were selected as the surrogate indicators among the remaining 9 pollutants in the Scheme, among which, RSP represents the filtering efficiency of the air-conditioning system, CO2 represents the occupant load and ventilation rate, and TVOC indicates building emission [13]. The overall summary of the dataset is shown at the top of the table, with the range of CO2 = 339–1497 ppm, RSP = 4–125 μg m−3, TVOC = 0–3144 μg m−3 and the calculated IAQ index = 0.189–1.99. Using the two assessment schemes introduced in Table 1 above, this dataset was further classified into “Satisfactory IAQ” (i.e., if all of the 9 pollutant levels fulfil the assessment scheme) or “Unsatisfactory IAQ” (i.e., 1 or more of the 9 pollutant levels fail the assessment scheme). While the mean values of CO2, RSP and TVOC in the “Satisfactory IAQ” group were significantly different from those in the “Unsatisfactory IAQ” group (p < 0.05, t-test), the sample (satisfactory or unsatisfactory) group means results from Schemes 1 and 2 were statistically the same (p > 0.1, t-test). Table 2 also exhibits the IAQ index θ, which is an IAQ indicator determined using Equation (1), with j = 1,…,3, Φj* being the fractional dose of RSP, CO2 and TVOC, Φj the exposure level of the assessed parameter over an exposure time, and Φj,e the reference exposure limit under Scheme 1 (RSP = 180 μg m−3, CO2 = 1000 ppm, TVOC = 600 μg m−3) [15].
θ = 1 3 j = 1 3 Φ j * ;   Φ j * = Φ j Φ j , e

2.2. Data Preprocessing

Figure 1 shows the pair plots of the IAQ parameters grouped by satisfactory and unsatisfactory IAQ assessed using Schemes 1 and 2. A linear data scaling to the range [0, 1] was applied for data normalization.
The training data and testing data were randomly selected at a distribution ratio of training data (1 − rd) and testing data (rd), as shown in Equation (2), where nd,t and nd,g are the numbers of data points in the testing and training datasets, respectively.
r d = n d , t n d , g
Multifold cross-validation was employed for model validation. The training dataset was divided into 5 and 10 subsets of equal size and each subset was tested using the hyperparameters trained on the remaining subsets. The cross-validation accuracy was determined based on the percentage of correctly classified data. A grid search was then conducted to optimize the model hyperparameters, which were later used to retrain the model for evaluation.
The model accuracy AC, the probability of the model making a correct prediction [14], is usually compared with the baseline accuracy ACbl in Equation (3) which indicates the certainty of the predictions made without the algorithm, where mode (N) is the mode of true result and N is the sample size.
A C b l = mode   N N
The baseline accuracy values adopted are 0.682 and 0.670 for Schemes 1 and 2, respectively. A model with an accuracy below the baseline is considered to be unsatisfactory.
In this study, as shown in Figure 2, a total of 16 (=4 × 2 × 2) evaluation conditions were generated from 4 different combinations (rd = 0.2, 0.3, 0.4, 0.5) of training and testing data, 2 multifold cross-validations (K = 5, 10) and 2 IAQ schemes (Schemes 1 and 2). Trained models (without grid-search-tuned model hyperparameters) and retrained models (with grid-search-tuned model hyperparameters) were then evaluated using the testing data of the 16 evaluation conditions, and finally, 32 sets of testing results were obtained for evaluating the performance of the 9 models for IAQ assessment.

2.3. Models for Evaluation

Table 3 shows the classification models (classifiers) employed for developing the IAQ assessment model. The selected models included Support Vector Machine (SVM) with different kernel functions (i.e., linear, polynomial, radial basis function (RBF), and sigmoid), k-Nearest Neighbors (kNN), Logistic Regression, Decision Tree (DT), Random Forest (RF) and Multilayer Perceptron Artificial Neural Network (MLP-ANN). These algorithms are commonly used for developing IAQ prediction and forecasting models based on the literature review described in the introduction. In order to provide a universal framework for developing the IAQ assessment models and updating the screening levels, these popular models were adopted and their performances were evaluated. More details of each machine learning model and its hyperparameters can be found in Appendix A.
Table 3 also presents the test ranges of the hyperparameters, the cross-validation accuracy and the model accuracy with the testing datasets, and the corresponding hyperparameters that gave the best prediction accuracy in all tests. The development and the training of models were coded using the Python programming language described by Pedregosa et al. [42].
Regularization was applied to avoid overfitting by penalizing large coefficients [43]. It was intended to reduce the generalization error but not the training error. As a result, the application of regularization allowed a certain amount of misclassified data points in the training dataset [44]. To minimize the error between the true value yi and the predicted value , the cost function f shown in Equation (4) could be expressed with the L2 loss function i y i j x i j β j 2 and the regularization factor C [45].
f = i y i j x i j β j 2 + C j β j 2

3. Results and Discussion

Figure 3 illustrates the cross-validation accuracy of the SVM classifiers with linear, RBF, sigmoid and polynomial kernels. Consistent accuracy of AC > 0.8 was observed when the regularization factor C was ≥2 for the SVM with linear kernel, and for the whole test ranges of the SVM with RBF and polynomial kernels. However, the SVM with sigmoid kernel did not perform well for the training datasets, as compared with other kernels, with AC ≤ 0.65, which dropped significantly for C ≥ 0.6.
Figure 4 shows the cross-validation accuracy of the kNN classifier, which was consistent for k = 2–11. While the accuracy was more sensitive to the weight function applied, a larger k that compensated for the accuracy drop was observed in Figure 4a.
According to Figure 5, the logistic regression classifier improved the prediction accuracy for regularization factor C > 2. The choice of training dataset was found to be insignificant to the model accuracy.
Figure 6 graphs the cross-validation accuracy of the decision tree classifier. Within the range of 0.75–0.8, the accuracy was sensitive to the size of the dataset, the impurity function, the minimum number of samples required to split an internal node ns, and the minimum number of samples required to be at a leaf node nr. It became less sensitive when the maximum depth value was greater than or equal to 10 (i.e., D ≥ 10).
Figure 7 exhibits the cross-validation accuracy of the random forest classifier. The accuracy, which became less sensitive for D ≥ 2, was improved, as compared with Figure 6. It can be seen that the number of trees nf compensated for the accuracy drop due to D ≤ 5.
A wide range of hyperparameters can be adopted for a MLP-ANN classifier. In this study, 100 and 200 neurons in the inner layers 1, 3, 4 and 6 were evaluated, with neuron arrangements of each layer in the ratios of (1), (1:8:1), (1:4:4:1) and (1:2:2:2:2:1). Figure 8 illustrates the cross-validation accuracy of the 60 configurations of the model hyperparameters for the inner-layer architecture (i.e., x-axis with legends 1–60, Table A1). A very sensitive accuracy ranging from <0.45 to about 0.8 was observed.
It was challenging to set up a suitable MLP-ANN for an engineering application without prior selection of the model hyperparameters. Table 4 shows the test accuracy of the MLP-ANN classifier. The identity activation function made the best predictions with the highest (mean and median) test accuracy. Iteration schemes ADAM and L-BFGS, with constant learning rates only, returned more accurate predictions, as compared with SGD.
To sum up, all of the IAQ assessment models developed achieved the maximum test accuracy, in a narrow range of 0.807–0.820, with the mean test accuracy ranging from 0.536 to 0.805. Table 5 presents the best-performed models in the 32 tests (16 each for the trained and retrained models). The results showed that the SVM with polynomial kernel gave the highest test accuracy and next-best predictions in the trained and retrained model tests. Moreover, models with decision tree and random forest classifiers gained 4 and 3 counts (out of 16), respectively, in the trained model test, whereas the SVM with linear kernel gained 8 counts (i.e., the best prediction performance) in the retrained model test. These classifiers can be good choices for accurate IAQ assessment model development.

4. Model Prediction of IAQ Assessment with IAQ Index Updates

The IAQ index was developed previously as a screening strategy to screen out premises with problematic IAQ based on assessment Scheme 1. Given that the assessment scheme has been updated to Scheme 2, this section evaluates the relative impact of the index due to the updated values of baselines in the two schemes.
The relative impact on the IAQ index for IAQ assessment with Schemes 1 and 2 was evaluated using three uniformly distributed ranges: CO2 = 400–1400 ppm, RSP = 1–120 μg m−3, and TVOC = 0–1500 μg m−3. The selected ranges of surrogate pollutants generally cover the observable range in the office IAQ database. Determined by Monte Carlo sampling techniques, the three IAQ parameters in the above ranges were used to calculate the corresponding IAQ index and to predict the IAQ satisfaction/dissatisfaction using the trained and retrained classifiers.
Figure 9 shows the percentage of predicted satisfactory and unsatisfactory IAQ for the range of IAQ indices under Schemes 1 and 2. The IAQ satisfaction was assessed by the best performing trained and retrained IAQ classification models (with model accuracy shown in brackets). Classifications were performed with models with classifiers of a decision tree, a random forest, SVM with polynomial kernel and RBF kernel for Scheme 1, and models with classifiers of kNN, MLP-ANN, SVM with linear kernel and polynomial kernel for Scheme 2. The figure shows that the predictions of unsatisfactory IAQ made by these models generally agree with each other, with a deviation up to ±5% from the average prediction of satisfactory IAQ with Scheme 2.
The IAQ index in Figure 9 does not map any particular office distribution function and, thus, a relative approach was adopted to study the relative impact of Scheme 2 on Scheme 1, in terms of assessment likelihood, using the dataset summarized in Table 2. The relative impact ratio r2,1 is determined by Equation (5), where xu and xs are the distribution functions of the IAQ index for unsatisfactory and satisfactory IAQ respectively.
r 2 , 1 = L R 2 L R 1 ;   L R = x 1 x 2 f x u d x x 1 x 2 f x s d x
Table 6 outlines a proposed likelihood ratio LR1 for air-conditioned offices with unsatisfactory IAQ using Scheme 1, as reported in an earlier study [29]. The estimation of r1,2 was made based on the average predictions from all models shown in Figure 9. Normality of the IAQ index was assumed (p > 0.05, w/s test). Based on the relative impact values determined for the IAQ index ranges <0.32, 0.32–0.42, 0.43–0.53, 0.54–0.64, ≥0.65, the corresponding values of LR2 were computed (by LR2 = r2,1 LR1) and summarized in Table 6. The corresponding likelihood ratios in Scheme 2 were found to be higher due to the tightening of assessment criteria in the updated scheme.

5. Conclusions

One of the ongoing IAQ development tasks is to constantly improve IAQ objectives so that they are updated, relevant and attainable. Territory-wide IAQ screening should be implemented immediately, and later, periodically, to understand the overall IAQ situation and to maintain an up-to-date IAQ profile. Given so many IAQ standards with a wide range of exposure limits established by various governments, a universal framework for IAQ assessment modelling, which applies to all standards, is of urgent need.
In this study, a new strategy for unsatisfactory IAQ prediction using machine learning models of three surrogate IAQ indicators in the IAQ index was proposed. The results showed that all selected machine learning models performed well, achieving a maximum test accuracy of 0.807–0.820. Among the selected models, SVM with linear kernel and polynomial kernel, decision tree classifier and random forest classifier gave an IAQ classification with higher accuracy. To further demonstrate the use of IAQ index with different exposure limits in IAQ assessment, machine learning models of IAQ index using two different baselines (Schemes 1 and 2) were presented. The predictions of IAQ made by all selected models generally agreed with each other, with a ±5% deviation observed in the prediction of satisfactory IAQ under Scheme 2. The likelihood ratio of the IAQ index in Scheme 2 also increased with the tightening criteria for assessing exposure levels.
As demonstrated, machine learning models for IAQ index give promising prediction accuracy in identifying unsatisfactory IAQ, and that shall provide an ultimate strategy for IAQ screening and assessment, even under various IAQ standards and exposure criteria.

Author Contributions

Conceptualization, L.-T.W. and K.-W.M.; methodology, L.-T.W.; formal analysis, L.-T.W.; writing—original draft preparation, L.-T.W., K.-W.M. and T.-W.T.; writing—review and editing, L.-T.W., K.-W.M. and T.-W.T.; supervision, L.-T.W. and K.-W.M.; project administration, K.-W.M.; funding acquisition, K.-W.M. and L.-T.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was jointly supported by a grant from the Collaborative Research Fund (CRF) COVID-19 and Novel Infectious Disease (NID) Research Exercise, Research Grants Council of the Hong Kong Special Administrative Region, China (Project no. PolyU P0033675/C5108-20G, HKPU P0033675/E-RB0P, PolyU 15217221 P0037773/Q-86B, PolyU 152088/17E P0005278/Q-59V) and the Research Institute for Smart Energy (RISE) Matching Fund (Project no. P0038532).

Data Availability Statement

Data available on request.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Nomenclature

IAQ index and updates
jsurrogate parameter
Φj*fractional dose
Φjexposure level
Φj,ereference exposure limit
θIAQ index
rrelative impact ratio
xu/xsdistribution functions for unsatisfactory/satisfactory IAQ index
LRlikelihood ratio
Data processingData processing
Xdata vector
rd/1 − rdtest data/training data
nd,t/nd,gnumber of data points in the test/training set
ACmodel accuracy
ACblbaseline accuracy
TP/TNtrue positive/negative
FP/FNfalse positive/negative
Nsample size
Knumber of folds
Units for IAQ parameters
ppmparts per million
μg m−3microgram per cubic meter
Bq m−3becquerels per cubic meter
CFU m−3colony-forming units per cubic meter
Regularization
fcost function
yitrue value
xβpredicted value
Cregularization factor
nnumber of dimensions
Decision tree/random forest
pj2probability of j
jclass
Dtree’s maximum depth
ns/nrminimum number of samples
required to split an internal node/be at a leaf node
nfnumber of trees
Support Vector Machines
α, βconstants
xiinputs
yioutput class
Mmargin half-width
εislack variables
c0, c1hyperparameters for K(xi,xj)
K(xi,xj)kernel function
γkernel coefficient
k-Nearest Neighbors
kconstant
d(xi,yi)Euclidean distance
y ^ predictions
Wweight function
dk−1neighbour distance
MLP-ANN
Rdataset
m/odimension for input/output
Jlocal gradient of function f
βparameter
yindependent variables
δincrement
Logistic regression
x0sigmoid’s midpoint of x
xinputs
klogistic growth rate
wcoefficient vector

Appendix A

Appendix A.1. Support Vector Machine (SVM)

The support vector machine (SVM) algorithm identifies the optimal hyperplane in an n-dimensional space that distinctly separates the data points to be classified into two classes (in this study, satisfaction or dissatisfaction). The algorithm maximizes the margin between these two classes. The linear classifier can be expressed by Equation (A1), where α and β are constants, x is the input vector of inputs xi [46,47], and yi is the output class.
f x = β 0 + i α i x i , x ;   f y i = 0 f x i < 0 1 f x i > 0
To maximize the margin half-width M of the strip that separates the data points into the two classes, slack variables εi are specified for the soft margins, such that observations (training data) on the wrong side are allowed. It is a trade-off between misclassification of the training samples and simplicity of the decision surface suitable for a general model.
In Equation (A2), C is the regularization factor that is optimized for the number of samples [42]. For a large value of C, the optimizer chooses a smaller-margin hyperplane if that hyperplane can classify all the training points correctly. Conversely, a small value of C causes the optimizer to look for a larger-margin separating hyperplane. The application of regularization improves the numerical stability and the universality errors for predicting unseen data.
i ε i C ;   y i β 0 + β 1 x i 1 + M 1 ε i ,   ε i 0
Four types of kernel functions K(xi,xj) in SVM were investigated in this study. They were linear, polynomial, radial basis function (RBF) and sigmoid kernel functions, expressed below in Equations (A3)–(A6), where c0 and c1 are the hyperparameters for the functions [48], and γ is the kernel coefficient, which defines how much influence a single training sample has. A large γ increases the area of influence of the support vectors but reduces the regularization for overfitting prevention, whereas a small γ constrains the model to capture the complexity of the data. The behavior of the model is very sensitive to the value of γ.
K x i ,   x j = φ x i T φ x j = x i , x j
K x i ,   x j = c 0 + γ x i , x j c 1
K x i ,   x j = e x p γ x i x j 2
K x i ,   x j = t a n h c 0 + γ x i , x j

Appendix A.2. k-Nearest Neighbors (kNN)

The k-nearest neighbors (kNN) algorithm is a non-parametric classification approach that classifies a point based on the majority class of the k-neighbors closest to the point. The average response of the k-closest points to x is given by Equation (A7).
f x = 1 k i = 1 k y i
The Euclidean distance d(xi,yi), expressed in Equation (A8), is usually adopted for calculating the distance [49].
d x i , y i = i = 1 k x i y i 2
The neighbors closer to a query point have a greater influence than the neighbors that are farther away. Therefore, the predictions y ^ can be made with a non-negative weight function to the neighbor distance W~dk−1, as shown in Equation (A9).
y ^ = i = 1 n W x i ,   x j x i

Appendix A.3. Logistic Regression

A logistic regression algorithm is a linear classification model. The probabilities of the outcomes of a single trial are modelled using the logistic function exhibited in Equation (A10), where x0 is the x value of the sigmoid’s midpoint, and k is the logistic growth rate [50].
f x = 1 1 + e x p k x x 0
The decision function is expressed in Equation (A11), where w is a coefficient vector.
f x = m i n w , c 1 2 w T w + C i = 1 n l o g e x p y i X i T w + c + 1

Appendix A.4. Decision Tree (DT) and Random Forest (RF)

A decision tree (DT) is a non-parametric learning algorithm that partitions the data into subsets for classification [40]. The goal is to create the smallest possible tree (training model) that can predict the value of a target variable by learning simple decision rules. A tree can be seen as a piecewise constant approximation. The binary partitioning process continues until no further splits can be made so that the tree nodes are pure. The node purity can be measured by Gini impurity (GI) or by the information entropy (EI). GI measures the frequency at which any element of the dataset is mislabeled when it is randomly labeled. EI measures the disorder of the features with the target. A tree node is determined by minimizing the chosen index so that all the contained elements in the node are of one unique class. The GI and EI can be expressed by Equations (A12) and (A13), where pj2 is the probability of class j.
G I = 1 j p j 2
E I = j p j l o g 2 p j
Regularization can be done by confining the tree size, the tree’s maximum depth D, the minimum number of samples required to split an internal node ns, and the minimum number of samples required to be at a leaf node nr.
A random forest (RF) is a meta-estimator that fits several decision tree classifiers to various subsamples of the dataset. It is also known as a random decision forest (RDF) that uses the mode of the classification to improve the predictive accuracy and control the problem of over-fitting [51]. The number of trees in the forest is a hyperparameter to be tuned, in addition to those hyperparameters for a decision tree.

Appendix A.5. Multilayer Perceptron Artificial Neural Network (MLP-ANN)

A multilayer perceptron artificial neural network (MLP-ANN) is a supervised learning algorithm that learns a function f(): RmRo by training a dataset R with m-dimensional input and o-dimensional output. It can also learn a nonlinear function approximated for predicting the output. As ANNs do not have predefined assumptions, they have a low sensitivity to error term assumptions and high tolerance to noise. Therefore, an MLP-ANN can be used to examine the relationships in complex nonlinear datasets in the same way as conventional statistical techniques, but without many of the parametric restrictions about the nature of the data relationships [29]. The algorithm is described by Equation (A14), where J is the local gradient of function f concerning parameters β, y is independent variables and δ is the increment.
J T J + λ d i a g J T J δ = J T y f B
The hyperparameters are adjusted for model performance. Hidden layer arrangement includes the number of hidden layers and the number of neurons in each hidden layer. The activation function of a neuron defines the output of that neuron given an input. Four activation functions (identity, logistic, tanh and rectified linear unit (ReLU)) used in this study are given in Equations (A15)–(A18).
f x = x
f x = 1 1 + e x p x
t a n h x = e x p x e x p x e x p x + e x p x
f x = 0 x 0 x x > 0
Moreover, iterative methods adopted for training the neural networks (weight optimization) can be specified. The L-BFGS type quasi-Newton method calculates the second derivative of the objective function and that leads to a more efficient descent direction [52]. Stochastic gradient descent (SGD), by using an estimate calculated from a randomly selected subset of the data rather than the entire dataset, optimizes an objective function with differentiable smoothness properties [53]. Adaptive moment estimation (Adam) is an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments [54].
Learning rate determines the weight updates. The default value for the constant learning rate is 0.001 for all iterative methods. Optional weights are available for the stochastic gradient descent solver. An “invscaling” weight gradually decreases the learning rate at each time step using an inverse scaling exponent to the time step, while an “adaptive” weight keeps the learning rate constant, as long as the training loss keeps decreasing. Dividing the current learning rate by 5 is generally adopted for the adaptive weight.

Appendix B

Table A1. Configuration sets of the model hyperparameters for the inner layer architecture for the MLP-ANN classifier.
Table A1. Configuration sets of the model hyperparameters for the inner layer architecture for the MLP-ANN classifier.
LegendActivationCLearning RateSolverLegendActivationCLearning RateSolver
1identity0.0001constantAdam31relu0.05adaptiveSDG
2logistic32tanh
3relu33identity1
4tanh34logistic
5identity0.0535relu
6logistic36tanh
7relu37identity0.0001constant
8tanh38logistic
9identity139relu
10logistic40tanh
11relu41identity0.05
12tanh42logistic
13identity0.0001LBFGS43relu
14logistic44tanh
15relu45identity1
16tanh46logistic
17identity0.0547relu
18logistic48tanh
19relu49identity0.0001invscaling
20tanh50logistic
21identity151relu
22logistic52tanh
23relu53identity0.05
24tanh54logistic
25identity0.0001adaptiveSDG55relu
26logistic56tanh
27relu57identity1
28tanh58logistic
29identity0.0559relu
30logistic60tanh

References

  1. Klepeis, N.E.; Nelson, W.C.; Ott, W.R.; Robinson, J.P.; Tsang, A.M.; Switzer, P.; Behar, J.V.; Hern, S.C.; Engelmann, W.H. The National Human Activity Pattern Survey (NHAPS): A resource for assessing exposure to environmental pollutants. J. Expo. Sci. Environ. Epidemiol. 2011, 11, 231–252. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Burroughs, H.E.; Hansen, S.J. Managing Indoor Air Quality; Fairmont Press: Lilburn, GA, USA, 2001. [Google Scholar]
  3. Brown, S.K. Indoor Air Quality. Australia: State of the Environment Technical Paper Series (Atmosphere); Department of the Environment, Sport and Territories: Canberra, Australia, 1997. [Google Scholar]
  4. Husman, T.M. The Health Protection Act, national guidelines for indoor air quality and development of the national indoor air programs in Finland. Environ. Health Perspect. 1999, 107 (Suppl. S3), 515–517. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Azuma, K.; Uchiyama, I.; Ikeda, K. The regulations for indoor air pollution in Japan: A public health perspective. J. Risk Res. 2008, 11, 301–314. [Google Scholar] [CrossRef]
  6. Aurola, R.; Valikyla, T. (Eds.) Guidelines for Healthy Housing; Ministry of Social Affairs and Health: Pori, Finland, 1997. (In Finnish) [Google Scholar]
  7. Ad-hoc-Arbeitsgruppe IRK-AGLMB. Guideline values for indoor air: General Scheme. Bundesgesundheitsblatt 1996, 39, 422–426. (In German) [Google Scholar]
  8. Meyers, R.A. Encyclopedia of Physical Science and Technology; Academic Press: San Diego, CA, USA, 2002. [Google Scholar]
  9. Schell, M.; Int-Hout, D. Demand Control Ventilation Using CO2. ASHRAE J. 2001, 43, 18–29. [Google Scholar]
  10. Hui, P.S.; Wong, L.T.; Mui, K.W. Feasibility study of an Express Assessment Protocol for the indoor air quality of air-conditioned offices. Indoor Built Environ. 2006, 15, 373–378. [Google Scholar] [CrossRef]
  11. Wong, L.T.; Mui, K.W.; Hui, P.S. A statistical model for characterizing common air pollutants in air-conditioned offices. Atmos. Environ. 2006, 40, 4246–4257. [Google Scholar] [CrossRef]
  12. Indoor Air Quality Management Group. Practice Note for Managing Air Quality in Air-Conditioned Public Transport. Facilities; Environmental Protection Department: Hong Kong, China, 2003. [Google Scholar]
  13. Wong, L.T.; Mui, K.W.; Hui, P.S. Screening for indoor air quality of air-conditioned offices. Indoor Built Environ. 2007, 16, 438–443. [Google Scholar] [CrossRef]
  14. Mui, K.W.; Hui, P.S.; Wong, L.T. Diagnostics of unsatisfactory indoor air quality in air-conditional workplaces. Indoor Built Environ. 2011, 20, 313–320. [Google Scholar] [CrossRef]
  15. Wong, L.T.; Mui, K.W.; Tsang, T.W. Evaluation of indoor air quality screening strategies: A step-wise approach for IAQ screening. Int. J. Environ. Res. Public Health 2016, 13, 1240. [Google Scholar] [CrossRef] [Green Version]
  16. WHO Regional Office for Europe. Air Quality Guidelines: Global Update 2005: Particulate Matter, Ozone, Nitrogen Dioxide and Sulfur Dioxide; World Health Organization Regional Office for Europe: Copenhagen, Denmark, 2006. [Google Scholar]
  17. WHO Regional Office for Europe. Review of Evidence on Health Aspects of Air Pollution—REVIHAAP Project: Final Technical Report; World Health Organization Regional Office for Europe: Copenhagen, Denmark, 2013. [Google Scholar]
  18. WHO Regional Office for Europe. Health Risks of Air Pollution in Europe—HRAPIE Project. Recommendations for Concentration–Response Functions for Cost–Benefit Analysis of Particulate Matter, Ozone and Nitrogen Dioxide; World Health Organization Regional Office for Europe: Copenhagen, Denmark, 2013. [Google Scholar]
  19. WHO Regional Office for Europe. Evolution of WHO Air Quality Guidelines: Past, Present and Future; World Health Organization Regional Office for Europe: Copenhagen, Denmark, 2017. [Google Scholar]
  20. WHO. WHO Global Air Quality Guidelines. Particulate Matter (PM2.5 and PM10), Ozone, Nitrogen Dioxide, Sulfur Dioxide and Carbon Monoxide; World Health Organization: Geneva, Switzerland, 2021. [Google Scholar]
  21. Rybarczyk, Y.; Zalakeviciute, R. Machine learning approaches for outdoor air quality modelling: A systematic review. Appl. Sci. 2018, 8, 2570. [Google Scholar] [CrossRef] [Green Version]
  22. Seyedzadeh, S.; Rahimian, F.; Glesk, I.; Roper, M. Machine learning for estimation of building energy consumption and performance: A review. Vis. Eng. 2018, 6, 5. [Google Scholar] [CrossRef]
  23. Wei, W.; Ramalho, O.; Malingre, L.; Sivanantham, S.; Little, J.C.; Mandin, C. Machine learning and statistical models for predicting indoor air quality. Indoor Air 2019, 29, 704–726. [Google Scholar] [CrossRef] [PubMed]
  24. Elbayoumi, M.; Ramli, N.A.; Fitri Md Yusof, N.F. Development and comparison of regression models and feedforward backpropagation neural network models to predict seasonal indoor PM2.5–10 and PM2.5 concentrations in naturally ventilated schools. Atmos. Pollut. Res. 2015, 6, 1013–1023. [Google Scholar] [CrossRef]
  25. Yuchi, W.; Gombojav, E.; Boldbaatar, B.; Galsuren, J.; Enkhmaa, S.; Beejin, B.; Naidan, G.; Ochir, C.; Legtseg, B.; Byambaa, T.; et al. Evaluation of random forest regression and multiple linear regression for predicting indoor fine particulate matter concentrations in a highly polluted city. Environ. Pollut. 2019, 245, 746–753. [Google Scholar] [CrossRef]
  26. Park, S.; Kim, M.; Kim, M.; Namgung, H.G.; Kim, K.T.; Cho, K.H.; Kwon, S.B. Predicting PM10 concentration in Seoul metropolitan subway stations using artificial neural network (ANN). J. Hazard. Mater. 2018, 341, 75–82. [Google Scholar] [CrossRef]
  27. Skön, J.; Johansson, M.; Raatikainen, M.; Leiviskä, K.; Kolehmainen, M. Modelling indoor air carbon dioxide (CO2) concentration using neural network. World Acad. Sci. Eng. Technol. Int. Sci. Index. 2012, 6, 737–741. [Google Scholar]
  28. Khazaei, B.; Shiehbeigi, A.; Haji Molla Ali Kani, A.R. Modeling indoor air carbon dioxide concentration using artificial neural network. Int. J. Environ. Sci. Technol. 2019, 16, 729–736. [Google Scholar] [CrossRef]
  29. Challoner, A.; Pilla, F.; Gill, L. Prediction of indoor air exposure from outdoor air quality using an artificial neural network model for inner city commercial buildings. Int. J. Environ. Res. Public Health 2015, 12, 15233–15253. [Google Scholar] [CrossRef]
  30. Kropat, G.; Bochud, F.; Jaboyedoff, M.; Laedermann, J.P.; Murith, C.; Palacios, M. Improved predictive mapping of indoor radon concentrations using ensemble regression trees based on automatic clustering of geological units. J. Environ. Radioact. 2015, 147, 51–62. [Google Scholar] [CrossRef]
  31. Kropat, G.; Bochud, F.; Jaboyedoff, M.; Laedermann, J.P.; Murith, C.; Gruson, M.P.; Baechler, S. Predictive analysis and mapping of indoor radon concentrations in a complex environment using kernel estimation: An application to Switzerland. Sci. Total Environ. 2015, 505, 137–148. [Google Scholar] [CrossRef] [PubMed]
  32. Ahn, J.; Shin, D.; Kim, K.; Yang, J. Indoor air quality analysis using deep learning with sensor data. Sensors 2017, 17, 2476. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  33. Saini, J.; Dutta, M.; Marques, G. Indoor air quality prediction systems for smart environments: A systematic review. J. Ambient Intell. Smart Environ. 2020, 12, 433–453. [Google Scholar] [CrossRef]
  34. Montgomery, D.C.; Jennings, C.L.; Kulahci, M. Introduction to Time Series Analysis and Forecasting; John Wiley & Sons: New York, NY, USA, 2008. [Google Scholar]
  35. Yu, T.C.; Lin, C.C. An intelligent wireless sensing and control system to improve indoor air quality: Monitoring, prediction, and preaction. Int. J. Distrib. Sens. Netw. 2015, 11, 140978. [Google Scholar] [CrossRef] [Green Version]
  36. Han, Z.; Gao, R.X.; Fan, Z. Occupancy and indoor environment quality sensing for smart buildings. In Proceedings of the 2012 IEEE International Instrumentation and Measurement Technology Conference Proceedings, Congress Graz, Graz, Austria, 13–16 May 2012; IEEE: Piscataway, NJ, USA, 2012. [Google Scholar]
  37. Ouaret, R.; Ionescu, A.; Petrehus, V.; Candau, Y.; Ramalho, O. Spectral band decomposition combined with nonlinear models: Application to indoor formaldehyde concentration forecasting. Stoch. Environ. Res. Risk Assess. 2018, 32, 985–997. [Google Scholar] [CrossRef]
  38. Zimmerman, N.; Presto, A.A.; Kumar, P.N.; Gu, J.; Hauryliuk, A.; Robinson, E.S.; Robinson, A.L.; Subramanian, R. A machine learning calibration model using random forests to improve sensor performance for lower-cost air quality monitoring. Atmos. Meas. Tech. 2018, 11, 291–313. [Google Scholar] [CrossRef] [Green Version]
  39. Leong, W.C.; Kelani, R.O.; Ahmad, Z. Prediction of air pollution index (API) using support vector machine (SVM). J. Environ. Chem. Eng. 2020, 8, 103208. [Google Scholar] [CrossRef]
  40. Sarkhosh, M.; Najafpoor, A.A.; Alidadi, H.; Shamsara, J.; Amiri, H.; Andrea, T.; Kariminejad, F. Indoor Air Quality associations with sick building syndrome: An application of decision tree technology. Build. Environ. 2021, 188, 107446. [Google Scholar] [CrossRef]
  41. Indoor Air Quality Management Group. A Guide on Indoor Air Quality Certification Scheme for Offices and Public Places; Hong Kong Environmental Protection Department, Government of the Hong Kong Special Administrative Region: Hong Kong, China, 2019. [Google Scholar]
  42. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  43. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef] [Green Version]
  44. Bzdok, D.; Altman, N.; Krzywinski, M. Statistics versus machine learning. Nat. Methods 2018, 15, 233–234. [Google Scholar] [CrossRef] [PubMed]
  45. Pecha, M.; Horák, D. Analyzing l1-loss and l2-loss Support Vector Machines Implemented in PERMON Toolbox. In AETA 2018—Recent Advances in Electrical Engineering and Related Sciences: Theory and Application; Zelinka, I., Brandstetter, P., Trong Dao, T., Hoang Duy, V., Kim, S., Eds.; Springer: Cham, Switzerland, 2020; pp. 13–23. [Google Scholar]
  46. Adak, M.F.; Ercan, S. Identification of Indoor Harmful Gas to Human Respiratory System using Support Vector Machines. In Proceedings of the 3rd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, 1–13 October 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
  47. Zhang, L.; Tian, F.; Nie, H.; Dang, L.; Li, G.; Ye, Q.; Kadri, C. Classification of multiple indoor air contaminants by an electronic nose and a hybrid support vector machine. Sens. Actuators B Chem. 2012, 174, 114–125. [Google Scholar] [CrossRef]
  48. Intan, P.K. Comparison of Kernel Function on Support Vector Machine in Classification of Childbirth. J. Mat. Mantik. 2019, 5, 90–99. [Google Scholar] [CrossRef] [Green Version]
  49. Imandoust, S.B.; Bolandraftar, M. Application of k-nearest neighbor (knn) approach for predicting economic events: Theoretical background. Int. J. Eng. 2013, 3, 605–610. [Google Scholar]
  50. Schein, A.I.; Ungar, L.H. Active learning for logistic regression: An evaluation. Mach. Learn. 2007, 68, 235–265. [Google Scholar] [CrossRef]
  51. Ho, T.K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; IEEE: Piscataway, NJ, USA, 1995. [Google Scholar]
  52. Bollapragada, R.; Nocedal, J.; Mudigere, D.; Shi, H.J.; Tang, P.T.P. A progressive batching L-BFGS method for machine learning. In Proceedings of the International Conference on Machine Learning, Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
  53. Bottou, L. Stochastic gradient learning in neural networks. In Proceedings of the Neuro-Nımes, Nimes, France, 12–16 November 1990; EC2: Nanterre, France, 1991. [Google Scholar]
  54. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Figure 1. Pair plots of CO2, RSP, and TVOC grouped by assessed indoor air quality (IAQ) against assessment (a) Scheme 1 (b) Scheme 2.
Figure 1. Pair plots of CO2, RSP, and TVOC grouped by assessed indoor air quality (IAQ) against assessment (a) Scheme 1 (b) Scheme 2.
Ijerph 19 05724 g001
Figure 2. Data processing for model training and evaluation.
Figure 2. Data processing for model training and evaluation.
Ijerph 19 05724 g002
Figure 3. Cross-validation accuracy of the SVM classifier. (a) Linear kernel, (b) rbf kernel, (c) sigmoid kernel, c0 = 0.01, (d) sigmoid kernel, c0 = 0.5, (e) polynomial kernel, c0 = 0, c1 = 2, (f) polynomial kernel, c0 = 1, c1 = 3.
Figure 3. Cross-validation accuracy of the SVM classifier. (a) Linear kernel, (b) rbf kernel, (c) sigmoid kernel, c0 = 0.01, (d) sigmoid kernel, c0 = 0.5, (e) polynomial kernel, c0 = 0, c1 = 2, (f) polynomial kernel, c0 = 1, c1 = 3.
Ijerph 19 05724 g003
Figure 4. Cross-validation accuracy of the kNN classifier. (a) W = 1/dk, (b) W = 1.
Figure 4. Cross-validation accuracy of the kNN classifier. (a) W = 1/dk, (b) W = 1.
Ijerph 19 05724 g004
Figure 5. Cross-validation accuracy of the logistic classifier.
Figure 5. Cross-validation accuracy of the logistic classifier.
Ijerph 19 05724 g005
Figure 6. Cross-validation accuracy of the decision tree classifier. (a) Entropy impurity, nr = 6 (b) Gini impurity, nr = 2.
Figure 6. Cross-validation accuracy of the decision tree classifier. (a) Entropy impurity, nr = 6 (b) Gini impurity, nr = 2.
Ijerph 19 05724 g006
Figure 7. Cross-validation accuracy of the random forest classifier. (a) Entropy impurity, ns = 9, nf = 10 (b) Gini impurity, ns = 9, nf = 110, (c) Gini impurity, ns = 2, nf = 110.
Figure 7. Cross-validation accuracy of the random forest classifier. (a) Entropy impurity, ns = 9, nf = 10 (b) Gini impurity, ns = 9, nf = 110, (c) Gini impurity, ns = 2, nf = 110.
Ijerph 19 05724 g007
Figure 8. Cross-validation accuracy of the MLP-ANN classifier. (a) 100 neurons, 1 hidden layer, (b) 200 neurons, 1 hidden layer, (c) 100 neurons, 6 hidden layers (d) 200 neurons, 6 hidden layers, (e) 100 neurons, 3 hidden layers.
Figure 8. Cross-validation accuracy of the MLP-ANN classifier. (a) 100 neurons, 1 hidden layer, (b) 200 neurons, 1 hidden layer, (c) 100 neurons, 6 hidden layers (d) 200 neurons, 6 hidden layers, (e) 100 neurons, 3 hidden layers.
Ijerph 19 05724 g008
Figure 9. Predicted IAQ satisfaction and dissatisfaction with an IAQ index with assessment criteria, (a) Scheme 1, (b) Scheme 2.
Figure 9. Predicted IAQ satisfaction and dissatisfaction with an IAQ index with assessment criteria, (a) Scheme 1, (b) Scheme 2.
Ijerph 19 05724 g009
Table 1. 8 h exposure limits of satisfactory indoor air quality.
Table 1. 8 h exposure limits of satisfactory indoor air quality.
Parameter (Unit)Scheme 1Scheme 2
CO2 (ppm)10001000
CO (ppm)8.76.1
RSP (μg m−3)180100
NO2 (μg m−3)150150
O3 (μg m−3)120120
HCHO (μg m−3)100100
TVOC (μg m−3)600600
Radon (Bq m−3)200167
Airborne bacteria (CFU m−3)10001000
Table 2. Statistical summary of levels of indoor air quality surrogate parameters in 525 offices, (a) overall summary; (b) summary of the dataset being classified as “Satisfactory IAQ” regarding Schemes 1 and 2; (c) summary of the dataset being classified as “Unsatisfactory IAQ” regarding Schemes 1 and 2.
Table 2. Statistical summary of levels of indoor air quality surrogate parameters in 525 offices, (a) overall summary; (b) summary of the dataset being classified as “Satisfactory IAQ” regarding Schemes 1 and 2; (c) summary of the dataset being classified as “Unsatisfactory IAQ” regarding Schemes 1 and 2.
(a) Overall Summary
CO2 (ppm)RSP (μg m3)TVOC (μg m−3)IAQ Index
mean658303580.473
std dev151203280.201
min339400.189
25%556151400.333
50%639222950.431
75%746384660.558
max149712531441.99
(b) Satisfactory IAQ
Scheme 1
Count358
mean634282420.397
std dev126201520.111
min339400.189
25%546141130.312
50%624202090.381
75%714333540.477
max9981255970.725
Scheme 2
Count352
mean634272400.394
std dev126181520.110
min33940.00.189
25%547141120.311
50%623202080.378
75%713323540.474
max998995970.725
(c) Unsatisfactory IAQ
Scheme 1
Count167
mean709346070.637
std dev184194460.249
min3967450.202
25%384193460.488
50%678295170.406
75%807447380.737
max14979131441.991
Scheme 2
Count173
mean707365980.634
std dev183224420.246
min396745.00.202
25%583193380.487
50%678294970.603
75%804467150.725
max149712531441.991
Table 3. Selected machine learning models and hyperparameters for the development of IAQ assessment models.
Table 3. Selected machine learning models and hyperparameters for the development of IAQ assessment models.
ModelsHyper-ParametersTest RangeValidation AccuracyTest AccuracyHyperparameters Used
SVM (linear)rd
C
0.2–0.5
0.1–10,000
0.794–0.8320.752–0.8240.4
1.0
SVM (polynomial)rd
C
c1
c0
0.2–0.5
0.1–10,000
2, 3
0, 1
0.813–0.8390.753–0.8330.4
1000
3
1
SVM (rbf)rd
C
0.2–0.5
0.1–10,000
0.806–0.8310.762–0.8240.4
1.0
SVM (sigmoid)rd
C
c0
0.2–0.5
0.0001–2000
0–1
0.638–0.6520.443–0.8000.2
0.0001
0
kNNrd
k
W
0.2–0.5
2, 3, …, 11
1, 1/dk
0.785–0.8090.762–0.8240.4
10
1
Logistic regressionrd
C
0.2–0.5
0.001–20,000
0.790–0.8250.753–0.8100.4
1
Decision treerd
D
ns
nr
Impurity
0.2–0.5
3, 4, …, 14
3, 4, …, 19
2, 3, …, 6
GI, EI
0.805–0.8290.714–0.8380.2
4
3
2
EI
Random forestrd
nf
D
ns
nr
Impurity
0.2–0.5
10, 60, 110
1, 2, …, 11
1, 2, …, 9
2, 3, …, 6
GI or EI
0.824–0.8440.724–0.8290.3
60
2
3
1
GI
MLP-ANNrd
C
Neurons
Hidden layer
Activation
Iteration
Learning rate
0.2–0.5
0.0001, 0.05, 1
100, 200
1, 3, 4, 6
Identity, logistic, tanh, relu
LBFGS, SDG, Adam
Constant, invscaling, adaptive
0.807–0.8360.714–0.8100.4
0.0001
200
3
relu
LBFGS
Constant
Table 4. Test accuracy of the MLP-ANN classifier (5-fold and 10-fold).
Table 4. Test accuracy of the MLP-ANN classifier (5-fold and 10-fold).
Hyper-ParametersTest Accuracy
ActivationIterationLearning RateMeanMedianMinMax
identityAll0.7400.7950.3360.836
logistic0.6360.6460.3480.828
tanh0.7280.7830.3480.836
relu0.7010.7430.3480.836
allADAMConstant0.7650.8010.6380.832
LBFGSConstant0.7670.8020.6380.836
SGDAdaptive0.7120.6480.6380.828
Constant0.7120.6480.6380.836
invscaling0.5500.6460.3360.676
identityADAMconstant0.8010.8060.6410.832
LBFGSconstant0.8050.8060.7780.824
SGDadaptive0.7580.7930.6380.828
constant0.7580.7910.6380.836
invscaling0.5790.6460.3360.668
logisticADAMconstant0.6670.6460.6380.828
LBFGSconstant0.6830.6460.6380.820
SGDadaptive0.6460.6460.6380.652
constant0.6460.6460.6380.652
invscaling0.5360.6460.3480.652
reluADAMconstant0.7940.8040.6380.832
LBFGSconstant0.7970.8040.6380.836
SGDadaptive0.6890.6460.6380.823
constant0.6890.6460.6380.826
invscaling0.5360.6460.3480.652
tanhADAMconstant0.7990.8050.6410.832
LBFGSconstant0.7820.7720.7020.824
SGDadaptive0.7540.7860.6380.826
constant0.7550.7860.6380.836
invscaling0.5480.6460.3480.676
Table 5. The most accurate classifiers in 32 comparison tests.
Table 5. The most accurate classifiers in 32 comparison tests.
ClassifierTrained ModelRetrained ModelTrained & Retrained Models
Count
(N = 16)
Test AccuracyCount
(N = 16)
Test AccuracyCount
(N = 16)
Test Accuracy
SVM (linear)0 80.81180.811
SVM (polynomial)60.82060.816120.818
SVM (rbf)0 20.81420.814
SVM (sigmoid)0 0 0
kNN20.8070 20.807
Logistic regression0 0 0
Decision tree40.8140 40.814
Random forest30.8190 30.819
MLP-ANN10.8100 10.810
Table 6. IAQ index of air-conditioned offices in Hong Kong.
Table 6. IAQ index of air-conditioned offices in Hong Kong.
IAQ Index θLikelihood Ratio (Scheme 1)
LR1
Relative Impact
r2,1
Likelihood Ratio (Scheme 2)
LR2
<0.320.11.40.1
0.32–0.420.41.20.5
0.43–0.530.81.10.9
0.54–0.641.71.32.2
≥0.65251.538
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wong, L.-T.; Mui, K.-W.; Tsang, T.-W. Updating Indoor Air Quality (IAQ) Assessment Screening Levels with Machine Learning Models. Int. J. Environ. Res. Public Health 2022, 19, 5724. https://doi.org/10.3390/ijerph19095724

AMA Style

Wong L-T, Mui K-W, Tsang T-W. Updating Indoor Air Quality (IAQ) Assessment Screening Levels with Machine Learning Models. International Journal of Environmental Research and Public Health. 2022; 19(9):5724. https://doi.org/10.3390/ijerph19095724

Chicago/Turabian Style

Wong, Ling-Tim, Kwok-Wai Mui, and Tsz-Wun Tsang. 2022. "Updating Indoor Air Quality (IAQ) Assessment Screening Levels with Machine Learning Models" International Journal of Environmental Research and Public Health 19, no. 9: 5724. https://doi.org/10.3390/ijerph19095724

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop