In this section, we present two applications to real datasets of environmental pollution from Santiago, Chile and Lima, Peru.
5.1. Chile Air Pollution
In this application, data collected from the Pudahuel MACAM during the year 2015 in the CEM period (1 April 2015 to 31 August 2015) were used. These data were obtained from the National Air Quality Information System (SINCA in Spanish) website of the Chilean MMA, which provides air quality data for the entire country (
http://sinca.mma.gob.cl, accessed on 15 June 2020). The Pudahuel station registered the highest concentrations of PM
during 2015. This station is the most influential monitoring station in Santiago, informing administrative decisions based on predicted critical episodes; see [
32]. Meteorological and air pollutant data for this station were obtained from the SINCA website of the Chilean MMA.
The explanatory variables used in PVCM are (i) the maximum level of PM
in
g/Nm
(PM
); (ii) the average wind speed in meters per second (WIND); (iii) the average relative humidity as a percentage (RH); and (iv) the average temperature in degrees Celsius (TEMP). The response variable considered is maximum level of PM
in
g/Nm
(PM
). We started our study with an exploratory analysis of the response variable, PM
.
Table 1 reports a descriptive summary of the data, including maximum, minimum, range, mean, median, standard deviation (SD), coefficient of variation (CV), and kurtosis (CK) for the response variable. The primary air quality regulation for PM
is 50
g/Nm
, on a 24 h level. According to
Table 1, the primary air quality regulation are exceeded for the response variable.
Figure 1a shows a boxplot with identification of possible atypical data. In this figure, the data {65, 73, 74, 75} have been highlighted as possibly atypical, indicating the need to use distributions with heavy tails. In
Figure 1b, a correlation plot of the explanatory variables and the response variable is shown. From this figure, a high positive correlation can be identified between PM
and PM
(correlation coefficient 0.82), while the other explanatory variables show moderate or low correlation with the response variable PM
, i.e., with WIND (−0.53), TEMP (−0.22) and RH (−0.49).
In
Figure 2, scatter plots of the explanatory variables, response variable, and possible interactions between explanatory variables are shown. In this figure, the relationship between PM
and the explanatory variable PM
is linear (
Figure 2a), while that of the relationship between PM
and WIND is nonlinear (
Figure 2b). In addition,
Figure 2c,d suggests that the RH and TEMP explanatory variables could be interacting with the WIND variable in a nonlinear way.
These trends suggest a PVCM between PM
and the explanatory variables. Specifically, we assume the following model:
where
denotes the response value associated with the
i-th PM
level,
is the
i-th PM
level,
are unknown smooth arbitrary functions of explanatory variable
(WIND) associated with the explanatory variables
(RH and TEMP), and
is a random error that follows Student’s
-distribution. To verify the distributional assumption established in the model, we performed a quantil-quantil (QQ) plot for the standardized residuals. From
Figure 3a, the good fit of the Student-
PVCM can be seen.
Figure 3b identifies {13, 32, 33, 37, 65, 73, 74, 75, 145} observations as possible outliers. We apply the procedure described in
Section 3.3 on smoothing parameters. Subsequently, we use the AIC method to select the value of
that maximizes the penalized log-likelihood function for the Student-
PVCM. For this, a grid of
was considered to find the value that maximizes the penalized log-likelihood function, obtaining
as optimal.
Table 2 shows the parameter estimates, penalized log-likelihood evaluated at
, SE estimates, and AIC value.
Figure 4 shows the estimated functions and their corresponding confidence bands based on the approximate SE (discontinuous curves). These plots suggests that the curves of the estimated functions vary with the explanatory variable WIND.
Figure 5 shows plot of the observed versus predicted PM
values for the Student-
PVCM. Analyzing this graph, the predictions are good because the Student-
PVCM is capable of following the overall trend of the observed PM
levels.
Figure 6 displays index plots of the Mahanalobis distance
under Student-
PVCM (
Figure 6a), while
Figure 6b shows the estimated weights
of Student-
PVCM. In these figures, possible outliers and/or influential values under the pre-adjusted model can be observed. In
Figure 6a, {13, 37, 65, 73, 74, 82} cases are highlighted as possible outliers. In
Figure 6b, it can be observed that the estimated weights for the observations described above take the smallest values, confirming the robust aspects of MPLEs against outlier observations under Student-
PVCM. With respect to the values detected in
Figure 6a, these correspond to the days 20 April, 14 April, 11 June, 19 June, 20 June, and 28 June, respectively.
The local influence allows for detection of the effect of perturbations on parameter estimation. To identify possible influential cases under the fitted model, diagnostic plots for the local influence are presented below. We present index plots of for and . In this application, we use the approach (cutoff line) to determine whether an observation is influential or not. In addition, we present a confirmatory analysis that allows analysis of the behavior of the estimates by eliminating those that have been highlighted as possibly influential under local influence techniques.
In
Figure 7a, the observations highlighted as influential under case-weight perturbation for
for the Student-
PVCM correspond to the observations
, which were registered on 27 April and 16 June, while in
Figure 7b the observations highlighted as influential in
correspond to the observations
, which were registered on 4 June, 16 June 16, and 21 June.
In
Figure 7c, the observations highlighted as influential in
for the Student-
PVCM correspond to the observations {58, 75}, which were registered on 4 June and 21 June. In
Figure 7d, the observations highlighted as influential in
for the Student-
PVCM correspond to the observations {33, 75}, which were registered on 10 May and 21 June.
Next, we analyze how model parameter estimation behaves when the explanatory variable PM
is modified. In
Figure 8a–d, no observations are highlighted as influential for
,
,
, or
under Student-
PVCM.
Considering the results obtained from the local influence plots, we mention here that under explanatory variable perturbation, , , , and are less sensitive for small degrees of freedom.
Now, we address the relative changes
(in %) of the estimates of
, and
considering the removal of highlighted observations as possible outliers and/or influential data present in local influence plots. From the above, we work with
to later be eliminated. The relative change of each estimated parameter is obtained using
where
and
denote the MPLE of
and the MPLE of
after removing the
i-th observation, respectively, for
with
, and
. The results obtained for set
I are displayed in
Table 3.
Note that in the local influence analysis the observations that were detected as possibly influential in the parametric component are not necessarily detected in the nonparametric component. For example, in the case-weight perturbation, observations #20 and #70 were detected as potentially influential for the parametric component. However, of these two observations only #70 is indicated as possibly influential in the nonparametric component of the first smooth function. In
Table 3, the individual elimination of observations #20 and #75 produces a relative change in
of 10% and 6.7%, respectively, identified as potentially influential cases. On these days, 27 April and 21 June, high concentrations of PM
and PM
were recorded, being higher for observation #75, while the wind speed was very close to the minimum recorded throughout the period for observation #75. The elimination of the set
I, the observations of which were detected as potentially influential in both the nonparametric and parametric components, generates significant changes in
and
on the order of 22% and 33%.
In Santiago, Chile, according to the MMA, an environmental alert was decreed for #58 and pre-emergency for #70 and #75. Thus, we can consider that there is a relationship between the official air quality alerts and the influential observations detected by our model.
In addition, in
Table 3, even though some RC values are large, inferential changes are not observed (i.e.,
p-values remain below 0.01). Note that the elimination of observations detached in the diagnostic plots causes larger changes in the parameter estimates on this account. Thus, the well-known robust aspects of the maximum likelihood estimates from Student-
models are not necessarily extended to other perturbation schemes, indicating the need for diagnostic examination in each case.
5.2. Lima Air Pollution
In this application, the dataset comprises a period of two years (from 1 January 2017 to 31 December 2018) and includes PM
(
g/Nm
) concentrations for year, month, day, and hour, ambient temperature in degrees Celsius, relative air humidity in percent, and wind speed in meters per second. The+ is dataset is based on data from five air quality monitoring stations of the SENAMHI: (i) Ate (ATE), (ii) Jesús María (CDM), (iii) Carabayllo (CRB), (iv) Huachipa (HCH), and (v) San Martín de Porres (SMP), with two located in North Lima, two in East Lima, and one in Central Lima; see [
4,
33]. The primary air quality regulation for PM
is 100
g/m
as the 24-h level; see [
34].
For this illustration, we use validated datasets from 2017 and 2018 during the CEM period, provided by SENAMHI. This allows us to obtain valid results. We analyze only the HCH monitoring station, as it is the station that presents the most critical pollution levels in Lima.
In this illustration, we consider the following explanatory variables: (i) maximum level of PM
in
g/Nm
(MAXPM
); (ii) maximum wind speed in meters per second (WIND); (iii) minimum temperature in degrees Celsius (TEMP); and (iv) maximum relative humidity in percentage (RH). For these data, the response variable considered is average PM
concentration in
g/Nm
(PM
).
Table 4 provides descriptive statistics for PM
levels. According to this table, the behavior of the data describes an empirical probability distribution with heavy tails in HCH. PM
concentration levels are high, with great variability. These critical pollution levels occur at HCH monitoring station due to the intense activity of the automotive fleet, and factories lead to further increase [
4,
33].
In
Figure 9a, #212 has been highlighted as possibly atypical. In
Figure 9b, a correlation plot of the explanatory variables and the response variable is shown, observing a high positive correlation between PM
and MAXPM
(correlation coefficient 0.91), while the other explanatory variables show moderate or low correlation with the response variable PM
, i.e., with WIND (0.36), RH (−0.52) and TEMP (0.47).
In
Figure 10, scatter plots of the explanatory variables, response variable, and possible interactions between explanatory variables are shown. For example, it can be seen that the relationship between PM
and the explanatory variable MAXPM
is linear (
Figure 10a), while that of the trend between PM
and WIND is nonlinear (
Figure 10b). In addition,
Figure 10c,d suggests that the TEMP and TEMP explanatory variables could be interacting with the WIND variable in a nonlinear way.
Based on the trends observed in the previous graphs, it is suggested that the PVCM be fitted between PM
and the explanatory variables. Specifically, we assume that
where
denotes the PM
concentrations,
denotes the maximum PM
concentration,
is the minimum temperature,
is the maximum air relative humidity,
is the maximum wind speed from the
i-th experimental unit,
,
are unknown functions, and
are independent random errors that follow a Student’s
-distribution with
. In
Figure 11a, which shows the QQ plot for the standardized residuals, an adequate fit of the Student-
PVCM is observed, whereas
Figure 11b identifies the observations {39, 51, 54, 155, 156, 160, 165, 167, 178, 188, 211} as possible outliers.
Table 5 shows the parameter estimates, the penalized log-likelihood evaluated at
, SE estimates, and AIC value.
In relation to the estimated smooth functions and their corresponding confidence bands, these are shown in
Figure 12. Both graphs confirm the non-linear trend noted in the exploratory analysis of the data. In other words, the smooth functions
and
vary with the explanatory variable WIND.
Figure 13 shows the plot of the observed PM
versus the values predicted by the Student-
PVCM. From the trend observed in the graph, we can conclude that the estimates obtained under the model seem to be optimal, as they generate reasonable estimated mean values.
Figure 14 displays index plots of the Mahanalobis distance
under Student-
PVCM (
Figure 14a), while
Figure 14b shows the estimated weights
of Student-
PVCM. In these figures, we can see possible outliers and/or influential values under the already-adjusted model. In
Figure 14a, {39, 51, 54, 155, 156, 160, 165, 167, 178, 188, 211} cases are highlighted as possible outliers. In
Figure 14b, we can observe that the estimated weights for the observations described above take the smallest values, confirming the robust aspects of MPLEs against outlier observations under Student-
PVCM. With respect to the values detected in
Figure 14a, these correspond to the days 9 May 2017, 21 May 2017, 24 May 2017, 2 April 2018, 3 April 2018, 7 April 2018, 12 April 2018, 14 April 2018, 25 April 2018, 5 May 2018, and 28 May 2018, respectively.
In
Figure 15a, the observations highlighted as influential under case-weight perturbation for
correspond to the observations
, which were registered on 14 April 2018, 18 April 2018, 25 April 2018, 28 April 2018, and 29 May 2018, while in
Figure 15b, the observations highlighted as influential in
correspond to the observations
, which were registered on 17 April 2017, 29 April 2017, and 30 April 2017. In
Figure 15c, the observations highlighted as influential in
correspond to the observations {29, 30}, which were registered on 29 April 2017 and 30 April 2017. In
Figure 15d, the observations highlighted as influential in
correspond to the observations {51, 54, 160, 165}, which were registered on 21 May 2017, 24 May 2017, 07 April 2018, and 12 April 2018.
Now, we analyze how the estimators behave when the explanatory variable MAXPM
is modified. In
Figure 16a–d, no observations are highlighted as influential for
,
,
, or
.
Considering the results obtained from the local influence plots, we mention here that under explanatory variable perturbation, , , , and are less sensitive for small degrees of freedom of Student’s -distribution. Note that this robust aspect of the estimators was observed in the previous application as well.
Here, we address the
(in %) of the estimates of
, and
considering the removal of highlighted observations as possible outliers and/or influential data present in local influence plots. The results obtained for set
are displayed in
Table 6.
As in the previous application, this influence analysis shows that the influential data in the parametric part are not necessarily the same in the non-parametric component. To illustrate this, in the case-weight perturbation scheme, the observations {167, 171, 178, 181, 212} are detected as influencing the parametric component and not the nonparametric one. In
Table 6, we note that the individual removal of observation #167 and #178 produces a relative change on the order of 7.53% and 8.42% on
, respectively, identified as potentially influential cases. These correspond to 25 April 2018 and 14 April 2018 of the CEM period. Analyzing these observations, a high concentration of PM
and wind speed were recorded in #167, while observation #212 corresponds to the maximum PM
recorded, which was detected as an outlier in
Section 5.2. Finally, the elimination of the set of observations I = {165, 167, 171, 178, 181, 212}, observations which were detected as potentially influential in both the non-parametric and parametric components, leads to significant changes in the MPL estimate of
and
on the order of 23.1% and 16.3%, respectively.
In summary, the diagnostic analysis based on the local influence method and residuals confirms that the proposed model is suitable for modeling pollution data, even if there are outliers and potentially influential observations.