Next Article in Journal
Improving Air Transportation by Using the Fuzzy Origin–Destination Matrix
Previous Article in Journal
On New Classes of Stancu-Kantorovich-Type Operators
Previous Article in Special Issue
The Influence of Latent and Chronic Infection on Pathogen Persistence
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

COVID-19 Data Imputation by Multiple Function-on-Function Principal Component Regression

by
Christian Acal
,
Manuel Escabias
,
Ana M. Aguilera
*,† and
Mariano J. Valderrama
Department of Statistics and O.R. and IMAG, University of Granada, 18071 Granada, Spain
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2021, 9(11), 1237; https://doi.org/10.3390/math9111237
Submission received: 21 April 2021 / Revised: 18 May 2021 / Accepted: 24 May 2021 / Published: 28 May 2021
(This article belongs to the Special Issue Statistical Methods for the Analysis of Infectious Diseases)

Abstract

:
The aim of this paper is the imputation of missing data of COVID-19 hospitalized and intensive care curves in several Spanish regions. Taking into account that the curves of cases, deceases and recovered people are completely observed, a function-on-function regression model is proposed to estimate the missing values of the functional responses associated with hospitalized and intensive care curves. The estimation of the functional coefficient model in terms of principal components’ regression with the completely observed data provides a prediction equation for the imputation of the unobserved data for the response. An application with data from the first wave of COVID-19 in Spain is developed after properly homogenizing, registering and smoothing the data in a common interval so that the observed curves become comparable. Finally, Canonical Correlation Analysis is performed on the functional principal components to interpret the relationship between hospital occupancy rate and illness response variables.

1. Introduction

The virus SARS-CoV-2 has been the main global concern ever since its start, at the end of 2019 in China. Its rapid propagation has put all areas of society on alert, not only the field of medicine. Nevertheless, a year and half after the beginning of the pandemic, the virus incidence has not seemed to decrease and the number of deaths continues its upward trend throughout the world. To obtain some idea of extremely negative impact of the pandemic, Coronavirus Disease (COVID-19) has caused a total of 2,780,266 deaths over the planet as of 28 March 2021, according to the real-time database developed by Johns Hopkins University [1]. Another crucial topic derived from the illness is the economic crisis which has devastated all countries. For instance, the unemployment rate is up 5.1% in last three months of 2020 in the UK, according to official data.
In order to combat this terrible situation, there is a great need to understand the development of the pandemic. Knowing its behaviour will enable correct decision making to mitigate the spread of the virus and to restore people’s daily lives as soon as possible. To do so, the scientific community is focusing all its efforts on developing new techniques, capable of modelling and predicting the evolution of COVID-19. The main variables of interest that gauge how the epidemiological situation stands in a country are the number of positive, recovered and deceased cases. Another important indicator is the number of people who are hospitalized or in intensive care units. From a mathematical perspective, many authors have already attempted to tackle these variables from different statistical perspectives. A new Bayesian indicator is introduced in [2] to forecast the beginning of a new wave. In [3], semi-empirical models based on the logistic map are considered in order to predict the variables in different phases of the pandemic in Spain. Likewise, Ref. [4] apply SIR models to analyse the trend of the disease over the world and, more specifically, in India. These variables are also addressed from the time series design by considering quasi-Poisson regression and two-piece scale mixture normal distribution when there is a lack of symmetry in the error’s distribution in [5,6], respectively. Additionally, Ref. [7] make an exhaustive comparison of five deep learning methods to forecast the number of new cases and recovered cases in Italy, Spain, France, China, the USA and Australia. Regarding the role of the environmental conditions in the evolution of the illness, Ref. [8] study whether the number of cases in China is connected with the daily average temperature and relative humidity through a generalized additive model. On this point, Ref. [9] show how the choice of the spatio-temporal model may affect the relationship between the spread of the virus and certain environmental conditions. Information theory metrics are also used to understand how time series associated with the pandemic are interconnected or causally related to each other [10]. In addition, how the incubation period distribution could vary by age and gender is investigated in [11]. On the other hand, a new family of distributions is introduced in [12] to model daily cases and deaths in Egypt and Saudi Arabia.
Taking the nature of the variables of interest into account, an approach based on Functional Data Analysis (FDA) is proposed in the current paper for data imputation. FDA is a modern branch of statistics that aims to analyse the information coming from curves or functions that evolve over time, space or other continuous arguments. Under this definition, it is clear that the number of COVID-19 positive, recovered, deceased, hospitalized and intensive care cases come from the observation of functional variables. FDA is usually applied in many areas of knowledge, such as Biosciences, Environment, Economy, Chemometrics and Electronics. A detailed review of the most important FDA methodologies, applications and computational aspects can be seen in books [13,14,15,16,17]. In this regard, some works have been developed, focused on revealing complex patterns of COVID-19 illness from an FDA viewpoint. Functional Principal Component Analysis (FPCA) and functional time series approaches based on dynamic FPCA are applied in [18] to explain variability and predict COVID-19 confirmed and death cases in the United States. On the other hand, a new Varimax rotation approach to FPCA is introduced in [19] to better interpret the main modes of variability in COVID-19 confirmed cases in the first wave in Spain. Time-varying FDA methods, to model the cumulative COVID-19 curves of cases by pooling data across countries, are applied in [20]. A multivariate FDA approach was also considered for spatio-temporal prediction of COVID-19 mortality counts in Spain [21].
All statistical models require complete and high-quality data to be able to provide accurate predictions, but, unfortunately, neither of these aspects are normally fulfilled during a pandemic. In the first wave of COVID-19 in Spain, a change in the way of recording data in some Autonomous Communities produced incomplete data in hospitalized and intensive care curves. In this paper, a functional linear regression model is proposed for the imputation of these missing curves, so that complete data are available to estimate the predictive models with guarantees. Although there are many works related to the imputation of multivariate data [22,23], there is a lot to be done in the functional framework. A novel approach for multiple imputation based on functional mixed effects models was proposed by [24] in a longitudinal data context. Different solutions to scalar-on-function regression with missing observations in the response are considered in [25,26,27,28,29]. Additionally, an extension to multiple functional regression imputation that handles both scalar and functional response variables related to EEG data is proposed in [30]. Likewise, different FDA imputation methods under sparse and irregular functional data settings are performed in [31]. The extension of the function-on-function linear regression (FFLR) model [32,33,34] to the case of multiple functional predictors is proposed in this paper to estimate the curves of hospitalized and intensive care people (functional responses) from the curves of confirmed, deceased and recovered cases (functional predictors).
In addition to this introduction, the manuscript scheme consists of a description of the data where the process of the homogenization, registration and smoothing of the sample curves is detailed in order to make them comparable (Section 2). The theoretical framework of multiple function-on-function linear regression and the imputation procedure based on principal components regression appear in Section 3. An application to COVID-19 data in the Spanish Autonomous Communities during the first wave of the pandemic is developed in Section 4. Finally, Section 5 contains a discussion about the results obtained throughout this paper.

2. Data Homogenization, Registration and Smoothing

Spain is organized administratively in autonomous communities (ACs) or territorial governments that have transferred health affairs. This territorial organization consists of 17 ACs plus two autonomous cities (Ceuta and Melilla) located on the African continent, and which have been excluded from the analyses presented here because they have no exclusive competences in the organization of health care assumed by the Spanish government. The 17 ACs are, in alphabetical order: Andalucía, Aragón, Asturias, Islas Baleares, Islas Canarias, Cantabria, Castilla La Mancha, Castilla León, Cataluña, Extremadura, Galicia, Madrid, Murcia, Navarra, País Vasco, La Rioja and Valencia. The population is highly variable between the different ACs. While Madrid, Catalunya and Andalucía have more than six, seven and eight million inhabitants, respectively (6,663,394, 7,675,217 and 8,414,240), La Rioja has approximately three hundred thousand inhabitants (316,798).
The first wave of the COVID-19 pandemic in Spain occurred between 2 February and 27 April 2020. In those early days of the pandemic, Spanish authorities published daily and accumulated data of the evolution of the pandemic in Spain, based on the information communicated by the different ACs. Specifically, the data, published daily, correspond to the following variables: number of confirmed (positive) cases, hospitalized people, people in intensive care units (ICUs), recovered people and deceased persons. The observed data for some of the ACs can be seen in Figure 1. The problem that arose, and gave rise to this work, is that some ACs (Castilla La Mancha, Castilla León, Madrid and Galicia) modified the recording of the data associated with people in ICU and hospitalized people from a specific day (see Figure 2). The mathematical action against COVID-19 of the Spanish Mathematics Committee (http://matematicas.uclm.es/cemat/covid19/) (accessed on: 26 May 2021) called for the development of a meta-predictor (collaborative prediction) based on the predictions from different models/algorithms, contributed by interested researchers, which builds optimized combinations of them, disaggregated by ACs. Therefore, the imputation of the missing hospitalized and ICU data is fundamental to building forecasting models to provide optimal predictions of the evolution of the pandemic through these variables. In order to solve this problem, a functional regression model is proposed in this paper to estimate the expected form of the missing accumulated data of ICU admissions and hospitalizations from the observed accumulated data of cases, deaths and recoveries.
From this point, the time evolution of COVID-19 cases, deceases, recoveries, hospitalizations and ICU admission will be considered as functional variables that will be denoted as X 1 ( t ) , X 2 ( t ) , X 3 ( t ) , Y 1 ( t ) and Y 2 ( t ) , respectively (the X-variables will be considered as predictors and the Y-variables as responses in the functional regression models). The observed data are the number of daily cumulative informed values of these five functional variables for the seventeen ACs in Spain from 20 February 2020 to 27 April 2020 (see data source (https://cnecovid.isciii.es/covid19/#documentacion-y-datos) (accessed on: 26 May 2021)). Then, a random sample of curves { ( x i j ( t ) , y i k ( t ) ) : i = 1 , 17 ; j = 1 , 2 , 3 ; k = 1 , 2 } which are observed daily are available.
Before carrying out a functional analysis of the data, it is necessary to complete a data registration, given the absence of uniformity in the publication of the observations. This means that, in the same functional variable, the first day with available data and the number of discrete observations in each AC are different. For example, in Andalucía, the first recorded data of hospitalized persons were from 10 March, in which 32 hospitalized people were registered and, for this variable, there were 49 discrete observations in this AC. On the other hand, in Madrid, the first recorded data of hospitalized persons were from 12 March, in which 1304 hospitalized people were registered, and the number of discrete observations in this AC was 47. However, the curves of positive cases recorded 62 and 63 discrete observations in Andalucía and Madrid, respectively. In addition, the population size of each AC could influence the adjustments of the proposed models, as larger numbers of cases were observed in larger communities, and the different population sizes of the different ACs makes it impossible to compare data between them. In order to avoid both problems, the number of cases per 10,000 inhabitants is considered, and the first observation for each curve corresponds to the day that first exceeds the maximum of the first reported values, discarding the previous ones.
After data homogenization, the period of observation and the number of discrete observations of each functional variable for each AC continue to differ. An important constraint of FDA methods is that all sample curves of a functional variable must be observed in the same domain. Classic solutions to this problem are based on the registration of all curves in a common interval (see [13]). In this paper, we propose registering all curves in the interval [ 0 , 1 ] by applying the FDA methodologies to the synchronized curves defined by
x i j * ( u ) = x i j ( T i j s t a r t + u ( T T i j s t a r t ) ) , y i k * ( u ) = y i k ( T i k s t a r t + u ( T T i k s t a r t ) ) u [ 0 , 1 ] ,
where [ T i j s t a r t , T ] and [ T i k s t a r t , T ] are the observed domains for the i -th predictor and the k -th response curves, respectively ( i = 1 , , 17 ; j = 1 , 2 , 3 ; k = 1 , 2 ) . From now on, and by abuse of a notation that helps to simplify the exposition, x i j and y i k will represent the registered curves.

From Discrete Daily Observations to Curves

Although the functional data are sets of curves, their true functional form is unknown and the recorded data are observations of each curve at a finite collection of timepoints. Then, the first step in FDA is to reconstruct the functional form of the curves from the observed discrete data.
There are different approaches to the processing of functional data, among which we can highlight the classic ones based on a basis representation of the curves [13] and the ones based on local–polynomial regression [16]. In this paper, basis expansions of the curves are considered by assuming that each of the functional variables ( X 1 , X 2 , X 3 ; Y 1 , Y 2 ) generating the sample curves, are smooth stochastic processes with trajectories in the space L 2 ( [ 0 , 1 ] ) of squared integrable functions in the interval [ 0 , 1 ] . In what follows, the basis expansion approach is illustrated for a random sample of a functional variable, defined on a general interval T . In our dataset, this procedure must be performed on each of the five considered functional variables for which the type and dimension of the basis could be different.
Consider a random of sample curves { x i ( t ) : i = 1 , , n ; t T } from a functional variable X with values in L 2 ( T ) , and let us assume that noisy observations x i k are available for each curve at a set of time knots t i 1 , t i 2 , , t i m i T , that is,
x i k = x i ( t i k ) + ϵ i k i = 1 , , n ; k = 1 , , m i .
Let us also suppose that the sample curves belong to a finite-dimensional space generated by a basis of functions { ϕ 1 ( t ) , , ϕ p ( t ) } . Therefore, each curve of the functional data set admits a basis representation in the form
x i ( t ) = j = 1 p a i j ϕ j ( t ) , i = 1 , , n .
The functional form of each curve is then determined by the vector of its basis coefficients a i = ( a i 1 , , a i p ) , which can be estimated in different ways, with least squares approximation being the most common method, providing the following estimation: a ^ i = ( Φ i Φ i ) 1 Φ i x i , where Φ i = ( ϕ j ( t i k ) ) m i × p , j = 1 , , p , k = 1 , , m i .
The type of basis must be selected according to the characteristics of the curves in the functional dataset. The most common basis are B-splines and trigonometric functions (see, for example, [13]). The former generates spaces of spline functions, piecewise polynomial functions that are smoothly joined and have good local behaviour. The latter provides suitable spaces for periodic functions. Many other bases were used in practice, such as bases of wavelets which are more appropriate for curves with discontinuities and sharp spikes. An application of wavelet approximation from sample curves of lupus and stress level was developed in [32]. A robust estimation of the mean function, together with a simultaneous confidence band, based on polynomial spline estimation, is developed in [35].
In this paper, a basis of cubic B-splines of dimension ten with equally spaced knots was used to approximate the five samples of curves of COVID-19 from their daily discrete data. Least squares approximation was performed on each curve to estimate the basis coefficients. The cubic regression splines of all curves considered here can be seen in Figure 3.

3. Functional Linear Regression Imputation with Missing Values in the Response

Motivated by the imputation of the missing curves of COVID-19 hospitalized and intensive care people, a functional linear regression model with functional response and several functional predictors is proposed in this paper. The general formulation of this multiple function-on-function linear regression (MFFLR) model and its estimation in terms of functional principal components regression are summarized in this section.

3.1. Multiple Function-on-Function Linear Model

The multiple function-on-function linear model allows for the estimation of a functional response Y from a vector of J functional predictor variables denoted by X = ( X 1 , , X J ) . Let us consider a random sample from ( X , Y ) denoted by { ( x i , y i ) : i = 1 , , n } with x i = ( x i 1 , x i 2 , , x i J ) , and let us assume that all functional variables take values on the Hilbert space L 2 ( T ) of the squared integrable functions on the interval T , with the usual inner product defined by < f , g > = T f ( t ) g ( t ) d t , t T .
Then, the functional linear model is formulated as
y i ( t ) = α ( t ) + j = 1 J T x i j ( s ) β j ( s , t ) d s + ε i ( t ) , i = 1 , , n ,
where α ( t ) is the intercept function, β j ( s , t ) are the J coefficient functions and ϵ i ( t ) are independent functional errors. Model (2) can be written in matrix form as
y i ( t ) = α ( t ) + T x i ( s ) β ( s , t ) d s + ε i ( t ) , i = 1 , , n ,
where x i ( s ) = ( x i 1 ( s ) , x i 2 ( s ) , , x i J ( s ) ) and β ( s , t ) = ( β 1 ( s , t ) , β 2 ( s , t ) , , β J ( s , t ) ) .
This expression considers that all functional variables are defined in the same interval T , but this is not a restriction and the model can be easily generalized for different domains in each of the functional variables. The estimation of this model is an ill-posed problem that is usually solved by least-squares penalized approaches and basis expansion of functional parameters and/or sample curves [13]. Some of the basis expansion approaches reduce the model to a multivariate linear model for the matrix of response basis coefficients in terms of the matrix of predictors’ basis coefficients. The main problem is that this multivariate model is affected by high multicollinearity, which causes inaccurate estimation of the parameters. Despite the good predictive ability of the model, this fact makes its interpretation more difficult. The most-studied solutions avoid the need for cross-validation to estimate the penalty parameter by reducing the problem to linear regression on uncorrelated predictor variables. Approaches based on functional PCA [36,37,38,39,40,41] and functional Partial Least Squares (PLS) [42,43,44,45,46,47] were widely studied in the literature in the context of different functional regression models.
In this paper, a principal components’ regression approach is considered. This can be seen as an extension of the principal components’ prediction models developed in [48,49] to predict a functional variable in a future interval of time from its evolution in the past. In the so-called PCP models, the functional response and the functional predictor correspond to the same functional variable, but were observed in different time periods. In the present approach, truncated principal component decompositions of the functional response and the functional predictors turn the functional linear model into a multivariate linear model in terms of a reduced set of response and predictor principal components.

3.2. Functional Principal Component Regression

Let us consider the principal component decompositions of both the response and the predictor functional variables, given by
x i j ( t ) = x ¯ j ( t ) + l = 1 n 1 ξ i l x j f l x j ( t ) , y i ( t ) = y ¯ ( t ) + l = 1 n 1 ξ i l y f l y ( t ) ,
where the principal components scores are given by
ξ i l x j = < x i j x ¯ j , f l x j > = T ( x i j ( t ) x ¯ j ( t ) ) f l x j ( t ) d t , ξ i l y = < y i y ¯ , f l y > = T ( y i ( t ) y ¯ ( t ) ) f l y ( t ) d t ,
with the weight functions f l x j and f l y being the eigenfunctions of the sample covariance operators of x i j ( t ) and y i ( t ) , respectively. The principal components scores are centered uncorrelated scalar variables with maximum variance given by the eigenvalues associated with their weight functions: V a r ( ξ i l x j ) = λ l x j , V a r ( ξ i l y ) = λ l y .
Theoretical and asymptotic properties of FPCA for Hilbert-valued random functions were studied in [50,51,52,53,54]. In the case of a basis expansion for each functional variable (see Equation (1)), each functional PCA is equivalent to the multivariate PCA of the matrix A Ψ 1 / 2 , with A = ( a i j ) being the n × p matrix of basis coefficients and Ψ being the p × p matrix of inner products between basis functions, Ψ = ( Ψ i j ) = < ϕ i , ϕ j > , i , j = 1 , , p . The vector of basis coefficients of the the l -th PC weight function f l ( t ) is given by b l = Ψ 1 / 2 v l , where v l is the l -th eigenvector of the sample covariance matrix of A Ψ 1 / 2 (see [55] for a detailed study).
The principal component decompositions given in Equation (3) turn the MFFLR Model (2) into a linear regression model for each PC of the functional response Y on all PCs of the functional predictors
ξ i k y = j = 1 J l = 1 n 1 b k l x j ξ i l x j + ε i k , i = 1 , , n ; k = 1 , , n 1 ,
with the functional coefficients given by β j ( s , t ) = k = 1 n 1 l = 1 n 1 b k l x j f k x j ( s ) f l y ( t ) .
By truncating each principal component decomposition, the following principal component multiple function-on-function linear regression (PC-MFFLR) model for the functional response is obtained
y ^ i ( s ) = y ¯ ( s ) + k = 1 K ξ ^ i k y f k y ( s ) = y ¯ ( s ) + k = 1 K j = 1 J l L k j b ^ k l x j ξ i l x j f k y ( s ) ,
with b ^ k l x j being the linear least-squared estimation of the regression coefficients b k l .
Different selection model approaches were developed to select the optimum PCs of each predictor variable (subsets L k j ), to be considered in Model (6) when it comes to estimating the first J PCs of the response variable. It is well known that PCs are ordered according to their explained variability and that the most explanatory components of the predictor variable might not be the most correlated with the response variable. In the case of the simple function-on-function linear model with only one predictor, a procedure that selects pairs of response/predictor PCs based on both explained variability and correlation was developed in [32]. A supervised version of FPCA that estimates the PCs by considering the correlation of the functional predictor and response variable was developed for the scalar-on-function regression model [56]. The usual selection models procedures based on stepwise and best subset regression, combined with cross-validation, can be adapted to this functional regression context.

3.3. Imputation of Missing Response Curves

Let us consider that all the predictor variables X j are completely observed and only the response variable Y has missing values. Let us assume, without loss of generality, that in the sample, the first n values of the response are observed and the last m values are missing. That means that there are n complete observed curves for all variables and m incomplete observations that are missing values for the response.
In order to estimate the missing response curves, the parameters b k l in Model (5) are estimated with the complete n sample curves of response and predictors. Then, the missing response curves { y i m i s s ( s ) : i = n + 1 , , n + m } are estimated by computing the principal component scores of predictors { ξ i l x j : i = n + 1 , , n + m , l = 1 , , n 1 } given by the Expression (4), and substituting them in the Equation (6). Then, the estimated PC-MFFLR model can be used to predict new response values Y on a test sample and to provide an accurate interpretation of the relationship between the predictor and the response variables.
If the objective is to predict the response variable in a future interval, a regression model of type (6) could be estimated to predict the response variable Y ( s ) in the future interval of amplitude k, denoted by [ T , T + k ] , in terms of the predictor variables ( X 1 ( t ) , , X J ( t ) , Y ( t ) ) in the past interval of time [ 0 , T ] . In the case of the COVID-19 data, the parameter k must be selected by taking the average number of days it takes for a person to develop severe symptoms and need to be admitted to the hospital into account.

4. Covid-19 Application Results

Let us remember that the main aim of this paper is the imputation of hospitalized and intensive care curves for those ACs with missing data. To do this, multiple function-on-function linear regression approaches are developed here. In addition, a canonical correlation analysis (CCA) is performed to interpret the relationship between variables related with hospital occupation (hospitalized and intensive care people) and illness response (positive, deceased and recovered people). The computational results were obtained with the free software R (‘fda’ and ‘yacca’ R-packages for FPCA and CCA, respectively).

4.1. Data Imputation

The imputation problem is solved by applying a multiple function-on-function linear regression for each of the responses Y 1 ( t ) (hospitalized) and Y 2 ( t ) (intensive care) from the three functional predictors X 1 ( t ) (sick), X 2 ( t ) (deceased) and X 3 ( t ) (recovered). Both functional regression models are estimated from the data of the thirteen ACs with complete data (training sample). Then, the predictions for the four ACs with missing data (Castilla La Mancha, Castilla León, Galicia and Madrid) are used for data imputation.
The first step is the estimation of the functional PCs for each of the five functional predictors. As a result, the first PC explained almost all the variability in the five predictors ( 99.32 % , 98.73 % , 97.97 % , 98.59 % , 96.37 % for X 1 , X 2 , X 3 , Y 1 , Y 2 , respectively). Figure 4 and Figure 5 show the weight functions associated to each first PC, and the perturbations of the sample mean curves obtained, by adding and subtracting a multiple of them. In order to obtain weight functions and PC scores which are much easier to interpret, two new functional Varimax rotation approaches were introduced in [19] with application for COVID-19 confirmed people.
After obtaining these functional principal components’ analyses, we consider a training sample composed of all the ACs except Castilla La Mancha, Castilla León, Galicia and Madrid, which will be considered as the prediction sample.
Taking into account that the first component of X 1 ( t ) , X 2 ( t ) and X 3 ( t ) were revealed to be highly and significantly correlated with the first components of Y 1 ( t ) and Y 2 ( t ) ; meanwhile, the other cross-correlations between PCs were not significant, and the function-on function linear regression models were reduced to the following linear models for the first PC of the response in terms of the first PC of each of the predictors
ξ ^ i 1 y k = γ 0 + ξ i 1 x 1 γ 1 y k + ξ i 1 x 2 γ 2 y k + ξ i 1 x 3 γ 3 y k + ε i y k , k = 1 , 2 ; i = 1 , , 17 .
These models allow for the accurate estimation of the first component of Y 1 ( t ) and Y 2 ( t ) from the first components of X 1 ( t ) , X 2 ( t ) and X 3 ( t ) with a determination coefficient of R 2 = 0.9249 and R 2 = 0.7443 , respectively. Finally, the Karhunen–Loève expansion, in terms of the predictor principal components, provides the following prediction equation for Y 1 ( t ) and Y 2 ( t )
y ^ i k ( t ) = y ¯ k ( t ) + ξ ^ i 1 y k f 1 y k ( t ) , k = 1 , 2 ; i = 1 , , 17 .
In order to evaluate the prediction ability of these models, the square root of the mean squared errors between observed and predicted curves are calculated by the expression
R M S E ( y i k ) = 0 1 ( y i k ( t ) y ^ i k ( t ) ) 2 d t 1 2 k = 1 , 2 ; i = 1 , , 13 .
These results can be seen in Table 1, where it can be observed that the predictions for ICU admission curves are more accurate.
Some of the observed and estimated training curves can be seen in Figure 6 and Figure 7 next to confidence bands for the predicted curves. These confidence bands are obtained by pointwise confidence intervals, computed for each fixed timepoint t p as follows
y ^ i k ( t p ) ± 2 × S E ^ ( y ^ i k ( t p ) ) , k = 1 , 2 ,
where S E ^ ( y ^ i k ( t p ) ) = S E ^ ( ξ ^ i k y k ) f 1 y k ( t p ) , k = 1 , 2 with S E ^ ( ξ ^ i k y k ) being the standard error of the PC prediction given by the corresponding multiple linear regression fit.
Finally, the expected curves provided by the regression models in Equation (7) for hospitalizations and ICU admissions in the badly recorded ACs, next to their confidence bands and observed curves, are drawn in Figure 8 and Figure 9.
The prediction of the missing curves using the PC-MFFLR considered models provides a pointwise estimation of hospitalizations and ICU admissions that corrects the inaccurate reported data. These pointwise predictions, next to their anomalous values for the first and last days of the first wave of COVID-19 in Castilla La Mancha, Castilla León, Madrid and Galicia, can be seen in Table 2.
The obtained predictions can be considered as an imputation of the real behaviour of these curves in the observation period if the mode of data communication did not change. Thus, in Castilla La Mancha on 27 April 8464, people were reported as hospitalized and the imputation provided by the model is 9062 cases; in Castilla León, 7777 versus 7658; in Galicia, 2758 versus 3511; in Madrid, 5039 versus 30,587. For ICU admissions, the differences between the registered and imputed cases are, again, evident. It can be seen that, in Castilla La Mancha on 27 April, 584 people were registered as admitted in ICU and the model gives an estimation of 871; in Castilla León, 522 versus 674; in Galicia, 73 versus 236; in Madrid, 780 versus 3111.

4.2. Canonical Correlation Analysis

Once the missing data were imputed and complete curves are available for the 17 ACs, the relationship between the variables related to the number of people admitted to hospitals (hospitalized and ICU people) and the ones affected by the disease (sick, deceased and recovered) can be studied. Canonical Correlation Analysis (CCA) was applied on the two sets of first principal components associated with these functional variables to explore this relationship without necessarily distinguishing between independent and dependent variables. The analysis makes sense because the correlations between PCs in the two groups are very high, suggesting that the variables are not linearly independent.
In agreement with the above, the first principal component of each functional variable is selected to carry out the analysis. Thus, the dataset consists of a sample of the seventeen Spanish ACs in an attempt to determine which factors influence in the hospital occupancy rate. The two groups of variables are, on the one hand, Hospital occupancy rate (HOR) formed by the first PC of hospitalized people and of ICU people ( ξ ^ 1 y 1 , ξ ^ 1 y 2 ) , and, on the other hand, Illness response (IR) comprised by the first PC of positive people, deceased people and recovered people ( ξ ^ 1 x 1 , ξ ^ 1 x 2 , ξ ^ 1 x 3 ) . The estimates of the squared canonical correlations between the two canonical variables for each pair appear in Table 3, next to the outcomes associated with the Barlett’s test for testing the null hypothesis that the two canonical variate pairs are uncorrelated. As a result, it can be concluded that both canonical pairs are significantly correlated and dependent on each other (there is a relationship between the two sets of variables).
Note that the squared canonical correlations represent, for each pair, the percentage of variance in one canonical variate explained by the variation in the other one, but say nothing about the extent to which the canonical variates themselves account for variation in the original variables. Then, around 95.4% of the variation in the first canonical variate for HOR ( U 1 ) is described by the variation in the first canonical variate for IR ( V 1 ), and almost 71% of the variation in U 2 is explained by V 2 . This fact suggests that both canonical correlations are important. Figure 10 displays how the values of the canonical variates are spread over the plane. The linear relation in each pair is clearly visible in these scatterplots. Likewise, it is possible to draw conclusions about which ACs behave similarly during the first wave. The results are in accordance with multiple studies about the COVID-19 pandemic in Spain (see, for example, [19]).
Additionally, the estimated canonical coefficients (loadings) for the HOR and RI variables are in Table 4 and Table 5, respectively. The magnitudes of these coefficients give the contributions of the individual variables to the corresponding canonical variable. Hence, the canonical variables are determined as follows
U 1 = 0.1767528 × ξ 1 y 2 + 0.1214623 × ξ 1 y 1 U 2 = 3.2637387 × ξ 1 y 2 + 0.2556912 × ξ 1 y 1 V 1 = 0.0336045 × ξ 1 x 1 + 0.1877252 × ξ 1 x 2 0.0044276 × ξ 1 x 3 V 2 = 0.1094736 × ξ 1 x 1 1.0135127 × ξ 1 x 2 + 0.0047968 × ξ 1 x 3
Once the raw canonical coefficients have been estimated, the following step is to interpret each canonical component. For that purpose, the squared correlations between the variables in each group and the canonical components are computed in Table 6 and Table 7 for the HOR and IR groups, respectively. These parameters indicate the fraction of HOR and IR variance associated with each of their components separately. Let us observe that, for the second canonical variables ( U 2 , V 2 ) , none of the correlations are large, so this pair provides very little information about the variables. Regarding the first canonical variate pair ( U 1 , V 1 ) , all the correlations with the variables are uniformly high. This means that U 1 and V 1 are an overall measure of HOR and IR variables, respectively, with U 1 being highly correlated with hospitalizations and V 1 more correlated with positive cases and deceased people.
These outcomes expose that the level of saturation in the hospitals are particularly determined by the number of hospitalized people; meanwhile, the response to the pandemic is governed by the number of positive cases and deaths. Despite the fact that the number of people in UCI and the number of recovered people play an important role in the canonical variates, their contribution is smaller.
Finally, a canonical redundancy analysis was performed in order to study the percentage of variance of one group of variables which is accounted for by the other (in the usual least squares sense). The results of this analysis can be seen in Table 8, and the correlations between each set of variables and the opposite group of canonical variates in Table 6 and Table 7. Table 8 shows that both components of the first canonical pair are a good overall predictor of the opposite set of variables, since the explained proportions of variance for HOR and IR are 0.864 and 0.839, respectively. Nevertheless, despite the significant correlation for the second pair, these variables do not account for a great amount of variability. This statement is corroborated by the squared correlations displayed in Table 6 and Table 7. These measures indicate that the first canonical variate of the IR group has an outstanding predictive power for the number of hospitalized (95.09%) and a considerable influence for the number of people in ICU (77.81%) as well. Similar interpretations are reached for the first canonical variable of HOR, which is a superb predictor of the number of cases and deaths (92.13% and 88.40%, respectively), and to lesser extent, of the number of recuperated (71.39%). The second canonical variables add virtually nothing given that the fraction of variance in each variable set attributable to the other group through the respective canonical variates barely overcomes 10% of the total variability.

5. Conclusions

The current economic and sanitary crisis provoked by the virus SARS-CoV-2 has taken up most of the planet’s attention since the World Health Organization declared a worldwide emergency state in the middle of March 2020. In order to control the propagation of the virus, the scientific community is immersed in the development of statistical models that enable governments to control the behaviour of the pandemic and to mitigate the devastating effects of the COVID-19 illness. Thus, it is essential to build powerful models to guarantee accurate predictions. Taking into account the nature of the variables of interest (for instance, number of positive cases, deceases, recovered, hospitalised and people in intensive care units), a wide variety of models have been tackled by considering Functional Data Analysis methodologies. Nevertheless, the good performance of these models depends on the quality of the data, which is not always as good as one might expect, especially in periods of pandemic, where the data are usually incomplete. On this matter, an extension of function-on-function linear regression is proposed for the imputation of missing values in the response, where the functional coefficients are estimated by means of principal components regression. The motivation for this work is to forecast the curves of hospitalized and intensive care people (functional responses) from the curves of positive cases, deaths and recoveries (functional predictors) for several Spanish Autonomous Communities that changed the means of recording data related to hospital occupancy rate. The imputation of these curves is made once the linear model is estimated, with a training sample composed by the remainder of communities that did not modify their way of registering data. The performance of the model is outstanding for the training sample, since the observed and predicted curves are very similar for both functional responses. Regarding the prediction sample, the obtained forecasts can be considered as an imputation of what should have been the real behaviour of these curves in the observation period if the mode of data communication did not change. It can be observed that the model captures the trend of the curves up to the change. Additionally, once the missing data were imputed, a canonical correlation analysis was carried out in order to study the possible relationship between the two groups of variables: hospital occupancy rate (number of hospitalized people and ICU admissions) and illness response (number of positive cases, deaths and recovered people). The first principal component score of each variable was selected to form the canonical analysis, since only the first principal component explains almost all the variability in the five functional variables. After an exhaustive analysis, both sets of variables were shown to be highly correlated with each other and, moreover, each of the first canonical variables is a good overall predictor of the opposite group of variables. At this point, the variables with more predictive power are the number of hospitalizations, positives and deceases. In sum, the present document introduces a new mechanism for the imputation of missing at random functional response curves and shows the relationship among interesting functional variables associated with the COVID-19 pandemic.

Author Contributions

All authors contributed equally in developing the methodology, implementing the proposed estimation algorithms with the software R, performing the simulation study and application, as well as in writing this manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by project MTM2017-88708-P of the Spanish Ministry of Science, Innovation and Universities (also supported by the FEDER program), project FQM-307 of the Government of Andalusia (Spain) and the Ph.D. grant (FPU18/01779) awarded to Christian Acal.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ACAutonomous Community
CCACanonical Correlation Analysis
FPCAFunctional Principal Component Analysis
FDAFunctional Data Analysis
HORHospital occupancy rate
IRIllness Response
MFFRMultiple Function-on-Function Regression
PCPrincipal Component
PCAPrincipal Component Analysis
PLSPartial Least Squares
PC-MFFRPrincipal Components Multiple Function-on-Function Regression

References

  1. Dong, E.; Du, H.; Gardner, L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect. Dis. 2020, 20, 533–534. [Google Scholar] [CrossRef]
  2. Berihuete, A.; Sanchez-Sanchez, M.; Suarez-Llorens, A. A Bayesian Model of COVID-19 Cases Based on the Gompertz Curve. Mathematics 2021, 9, 228. [Google Scholar] [CrossRef]
  3. Mora, J.C.; Pérez, S.; Dvorzhak, A. Application of a Semi-Empirical Dynamic Model to Forecast the Propagation of the COVID-19 Epidemics in Spain. Forecasting 2020, 2, 452–469. [Google Scholar] [CrossRef]
  4. Agarwal, P.; Jhajharia, K. Data analysis and modeling of COVID-19. J. Stat. Manag. Syst. 2021, 24, 1–16. [Google Scholar] [CrossRef]
  5. Tobias, A. Evaluation of the lockdowns for the SARS-CoV-2 epidemic in Italy and Spain after one month follow up. Sci. Total Environ. 2020, 725, 138539. [Google Scholar] [CrossRef] [PubMed]
  6. Maleki, M.; Mahmoudi, M.R.; Heydari, M.H.; Pho, K.H. Modeling and forecasting the spread and death rate of coronavirus (COVID-19) in the world using time series models. Chaos Solitons Fractals 2020, 140, 110151. [Google Scholar] [CrossRef]
  7. Zeroual, A.; Harrou, F.; Dairi, A.; Sun, Y. Deep learning methods for forecasting COVID-19 time-Series data: A Comparative study. Chaos Solitons Fractals 2020, 140, 110121. [Google Scholar] [CrossRef] [PubMed]
  8. Qi, H.; Xiao, S.; Shi, R.; Ward, M.P.; Chen, Y.; Tu, W.; Su, Q.; Wang, W.; Wang, X.; Zhang, Z. COVID-19 transmission in Mainland China is associated with temperature and humidity: A time-series analysis. Sci. Total Environ. 2020, 728, 138778. [Google Scholar] [CrossRef]
  9. Briz-Redon, A. The impact of modelling choices on modelling outcomes: A spatio-temporal study of the association between COVID-19 spread and environmental conditions in Catalonia (Spain). Stoch. Environ. Res. Risk Assess. 2021. [Google Scholar] [CrossRef]
  10. Zanin, M.; Papo, D. Assessing functional propagation patterns in COVID-19. Chaos Solitons Fractals 2020, 138, 109993. [Google Scholar] [CrossRef]
  11. Pak, D.; Langohr, K.; Ning, J.; Cortés-Martínez, J.; Gómez-Melis, G.; Shen, Y. Modeling the Coronavirus Disease 2019 Incubation Period: Impact on Quarantine Policy. Mathematics 2020, 8, 1631. [Google Scholar] [CrossRef]
  12. Mansour, M.; Farsi, M.; Mohamed, S.; Elrazik, M. Modeling the COVID-19 Pandemic Dynamics in Egypt and Saudi Arabia. Mathematics 2021, 9, 827. [Google Scholar] [CrossRef]
  13. Ramsay, J.O.; Silverman, B.W. Functional Data Analysis, 2nd ed.; Springer: New York, NY, USA, 2005. [Google Scholar]
  14. Ramsay, J.O.; Silverman, B.W. Applied Functional Data Analysis: Methods and Case Studies; Springer: New York, NY, USA, 2002. [Google Scholar]
  15. Ramsay, J.O.; Hooker, G.; Graves, S. Functional Data Analysis with R and MATLAB; Springer: New York, NY, USA, 2009. [Google Scholar]
  16. Ferraty, F.; Vieu, P. Nonparametric Functional Data Analysis. Theory and Practice; Springer: New York, NY, USA, 2006. [Google Scholar]
  17. Horvath, L.; Kokoszka, P. Inference for Functional Data with Applications; Springer: New York, NY, USA, 2012. [Google Scholar]
  18. Tang, C.; Wang, T.; Zhang, P. Functional data analysis: An application to COVID-19 data in the United States. arXiv 2020, arXiv:2009.08363. [Google Scholar]
  19. Acal, C.; Aguilera, A.M.; Escabias, M. New Modeling Approaches Based on Varimax Rotation of Functional Principal Components. Mathematics 2020, 8, 2085. [Google Scholar] [CrossRef]
  20. Carroll, C.; Bhattacharjee, S.; Chen, Y.; Dubey, P.; Fan, J.; Gajardo, A.; Zhou, X.; Müller, H.G.; Wang, J.L. Time dynamics of COVID-19. Sci. Rep. 2020, 10, 21040. [Google Scholar] [CrossRef] [PubMed]
  21. Torres-Signes, A.; Frías, M.P.; Ruiz-Medina, M.D. COVID-19 mortality analysis from soft-data multivariate curve regression and machine learning. arXiv 2021, arXiv:2008.06344. [Google Scholar]
  22. Little, R.J.; Rubin, D.B. Statistical Analysis with Missing Data, 3rd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2019. [Google Scholar]
  23. Graham, J.W. Missing Data: Analysis and Design; Springer Science & Business Media: New York, NY, USA, 2012. [Google Scholar]
  24. He, Y.; Yucel, R.; Raghunathan, T.E. A functional multiple imputation approach to incomplete longitudinal data. Stat. Med. 2011, 30, 1137–1156. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  25. Ferraty, F.; Sued, M.; Vieu, P. Mean estimation with data missing at random for functional covariables. Statistics 2013, 47, 688–706. [Google Scholar] [CrossRef]
  26. Ling, N.; Liang, L.; Vieu, P. Nonparametric regression estimation for functional stationary ergodic data with missing at random. J. Stat. Plan. Inference 2015, 162, 75–87. [Google Scholar] [CrossRef]
  27. Ling, N.; Liu, Y.; Vieu, P. Conditional mode estimation for functional stationary ergodic data with responses missing at random. Statistics 2016, 50, 991–1013. [Google Scholar] [CrossRef]
  28. Crambes, C.; Henchiri, Y. Regression imputation in the functional linear model with missing values in the response. J. Stat. Plan. Inference 2019, 201, 103–119. [Google Scholar] [CrossRef] [Green Version]
  29. Febrero-Bande, M.; Galeano, P.; González-Manteiga, W. Estimation, imputation and prediction for the functional linear model with scalar response with responses missing at random. Comput. Stat. Data Anal. 2019, 131, 91–103. [Google Scholar] [CrossRef]
  30. Ciarleglio, A.; Petkova, E.; Harel, O. Multiple imputation in functional regression with applications to EEG data in a depression study. arXiv 2020, arXiv:2001.08175. [Google Scholar]
  31. Rao, A.R.; Reimherr, M. Modern multiple imputation with functional data. Stat 2021, 10, e331. [Google Scholar] [CrossRef]
  32. Aguilera, A.M.; Escabias, M.; Ocaña, F.A.; Valderrama, M.J. Functional Wavelet-Based Modelling of Dependence Between Lupus and Stress. Methodol. Comput. Appl. Probab. 2015, 17, 1015–1028. [Google Scholar] [CrossRef]
  33. Valderrama, M.; Ocaña, F.; Aguilera, A.; Ocaña-Peinado, F. Forecasting pollen concentration by a two-step functional model. Biometrics 2010, 66, 578–585. [Google Scholar] [CrossRef]
  34. Qi, X.; Luo, R. Function-on-function regression with thousands of predictive curves. J. Multivar. Anal. 2018, 163, 51–66. [Google Scholar] [CrossRef]
  35. Lima, I.; Cao, G.; Billor, N. Robust simultaneous inference for the mean function of functional data. Test 2019, 28, 785–803. [Google Scholar] [CrossRef]
  36. Chiou, J.M.; Müller, H.G.; Wang, J.L. Functional response models. Stat. Sin. 2004, 14, 659–677. [Google Scholar]
  37. Escabias, M.; Aguilera, A.M.; Valderrama, M.J. Principal component estimation of functional logistic regression: Discussion of two different approaches. J. Nonparametr. Stat. 2004, 16, 365–384. [Google Scholar] [CrossRef] [Green Version]
  38. Müller, H.G.; Stadtmüller, U. Generalized functional linear models. Ann. Stat. 2005, 33, 774–805. [Google Scholar] [CrossRef] [Green Version]
  39. Aguilera-Morillo, M.C.; Aguilera, A.M.; Escabias, M.; Valderrama, M.J. Penalized spline approaches for functional logit regression. Test 2013, 22, 251–277. [Google Scholar] [CrossRef]
  40. Escabias, M.; Aguilera, A.; Aguilera-Morillo, M. Functional PCA and Base-Line Logit Models. J. Classif. 2014, 31, 296–324. [Google Scholar] [CrossRef]
  41. Aguilera, A.M.; Aguilera-Morillo, M.C.; Preda, C. Penalized versions of functional PLS regression. Chemom. Intell. Lab. Syst. 2016, 154, 80–92. [Google Scholar] [CrossRef]
  42. Preda, C.; Saporta, G. PLS regression on a stochastic process. Comput. Stat. Data Anal. 2005, 48, 149–158. [Google Scholar] [CrossRef] [Green Version]
  43. Escabias, M.; Aguilera, A.M.; Valderrama, M.J. Functional PLS logit regression model. Comput. Stat. Data Anal. 2007, 51, 4891–4902. [Google Scholar] [CrossRef]
  44. Aguilera, A.M.; Escabias, M.; Preda, C.; Saporta, G. Using basis expansion for estimating functional PLS regression. Applications with chemometric data. Chemom. Intell. Lab. Syst. 2010, 104, 289–305. [Google Scholar] [CrossRef]
  45. Delaigle, A.; Hall, P. Methodology and theory for partial least squares applied to functional data. Ann. Stat. 2012, 40, 322–352. [Google Scholar] [CrossRef] [Green Version]
  46. Febrero-Bande, M.; Galeano, P.; González-Manteiga, W. Functional principal component regression and functional partial least squares regression: An overview and a comparative study. Int. Stat. Rev. 2017, 85, 61–83. [Google Scholar] [CrossRef]
  47. Aguilera, A.M.; Acal, C.; Aguilera-Morillo, M.C.; Jiménez-Molinos, F.; Roldán, J.B. Homogeneity problem for basis expansion of functional data with applications to resistive memories. Math. Comput. Simul. 2021, 186, 41–51. [Google Scholar] [CrossRef]
  48. Aguilera, A.M.; Ocaña, F.A.; Valderrama, M.J. An approximated principal component prediction model for continuous-time stochastic processes. Appl. Stoch. Model. Data Anal. 1997, 13, 61–72. [Google Scholar] [CrossRef]
  49. Aguilera, A.M.; Ocaña, F.A.; Valderrama, M.J. Forecasting with unequally spaced data by a functional principal component approach. Test 1999, 8, 233–254. [Google Scholar] [CrossRef]
  50. Deville, J.C. Méthodes statistiques et numériques de l’analyse harmonique. Ann. De L’INSEE 1974, 15, 3–101. [Google Scholar] [CrossRef]
  51. Dauxois, J.; Pousse, A.; Romain, Y. Asymptotic theory for the principal component analysis of a vector random function: Some applications to statistical inference. J. Multivar. Anal. 1982, 12, 136–156. [Google Scholar] [CrossRef] [Green Version]
  52. Ocaña, F.A.; Aguilera, A.M.; Valderrama, M.J. Functional Principal Components Analysis by Choice of Norm. J. Multivar. Anal. 1999, 71, 262–276. [Google Scholar] [CrossRef] [Green Version]
  53. Hall, P.; Hosseini-Nasab, M. On properties of functional principal components analysis. J. R. Stat. Soc. B 2006, 68, 109–126. [Google Scholar] [CrossRef]
  54. Ruiz-Castro, J.E.; Acal, C.; Aguilera, A.M.; Aguilera-Morillo, M.C.; Roldán, J.B. Linear-Phase-Type probability modelling of functional PCA with applications to resistive memories. Math. Comput. Simul. 2021, 186, 71–79. [Google Scholar] [CrossRef]
  55. Ocaña, F.A.; Aguilera, A.M.; Escabias, M. Computational considerations in functional principal component analysis. Comput. Stat. 2007, 22, 449–465. [Google Scholar] [CrossRef]
  56. Nie, Y.; Wang, L.; Liu, B.; Cao, J. Supervised functional principal component analysis. Stat. Comput. 2018, 28, 713–723. [Google Scholar] [CrossRef]
Figure 1. Discrete daily observations of accumulated positive cases, deaths, hospitalizations, ICU admissions and recoveries in Madrid, Andalucía, Cataluña and La Rioja.
Figure 1. Discrete daily observations of accumulated positive cases, deaths, hospitalizations, ICU admissions and recoveries in Madrid, Andalucía, Cataluña and La Rioja.
Mathematics 09 01237 g001
Figure 2. Discrete daily observations of accumulated hospitalizations and ICU admissions in Madrid, Castilla La Mancha, Castilla León and Galicia.
Figure 2. Discrete daily observations of accumulated hospitalizations and ICU admissions in Madrid, Castilla La Mancha, Castilla León and Galicia.
Mathematics 09 01237 g002
Figure 3. Curves of accumulated positive cases, deaths, recoveries, hospitalizations and ICU admissions.
Figure 3. Curves of accumulated positive cases, deaths, recoveries, hospitalizations and ICU admissions.
Mathematics 09 01237 g003
Figure 4. First weight PC functions (left), sample mean curves and the perturbations for each functional response (right) y ¯ k ± 2 λ 1 y k f 1 y k ; k = 1 , 2 .
Figure 4. First weight PC functions (left), sample mean curves and the perturbations for each functional response (right) y ¯ k ± 2 λ 1 y k f 1 y k ; k = 1 , 2 .
Mathematics 09 01237 g004
Figure 5. First weight PC functions (a,c,e), sample mean curves and the perturbations (b,d,f) for each functional predictor x ¯ j ± 2 λ 1 x j f 1 x j , j = 1 , 2 , 3 .
Figure 5. First weight PC functions (a,c,e), sample mean curves and the perturbations (b,d,f) for each functional predictor x ¯ j ± 2 λ 1 x j f 1 x j , j = 1 , 2 , 3 .
Mathematics 09 01237 g005
Figure 6. Observed and predicted curves (with pointwise confidence bands) of hospitalizations (left) and ICU admissions (right) in some of the training ACs: Andalucía, Cantabria and Navarra.
Figure 6. Observed and predicted curves (with pointwise confidence bands) of hospitalizations (left) and ICU admissions (right) in some of the training ACs: Andalucía, Cantabria and Navarra.
Mathematics 09 01237 g006
Figure 7. Observed and predicted curves (with pointwise confidence bands) of hospitalizations (left) and ICU admissions (right) in some of the training ACs: Asturias, Valencia and La Rioja.
Figure 7. Observed and predicted curves (with pointwise confidence bands) of hospitalizations (left) and ICU admissions (right) in some of the training ACs: Asturias, Valencia and La Rioja.
Mathematics 09 01237 g007
Figure 8. Observed and predicted curves (with pointwise confidence bands) of hospitalizations (left) and ICU admissions (right) in Castilla León and Castilla La Mancha.
Figure 8. Observed and predicted curves (with pointwise confidence bands) of hospitalizations (left) and ICU admissions (right) in Castilla León and Castilla La Mancha.
Mathematics 09 01237 g008
Figure 9. Observed and predicted curves (with pointwise confidence bands) of hospitalizations (left) and ICU admissions (right) in Madrid and Galicia.
Figure 9. Observed and predicted curves (with pointwise confidence bands) of hospitalizations (left) and ICU admissions (right) in Madrid and Galicia.
Mathematics 09 01237 g009
Figure 10. Scatterplot for the first and second canonical variate pairs.
Figure 10. Scatterplot for the first and second canonical variate pairs.
Mathematics 09 01237 g010
Table 1. Root mean squared prediction errors for hospitalizations ( y i 1 ) and ICU admissions ( y i 2 ) curves in the different training ACs.
Table 1. Root mean squared prediction errors for hospitalizations ( y i 1 ) and ICU admissions ( y i 2 ) curves in the different training ACs.
AC RMSE ( y i 1 ) RMSE ( y i 2 )
Andalucía0.775779480.13645363
Aragón1.510750140.16559390
Asturias3.055648280.05135305
Islas Baleares0.473971620.30168996
Islas Canarias1.175636810.04168788
Cantabria0.919930380.06896749
Catalunya3.456561210.62176527
Valencia0.925912200.04144271
Extremadura3.309613800.59974926
Murcia1.620147520.11596922
Navarra1.261097420.20248242
País vasco4.468847520.30765768
La Rioja3.376927980.22644853
Table 2. Pointwise imputation of hospitalizations and ICU admissions for the first and last days of the first COVID-19 wave in Castilla La Mancha, Castilla León, Madrid and Galicia.
Table 2. Pointwise imputation of hospitalizations and ICU admissions for the first and last days of the first COVID-19 wave in Castilla La Mancha, Castilla León, Madrid and Galicia.
Hospitalizations
Castilla La ManchaCastilla LeónGaliciaMadrid
TimeObsPredObsPredObsPredObsPred
163541347649555757015181351
283856562963090667623371779
31547735798784104380923372247
41826932977961114796137102772
521621164119711631250112037783371
621621436145713941338127651684059
727071758182316561447142463384853
829772124221419481630156473885768
930182520264822591767169884416794
817382007080697026093171819128,159
819983177174706426523222793028,482
824384307264715526743270746428,802
830485427397724626943316707729,120
834286547506733627073362660129,434
838587637555742427223407618329,740
841788687653750827353449589230,037
844489697703758627463484544130,320
846490627777765827583511503930,587
ICU Admissions
Castilla La ManchaCastilla LeónGaliciaMadrid
TimeObsPredObsPredObsPredObsPred
123372435292777127
2234543443535102152
3295654544744135184
4377069675553180224
5378885836963180273
6651101061028674253332
7761361201249885253404
814216713715011296340488
9182202170178123107491587
53178450161510823711112826
53479350362210123710762853
5378015056289623810242878
546809508634922399812903
553817510639872399492930
559826511645902398922962
565838515653852398733001
579852518662832388213050
584871522674732367803111
Table 3. Estimates of the canonical correlations next to χ 2 values associated with Bartlett’s omnibus statistic, degrees of freedom and p-values for each canonical variate pair.
Table 3. Estimates of the canonical correlations next to χ 2 values associated with Bartlett’s omnibus statistic, degrees of freedom and p-values for each canonical variate pair.
Canonical CorrelationSquared Canonical CorrelationStatdfp-Value
0.97656930.953687655.827216<0.001
0.83986690.705376415.886732<0.001
Table 4. Canonical coefficients for HOR variables.
Table 4. Canonical coefficients for HOR variables.
U 1 U 2
ICU−0.1767528−3.2637387
Hospitalized0.12146230.2556912
Table 5. Canonical coefficients for IR variables.
Table 5. Canonical coefficients for IR variables.
V 1 V 2
Cases0.03360450.1094736
Deceased0.1877252−1.0135127
Recovered−0.00442760.0047968
Table 6. Squared correlations between the HOR variables and the canonical variables.
Table 6. Squared correlations between the HOR variables and the canonical variables.
U 1 U 2 V 1 V 2
ICU0.81588800.1841119530.77810230.129868216
Hospitalized0.99707560.0029243530.95089860.002062769
Table 7. Squared correlations between the IR variables and the canonical variables.
Table 7. Squared correlations between the IR variables and the canonical variables.
V 1 V 2 U 1 U 2
Cases0.96606820.033664290.92132720.02374600
Deceased0.92694400.072371540.88401490.05104917
Recovered0.74858810.040127340.71391920.02830487
Table 8. Total fraction of HOR (IR) variance accounted by IR (HOR) variables, through each canonical variate in first row (second row).
Table 8. Total fraction of HOR (IR) variance accounted by IR (HOR) variables, through each canonical variate in first row (second row).
U 1 U 2 V 1 V 2
HOR--0.864500460.06596549
RI0.839753760.03436668--
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Acal, C.; Escabias, M.; Aguilera, A.M.; Valderrama, M.J. COVID-19 Data Imputation by Multiple Function-on-Function Principal Component Regression. Mathematics 2021, 9, 1237. https://doi.org/10.3390/math9111237

AMA Style

Acal C, Escabias M, Aguilera AM, Valderrama MJ. COVID-19 Data Imputation by Multiple Function-on-Function Principal Component Regression. Mathematics. 2021; 9(11):1237. https://doi.org/10.3390/math9111237

Chicago/Turabian Style

Acal, Christian, Manuel Escabias, Ana M. Aguilera, and Mariano J. Valderrama. 2021. "COVID-19 Data Imputation by Multiple Function-on-Function Principal Component Regression" Mathematics 9, no. 11: 1237. https://doi.org/10.3390/math9111237

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop