# Graph Regression Model for Spatial and Temporal Environmental Data—Case of Carbon Dioxide Emissions in the United States

^{1}

^{2}

^{3}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

#### Contributions

## 2. Problem Statement—The Classical Approach

#### 2.1. Preliminaries

#### 2.2. System Model

- 1.
- Determine, for each of the N different locations, the specific relationship between the response variables ${\left\{{y}_{t,i}\right\}}_{t=1}^{T}$ and the set of covariates ${\left\{{\mathit{x}}_{t}\right\}}_{t=1}^{T}$.
- 2.
- Based on this relationship, make a prediction of the CO${}_{2}$ levels in different locations in space and time.

#### 2.3. Problem Formulation with a Classical Linear Regression Model

**Theorem 1**

- (A1)
- The matrix
**Φ**is nonrandom and has full rank, i.e., its columns are linearly independent, - (A2)
- The vector $\mathit{y}$ is a random vector such that the following hold:
- (i)
- $\mathbb{E}\left[\mathit{y}\right]=\mathbf{\Phi}{\mathit{\beta}}_{0}$ for some ${\mathit{\beta}}_{0}$;
- (ii)
- $\mathrm{Var}\left(\mathit{y}\right)=\mathbf{\Sigma}$ is a known positive definite matrix.

#### 2.4. Generalized Linear Models

## 3. Proposed Graph Regression Model

#### 3.1. Penalized Regression Model over Graph

- Case 1—$\tilde{\mathit{L}}={\mathit{I}}_{T}\otimes \mathit{L}$: the penalization induces the smoothness of the successive mean vectors $\mathbb{E}\left[{\mathit{y}}_{1}\right],\dots ,\mathbb{E}\left[{\mathit{y}}_{T}\right]$ over a static graph structure $\mathit{L}$.
- Case 2—$\tilde{\mathit{L}}=\mathrm{diag}({\mathit{L}}_{1},\dots ,{\mathit{L}}_{T})$: the penalization induces the smoothness of the successive mean vectors $\mathbb{E}\left[{\mathit{y}}_{1}\right],\dots ,\mathbb{E}\left[{\mathit{y}}_{T}\right]$ over a time-varying graph structure, ${\mathit{L}}_{1},\dots ,{\mathit{L}}_{T}$.
- Case 3—$\tilde{\mathit{L}}={\mathit{D}}_{h}^{\top}({\mathit{I}}_{T-1}\otimes \mathit{L}){\mathit{D}}_{h}$ or $\tilde{\mathit{L}}={\mathit{D}}_{h}^{\top}\mathrm{diag}({\mathit{L}}_{1},\dots ,{\mathit{L}}_{T-1}){\mathit{D}}_{h}$: The penalization induces the smoothness of the time difference mean vectors $\mathbb{E}\left[{\mathit{y}}_{2}\right]-\mathbb{E}\left[{\mathit{y}}_{1}\right],\dots ,\mathbb{E}\left[{\mathit{y}}_{T}\right]-\mathbb{E}\left[{\mathit{y}}_{T-1}\right]$ over a graph structure which could be either static or time varying, respectively. The matrix ${\mathit{D}}_{h}^{\top}$ of dimension $NT\times N(T-1)$ defined as$${\mathit{D}}_{h}^{\top}=\left[\begin{array}{cccccc}-{\mathit{I}}_{N}& {\mathbf{0}}_{N}& \dots & \dots & \dots & {\mathbf{0}}_{N}\\ {\mathit{I}}_{N}& -{\mathit{I}}_{N}& {\mathbf{0}}_{N}& \dots & \dots & {\mathbf{0}}_{N}\\ {\mathbf{0}}_{N}& {\mathit{I}}_{N}& -{\mathit{I}}_{N}& {\mathbf{0}}_{N}& \dots & {\mathbf{0}}_{N}\\ \vdots & \ddots & \ddots & \ddots & \ddots & \vdots \\ {\mathbf{0}}_{N}& \dots & {\mathbf{0}}_{N}& {\mathit{I}}_{N}& -{\mathit{I}}_{N}& {\mathbf{0}}_{N}\\ {\mathbf{0}}_{N}& \dots & \dots & {\mathbf{0}}_{N}& {\mathit{I}}_{N}& -{\mathit{I}}_{N}\\ {\mathbf{0}}_{N}& \dots & \dots & \dots & {\mathbf{0}}_{N}& {\mathit{I}}_{N}\end{array}\right],$$

**Proposition 1.**

**Proof.**

#### 3.2. Learning and Prediction Procedure

Algorithm 1 Learning procedure of the proposed penalized regression model over graph |

Input: ${\mathcal{D}}_{train}={\left\{{\mathit{x}}_{t},{\mathit{y}}_{t}\right\}}_{t=1}^{{\rho}_{train}T}$, ${\mathcal{D}}_{val}={\left\{{\mathit{x}}_{t},{\mathit{y}}_{t}\right\}}_{t={\rho}_{train}T+1}^{({\rho}_{train}+{\rho}_{val})T}$ ${\mathcal{D}}_{test}={\left\{{\mathit{x}}_{t},{\mathit{y}}_{t}\right\}}_{t=({\rho}_{train}+{\rho}_{val})T+1}^{T}$ - 1:
- Iterations of a numerical optimization method
- 2:
**while**${E}_{{\mathcal{D}}_{val}}^{\ast}\ne {E}_{{\mathcal{D}}_{val}}^{min}$**do**- 3:
- Let ${\mathit{\gamma}}^{\ast}$ denote the candidate for the values of hyperparameters for this iteration of the chosen derivative-free optimization technique.
- 4:
- Given ${\mathit{\gamma}}^{\ast}$, obtain the optimal regression coefficient ${\widehat{\mathit{\beta}}}^{\ast}$ in (15) using only the data from the training set ${\mathcal{D}}_{train}$:$${\widehat{\mathit{\beta}}}^{\ast}=\underset{\mathit{\beta}}{arg\; min}\left(V\left(\mathit{y}\in {\mathcal{D}}_{train};\mathit{\beta}\right)+{\gamma}_{1}^{\ast}{\mathit{\beta}}^{\top}\mathit{\beta}+{\gamma}_{2}^{\ast}\sum _{t\in {\mathcal{D}}_{train}}{\mathit{g}}^{-1}{\left(\mathit{\varphi}\left({\mathit{x}}_{t}\right)\mathit{\beta}\right)}^{\top}\tilde{\mathit{L}}{\mathit{g}}^{-1}\left(\mathit{\varphi}\left({\mathit{x}}_{t}\right)\right)\mathit{\beta}\right).$$
- 5:
- Compute the estimator of the generalization error using the validation set:$${E}_{{\mathcal{D}}_{val}}^{\ast}=\frac{1}{{\rho}_{val}T}\sum _{t\in {\mathcal{D}}_{val}}\left|\right|{\mathit{y}}_{t}-{g}^{-1}\left(\mathit{\varphi}\left({\mathit{x}}_{t}\right)\right){\widehat{\mathit{\beta}}}^{\ast}{\left|\right|}^{2}$$
- 6:
**end while**
Output: Optimal hyperparameters $\widehat{\mathit{\gamma}}$ and regression coefficients $\widehat{\mathit{\beta}}$ |

## 4. Numerical Study—CO${}_{2}$ Prediction in the United States

#### 4.1. Choice of Covariates and Data Pre-Processing

- Daily weather data (available on the platform of National Centers for Environmental Information (NCEI) https://www.ncdc.noaa.gov/ghcnd-data-access (accessed on 1 August 2023)) in the United States of America including maximal temperature (TMAX), minimal temperature (TMIN) and precipitation (PREC);
- Temporal information to capture the time patterns of the data;
- Lagged CO${}_{2}$ emission variables to take into account the time correlation of the response.

#### 4.2. Graph Construction of the Spatial Component

#### 4.3. Numerical Experiments

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Appendix A. Proof of Proposition 1

## Appendix B. List of Counties Used in the Numerical Study

List of Counties | |||||
---|---|---|---|---|---|

Number | Counties | States | Number | Counties | States |

1 | Anoka County | Minnesota | 31 | Daviess County | Kentucky |

2 | Dakota County | Minnesota | 32 | Hopkins County | Kentucky |

3 | Lyon County | Minnesota | 33 | Russel County | Kentucky |

4 | Buchanan County | Iowa | 34 | Alamance County | North Carolina |

5 | Crawford County | Iowa | 35 | Lenoir County | North Carolina |

6 | Page County | Iowa | 36 | Pender County | North Carolina |

7 | Union County | Iowa | 37 | Randolph County | North Carolina |

8 | Ashley County | Arkansas | 38 | Charleston County | South Carolina |

9 | Columbia County | Arkansas | 39 | Dillon County | South Carolina |

10 | Outagamie County | Wisconsin | 40 | Lee County | South Carolina |

11 | Dane County | Wisconsin | 41 | Marlboro County | South Carolina |

12 | Clark County | Illinois | 42 | Pickens County | South Carolina |

13 | Mercer County | Illinois | 43 | Bartholomew County | Indiana |

14 | Ogle County | Illinois | 44 | Posey County | Indiana |

15 | Stephenson County | Illinois | 45 | Mahoning County | Ohio |

16 | Lawrence County | Tennessee | 46 | Shelby County | Ohio |

17 | Obion County | Tennessee | 47 | Delta County | Michigan |

18 | Cumberland County | Tennessee | 48 | Montcalm County | Michigan |

19 | Hinds County | Mississipi | 49 | Washtenaw County | Michigan |

20 | Tate County | Mississipi | 50 | Armstrong County | Pennsylvania |

21 | Blount County | Alabama | 51 | Montour County | Pennsylvania |

22 | Autauga County | Alabama | 52 | Lebanon County | Pennsylvania |

23 | Marengo County | Alabama | 53 | Luzerne County | Pennsylvania |

24 | Morgan County | Alabama | 54 | Addison County | Vermont |

25 | Talladega County | Alabama | 55 | Windsor County | Vermont |

26 | Bulloch County | Georgia | 56 | Grant Parish | Louisiana |

27 | Habersham County | Georgia | 57 | Red River Parish | Louisiana |

28 | Bradford County | Florida | 58 | Vermilion Parish | Louisiana |

29 | Clay County | Florida | 59 | Madison Parish | Louisiana |

30 | Taylor County | Florida |

## References

- Cressie, N.; Wikle, C. Statistics for Spatio-Temporal Data; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
- Wikle, C. Modern Perspectives on Statistics for Spatio-Temporal Data. Wires Comput. Stat.
**2014**, 7, 86–98. [Google Scholar] [CrossRef] - Wikle, C.K.; Zammit-Mangion, A.; Cressie, N. Spatio-Temporal Statistics with R; Chapman & Hall/CRC: Boca Raton, FL, USA, 2019. [Google Scholar]
- Stroup, W. Generalized Linear Mixed Models: Modern Concepts, Methods and Applications; Chapman & Hall/CRC Texts in Statistical Science; Chapman & Hall/CRC: Boca Raton, FL, USA, 2012. [Google Scholar]
- St-Pierre, J.; Oualkacha, K.; Bhatnagar, S.R. Efficient penalized generalized linear mixed models for variable selection and genetic risk prediction in high-dimensional data. Bioinformatics
**2023**, 39, btad063. [Google Scholar] [CrossRef] [PubMed] - Schelldorfer, J.; Meier, L.; Bühlmann, P. GLMMLasso: An Algorithm for High-Dimensional Generalized Linear Mixed Models Using ℓ
_{1}-Penalization. J. Comput. Graph. Stat.**2014**, 23, 460–477. [Google Scholar] [CrossRef] - Shuman, D.I.; Narang, S.K.; Frossard, P.; Ortega, A.; Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag.
**2013**, 30, 83–98. [Google Scholar] [CrossRef] - Qiu, K.; Mao, X.; Shen, X.; Wang, X.; Li, T.; Gu, Y. Time-Varying Graph Signal Reconstruction. IEEE J. Sel. Top. Signal Process.
**2017**, 11, 870–883. [Google Scholar] [CrossRef] - Giraldo, J.H.; Mahmood, A.; Garcia-Garcia, B.; Thanou, D.; Bouwmans, T. Reconstruction of Time-Varying Graph Signals via Sobolev Smoothness. IEEE Trans. Signal Inf. Process. Over Netw.
**2022**, 8, 201–214. [Google Scholar] [CrossRef] - Belkin, M.; Niyogi, P.; Sindhwani, V. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. J. Mach. Learn. Res.
**2006**, 7, 2399–2434. [Google Scholar] - Venkitaraman, A.; Chatterjee, S.; Händel, P. Predicting Graph Signals Using Kernel Regression Where the Input Signal is Agnostic to a Graph. IEEE Trans. Signal Inf. Process. Over Netw.
**2019**, 5, 698–710. [Google Scholar] [CrossRef] - Karakurt, I.; Aydin, G. Development of regression models to forecast the CO
_{2}emissions from fossil fuels in the BRICS and MINT countries. Energy**2023**, 263, 125650. [Google Scholar] [CrossRef] - Fouss, F.; Saerens, M.; Shimbo, M. Algorithms and Models for Network Data and Link Analysis; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar]
- Aitken, A.C. On Least-squares and Linear Combinations of Observations. Proc. R. Soc. Edinb.
**1936**, 55, 42–48. [Google Scholar] [CrossRef] - Nelder, J.A.; Baker, R. Generalized Linear Models; Wiley Online Library: Hoboken, NJ, USA, 1972. [Google Scholar]
- McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd ed.; Chapman & Hall: London, UK, 1989; p. 500. [Google Scholar]
- Denison, D.G. Bayesian Methods for Nonlinear Classification and Regression; John Wiley & Sons: Hoboken, NJ, USA, 2002; Volume 386. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv.
**2010**, 4, 40–79. [Google Scholar] [CrossRef] - Hjorth, U.; Hjort, U. Model Selection and Forward Validation. Scand. J. Stat.
**1982**, 9, 95–105. [Google Scholar] - Gurney, K.R.; Liang, J.; Patarasuk, R.; Song, Y.; Huang, J.; Roest, G. The Vulcan Version 3.0 High-Resolution Fossil Fuel CO
_{2}Emissions for the United States. J. Geophys. Res. Atmos.**2020**, 125, e2020JD032974. [Google Scholar] [CrossRef] [PubMed] - Nevat, I.; Mughal, M.O. Urban Climate Risk Mitigation via Optimal Spatial Resource Allocation. Atmosphere
**2022**, 13, 439. [Google Scholar] [CrossRef]

**Figure 2.**Choice of the covariate WD to encapsulate information about the weekday for the CO${}_{2}$ emission. (

**a**) Spatial and temporal average of the CO${}_{2}$ emission per weekday. (

**b**) Values assigned to the covariates WD depending on the current weekday.

**Figure 3.**Illustration of the time correlation of the daily CO${}_{2}$ emissions per county with the autocorrelation function (ACF) of three different counties.

**Figure 7.**Boxplots of the RMSE obtained after 50 random choices of two counties per state for the different regression models (70% of the dataset is used for training).

**Table 1.**RMSE of the penalized regression model over graph with the Laplacian defined using an adjacency matrix based either on geodesic distances or on empirical correlations.

Root Mean Square Error (RMSE): Distances Versus Empirical Correlations | ||||||
---|---|---|---|---|---|---|

Testing Set | Validation Set | Training Set | ||||

Perc. Train | Graph (Distance) | Graph (Correlation) | Graph (Distance) | Graph (Correlation) | Graph (Distance) | Graph (Correlation) |

70% | 16.42 | 27.04 | 13.67 | 14.92 | 13.40 | 7.96 |

Root Mean Square Error (RMSE) | |||||||||
---|---|---|---|---|---|---|---|---|---|

Testing Set | Validation Set | Training Set | |||||||

Perc. Train | Graph Reg. | Ridge | OLS | Graph Reg. | Ridge | OLS | Graph Reg. | Ridge | OLS |

50% | 35.65 | 41.43 | 42.10 | 16.80 | 17.86 | 17.65 | 9.13 | 6.74 | 6.55 |

60% | 30.02 | 36.77 | 41.41 | 15.02 | 19.60 | 19.73 | 21.73 | 6.52 | 6.52 |

70% | 16.42 | 22.65 | 49.52 | 13.67 | 17.13 | 16.44 | 13.40 | 7.94 | 7.02 |

**Table 3.**RMSE of the different regression models without the use of the lagged response variables as covariates.

Root Mean Square Error (RMSE) without Lagged Variables | |||||||||
---|---|---|---|---|---|---|---|---|---|

Testing Set | Validation Set | Training Set | |||||||

Perc. Train | Graph Reg. | Ridge | OLS | Graph Reg. | Ridge | OLS | Graph Reg. | Ridge | OLS |

70% | 38.54 | 38.54 | 41.76 | 20.28 | 20.28 | 20.34 | 9.65 | 9.65 | 9.64 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Tayewo, R.; Septier, F.; Nevat, I.; Peters, G.W.
Graph Regression Model for Spatial and Temporal Environmental Data—Case of Carbon Dioxide Emissions in the United States. *Entropy* **2023**, *25*, 1272.
https://doi.org/10.3390/e25091272

**AMA Style**

Tayewo R, Septier F, Nevat I, Peters GW.
Graph Regression Model for Spatial and Temporal Environmental Data—Case of Carbon Dioxide Emissions in the United States. *Entropy*. 2023; 25(9):1272.
https://doi.org/10.3390/e25091272

**Chicago/Turabian Style**

Tayewo, Roméo, François Septier, Ido Nevat, and Gareth W. Peters.
2023. "Graph Regression Model for Spatial and Temporal Environmental Data—Case of Carbon Dioxide Emissions in the United States" *Entropy* 25, no. 9: 1272.
https://doi.org/10.3390/e25091272