Next Article in Journal
Local Gaussian Cross-Spectrum Analysis
Previous Article in Journal
Modeling COVID-19 Infection Rates by Regime-Switching Unobserved Components Models

Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

# Information-Criterion-Based Lag Length Selection in Vector Autoregressive Approximations for I(2) Processes

by
Department of Business Administration and Economics, Bielefeld University, Universitätsstrasse 25, D-33615 Bielefeld, Germany
Econometrics 2023, 11(2), 11; https://doi.org/10.3390/econometrics11020011
Received: 15 December 2022 / Revised: 17 April 2023 / Accepted: 18 April 2023 / Published: 20 April 2023

## Abstract

:
When using vector autoregressive (VAR) models for approximating time series, a key step is the selection of the lag length. Often this is performed using information criteria, even if a theoretical justification is lacking in some cases. For stationary processes, the asymptotic properties of the corresponding estimators are well documented in great generality in the book Hannan and Deistler (1988). If the data-generating process is not a finite-order VAR, the selected lag length typically tends to infinity as a function of the sample size. For invertible vector autoregressive moving average (VARMA) processes, this typically happens roughly proportional to $log T$. The same approach for lag length selection is also followed in practice for more general processes, for example, unit root processes. In the I(1) case, the literature suggests that the behavior is analogous to the stationary case. For I(2) processes, no such results are currently known. This note closes this gap, concluding that information-criteria-based lag length selection for I(2) processes indeed shows similar properties to in the stationary case.
JEL Classification:
C13; C32

## 1. Introduction

The workhorse in empirical modeling of economic time series is still constituted by the vector autoregressive (VAR) model. The VAR model explains the process $( y t ) t ∈ Z , y t ∈ R p ,$ using the difference equation:
$y t = A 1 y t − 1 + ⋯ + A h y t − h + u t , t ∈ Z$
where $( u t ) t ∈ Z$ is a white noise process. Under the stability assumption $| A ( z ) | ≠ 0 , | z | ≤ 1$ for $A ( z ) = I p − A 1 z − ⋯ − A h z h$, this equation has a unique stationary solution. Often, prior to VAR modeling, deterministic components are extracted such as a constant, a linear trend, and seasonal dummies.
Oftentimes, this model is seen as an approximation to the data-generating process. This happens, for instance, if the data are generated by a vector autoregressive moving average (VARMA) process, where $u t = ε t + B 1 ε t − 1 + ⋯ + B g ε t − g$ for white noise $( ε t ) t ∈ Z$ with expectation zero and variance $Σ > 0$. The VARMA system $( A ( z ) , B ( z ) ) , B ( z ) = I p + B 1 z + … + B g z g$ (where the pair $( A ( z ) , B ( z ) )$ is left-coprime) is called invertible if $| B ( z ) | ≠ 0 , | z | ≤ 1$. Under stability and invertibility, we obtain the VAR$( ∞ )$ representation
$y t = ∑ j = 1 ∞ Φ j y t − j + ε t .$
Under invertibility, we have
$∥ Φ j ∥ ≤ C ρ 0 j$
where $ρ 0 − 1 > 1$ typically is the smallest modulus of the solutions of $| B ( z ) | = 0$. Thus, the coefficients converge to zero, geometrically implying that a VAR approximation seems plausible.
When estimating a VAR model, one has to decide on the lag length, h. This choice of the lag length is typically based on adding a penalty term to the estimation accuracy as measured using the residual variance $Σ h$ for a range of potential lag lengths $0 ≤ h ≤ H T$, resulting in so-called information criteria. Here, the residual variance $Σ h$ is estimated from data as follows:1
$Σ ^ h = T − 1 ∑ t = h + 1 T ε ^ t ( h ) ε ^ t ( h ) ′ , ε ^ t ( h ) = y t − A ^ 1 y t − 1 − . . . − A ^ h y t − h .$
If no structural restrictions apply, the matrices $A ^ 1 , … , A ^ h$ are typically estimated using OLS or the Yule–Walker equations (see, for example, Hannan and Deistler 1988, p. 211).2 In this paper, we only consider OLS estimation.
Information criteria then are defined as
$I C ( h ; C T ) = log det Σ ^ h + C T h p 2 T .$
Here, $h p 2$ denotes the number of parameters estimated in the conditional mean equation, and $C T$ is a penalty factor. The most prominent choices are AIC ($C T = 2$) and BIC ($C T = log T$). The minimal integer $h ^$ minimizing IC is then the selected lag length.3
For the case of stationary invertible VARMA processes, the asymptotic properties are well documented in section 6.6 of HD. Under appropriate assumptions on the noise (see Theorem 6.6.3 of HD) and upper bounds $H T$ for the lag length, for $h ^ B I C$ estimated using BIC, we have $h ^ B I C / l ( T ) = 1$ a.s. (Theorem 6.6.3 of HD), and the same limit holds for $h ^ A I C$ in probability (Theorem 6.6.4 of HD), where $l ( T ) = log T / ( − 2 log ρ 0 )$. Consequently, $ρ 0 l ( T ) = exp ( − log ρ 0 log T / ( 2 log ρ 0 ) ) = T − 1 / 2$. Thus, in this case, asymptotically, the approximation error (2) is of the order $T − 1 / 2$.
Moreover, Theorem 7.4.7 of HD states that in a more general setting, if $( y t ) t ∈ Z$ is not generated by a finite autoregression under the same assumptions on $( ε t ) t ∈ Z$, as above and where $∑ j = 1 ∞ j 1 / 2 ∥ Φ j ∥ < ∞$, we have
$I C ( h ; C T ) = log det Σ ˙ T + L ˜ T ( h ) ( 1 + o P ( 1 ) ) , L ˜ T ( h ) = h p 2 T ( C T − 1 ) + t r [ Σ − 1 ( Σ h − Σ ) ] ,$
uniformly in $0 ≤ h ≤ H T = o ( T / ( log T ) )$, if $C T > 1$. Here, $Σ ˙ T = 1 T ∑ t = 1 T ε t ε t ′$ is independent of h. Now impose the following assumption on the sequence $Σ h , h ∈ N$:
Assumption 1.
For the sequence $Σ h , h ∈ N ,$ there exists a twice continuously differentiable function $θ ( h )$ with second derivative $θ ″ ( h )$ such that $l i m h → ∞ { t r [ Σ − 1 ( Σ h − Σ ) ] } / θ ( h ) = 1$ and
$lim inf h → ∞ h 2 θ ″ ( h ) / θ ( h ) > 0 .$
Under this assumption HD show that the optimal h (for $I C ( h ; C T )$) is close to a minimizer $l ( T )$ of the deterministic function $L T ( h ) = h p 2 T ( C T − 1 ) + θ ( h )$.
The limitation to stationary processes and autoregressive coefficients $Φ j$ declining fast enough excludes persistent processes such as unit root processes that are often encountered in economic time series. It is common practice, nevertheless, to select the lag length using information criteria also in case of highly persistent processes. For integrated processes of the I(1) kind, this is backed by some results in the literature, such as Ng and Perron (1995), stating that lag length selection for VARMA models has properties analogous to the stationary case. These results are generalized to a larger class of processes in Lütkepohl and Saikkonen (1999).4
However, in the literature, there are currently no results available for the I(2) case of doubly integrated processes. This class of processes was introduced by Granger and Lee (1989), who illustrated the main underlying idea using an example from inventory processes: if the demand for a product (a flow variable) is modeled as an I(1) process (which is realistic in a number of cases; see Granger and Lee 1989), then the stock of the product at the producers will be I(2), as stock variables sum up the corresponding flow variables. In a macroeconomic framework, I(2) processes have been investigated, for example, by Johansen and Juselius and coworkers, who argue that I(2) processes play a role if the model contains nominal quantities, as inflation often is found to be integrated. Since inflation is the change rate of prices, it follows that prices must be I(2) then (see, for example, Juselius 2006, chps. 16–18, or Johansen 1995, chp. 9).
While these models have been used in the literature, the corresponding inferential methodology focuses on the case of autoregressions with a finite lag length. If the data-generating process is more general, such models are often an approximation and must involve the selection of the lag length. In practice, this is performed using information criteria such as AIC or BIC also in this situation while a theoretical justification is missing.
This paper closes the following gap: we provide a general result for the asymptotic behavior of lag length selection using information criteria for I(2) processes, linking the performance for the I(2) processes to a related stationary process for which the results of HD can be applied. The proof of this result can be found in the appendix. The result is illustrated using a simulation study in Section 3.

## 2. Integrated Processes

The key to the results for integrated processes is the insight that the OLS residuals of (1) and—for example—of the vector error correction equation
$Δ y t = Π y t − 1 + ∑ j = 1 h − 1 Γ j Δ y t − j + u t , Δ y t = y t − y t − 1 ,$
are identical. Since the information criteria are a function of $Σ ^ h$ (see (3)), they are invariant to linear transformations of the regressors as well as to transformations of the dependent variable by adding linear functions of the regressors.
If in (6) $Π = 0$, this defines a VAR($h − 1$) process $( Δ y t ) t ∈ Z$ for which stationary theory applies under stability assumptions for $Γ ( z ) = I p − ∑ j = 1 h − 1 Γ j z j$. As in this case the estimation of $Π$ is superconsistent, one would suspect that the inclusion of $y t − 1$ as a regressor does not change the lag length selection substantially except for adding one lag to account for the differencing. A similar reasoning applies in the case of reduced rank of $Π$.
Paulsen (1984) shows that for finite-order autoregressive I(1) processes, indeed, the results for the stationary case can be extended. Lütkepohl and Saikkonen (1999) extend some of the results to VAR() I(1) processes, rectifying earlier incorrect proofs in the literature. Bauer and Wagner (2004) provide theory for the case of multifrequency I(1) processes such as seasonally integrated processes by using the vector of seasons representation linking the multifrequency I(1) case to the I(1) case.
In more detail, let, in the I(1) case, $Π = α β ′$, where $β ∈ R p × r , β ′ β = I r$ and 5$β ⊥ ∈ R p × ( p − r )$, $β ′ β ⊥ = 0 , β ⊥ ′ β ⊥ = I p − r$ denote matrices such that $B = [ β , β ⊥ ]$ is orthonormal. Let $y ˜ t = B ′ y t = ( y ˜ 1 , t ′ , y ˜ 2 , t ′ ) ′ , y ˜ 1 , t ∈ R r$. Then assume that
$y ˜ 1 , t Δ y ˜ 2 , t = u t = u 1 , t u 2 , t = c ˜ • ( L ) ε t ,$
where $| c ˜ • ( z ) | ≠ 0 , | z | ≤ 1 , c ˜ • ( z ) = ∑ j = 0 ∞ c ˜ • , j z j$ and where it is assumed that $∑ j = 0 ∞ j 3 / 2 ∥ c ˜ • , j ∥ < ∞$. Under these assumptions, it follows that we obtain the VAR() representation $Φ ˜ ( L ) u t = ε t$ where $Φ ˜ ( z ) = c ˜ • − 1 ( z )$, implying the VAR() representation
$Φ ˜ ( L ) u t = [ Φ ˜ ( L ) d i a g ( I r , Δ I p − r ) B ′ ] y t = ε t$
for the I(1) process $( y t ) t ∈ Z$. Furthermore, the triangular representation (7) implies that the process $( y 2 , ˜ t ) t ∈ Z$ is integrated and not cointegrated; compare, for example, Saikkonen and Lütkepohl (1996), sct. 2.
The corresponding VAR(h) approximation is given as $y t = ∑ j = 1 h A j y t − j + e t ( h )$. Premultiplying with $B ′$ and using that (for $h > 1$), the set of regressors $y t − 1 , y t − 2 , … , y t − h$ (of dimension $h p$) can be linearly transformed into the set of regressors $u t − 1 , … , u t − h + 1$ (of dimension $p ( h − 1 )$) plus $y ˜ 2 , t − 1 ∈ R p − r , u 1 , t − h = y ˜ 1 , t − h ∈ R r$, we obtain (for appropriate coefficient matrices)
$y ˜ t = Φ ˜ 1 , 2 y ˜ 2 , t − 1 + ∑ j = 1 h − 1 Φ ˜ j u t − j + A h , 1 u 1 , t − h + e ˜ t ( h )$
where $e ˜ t ( h ) = B ′ e t ( h )$. Using (7) then leads to
$u t = Φ 1 , 2 y ˜ 2 , t − 1 + ∑ j = 1 h − 1 Φ ˜ j u t − j + A h , 1 u 1 , t − h + e ˜ t ( h ) .$
Here, all variables are stationary except for $y ˜ 2 , t − 1$, which is I(1) and not cointegrated by design. Consequently, $Φ 1 , 2 = 0$ has to hold (compare equation (A.2) in Saikkonen and Lütkepohl 1996). Omitting this regressor, the approximation almost equals a VAR(h) approximation of $( u t ) t ∈ Z$ except for the inclusion of $u 1 , t − h$ instead of the full vector $u t − h$. We have $Σ h − 1 ≥ Σ h − 1 u ≥ Σ h ≥ Σ h u$ for each $h > 1$ where $Σ h u$ denotes the residual variance in a VAR(h) approximation for $( u t ) t ∈ Z$. This shows that the residual variance of $( y t ) t ∈ Z$ and the one of $( u t ) t ∈ Z$ are closely related.
Thus, the essential step in linking the asymptotic properties of the lag length selection in the I(1) case (obtained from inclusion of $y ˜ 2 , t − 1$) to the ones in the stationary case (assuming that $y ˜ 2 , t − 1$ is left out, which is infeasible in practice as the matrix $β$ is not known prior to estimation) lies in establishing that the inclusion of $y ˜ 2 , t − 1$ has negligible effects.
For the I(2) case, the same route can be taken. To this end, consider processes according to the following assumptions:
Assumption 2.
Let $( y t ) t ∈ Z$ be an I(2) process (not generated by a finite-order autoregression) obtained as a solution to the equation (with deterministic values $y 0 , y 1$, and $y 2$)
$Δ 2 y t = α β ′ y t − 1 + Γ Δ y t − 1 + ∑ j = 1 ∞ Π j Δ 2 y t − j + ε t , t ∈ Z ,$
where
• $α , β ∈ R p × r , 0 ≤ r < p$ are full column rank matrices;
• The function $Π ( z ) = ( 1 − z ) 2 I p − α β ′ z − Γ ( 1 − z ) z − ∑ j = 1 ∞ ( 1 − z ) 2 Π j z j$ (converging absolutely for $| z | < 1 + δ , δ > 0$) fulfills that $| Π ( z ) | = 0$ implies that $| z | > 1$ or $z = 1$;
• With $β 2 = β ⊥ η ⊥ , α 2 = α ⊥ ξ ⊥$ (where $α ⊥ ′ Γ β ⊥ = ξ η ′ , η , ξ ∈ R ( p − r ) × s$ are of full column rank $s < p − r$) the matrix
$α 2 ′ ( I p + Γ β ( β ′ β ) − 1 ( α ′ α ) − 1 α ′ Γ − ∑ j = 1 ∞ Π j ) β 2$
is nonsingular.
Furthermore, the process $( ε t ) t ∈ Z$ is independent identically distributed (iid) with mean zero, variance $Σ > 0$, and with $E ( ε t , j 4 log + | ε t , j | ) < ∞ , j = 1 , … , p$.
From these assumptions, it follows that the process $( y t ) t ∈ Z$ is I(2) with cointegration and potentially multi-cointegration occurring. The structure of the process is best seen in the triangular representation of the process, which is equivalent to the one used above (see, for example, Li and Bauer 2020): assume that the stationary process $( u t ) t ∈ Z$ is generated according to $u t = c ˜ • ( L ) ε t$ (for suitable assumptions on $c ˜ • ( z )$ including $| c ˜ • ( 1 ) | ≠ 0$) and related to the process $( y t ) t ∈ Z$ via
$y ˜ 1 , t − A Δ y ˜ 3 , t Δ y ˜ 2 , t Δ 2 y ˜ 3 , t = u t , y ˜ t = y ˜ 1 , t y ˜ 2 , t y ˜ 3 , t = β ′ β 1 ′ β 2 ′ y t ,$
where $β , β 1$, and $β 2$ are as in Assumption 2. Then, clearly, $( y 3 , ˜ t ) t ∈ Z = β 2 ′ ( y t ) t ∈ Z$ is I(2) and not cointegrated, $( y 2 , ˜ t ) t ∈ Z = β 1 ′ ( y t ) t ∈ Z$ is I(1) and not cointegrated, and $( y 1 , ˜ t ) t ∈ Z = β ′ ( y t ) t ∈ Z$ is I(1) and cointegrates with $Δ y ˜ 3 , t$ to stationarity for nonzero A. If $A = 0$, it is stationary.6 The cointegrating relations $y ˜ 1 , t − A Δ y ˜ 3 , t$ between an I(1) and a differenced I(2) process are termed multi-cointegration by Granger and Lee (1989).
It follows that (using $u 3 , t = Δ 2 y ˜ 3 , t = Δ y ˜ 3 , t − Δ y ˜ 3 , t − 1$)
$Δ 2 y ˜ t = Δ 2 y ˜ 1 , t Δ 2 y ˜ 2 , t Δ 2 y ˜ 3 , t = y ˜ 1 , t − y ˜ 1 , t − 1 − Δ y ˜ 1 , t − 1 Δ y ˜ 2 , t − Δ y ˜ 2 , t − 1 Δ 2 y ˜ 3 , t = u t + A Δ y ˜ 3 , t − y ˜ 1 , t − 1 − Δ y ˜ 1 , t − 1 − Δ y ˜ 2 , t − 1 0 = − I 0 0 0 0 0 0 0 0 y ˜ t − 1 + − I 0 A 0 − I 0 0 0 0 Δ y ˜ t − 1 + I 0 A 0 I 0 0 0 I ︸ A u t .$
Denoting $A u t = c • ( L ) ε t$ and using $c • − 1 ( L ) = c • − 1 ( 1 ) + c • * ( 1 ) Δ + c • * * ( L ) Δ 2$, we obtain from premultiplying the above equation with $c • − 1 ( L )$:
$c • − 1 ( L ) Δ 2 y ˜ t = ε t − c • − 1 ( 1 ) I 0 0 β ′ y t − 1 + c • * ( 1 ) − I 0 0 β ′ + c • − 1 ( 1 ) − β ′ + A β 2 ′ − β 1 ′ 0 Δ y t − 1 + c ˜ ( L ) Δ 2 y ˜ t − 1 .$
Noting that $c • ( 0 ) = A$ such that $B A c • − 1 ( 0 ) B ′ = I$, it follows that
$Γ = B A c • * ( 1 ) − I 0 0 β ′ + c • − 1 ( 1 ) − β ′ + A β 2 ′ − β 1 ′ 0$
and thus
$α = − B A c • − 1 ( 1 ) I 0 0 , α ⊥ ′ = − 0 I 0 0 0 I c • ( 1 ) A − 1 B ′ , α ⊥ ′ Γ β ⊥ = I 0 0 0 .$
This confirms the low rank decompositions $α β ′$ and $α ⊥ ′ Γ β ⊥ = ζ η ′$ as well as the rank of these matrices. The VAR() representation then follows from calculating the coefficients of $c ˜ ( L )$ and $c • ( L ) − 1$. Therefore, the triangular representation immediately shows that the process $( y t ) t ∈ Z$ is I(2) and not integrated of higher order, but leaves the dynamics contained in $c • ( z )$ unspecified. The VAR() representation, on the other hand, focuses on the dynamics, but makes the derivation of the properties of the processes defined by the difference equation less immediate.
To illustrate the function of the constraint (8) to exclude higher order of integration in the solutions process, assume for the moment that the process $( y t ) t ∈ Z$ is I(2) and consider $α 2 ′ Δ 2 y t$:
$α 2 ′ Δ 2 y t = α 2 ′ Γ Δ y t − 1 + α 2 ′ ∑ j = 1 ∞ Π j Δ 2 y t − j + α 2 ′ ε t .$
Here, $α 2 ′ Γ β ⊥ = 0$ (as $α 2 ′$ denotes the second block row of $α ⊥ ′$) such that $α 2 ′ Γ Δ y t − 1 = α 2 ′ Γ β ( β ′ β ) − 1 β ′ Δ y t − 1$. Now, using the first difference of the model equation, we see that
$( α ′ α ) − 1 α ′ Δ 3 y t = β ′ Δ y t − 1 + ( α ′ α ) − 1 ( α ′ Γ Δ 2 y t − 1 + α ′ ∑ j = 1 ∞ Π j Δ 3 y t − j + α ′ Δ ε t )$
such that $β ′ Δ y t − 1 = − ( α ′ α ) − 1 α ′ Γ Δ 2 y t − 1 + Δ v t$ for some stationary process $( v t ) t ∈ Z$.
Combining these facts, we see that
$α 2 ′ Δ 2 y t = − α 2 ′ Γ β ( β ′ β ) − 1 ( α ′ α ) − 1 α ′ Γ Δ 2 y t − 1 + α 2 ′ ∑ j = 1 ∞ Π j Δ 2 y t − j + α 2 ′ ε t + γ Δ v t$
for some matrix $γ$. Thus (the last equation defines $a ˜ ( L )$),
$α 2 ′ Δ 2 y t + α 2 ′ Γ β ( β ′ β ) − 1 ( α ′ α ) − 1 α ′ Γ Δ 2 y t − 1 − α 2 ′ ∑ j = 1 ∞ Π j Δ 2 y t − j = a ˜ ( L ) Δ 2 y t = α 2 ′ ε t + γ Δ v t .$
Consequently,
$a ˜ ( L ) β 2 β 2 ′ Δ 2 y t = a ˜ ( L ) Δ 2 y t − a ˜ ( L ) [ β β ′ + β 1 β 1 ′ ] Δ 2 y t = α 2 ′ ε t + γ Δ v t − a ˜ ( L ) [ β β ′ + β 1 β 1 ′ ] Δ 2 y t .$
The right hand side process is a stationary process with nonsingular spectrum at $z = 1$ as it equals $α 2 ′ ε t + Δ v ˜ t$ for stationary process $( v ˜ t ) t ∈ Z$. This shows that for the properties of $β 2 ′ Δ 2 y t$, the square matrix
$a ˜ ( 1 ) β 2 = α 2 ′ I p + Γ β ( β ′ β ) − 1 ( α ′ α ) − 1 α ′ Γ − ∑ j = 1 ∞ Π j β 2$
is of major importance: if it is nonsingular, then $β 2 ′ Δ 2 ( y t ) t ∈ Z$ is stationary. If it is singular such that the process still contains a unit root, then $β 2 ′ Δ 2 ( y t ) t ∈ Z$ is an integrated process. Hence, the nonsingularity of (8) ascertains that the solution process $( y t ) t ∈ Z$ is I(2) and not I(3) (or integrated of even higher order).
For simplicity of presentation, no deterministic terms are included in the model formulation. Then, the corresponding result for I(2) processes is the main result of this note:
Theorem 1.
Let $( y t ) t ∈ Z$ be generated according to Assumption 2. Let the lag length selection be performed by minimizing the information criterion $I C ( h ; C T )$ with penalty term $C T ≥ c > 1 , C T / T → 0$ over the integers $0 ≤ h ≤ H T = o ( T 1 / 3 )$. Let the minimal minimizing argument be denoted as $h ^ ( C T )$. Then,
(i)
$P ( h ^ ( C T ) ≤ M ) → 0$ for each constant $0 < M$.
(ii)
If Assumption A1 holds for $Σ h$ with function $θ ( h )$, then $h ^ ( C T ) / h T * → 1$ in probability, where $h T *$ minimizes the function $L T ( h ; C T ) = h p 2 ( C T − 1 ) / T + θ ( h )$.
(iii)
If $( y t ) t ∈ Z$ is an I(2) invertible VARMA process corresponding to the left-coprime pair $( A ( z ) , B ( z ) )$, then $− 2 h ^ ( C T ) log ρ 0 / log T → 1$ in probability, where $ρ 0 − 1 = min { | z | : z ∈ C , | B ( z ) | = 0 }$.
The same results hold if the process is demeaned and/or detrended before estimation.
The result shows that the lag length selection essentially follows the minima of the function $L T ( h ; C T )$ analogously to the stationary case. In the VARMA situation, we obtain a lag length that is proportional to $log T$ and depends on the location of the zero closest to the unit circle. Otherwise, the minima of $L T ( h ; C T )$ will increase for increasing sample size as the penalty term $h p 2 ( C T − 1 ) / T$ tends to zero for $T → ∞$ while $θ ( h )$ decreases monotonously.
The assumptions here are stronger than needed in almost all aspects. Convergence of $Π ( z )$ on $| z | < 1 + δ$ can be weakened to $∑ j = 1 ∞ j a ∥ Π j ∥ < ∞$ for suitable $a > 0$. In addition, the iid assumption for $( ε t ) t ∈ Z$ can be weakened to martingale difference type assumptions as in HD. I here use the stronger assumptions in order to stay in line with Li and Bauer (2020) (in the following LB) on which the proof (to be found in Appendix A) is based.7
Note that Assumption 2 excludes the finite lag length case in which $( y t ) t ∈ Z$ is generated by an autoregression of, say, order $h 0$. In this case, we have $Σ h = Σ , h ≥ h 0$. It follows that, asymptotically, $h ^ ( C T ) ≥ h 0$ with probability one if $C T / T → 0 , C T ≥ c > 1$. Furthermore, the probability of $h ^ ( C T ) > h 0$ tends to zero for $C T → ∞$ as a function of T as the penalty then dominates the estimation error for $Σ ^ h$. This demonstrates that BIC is (weakly) consistent for VAR($h 0$) processes.
Inspecting the proof, it is obvious that analogous results also hold for the I(1) and the multifrequency I(1) case. Moreover, often, not only is the lag length selected uniformly for all component processes, but different lag lengths are selected for each component. In addition, in this case, the proof in the appendix can be adapted to show that the asymptotic properties in the integrated case are analogous to the ones in the stationary case obtained by suitably differencing the process.
Such an extension also pertains to VARX models. In this situation, it is also possible to be more general such that the exogenous process does not need to possess a VAR() representation with sufficiently fast decreasing coefficients, but may show (stationary or integrated) long memory if the number of lags to be included is subject to an upper bound. The bottom line in all such cases is that the integration properties of the process are of less concern, if differencing leads to stationarity with sufficiently fast decaying impulse responses.
Finally, note that similar results appear possible for processes of higher order of integration.

## 3. Simulations

In this section, the theory is illustrated with a simulation exercise involving MA(1) processes of the form:
$u t = ε t − θ ε t − 1 .$
for standard Gaussian white noise process $( ε t ) t ∈ Z$. These processes are stationary and invertible for $θ$ being a stable matrix. In the scalar case, the smallest zero of the process equals $ρ 0 = | θ |$. Therefore, the required number of lags in an autoregressive approximation is controlled via $θ$ and is expected to grow similar to $log T / ( − 2 log ρ 0 )$ as a function of the sample size.
A total of $M = 1000$ realizations of the process with sample size $T = 100 , 200 , 400$, and $T = 800$ are simulated, and one parameter $θ = − 0.9 , − 0.7 , … , 0.9$ is varied on a regular grid.
We thus investigate the increase of the average selected lag length for univariate MA(1) processes as a function of the sample size as well as the location of the zero closest to the unit circle. This is performed for the stationary process as well as for a doubly integrated MA(1) process obtained via double summation of the stationary process. The double integration is expected to add two lags compared to the lag length required for the stationary process $( u t ) t ∈ Z$ as $Δ 2 y t = u t$. Furthermore, the double integration potentially also adds a deterministic quadratic trend. Thus, lag length selection is performed on the corresponding detrended process in the I(2) case.
Figure 1 shows the resulting average lag lengths selected using BIC. The main effects contained in the theorem are clearly visible: the selected lag length increases with sample size. The logarithmic scale on the x-axis for plot (a) shows8 the roughly linear increase in $log T$. In addition, larger absolute values of $θ$ result in larger lag lengths. The average selected lag lengths with BIC are very similar to the optimal values $h T *$ (corresponding to the stationary process $( u t ) t ∈ Z$). The doubly integrated processes $( y t ) t ∈ Z$ require roughly two more lags in all cases compared to the stationary process $Δ 2 ( y t ) t ∈ Z$ except close to $θ = 0.9$ (which is close to a pole zero cancellation), where only one additional lag results in the lag length selection on average.

## 4. Conclusions

In this paper, we investigate the properties of the commonly used lag length selection using information criteria for VAR approximations of I(2) processes. The discussion shows that the asymptotic properties of the lag length selection criteria are analogous to the properties according to the standard stationary setting, but typically require two extra lags to account for the double integration.
In the invertible VARMA case, this implies that the lag lengths selected using AIC or BIC tend to infinity as a function of the sample size proportional to $log T$. The proportionality constant depends on the location of the zero closest to the unit circle. This is identical to the stationary case. The proof of the result indicates that this property is robust for a great number of unit roots being present in the data-generating process.

## Funding

This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation—Projektnummer 276051388) which is gratefully acknowledged. I acknowledge support for the publication costs by the Deutsche Forschungsgemeinschaft and the Open Access Publication Fund of Bielefeld University.

## Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

## Conflicts of Interest

The author declares no conflict of interest.

## Appendix A. Proof of Theorem 1

Proof.
As in Johansen (1995), sct. 4.3, it follows that the process
$u t = y ˜ 1 , t − A Δ y ˜ 3 , t Δ y ˜ 2 , t Δ 2 y ˜ 3 , t = c • ( L ) ε t , y ˜ t = y ˜ 1 , t y ˜ 2 , t y ˜ 3 , t = β ′ β 1 ′ β 2 ′ y t ,$
is stationary for $A = α ¯ ′ Γ β 2$ and $β 1 = β ⊥ η$ for appropriate choice of initial conditions $y 0 , y 1$, and $y 2$. The MA representation, as in Johansen (1995), Theorem 4.6, shows that the difference for different values of $y 0 , y 1$, and $y 2$ amounts to the addition of deterministic terms, which are easily dealt with.
The properties of $c • ( z )$ follow from the usual arguments, as in Johansen (1995), sct. 4.3 (see Theorem 4.6 on p. 58) noting that derivatives of absolutely convergent power series have the same radius of convergence as the original function.
The rest of the proof uses such a triangular representation. Then
$y ˜ 1 , t = A Δ y ˜ 3 , t + u 1 , t , Δ y ˜ 2 , t = u 2 , t , Δ 2 y ˜ 3 , t = u 3 , t ,$
where $c ˜ • ( L ) − 1 u t = ε t$ is a VAR() process. Note, however, that the triangular representation requires knowledge of $β , β 1 , β 2$ and hence is not operational in general. We use it here only as a technical device in the proof in order to separate processes of different order of integration. In particular, the representation above in conjunction with the nonsingularity of the spectrum of $( u t ) t ∈ Z$ at $z = 1$ directly implies that $y ˜ 3 , t$ is I(2) and not cointegrated, $y ˜ 2 , t$ is I(1) and not cointegrated, and $y ˜ 1 , t$ is I(1) or I(0) and cointegrated with $Δ y ˜ 3 , t$.
Using this notation (A.2) of LB states that9
$Δ 2 y t = Φ 2 y ˜ 2 , t − 1 + Φ 3 y ˜ 3 , t − 1 + Θ Δ y ˜ 3 , t − 1 + ∑ j = 1 h Ξ j u t − j + Ξ h + 1 , 1 : 2 u 1 , t − h − 1 u 2 , t − h − 1 + Ξ h + 2 , 1 ( u 1 , t − h − 2 − A u 3 , t − h − 1 ) + e t ( h + 2 ) .$
The regressors on the right hand side are a linear transformation of the regressors in the VAR($h + 2$) approximation in levels. $Φ 2 , Φ 3$, and $Θ$ subsume all nonstationary directions of the regressors. Since these are not cointegrated—as follows from the triangular representation—and all other quantities in the equation are stationary, their coefficients need to be zero, that is, $Φ 2 = 0 , Φ 3 = 0 , Θ = 0$.
We rewrite the equation as $Δ 2 y t = Ψ u z t + Ξ h U t − 1 , h + e t ( h + 2 )$, where
$z t = ( y ˜ 2 , t − 1 ′ , Δ y ˜ 3 , t − 1 ′ , y ˜ 3 , t − 1 ′ ) ′$
and $U t − 1 , h$ collects all stationary regressors. Denoting $〈 a t , b t 〉 = T − 1 ∑ t = h + 1 T a t b t ′$ and $W t = [ z t ′ , U t − 1 , h ′ ] ′$, we obtain, from the Frisch–Waugh–Lovell theorem,
$Σ ^ h + 2 = 〈 Δ 2 y t , Δ 2 y t 〉 − 〈 Δ 2 y t , W t 〉 〈 W t , W t 〉 − 1 〈 W t , Δ 2 y t 〉 = 〈 Δ 2 y t , Δ 2 y t 〉 − 〈 Δ 2 y t , U t − 1 , h 〉 〈 U t − 1 , h , U t − 1 , h 〉 − 1 〈 U t − 1 , h , Δ 2 y t 〉 − 〈 Δ 2 y t , z t π 〉 〈 z t π , z t π 〉 − 1 〈 z t π , Δ 2 y t 〉$
where $z t π = z t − 〈 z t , U t − 1 , h 〉 〈 U t − 1 , h , U t − 1 , h 〉 − 1 U t − 1 , h$. We introduce the scaling matrix $D T = d i a g ( T − 1 / 2 I r + s , T − 3 / 2 I s )$. Then, $D T 〈 z t , z t 〉 D T → d W$ for some random, almost surely positive, definite matrix $W ∈ R ( r + 2 s ) × ( r + 2 s )$. Moreover,
The theory for stationary processes implies that the second factor is $O P ( 1 )$. The expectation of the first factor is $O ( h / T ) = o ( 1 )$ (compare LB, p. 17). Thus, the whole term is $O P ( h / T ) = o p ( 1 )$ uniformly in $h ≤ H T$, implying that $( D T 〈 z t π , z t π 〉 D T ) − 1 = ( D T 〈 z t , z t 〉 D T ) − 1 ( 1 + o P ( 1 ) ) = O P ( 1 )$.
Similarly,
$〈 Δ 2 y t , U t − 1 , h 〉 〈 U t − 1 , h , U t − 1 , h 〉 − 1 〈 U t − 1 , h , z t 〉 D T = O P ( T − 1 / 2 )$
since $∥ 〈 Δ 2 y t , U t − 1 , h 〉 〈 U t − 1 , h , U t − 1 , h 〉 − 1 − Ψ h ∥ = O P ( h / T )$, according to Theorem 7.4.5. of HD, where $〈 Ψ h U t − 1 , h , z t 〉 D T = O P ( T − 1 / 2 )$. Since $〈 Δ 2 y t , z t 〉 D T = O P ( T − 1 / 2 )$, we obtain $〈 Δ 2 y t , z t π 〉 D T = O P ( T − 1 / 2 )$. Thus,
$〈 Δ 2 y t , z t π 〉 〈 z t π , z t π 〉 − 1 〈 z t π , Δ 2 y t 〉 = ( 〈 Δ 2 y t , z t π 〉 D T ) ( D T 〈 z t π , z t π 〉 D T ) − 1 D T 〈 z t π , Δ 2 y t 〉 = O P ( T − 1 / 2 ) O P ( 1 ) O P ( T − 1 / 2 ) = O P ( 1 / T ) .$
Comparing with the expression for $Σ ^ h + 2$ from above, we obtain
$Σ ^ h + 2 = 〈 Δ 2 y t , Δ 2 y t 〉 − 〈 Δ 2 y t , U t − 1 , h 〉 〈 U t − 1 , h , U t − 1 , h 〉 − 1 〈 U t − 1 , h , Δ 2 y t 〉 + O P ( 1 / T ) .$
Now
$Δ 2 y ˜ t = Δ 2 u 1 , t + A Δ u 3 , t Δ u 2 , t u 3 , t = I r 0 A 0 I s 0 0 0 I p − r − s u t + − 2 u 1 , t − 1 + u 1 , t − 2 − A u 3 , t − 1 − u 2 , t − 1 0$
implies
$| Σ ^ h + 2 | = | Σ ˜ h u | + O P ( 1 / T ) , Σ ˜ h u : = 〈 u t , u t 〉 − 〈 u t , U t − 1 , h 〉 〈 U t − 1 , h , U t − 1 , h 〉 − 1 〈 U t − 1 , h , u t 〉 .$
Thus, $| Σ ^ h u | ≥ | Σ ˜ h u | ≥ | Σ ^ h + 2 u |$ and $| Σ ^ h + 2 | − | Σ ˜ h u | = O P ( T − 1 )$ uniformly in $0 < h < H T$, where $Σ ^ h u$ denotes the estimated residual variance of a long VAR approximation of $( u t ) t ∈ Z$ using lag length h.
This is sufficient for the results, as can be seen as follows: divergence to infinity for $h ^ ( C T )$ is obvious since the error term is at most of the same order as the penalty term.
To establish (ii), consider the case $C T → ∞$ first. Note that the order $O P ( T − 1 )$ together with $C T → ∞$ implies that $I C ( h ; C T ) = I C u ( h ; C T ) ( 1 + o P ( 1 ) )$, where $I C u$ denotes the criterion for $( u t ) t ∈ Z$. This is the prerequisite for (ii) and (iii) according to the arguments in HD on pp. 332–34 following Theorem 7.4.7. (ii): Let $h T *$ denote the minimizer over h of $L T ( h )$ (for $h ∈ R +$). Then, a mean value expansion around $h T *$ leads to
$L T ( h ^ T ; C T ) − L T ( h T * ; C T ) = 1 2 θ ″ ( h ¯ T ) ( h ^ T − h T * ) 2 = θ ″ ( h ¯ T ) ( h T * ) 2 2 θ ( h T * ) ( h ^ T h T * − 1 ) 2 θ ( h T * ) ≥ 0$
where $h ¯ T$ is an intermediate value. Now $I C ( h ; C T )$ is minimized at $h ^ T$, where $I C ( h ; C T ) − Σ ˙ T − L T ( h ; C T ) = O P ( T − 1 ) + o P ( θ ( h ) )$ uniformly in h, according to the above in combination with (7.4.22) of HD. This implies that either $h ^ T / h T * → 1$ or $θ ( h T * ) = O P ( T − 1 )$ such that the positive right hand side of the above can be reversed by the estimation error to obtain $I C ( h ^ T ; C T ) ≤ I C ( h T * ; C T )$.
However, $θ ( h T * ) = O P ( T − 1 )$ is a contradiction to $h T *$, minimizing $L T ( h ; C T ) = h p 2 ( C T − 1 ) / T + θ ( h )$: in that case, $h T * − 1$ reduces $L T$ by $p 2 ( C T − 1 ) / T$ plus a smaller error term due to $C T → ∞$. This leads to $h ^ T / h T * → 1$.
For $h → ∞$, the order $I C ( h ; C T ) = I C u ( h ; C T ) ( 1 + o P ( 1 ) )$ also holds with $C T$ remaining finite, as in AIC. Since the lag length estimated for finite $C T$ is at least as large as the one selected using BIC, $h ^ T → ∞$ follows, again implying the results.
All statements above also hold for demeaned and/or detrended processes. □

## Notes

 1 Other variants exist, for example, using the same time points in the summation for all models; compare with Kilian and Lütkepohl (2017). 2 Hannan and Deistler (1988) will, in the following, be abbreviated as HD. 3 In the unlikely case of draws, the smallest integer h achieving the minimum is selected. 4 Lütkepohl and Saikkonen (1999) also correct an argument contained in Ng and Perron (1995). 5 Here and below we use the notation $X ⊥$ for a full column rank matrix whose columns span the orthonormal complement of the column space of a full column rank matrix X. 6 Since we fix the values $y 0 , y 1$, and $y 2$, the processes will only be stationary for appropriate choices of the values, otherwise it is the sum of a stationary process and the effects of the initial values. 7 As pointed out by a reviewer (for which I am grateful), the moment condition in HD is slightly stronger than assuming finite fourth moment as used in LB. 8 Note that the optima $h T *$ correspond to $log T / ( − 2 log ρ 0 )$ rounded to the nearest integer. 9 LB uses the lag length $h + 2$ instead of h.

## References

1. Bauer, Dietmar, and Martin Wagner. 2004. Autoregressive Approximations to MFI(1) Processes. Working Paper No. 174, Reihe Ökonomie/ Economics Series; Vienna: Institut für Höhere Studien (IHS). [Google Scholar]
2. Granger, Clive W. J., and Tae-Hwy Lee. 1989. Investigation of Production, Sales and Inventory relationships using multicointegration and non-symmetric error correction models. Journal of Applied Econometrics 4: 145–59. [Google Scholar] [CrossRef]
3. Hannan, Edward James, and Manfred Deistler. 1988. The Statistical Theory of Linear Systems. New York: John Wiley. [Google Scholar]
4. Johansen, Søren. 1995. Likelihood-Based Inference in Cointegrated Vector Auto-Regressive Models. Oxford: Oxford University Press. [Google Scholar]
5. Juselius, Katarina. 2006. The Cointegrated VAR Model: Methodology and Applications. Oxford: Oxford University Press. [Google Scholar]
6. Kilian, Lutz, and Helmut Lütkepohl. 2017. Structural Vector Autoregressive Analysis. Cambridge: Cambridge University Press. [Google Scholar]
7. Li, Yuanyuan, and Dietmar Bauer. 2020. Modeling I (2) processes using vector autoregressions where the lag length increases with the sample size. Econometrics 8: 38. [Google Scholar] [CrossRef]
8. Lütkepohl, Helmut, and Pennti Saikkonen. 1999. Order Selection in Testing for the Cointegrating Rank of a VAR Process, Cointegration, Causality, and Forecasting. A Festschrift in Honour of Clive WJ Granger. Oxford: Oxford University Press, pp. 168–99. [Google Scholar]
9. Ng, Serena, and Pierre Perron. 1995. Unit Root Tests in ARMA Models with Data-Dependent Methodes for the Selection of the Truncation Lag. Journal of the American Statistical Association 90: 268–81. [Google Scholar] [CrossRef]
10. Paulsen, Jostein. 1984. Order determination of multivariate autoregressive time series with unit roots. Journal of Time Series Analysis 5: 115–27. [Google Scholar] [CrossRef]
11. Saikkonen, Pentti, and Helmut Lütkepohl. 1996. Infinite-Order Cointegrated Vector Autoregressive Processes: Estimation and Inference. Econometric Theory 12: 814–44. [Google Scholar] [CrossRef]
Figure 1. Average selected lag length for the stationary processes (blue) and the $I ( 2 )$ processes (black, dashed). Red stars denote the optimizers $h T *$. (a) For $θ = − 0.9$ over sample sizes $T = 100 , 200 , 400 , 800$. (b) For sample size $T = 800$ over different values of $θ$.
Figure 1. Average selected lag length for the stationary processes (blue) and the $I ( 2 )$ processes (black, dashed). Red stars denote the optimizers $h T *$. (a) For $θ = − 0.9$ over sample sizes $T = 100 , 200 , 400 , 800$. (b) For sample size $T = 800$ over different values of $θ$.
 Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

## Share and Cite

MDPI and ACS Style

Bauer, D. Information-Criterion-Based Lag Length Selection in Vector Autoregressive Approximations for I(2) Processes. Econometrics 2023, 11, 11. https://doi.org/10.3390/econometrics11020011

AMA Style

Bauer D. Information-Criterion-Based Lag Length Selection in Vector Autoregressive Approximations for I(2) Processes. Econometrics. 2023; 11(2):11. https://doi.org/10.3390/econometrics11020011

Chicago/Turabian Style

Bauer, Dietmar. 2023. "Information-Criterion-Based Lag Length Selection in Vector Autoregressive Approximations for I(2) Processes" Econometrics 11, no. 2: 11. https://doi.org/10.3390/econometrics11020011

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.