Next Article in Journal
Modification and Characterization of Lactoferrin-Iron Free with Methylimidazolium N-ethylamine Ionic Liquid as Potential Drugs Anti SARS-CoV-2
Previous Article in Journal
Tropospheric and Ionospheric Modeling Using GNSS Time Series in Volcanic Eruptions (La Palma, 2021)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Measuring Extremal Clustering in Time Series †

Centro de Matemática, Universidade do Minho, 4710-057 Braga, Portugal
Presented at the 9th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 12–14 July 2023.
Eng. Proc. 2023, 39(1), 64; https://doi.org/10.3390/engproc2023039064
Published: 6 July 2023
(This article belongs to the Proceedings of The 9th International Conference on Time Series and Forecasting)

Abstract

:
The propensity of data to cluster at extreme values is important for risk assessment. For example, heavy rain over time leads to catastrophic floods. The extremal index is a measure of Extreme Values Theory that allows measurement of the degree of high-value clustering in a time series. Inference about the extremal index requires a prior choice of values for tuning parameters, which impacts the efficiency of existing estimators. In this work, we propose an algorithm that avoids these constraints. Performance is evaluated based on simulations. We also illustrate with real data.

1. Introduction

The occurrence of extreme values can lead to risky situations. Climate change, the global economic and financial crisis resulting from the COVID-19 pandemic situation, and the war in Ukraine have contributed to continuously growing attention from analysts, namely, to the risk of extreme phenomena. The duration of extreme values in time means the generation of clusters, the extension of which can aggravate the phenomenon. Extreme Values Theory (EVT) presents a set of adequate tools in this context. The extremal index is a measure of serial dependence assessing the propensity of data for the occurrence of clusters of extreme values. Figure 1 shows the maximum of sea-surge heights, where clusters of high values are visible.
More precisely, considering X = { X n } n 1 as a stationary sequence of random variables (r.v.) with a common marginal distribution function (d.f.) F and denoting M n = max ( X 1 , . . . , X n ) , then X has extremal index θ ( 0 , 1 ] if for each real τ > 0 there exists a sequence of normalized levels u n , i.e., satisfying n ( 1 F ( u n ) ) τ , as n , such that P ( M n u n ) exp ( θ τ ) . In the independent and identically distributed (i.i.d.) case, we have P ( M n u n ) exp ( τ ) and thus θ = 1 . On the other hand, if θ = 1 , then the tail behavior of X resembles an i.i.d. sequence. Clustering of extreme values takes place whenever θ < 1 , and the smaller the θ is, the larger is the propensity for clusters to appear. Under some dependence conditions, θ is stated as the arithmetic inverse of the mean cluster size (Hsing et al. [1] 1988).
Assuming F is continuous, we have U i = F ( X i ) , i = 1 , . . . , n standard uniform r.v. and P ( n log ( F ( M n ) ) τ ) P ( n ( 1 F ( M n ) ) τ ) = P ( M n u n ) exp ( θ τ ) , with F ( M n ) = max ( U 1 , . . . , U n ) . Thus, Y n = n log ( F ( M n ) ) and Z n = n ( 1 F ( M n ) ) follow asymptotically an exponential distribution with parameter θ . The maximum likelihood estimator was considered by Northrop ([2] 2015) based on Y n . More precisely, dividing the time series X 1 , . . . , X n into k n blocks of length b n , with n = b n k n , and considering M n i = M ( ( i 1 ) b n + 1 ) : ( i b n ) = max ( X ( i 1 ) b n + 1 , . . . , X i b n ) , i = 1 , . . . , k n , the maximum of the i-th block in the disjoint blocks case, and M n i = M ( ( i 1 ) ) : ( i + b n 1 ) = max ( X i 1 , . . . , X i + b n 1 ) , i = 1 , . . . , n b n + 1 , the maximum of the i-th block in the sliding blocks case, the Northrop estimator is given by
θ ˜ N = 1 t n i = 1 t n Y ^ n i 1 ,
where Y ^ n i = b n log ( F ^ ( M n i ) ) and F ^ denotes the empirical d.f. estimating the usually unknown F, with t n = k n or t n = n b n + 1 depending on whether we are using disjoint or sliding blocks, respectively. Berghaus and Bücher ([3] 2018) considered
θ ˜ B = 1 t n i = 1 t n Z ^ n i 1 ,
with Z n i = b n ( 1 F ^ ( M n i ) ) , a more amenable formulation to derive the asymptotic properties. Here, we consider the Berghaus and Bücher estimator with bias adjustment given by
θ ^ = θ ˜ B 1 / b n .
We also consider the sliding blocks version since it usually performs better (Northrop [2] 2015, Berghaus and Bücher [3] 2018).
Observe that the estimators above only depend on a tuning parameter: the block length b b n . This is an advantage of these methods since most estimators of θ presented in the literature have two sources of uncertainty and thus two parameters to be defined in advance: the clustering generation of high values and the choice of a high threshold above which the clusters occur. To mention the best known ones, there are the Nandagopalan ([4] 1990), Runs and Blocks (Weissman and Novak, [5] 1998 and references there in), K-gaps (Süveges and Davison, [6] 2010), censored/truncated (Holěsovský and Fusek, [7,8] 2020/22), and cycles estimator (Ferreira and Ferreira, [9] 2018). We also refer to other estimators that require a single tuning parameter, such as the intervals estimator, which needs to fix a high threshold (Ferro and Segers, [10] 2003), and, similar to the Northop estimator above, where we only choose the block length for maxima, we cite Gomes ([11] 1993), Ancona-Navarrete and Tawn ([12] 2000), and Ferreira and Ferreira ([13] 2022).
As already highlighted in the literature, there is no simple optimal methodology for the best choice of block length and a single estimate for θ . In EVT, we have a typical bias–variance trade-off observed in sample path estimates of rare event parameters. For block estimators, the bias decreases with b while the variance increases. A recurrent method is to plot the estimates obtained for successive block size values and visually identify case-by-case plateau zones of these estimates. The stability around a value is an indicator of a reasonable estimate, and this stability region, in general, should have neither too small nor too large a value of b due to the trade-off between bias and variance already mentioned. Figure 2 is a plot of the trajectory of estimates (full line) along with 95% confidence intervals (CI) (dashed line) obtained for each block length b from 1 to 100 in a random sample of dimension 1000 generated from a moving maximum model with standard Fréchet margins. We can see a plateau region in the estimates around the true value (horizontal line) θ = 0.5 for the block sizes between 25 and 45. Observe the large variability occurring for large values of b and the higher bias for small values of b.
Some methods have been proposed in the literature to help in the choice of tuning parameters based on the stability regions of the estimates graph: see, e.g., Frahm et al. ([14] 2005), Gomes and Neves ([15] 2020), and their references. In particular, the algorithm proposed in Frahm et al. ([14] 2005) was implemented in the context of estimating the bivariate tail dependence, and in Ferreira ([16] 2018), it was applied to extremal index estimators requiring the choice of a high threshold. In this work, our objective is to propose an adaptation of the algorithm developed in Frahm et al. ([14] 2005) applied to estimator (3) in order to find a suitable plateau of estimates taking into account the bias–variance trade-off. As a byproduct, this will allow us to circumvent the unique tuning parameter selection corresponding to the block size of where the sequence of maximums will be extracted, as described above. The method will be detailed in Section 2 and analyzed through simulation in Section 3. We end with an application to real data.

2. Estimation Method

Our proposed estimation of θ is based on the bias-corrected estimator θ ^ in (3) by considering sliding blocks and on the heuristic plateau-finding algorithm of Frahm et al. ([14] 2005).
The algorithm is described in the following steps:
Step 1.
Calculate estimates θ ^ b from estimator (3) for 1 b t < n ;
Step 2.
Smooth the results of the previous step by taking means of 2 w + 1 successive estimates; we consider bandwidth w = 0.02 t ;
Step 3.
Define plateaus of length m = t 2 w , i.e., p j = θ ^ ¯ j , . . . , θ ^ ¯ j + m 1 , j = 1 , . . . , t 2 w m + 1 ;
Step 4.
Compute the standard deviation s of θ ^ ¯ 1 , . . . , θ ^ ¯ t 2 w and choose the first plateau p j satisfying i = j + 1 j + m 1 θ ^ ¯ i θ ^ ¯ j 2 s ;
Step 5.
The extremal index is estimated through 1 m i = 1 m θ ^ ¯ j + i 1 , i.e., taking the average of the estimates that constitute the plateau chosen in the previous step. This is denoted the plateau estimator.
The estimators (1), (2), and (3) are already implemented in package exdex of software R (Northrop and Christodoulides [17] 2019) with the respective CIs. We use package exdex to compute estimator (3) under sliding blocks and the respective upper and lower 95% CI bounds. We also apply Steps 1, 2, and 3 to the lower and upper bounds of the CIs. Once the plateau of t h e t a estimates is chosen in Step 4, we pick the corresponding plateau in the CI limits, and in Step 5, we apply the average of the plateau values of the lower limit of the CI as well as the average of the plateau values of the upper limit of the CI.
We are going to analyze the estimation method described above through simulation. The models that will be used are the following:
  • First-order auto-regressive model with Cauchy standard marginals (ARC), X i = ρ X i 1 + ϵ i , { ϵ i } i.i.d. having Cauchy d.f. with mean 0 and scale 1 | ρ | and θ = 1 ρ if ρ > 0 (Chernick et al. [18], 1991); we consider ρ = 0.9 and θ = 0.1 ;
  • An m-dependent model (MMU), X i = max ( U i , U i + 1 , . . . , U i + m 1 ) , i 1 , where { U i } is an i.i.d. sequence of r.v. (Newell [19] 1964) with θ = 1 / m ; we consider U i , i 1 , standard uniform r.v., and m = 3 , and thus, θ = 1 / 3 ;
  • Moving maxima Fréchet model (MMF), X i = max j = 0 , . . . , d a j Z i j with a j 0 , j = 0 d a j = 1 and { Z i } i.i.d. standard Fréchet where θ = max j = 0 , . . . , d a j (Weissman and Cohen [20] 1995); we consider d = 2 and parameters a 0 = 1 / 3 , a 1 = 1 / 6 , and a 2 = 1 / 2 , and thus, θ = 1 / 2 ;
  • ARCH(1) process, X i = ( β + α X i 1 2 ) 1 / 2 ϵ i , with i.i.d. Gaussian innovations { ϵ i } , α = 0.7 , and β = 2 · 10 5 , where θ = 0.721 (Cai, [21] 2019);
  • First-order max auto-regressive (MAR), X i = max ( ϕ X i 1 , ϵ i ) , i 1 , X 0 = ϵ 1 / ( 1 ϕ ) , { ϵ i } i.i.d. with standard Fréchet marginals and θ = 1 ϕ (Davis and Resnick [22] 1989); we consider ϕ = 0.1 a nd θ = 0.9 ;
  • An i.i.d. sequence (Ind) of Fréchet r.v. where θ = 1 .

3. Simulation Study and Application

The simulation study is based on random generation of samples with size 1000 replicated 1000 times within each of the models described above. We consider different models with different values of θ . We apply the estimation plateau method of Section 2 both to estimate θ and the respective 95% CI lower and upper bounds. Table 1 contains the estimation global results of the plateau method. See also the simulation results of θ ^ given in (3) for each block size b in Figure 3 as well as the results of the plateau method. We can observe in each model that the plateau estimate (thicker gray horizontal full line) is located in a plateau zone of the sample path of estimates plotted as a function of block size b (full black line), as expected. We can also see that the plateau estimate is close to the true value (blue horizontal full line). In all cases, it is verified that the limits of the 95% CIs estimated by the plateau method (thicker gray horizontal dotted–dashed lines) include the true value of θ . In the ARCH case, the estimates closest to the true value of θ occur for large values of b where the variability is very high, which makes it difficult to apply the plateau methodology. Even so, the root mean squared error (rmse) of 0.1126 is not very expressive. The independent model (Ind) has θ = 1 and, therefore, constitutes a frontier value of the parameter support, which typically leads to difficulties in statistical estimation. Still, the plateau method shows relatively low bias and rmse. Observe also that in the MAR model, we have θ = 0.9 , which is quite near to the boundary value of 1, and the plateau method does a very good job.

Application to Real Data

We illustrate the method with the real data newlyn available in the R package exdex consisting of 2894 sea-surge heights measured at the coast of Newlyn, Cornwall, UK, over years 1971–1976. The observations correspond to the maximum hourly surge heights during periods of 15 h. See the left plot in Figure 4. Previous analysis of this data can be seen in Northrop ([2] 2015) and references therein. The sample path of estimates from (3) as a function of block size b and respective 95% confidence limits are plotted on the right graph of Figure 4. The horizontal full line corresponds to the plateau estimate of θ given by 0.2577 , and the horizontal dotted–dashed lines correspond to the plateau 95% CI estimate ( 0.2206 , 0.2948 ) .

4. Conclusions

This work addresses the estimation of the extremal index θ . This is an important measure in time series, namely in assessing risky phenomena, as it measures the propensity for the occurrence of clusters of extreme values. The estimation of θ requires a prior setting of tuning parameter values that impacts the precision of estimates. In this work, we presented an algorithm that allows estimation of θ free of tuning parameters. We applied this methodology to diverse models, and the results are encouraging in several cases. In the future, it is intended to continue the study of this methodology and develop it in order to improve its applicability to different types of models.

Funding

The research at CMAT was partially financed by Portuguese Funds through FCT—Fundação para a Ciência e a Tecnologia within the Projects UIDB/00013/2020 and UIDP/00013/2020.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Hsing, T.; Hüsler, J.; Leadbetter, M.R. On the exceedance point process for a stationary sequence. Probab. Theory Relat. Fields 1988, 78, 97–112. [Google Scholar] [CrossRef]
  2. Northrop, P.J. An efficient semiparametric maxima estimator of the extremal index. Extremes 2015, 18, 585–603. [Google Scholar] [CrossRef] [Green Version]
  3. Berghaus, B.; Bücher, A. Weak convergence of a pseudo maximum likelihood estimator for the extremal index. Ann. Stat. 2018, 46, 2307–2335. [Google Scholar] [CrossRef] [Green Version]
  4. Nandagopalan, S. Multivariate Extremes and Estimation of the Extremal Index. Ph.D. Thesis, University of North Carolina, Chapel Hill, NC, USA, 1990. [Google Scholar]
  5. Weissman, I.; Novak, S.Y. On blocks and runs estimators of the extremal index. J. Stat. Plan. Inference 1998, 66, 281–288. [Google Scholar] [CrossRef]
  6. Süveges, M.; Davison, A.C. Model misspecification in peaks over threshold analysis. Ann. Appl. Stat. 2010, 4, 203–221. [Google Scholar] [CrossRef] [Green Version]
  7. Holěsovský, J.; Fusek, M. Estimation of the extremal index using censored distributions. Extremes 2020, 23, 197–213. [Google Scholar] [CrossRef]
  8. Holěsovský, J.; Fusek, M. Improved interexceedance-times-based estimator of the extremal index using truncated distribution. Extremes 2022, 25, 695–720. [Google Scholar] [CrossRef]
  9. Ferreira, H.; Ferreira, M. Estimating the extremal index through local dependence. Ann. L’Institut Henri-Poincaré-Probab. Stat. 2018, 54, 587–605. [Google Scholar] [CrossRef] [Green Version]
  10. Ferro, C.A.T.; Segers, J. Inference for clusters of extreme values. J. R. Stat. Soc. Ser. B 2003, 65, 545–556. [Google Scholar] [CrossRef]
  11. Gomes, M. On the estimation of parameters of rare events in environmental time series. In Statistics for the Environment 2: Water Related Issues; Barnett, V., Turkman, K., Eds.; Wiley: Hoboken, NJ, USA, 1993; pp. 225–241. [Google Scholar]
  12. Ancona-Navarrete, M.A.; Tawn, J.A. A comparison of methods for estimating the extremal index. Extremes 2000, 3, 5–38. [Google Scholar] [CrossRef]
  13. Ferreira, H.; Ferreira, M. A new blocks estimator for the extremal index. Commun.-Stat.-Theory Methods, 2022; in press. [Google Scholar] [CrossRef]
  14. Frahm, G.; Junker, M.; Schmidt, R. Estimating the tail-dependence coefficient: Properties and pitfalls. Insur. Math. Econ. 2005, 37, 80–100. [Google Scholar] [CrossRef]
  15. Gomes, D.P.; Neves, M.M. Extremal index blocks estimator: The threshold and the block size choice. J. Appl. Stat. 2020, 47, 2846–2861. [Google Scholar] [CrossRef] [PubMed]
  16. Ferreira, M. Heuristic Tools for the Estimation of The Extremal Index: A Comparison of Methods. Revstat-Stat. J. 2018, 16, 115–136. [Google Scholar]
  17. Northrop, P.J.; Christodoulides, C. Exdex: Estimation of the Extremal Index. R Package Version 1.0.1. 2019. Available online: https://CRAN.R-project.org/package=exdex (accessed on 10 January 2023).
  18. Chernick, M.R.; Hsing, T.; McCormick, W.P. Calculating the extremal index for a class of stationary sequences. Adv. Appl. Probab. 1991, 23, 835–850. [Google Scholar] [CrossRef]
  19. Newell, G.F. Asymptotic Extremes for m-Dependent Random Variables. Ann. Math. Stat. 1964, 35, 1322–1325. [Google Scholar] [CrossRef]
  20. Weissman, I.; Cohen, U. The extremal index and clustering of high values for derived stationary sequences. J. Appl. Prob. 1995, 32, 972–981. [Google Scholar] [CrossRef]
  21. Cai, J.J. Statistical inference on D(d)(un) condition and estimation of the Extremal Index. arXiv 2019, arXiv:1911.06674. [Google Scholar]
  22. Davis, R.; Resnick, S. Basic properties and prediction of max-ARMA processes. Adv. Appl. Probab. 1989, 21, 781–803. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Maximum hourly sea-surgeheights (over contiguous 15-h time periods) in years 1971–1976 at the Newlyn Coast, Cornwall, UK.
Figure 1. Maximum hourly sea-surgeheights (over contiguous 15-h time periods) in years 1971–1976 at the Newlyn Coast, Cornwall, UK.
Engproc 39 00064 g001
Figure 2. Estimates of θ ^ given in (3) for successive values of block size b = 1 , , 100 (full line) obtained for a sample simulated from a moving maxima Fréchet model with θ = 0.5 (horizontal line). The dashed lines correspond to 95% CI.
Figure 2. Estimates of θ ^ given in (3) for successive values of block size b = 1 , , 100 (full line) obtained for a sample simulated from a moving maxima Fréchet model with θ = 0.5 (horizontal line). The dashed lines correspond to 95% CI.
Engproc 39 00064 g002
Figure 3. Simulation results: average of estimates of θ for each block size b = 2 , , 200 using θ ^ in (3) (full black line) and average of respective 95% CI upper and lower bounds (dotted lines); plateau estimation of θ (thicker gray horizontal full line) and respective plateau estimates of 95% CI upper and lower bounds (thicker gray horizontal dotted–dashed lines). The true value of θ corresponds to the blue horizontal full line.
Figure 3. Simulation results: average of estimates of θ for each block size b = 2 , , 200 using θ ^ in (3) (full black line) and average of respective 95% CI upper and lower bounds (dotted lines); plateau estimation of θ (thicker gray horizontal full line) and respective plateau estimates of 95% CI upper and lower bounds (thicker gray horizontal dotted–dashed lines). The true value of θ corresponds to the blue horizontal full line.
Engproc 39 00064 g003
Figure 4. (Left) Maximum hourly (within successive 15-hour periods) surge height time series at Newlyn Coast, Cornwall, UK, in years 1971–1976; (Right) Sample path estimates obtained from estimator in (3) (full line) and respective 95% CI limits (dotted lines) for successive values of block size b, plateau estimate of θ (horizontal full line), and respective 95% CI plateau estimate limits (horizontal dotted–dashed lines).
Figure 4. (Left) Maximum hourly (within successive 15-hour periods) surge height time series at Newlyn Coast, Cornwall, UK, in years 1971–1976; (Right) Sample path estimates obtained from estimator in (3) (full line) and respective 95% CI limits (dotted lines) for successive values of block size b, plateau estimate of θ (horizontal full line), and respective 95% CI plateau estimate limits (horizontal dotted–dashed lines).
Engproc 39 00064 g004
Table 1. Simulation results of plateau method: average of θ estimates (mean), average of lower and upper 95% CI bound estimates, bias, root mean squared error (rmse), and standard deviation of θ estimates (sd).
Table 1. Simulation results of plateau method: average of θ estimates (mean), average of lower and upper 95% CI bound estimates, bias, root mean squared error (rmse), and standard deviation of θ estimates (sd).
meanlowerupperbiasrmsesd
ARC ( θ = 0.1 )0.11060.08410.13720.01060.02180.0190
MMU ( θ = 1 / 3 )0.35870.30420.41390.02540.04940.0424
MMF ( θ = 0.5 )0.51600.43790.59400.01600.06360.0616
ARCH ( θ = 0.721 )0.76340.62670.89200.04240.11260.1044
MAR ( θ = 0.9 )0.90170.77790.97630.00170.08270.0827
Ind ( θ = 1 )0.97090.87560.9969−0.02910.06430.0573
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ferreira, M. Measuring Extremal Clustering in Time Series. Eng. Proc. 2023, 39, 64. https://doi.org/10.3390/engproc2023039064

AMA Style

Ferreira M. Measuring Extremal Clustering in Time Series. Engineering Proceedings. 2023; 39(1):64. https://doi.org/10.3390/engproc2023039064

Chicago/Turabian Style

Ferreira, Marta. 2023. "Measuring Extremal Clustering in Time Series" Engineering Proceedings 39, no. 1: 64. https://doi.org/10.3390/engproc2023039064

Article Metrics

Back to TopTop