A Geometric Perspective on Functional Outlier Detection
Abstract
:1. Introduction
1.1. Problem Setting and Proposal
1.2. Background and Related Work
2. Functional Outlier Detection as a Manifold-Learning Problem
2.1. The Two Notions of Functional Outliers: Off- and On-Manifold
- An off-manifold outlier if and ;
- An on-manifold outlier if and .
2.2. Methods
2.3. Examples of Functional Outlier Scenarios
2.3.1. Outlier Scenarios Based on Existing Taxonomies
2.3.2. General Functional Outlier Scenarios
3. Experiments
3.1. Qualitative Analysis of Real Data
3.2. Quantitative Analysis of Synthetic Data
3.2.1. Methods
3.2.2. Data-Generating Processes
3.2.3. Performance Assessment
3.2.4. Results
3.3. General Dissimilarity Measures and Manifold Methods
4. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
LOF | Local outlier factor |
FD(A) | Functional data (analysis) |
(F)PCA | (Functional) principle component analysis |
HDR | High-density region |
NHST | Null hypothesis significance testing |
ECG | Electrocardiogram |
MDS | Multidimensional scaling |
DTW | Dynamic time warping |
MS-plot | Magnitude–shape plot |
GOF | Goodness of fit |
DO | Directional outlyingness |
TV | Total variational depth |
ED | Elastic depth |
DGP | Data-generating process |
ECDF | Empirical cumulative distribution function |
AUC | Area under the ROC curve |
MBD | Modified band depth |
MEI | Modified epigraph index |
MO | Mean directional outlyingness |
VO | Variability of directional outlyingness |
Appendix A. Formalizing Phase Variation Scenarios
Appendix A.1. Phase Variation: Case I
Appendix A.2. Phase Variation: Case II
Appendix B. Sensitivity Analysis
2 vs. 5 | 5 vs. 20 | 2 vs. 5 | 5 vs. 20 | 2 vs. 5 | 5 vs. 20 | 2 vs. 5 | 5 vs. 20 | 2 vs. 5 | 5 vs. 20 | 2 vs. 5 | 5 vs. 20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ECG | 0.96 | 0.97 | 0.98 | 0.97 | 0.97 | 0.99 | 0.94 | 0.99 | 0.94 | 0.98 | 0.90 | 0.97 |
Octane | 0.94 | 0.99 | 0.96 | 0.98 | 0.97 | 0.99 | 0.98 | 0.99 | 0.98 | 0.99 | 0.96 | 0.98 |
Weather | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Tecator | 0.97 | 0.99 | 0.96 | 0.99 | 0.99 | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 1.00 | 1.00 |
Wine | 0.98 | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 |
Wasserstein | ||||||||||||
2 vs. 5 | 5 vs. 20 | 2 vs. 5 | 5 vs. 20 | 2 vs. 5 | 5 vs. 20 | 2 vs. 5 | 5 vs. 20 | 2 vs. 5 | 5 vs. 20 | 2 vs. 5 | 5 vs. 20 | |
ECG | 0.89 | 0.96 | 0.87 | 0.96 | 0.86 | 0.95 | 0.86 | 0.95 | 0.85 | 0.94 | 0.98 | 0.97 |
Octane | 0.96 | 0.98 | 0.95 | 0.99 | 0.96 | 0.98 | 0.94 | 0.97 | 0.94 | 0.97 | 0.95 | 0.96 |
Weather | 1.00 | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Tecator | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 0.96 | 0.99 |
Wine | 0.99 | 1.00 | 0.98 | 1.00 | 0.98 | 1.00 | 0.98 | 0.99 | 0.98 | 0.99 | 0.99 | 1.00 |
Appendix C. Quantitative Results on the fdaoutlier Package DGPs
Appendix D. Visualization Methods: roahd::outliergram, fdaoutlier::msplot, Translation–Phase–Amplitude Boxplots, Elastic Depth Boxplots, and HDR Boxplots
Appendix E. In-Depth Analysis of Simulation Model 7
Appendix F. Examples of the DGPs Used for the Quantitative Evaluation
Appendix G. ArrowHead Data
References
- Dai, W.; Mrkvička, T.; Sun, Y.; Genton, M.G. Functional outlier detection and taxonomy by sequential transformations. Comput. Stat. Data Anal. 2020, 149, 106960. [Google Scholar] [CrossRef] [Green Version]
- Arribas-Gil, A.; Romo, J. Discussion of “Multivariate functional outlier detection”. Stat. Methods Appl. 2015, 24, 263–267. [Google Scholar] [CrossRef]
- Hubert, M.; Rousseeuw, P.J.; Segaert, P. Multivariate functional outlier detection. Stat. Methods Appl. 2015, 24, 177–202. [Google Scholar] [CrossRef] [Green Version]
- Ma, Y.; Fu, Y. Manifold Learning Theory and Applications; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
- Lee, J.A.; Verleysen, M. Nonlinear Dimensionality Reduction; Springer Science & Business Media: New York, NY, USA, 2007. [Google Scholar]
- Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 15–18 May 2000; pp. 93–104. [Google Scholar]
- Ramsay, J.O.; Silverman, B.W. Functional Data Analysis, 2nd ed.; Springer Series in Statistics; Springer: New York, NY, USA, 2005. [Google Scholar]
- Hernández, N.; Muñoz, A. Kernel Depth Measures for Functional Data with Application to Outlier Detection. In Artificial Neural Networks and Machine Learning–ICANN 2016; Villa, A.E., Masulli, P., Pons Rivero, A.J., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; pp. 235–242. [Google Scholar]
- Harris, T.; Tucker, J.D.; Li, B.; Shand, L. Elastic depths for detecting shape anomalies in functional data. Technometrics 2021, 63, 466–476. [Google Scholar] [CrossRef]
- Sawant, P.; Billor, N.; Shin, H. Functional outlier detection with robust functional principal component analysis. Comput. Stat. 2012, 27, 83–102. [Google Scholar] [CrossRef]
- Staerman, G.; Mozharovskyi, P.; Clémençon, S.; d’Alché Buc, F. Functional isolation forest. In Proceedings of the Eleventh Asian Conference on Machine Learning, Nagoya, Japan, 17–19 November 2019; Lee, W.S., Suzuki, T., Eds.; Volume 10, pp. 332–347. [Google Scholar]
- Vinue, G.; Epifanio, I. Robust archetypoids for anomaly detection in big functional data. Adv. Data Anal. Classif. 2021, 15, 437–462. [Google Scholar] [CrossRef]
- Rousseeuw, P.J.; Raymaekers, J.; Hubert, M. A measure of directional outlyingness with applications to image data and video. J. Comput. Graph. Stat. 2018, 27, 345–359. [Google Scholar] [CrossRef] [Green Version]
- Dai, W.; Genton, M.G. Directional outlyingness for multivariate functional data. Comput. Stat. Data Anal. 2019, 131, 50–65. [Google Scholar] [CrossRef] [Green Version]
- Xie, W.; Kurtek, S.; Bharath, K.; Sun, Y. A Geometric Approach to Visualization of Variability in Functional data. J. Am. Stat. Assoc. 2017, 112, 979–993. [Google Scholar] [CrossRef]
- Hyndman, R.J.; Shang, H.L. Rainbow plots, bagplots, and boxplots for functional data. J. Comput. Graph. Stat. 2010, 19, 29–45. [Google Scholar] [CrossRef] [Green Version]
- Ali, M.; Jones, M.W.; Xie, X.; Williams, M. TimeCluster: Dimension reduction applied to temporal data for visual analytics. Vis. Comput. 2019, 35, 1013–1026. [Google Scholar] [CrossRef] [Green Version]
- Yu, G.; Zou, C.; Wang, Z. Outlier Detection in Functional Observations with Applications to Profile Monitoring. Technometrics 2012, 54, 308–318. [Google Scholar] [CrossRef]
- Chen, D.; Müller, H.G. Nonlinear manifold representations for functional data. Ann. Stat. 2012, 40, 1–29. [Google Scholar] [CrossRef] [Green Version]
- Dimeglio, C.; Gallón, S.; Loubes, J.M.; Maza, E. A robust algorithm for template curve estimation based on manifold embedding. Comput. Stat. Data Anal. 2014, 70, 373–386. [Google Scholar] [CrossRef] [Green Version]
- Herrmann, M.; Scheipl, F. Unsupervised Functional Data Analysis via Nonlinear Dimension Reduction. arXiv 2020, arXiv:2012.11987. [Google Scholar]
- Cuevas, A. A partial overview of the theory of statistics with functional data. J. Stat. Plan. Inference 2014, 147, 1–23. [Google Scholar] [CrossRef]
- Malkowsky, E.; Rakočević, V. Advanced Functional Analysis; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar]
- Polonik, W. Minimum volume sets and generalized quantile processes. Stoch. Process. Their Appl. 1997, 69, 1–24. [Google Scholar] [CrossRef] [Green Version]
- Ojo, O.; Lillo, R.E.; Anta, A.F. Outlier Detection for Functional Data with R Package fdaoutlier. arXiv 2021, arXiv:2105.05213. [Google Scholar]
- Zimek, A.; Filzmoser, P. There and back again: Outlier detection between statistical reasoning and data mining algorithms. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1280. [Google Scholar] [CrossRef] [Green Version]
- Cox, M.A.; Cox, T.F. Multidimensional scaling. In Handbook of Data Visualization; Springer: Berlin/Heidelberg, Germany, 2008; pp. 315–347. [Google Scholar]
- Tenenbaum, J.B.; Silva, V.D.; Langford, J.C. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 2000, 290, 2319–2323. [Google Scholar] [CrossRef]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2020, arXiv:1802.03426. [Google Scholar]
- Gangbo, W.; Li, W.; Osher, S.; Puthawala, M. Unnormalized optimal transport. J. Comput. Phys. 2019, 399, 108940. [Google Scholar] [CrossRef] [Green Version]
- Bagnall, A.; Lines, J.; Bostrom, A.; Large, J.; Keogh, E. The great time series classification bake off: A review and experimental evaluation of recent algorithmic advances. Data Min. Knowl. Discov. 2017, 31, 606–660. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Olszewski, R.T. Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2001. [Google Scholar]
- Dai, W.; Genton, M.G. Multivariate functional data visualization and outlier detection. J. Comput. Graph. Stat. 2018, 27, 923–934. [Google Scholar] [CrossRef] [Green Version]
- Shang, H.L.; Hyndman, R.J. fds: Functional Data Sets; R Package Version 1.8; R package; 2018. [Google Scholar]
- Kalivas, J.H. Two datasets of near infrared spectra. Chemom. Intell. Lab. Syst. 1997, 37, 255–259. [Google Scholar] [CrossRef]
- Febrero-Bande, M.; Oviedo de la Fuente, M. Statistical Computing in Functional Data Analysis: The R Package fda.usc. J. Stat. Softw. 2012, 51, 1–28. [Google Scholar] [CrossRef] [Green Version]
- Ferraty, F.; Vieu, P. Nonparametric Functional Data Analysis: Theory and Practice; Springer Science & Business Media: New York, NY, USA, 2006. [Google Scholar]
- Holland, J.; Kemsley, E.; Wilson, R. Use of Fourier transform infrared spectroscopy and partial least squares regression for the detection of adulteration of strawberry purees. J. Sci. Food Agric. 1998, 76, 263–269. [Google Scholar] [CrossRef]
- Mead, A. Review of the development of multidimensional scaling methods. J. R. Stat. Soc. Ser. 1992, 41, 27–39. [Google Scholar] [CrossRef]
- Arribas-Gil, A.; Romo, J. Shape outlier detection and visualization for functional data: The outliergram. Biostatistics 2014, 15, 603–619. [Google Scholar] [CrossRef] [Green Version]
- Ieva, F.; Paganoni, A.M.; Romo, J.; Tarabelloni, N. roahd Package: Robust Analysis of High Dimensional Data. R J. 2019, 11, 291–307. [Google Scholar] [CrossRef]
- Shang, H.L.; Hyndman, R. Rainbow: Bagplots, Boxplots and Rainbow Plots for Functional Data, R package version 3.6; R package; 2019. [Google Scholar]
- Huang, H.; Sun, Y. A decomposition of total variation depth for understanding functional outliers. Technometrics 2019, 61, 445–458. [Google Scholar] [CrossRef]
- Ojo, O.T.; Lillo, R.E.; Fernandez Anta, A. fdaoutlier: Outlier Detection Tools for Functional Data Analysis, R package version 0.2.0.; R package; 2021. [Google Scholar]
- Tucker, J.D. fdasrvf: Elastic Functional Data Analysis, R package version 1.9.7.; R package; 2021. [Google Scholar]
- Dau, H.A.; Bagnall, A.; Kamgar, K.; Yeh, C.C.M.; Zhu, Y.; Gharghabi, S.; Ratanamahatana, C.A.; Keogh, E. The UCR time series archive. IEEE/CAA J. Autom. Sin. 2019, 6, 1293–1305. [Google Scholar] [CrossRef]
- Ye, L.; Keogh, E. Time series shapelets: A new primitive for data mining. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; pp. 947–956. [Google Scholar]
- Rakthanmanon, T.; Campana, B.; Mueen, A.; Batista, G.; Westover, B.; Zhu, Q.; Zakaria, J.; Keogh, E. Searching and mining trillions of time series subsequences under dynamic time warping. In Proceedings of the 18th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, Beijing, China, 12–16 August 2012; pp. 262–270. [Google Scholar]
- Lemire, D. Faster retrieval with a two-pass dynamic-time-warping lower bound. Pattern Recognit. 2009, 42, 2169–2180. [Google Scholar] [CrossRef] [Green Version]
- Fuchs, K.; Gertheiss, J.; Tutz, G. Nearest neighbor ensembles for functional data with interpretable feature selection. Chemom. Intell. Lab. Syst. 2015, 146, 186–197. [Google Scholar] [CrossRef]
- Narayan, A.; Berger, B.; Cho, H. Assessing single-cell transcriptomic variability through density-preserving data visualization. Nat. Biotechnol. 2021, 39, 765–774. [Google Scholar] [CrossRef]
- De Silva, V.; Tenenbaum, J.B. Global versus local methods in nonlinear dimensionality reduction. NIPS 2002, 15, 705–712. [Google Scholar]
- Brandes, U.; Pich, C. Eigensolver methods for progressive multidimensional scaling of large data. In International Symposium on Graph Drawing; Springer: Berlin/Heidelberg, Germany, 2006; pp. 42–53. [Google Scholar]
- Ingram, S.; Munzner, T.; Olano, M. Glimmer: Multilevel MDS on the GPU. IEEE Trans. Vis. Comput. Graph. 2008, 15, 249–261. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Clémençon, S.; Thomas, A. Mass volume curves and anomaly ranking. Electron. J. Stat. 2018, 12, 2806–2872. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Herrmann, M.; Scheipl, F. A Geometric Perspective on Functional Outlier Detection. Stats 2021, 4, 971-1011. https://doi.org/10.3390/stats4040057
Herrmann M, Scheipl F. A Geometric Perspective on Functional Outlier Detection. Stats. 2021; 4(4):971-1011. https://doi.org/10.3390/stats4040057
Chicago/Turabian StyleHerrmann, Moritz, and Fabian Scheipl. 2021. "A Geometric Perspective on Functional Outlier Detection" Stats 4, no. 4: 971-1011. https://doi.org/10.3390/stats4040057