Next Article in Journal
Acknowledgment to the Reviewers of Stats in 2022
Previous Article in Journal
Change Point Detection by State Space Modeling of Long-Term Air Temperature Series in Europe
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Statistical Prediction of Future Sports Records Based on Record Values

Institute of Statistics, RWTH Aachen University, D-52056 Aachen, Germany
*
Author to whom correspondence should be addressed.
Stats 2023, 6(1), 131-147; https://doi.org/10.3390/stats6010008
Submission received: 29 November 2022 / Revised: 3 January 2023 / Accepted: 6 January 2023 / Published: 11 January 2023
(This article belongs to the Section Statistical Methods)

Abstract

:
Point prediction of future record values based on sequences of previous lower or upper records is considered by means of the method of maximum product of spacings, where the underlying distribution is assumed to be a power function distribution and a Pareto distribution, respectively. Moreover, exact and approximate prediction intervals are discussed and compared with regard to their expected lengths and their percentages of coverage. The focus is on deriving explicit expressions in the point and interval prediction procedures. Predictions and forecasts are of interest, e.g., in sports analytics, which is gaining more and more attention in several sports disciplines. Previous works on forecasting athletic records have mainly been based on extreme value theory. The presented statistical prediction methods are exemplarily applied to data from various disciplines of athletics as well as to data from American football based on fantasy football points according to the points per reception scoring scheme. The results are discussed along with basic assumptions and the choice of underlying distributions.

1. Introduction

The consideration, collection and importance of data and their analysis is gaining more and more attention in many different fields—as it does in several sports disciplines. For example, in US sports such as baseball, basketball and American football, data regarding many aspects of a game, team or player have been recorded for many years. This trend can be observed in other areas of sports such as athletics, as well. Thus, statistical methods and analysis have become a topic of growing interest, and an interesting part in sports analysis are predictions and forecasts.
Among many other authors and articles on statistical methodology in sports, we refer to some of them. Based on results for order statistics and record values, Noubary [1] proposed methods to forecast athletic records. Several authors applied extreme value theory. In Einmahl and Magnus [2], the authors examined and compared world records in athletics based on extreme value theory, and they forecast the ultimate record in a specific discipline. By this, they initiated a series of subsequent works in this area. Noubary [3] was concerned with athletic records and calculated the probabilities of future best performances based on tail estimation in extreme value theory. Einmahl and Smeets [4] estimated ultimate world records for the 100 m running of women and men. Henriques-Rodrigues et al. [5] focused on the semiparametric estimation of the extreme value index and of the right endpoint of the underlying distribution in case it was finite. The ultimate record in men’s long jump was examined in Fraga Alves et al. [6]. Annual best times were used in Stephenson and Tawn [7] to fit a model based on extreme value distributions to come up with extrapolated values which served as predicted records. Adam and Tawn [8] considered annual records in athletics and swimming disciplines and fitted the parameters of the generalized extreme value distribution as underlying distribution.
A variety of other approaches for data from, e.g., baseball, football and athletics, can be found in the literature. As examples, we refer to Albert et al. [9] and Wunderlich and Memmert [10].
In contrast to forecasting procedures via extreme value theory in the sense of a long-term prediction by determining an extreme value distribution, the focus of our approach is on statistical prediction of the next record or records by means of previous record values in a particular sports discipline. We first repeat properties of upper and lower record values while focusing on lower records. We use the maximum product of spacings prediction procedure to derive a point predictor of the next record value based on previous ones, which are usually recorded. In the statistical model, we assume the lower/upper record values to be based on a sequence of independent and identically distributed random variables having a power function distribution/Pareto distribution due to our focus on explicit statistical prediction methods. Then, we derive some exact and approximate prediction intervals. We compare these intervals with regard to their expected lengths, and we analyze their percentages of coverage in a simulation study. We apply our results to real data from women’s 100 m and various other disciplines of athletics as well as to American football data by considering the evaluation of players by assigning so-called fantasy football points to players’ actions. Moreover, we discuss the distributional assumptions.

2. Prediction of Record Values

We first recall some basic properties of record values. For details, we refer to the monograph of Arnold et al. [11].
Let ( X i ) i N be a sequence of independent and identically distributed (iid) random variables with absolutely continuous cumulative distribution function (cdf) F and probability density function (pdf) f. Here, N denotes the set of positive integers. The random variables
T u ( 1 ) = 1 and T u ( n + 1 ) = min { j > T u ( n ) : X j > X T u ( n ) } , n N ,
are called upper record times. The random variables R n = X T u ( n ) , n N , are then called upper record values. Analogously, the lower record times
T l ( 1 ) = 1 and T l ( n + 1 ) = min { j > T l ( n ) : X j < X T l ( n ) } , n N ,
are used to define the sequence of lower record values L n = X T l ( n ) , n N . The joint densities for the first r N upper and lower record values R = ( R 1 , , R r ) and L = ( L 1 , , L r ) are given by
f R ( x 1 , , x r ) = f ( x r ) i = 1 r 1 f ( x i ) 1 F ( x i )
where < x 1 < x 2 < < x r < and
f L ( x 1 , , x r ) = f ( x r ) i = 1 r 1 f ( x i ) F ( x i )
where > x 1 > x 2 > > x r > , respectively.
It is well known that, in distribution, the upper record values R 1 , , R r as well as the lower record values L 1 , , L r can be simultaneously expressed by independent and identically standard exponentially distributed random variables Z 1 , , Z r via
( R 1 , , R r ) = d F 1 1 exp i = 1 j Z i 1 j r , ( L 1 , , L r ) = d F 1 exp i = 1 j Z i 1 j r ,
where = d denotes equality in distribution (see Arnold et al. [11]).
Aiming at deriving explicit statistical procedures, we choose power function distributions in case of lower records and Pareto distributions in case of upper records as underlying distributions; both families of distributions turn out to have a good fit to various data sets in sports. For parameters λ > 0 and β > 0 , the cdf and pdf of the power function distribution P o w ( λ , β ) are given by
F λ , β ( x ) = x λ β and f λ , β ( x ) = β x β 1 λ β , x ( 0 , λ ) ,
and, for parameters μ > 0 and γ > 0 , the cdf and pdf of the Pareto distribution P a r ( μ , γ ) are given by
F μ , γ ( x ) = 1 μ x γ and f μ , γ ( x ) = γ μ γ x γ + 1 , x ( μ , ) .
In the sequel, we assume the parameters β and γ to be unknown and the (threshold) parameters λ and μ to be fixed.
There is a close relationship between power function and Pareto distributions, namely, if Y P a r ( μ , γ ) holds then Y 1 P o w ( λ , β ) follows, where λ = 1 μ and β = γ . Because of this relationship, we get consistent results for running (looking for minimum times), throwing and jumping events in athletics (looking for maximum lengths, widths and heights) without transforming times in velocities, as, e.g., in Einmahl and Magnus [2].

2.1. Point Prediction

We consider point prediction of future record values based on previous records in a sequence of iid random variables. Among others, Kaminsky and Rhodin [12] presented the maximum likelihood predictor (MLP) and Volovskiy and Kamps [13] introduced the maximum observed likelihood prediction (MOLP). Based on a Pareto distribution, the MLP and MOLP for upper record values were shown in Volovskiy and Kamps [14]. However, both methods share the disadvantage that neither the next upper nor the next lower record value can be predicted reasonably, since the methods lead to the last record as predictor of the next one. Raqab et al. [15] derived the MLP and the so-called conditional median predictor. Moreover, there are some Bayesian results for estimation and prediction in the upper record Pareto setting in Ahmadi and Doostparast [16], Madi and Raqab [17] and Raqab et al. [15]. For further references, we refer to Volovskiy and Kamps [13] and Volovskiy and Kamps [14].
We applied the maximum product of spacings prediction (MPSP) procedure for lower record values along the lines of Volovskiy and Kamps [14] for upper record values. The idea underlying the MPSP method is to apply a transformation to the sequence L = ( L 1 , , L s ) of observed records L 1 , , L r , r < s as well as the yet-to-be-observed record values up to L s , such that the transformed random variables are distributed as order statistics U 1 : s 1 , , U s 1 : s 1 from an iid sample of s 1 uniform random variables. Then, by considering the spacings of the observed data after applying the transformation, the MPSP method chooses the prediction value π M P S P ( s ) for the record value L s that renders the observed data as uniform as possible according to some measure of uniformity. For a related estimation method in parametric inference utilizing the probability integral transform, we refer to Cheng and Amin [18] and Ranneby [19].
In what follows, we assume that the observed record values are from a sequence of iid random variables with continuous cdf F θ , which may depend on an unknown parameter θ Θ , where Θ denotes the set of parameters. Let G θ denote the function G θ ( x ) = ln ( F θ ( x ) ) , x R . Then, Equation (3) reveals that
G θ ( L i ) = d R ˜ i , i N ,
where ( R ˜ i ) i = 1 is the sequence of upper record values in an iid sequence of standard exponential random variables. Reasoning as in Volovskiy and Kamps [14], we conclude that
G θ ( L i ) G θ ( L s ) = d U i : ( s 1 ) , i = 1 , , s 1 .
As mentioned above, the prediction of the sth record is assumed to be based on observations of the record values L 1 , , L r . Since the expected values of the order statistics U 1 : s 1 , , U s 1 : s 1 are equidistantly arranged in the interval ( 0 , 1 ) , Equation (7) motivates the following inferential procedure for the future realization of L s : using the notational convention that, for an interval I R and n N , I < n = { ( x 1 , , x n ) I n | x 1 < x 2 < < x n } , for n N , we define
Z n = ( θ , x 1 , , x n ) Θ × R < n | ( x 1 , , x n ) ( α ( F θ ) , ω ( F θ ) ) < n
and
P n ( θ , x 1 , , x n ) = i = 1 n G θ ( x i ) G θ ( x n ) G θ ( x i 1 ) G θ ( x n ) , ( θ , x 1 , , x n ) Z n ,
and we call any function x ^ = ( x ^ r + 1 , , x ^ s ) : R < r R < s r a maximum product of spacings predictor of L s if, for a suitable function θ ^ : R < r Θ , we have that x ^ and θ ^ maximize P s , i.e., for any θ Θ ,
( θ ^ ( x ) , x , x ^ ( x ) ) Z s , x ( α ( F θ ) , ω ( F θ ) ) < r
and
P s ( θ ^ ( x ) , x , x ^ ( x ) ) = max θ Θ , x R < s r : ( θ , x , x ) Z s P s ( θ , x , x ) , x ( α ( F θ ) , ω ( F θ ) ) < r ,
where α ( F θ ) and ω ( F θ ) denote the left and right endpoint of F θ , respectively. Note that the maximization of P s with respect to θ is necessary since θ is considered to be unknown. However, given that in the present situation our focus is exclusively on predictive inference, the main object of interest is the point predictor given by the function π M P S P ( s ) = x ^ . By actually carrying out the above maximization, one obtains that the MPSP of L s takes the form
π M P S P ( s ) ( L ) = F θ ^ ( L ) 1 F θ ^ ( L ) ( L r ) s r , s > r
where the function θ ^ is such that, for any θ Θ ,
( θ ^ ( x ) , x ) Z r , x ( α ( F θ ) , ω ( F θ ) ) < r
and
P r ( θ ^ ( x ) , x ) = max θ Θ : ( θ , x ) Z r P r ( θ , x ) , x ( α ( F θ ) , ω ( F θ ) ) < r .
The proof of Equation (8) proceeds along the same lines as the proof of (Volovskiy and Kamps [14], Theorem 2.3). The maximum product of spacings prediction procedure is applied to the problem of sports records prediction in Section 3.
In the particular situation of an underlying power function distribution as in Equation (4), one easily sees that, for any sequence of observed lower record values x ( 0 , λ ) < r , the objective function β P r ( β , x ) is independent of the shape parameter β , since the ratio in Equation (7) is
G θ ( L i ) G θ ( L s ) = ln ( L i ) ln ( λ ) ln ( L s ) ln ( λ ) , i = 1 , , s 1 .
Consequently, according to Formula (8), the MPSP of L s is given by
π M P S P ( s ) ( L ) = λ L r λ s r .
The MPSP is based on the ratios of G θ evaluated for certain arguments. Hence, in the power function situation, the shape parameter β drops out.
Analogously, the MPSP of the sth upper record value in a Pareto distributed sequence of random variables is given by
π M P S P ( s ) ( R ) = μ R r μ s r .
It should be noted that the threshold parameters λ and μ are supposed to be known.

2.2. Interval Prediction

There are only a few articles in the literature dealing with prediction and prediction intervals for lower record values. In one of them, Wang et al. [20] discussed estimation and prediction in a lower record setting with a baseline cdf F λ , α ( x ) = D λ ( x ) α for x > 0 with parameters λ , α > 0 from the proportional reversed hazard family, where D λ is a cdf itself depending only on the parameter λ .
Prediction intervals for upper record values have been studied in several articles. For example, Awad and Raqab [21] compared different intervals for records based on an exponential distribution. Asgharzadeh et al. [22] and Raqab et al. [15] introduced some intervals for upper record values from Pareto distributions. The intervals for lower record values proposed in this section are similar to those in the aforementioned articles and are applied to sports data in Section 3.

2.2.1. Lower Record Values from Power Function Distributions

According to Equation (2), the density of the first r lower record values is
f L ( x λ , β ) = β r λ β x r β 1 i = 1 r 1 1 x i .
From the log-likelihood function
l ( β x , λ ) = r ln ( β ) β ln ( λ ) + ( β 1 ) ln ( x r ) i = 1 r 1 ln ( x i ) ,
we obtain the maximum likelihood estimator
β ^ M L = r ln ( λ ) ln ( L r )
of β , which is required in the sequel.
In order to derive some prediction intervals for future lower record values, we examine the statistic
T 1 = ln L s ln L r ln L r ln L 1 .
By using Equation (3), one gets
T 1 = d ln ( λ ) 1 β i = 1 s Z i ln ( λ ) + 1 β i = 1 r Z i ln ( λ ) 1 β i = 1 r Z i ln ( λ ) + 1 β Z 1 = i = r + 1 s Z i i = 2 r Z i .
Since the Z i are standard exponential random variables, the sums in the numerator and denominator follow a gamma distribution. Furthermore, the sums in the latter expression are independent, and, after scaling, we find
r 1 s r T 1 = r 1 s r ln L s ln L r ln L r ln L 1 F ( 2 ( s r ) , 2 ( r 1 ) ) ,
where F ( · , · ) denotes an F-distribution with respective degrees of freedom. Now, we can compute a ( 1 α ) -prediction interval for the sth lower record value by using F-quantiles, namely
I 1 = L r exp s r r 1 ln L r ln L 1 F 1 α 2 ( 2 ( s r ) , 2 ( r 1 ) ) , L r exp s r r 1 ln L r ln L 1 F α 2 ( 2 ( s r ) , 2 ( r 1 ) ) ,
where α ( 0 , 1 ) . Next, we consider
T 2 = ln L s ln L r ln L r ln ( λ ) = d i = r + 1 s Z i i = 1 r Z i .
Thus, we find
r s r ln L s ln L r ln L r ln ( λ ) F ( 2 ( s r ) , 2 r ) .
Hence, a ( 1 α ) -prediction interval is given by
I 2 = L r exp s r r ln L r ln λ F 1 α 2 ( 2 ( s r ) , 2 r ) , L r exp s r r ln L r ln λ F α 2 ( 2 ( s r ) , 2 r ) .
A generalized version of the statistic T 2 is used in Wang et al. [20] in a setting with unknown parameter λ . For a known parameter λ , prediction interval I 2 will be applied instead of I 1 .
In addition to those exact intervals, we study some approximate ( 1 α ) -prediction intervals based on statistics similar to T 1 and T 2 , starting with T 3 = ln L r ln L s . It follows that
2 β T 3 = d 2 β 1 β i = r + 1 s Z i χ 2 ( 2 ( s r ) ) ,
where χ 2 ( · ) denotes a χ 2 -distribution with respective degrees of freedom. Plugging in β ^ M L = r / ( ln ( λ ) ln ( L r ) ) for β leads to the approximate ( 1 α ) -prediction interval
I 3 = L r exp ln L r ln λ 2 r χ 1 α 2 2 ( 2 ( s r ) ) , L r exp ln L r ln λ 2 r χ α 2 2 ( 2 ( s r ) ) ,
containing χ 2 -quantiles. Furthermore, we consider the statistic T 4 = ln ( λ ) ln L s . From Equation (3), one gets
2 β T 4 = d 2 i = 1 s Z i χ 2 ( 2 s )
and plugging in β ^ M L , leads to
I 4 = λ exp ln L r ln λ 2 r χ 1 α 2 2 ( 2 s ) , λ exp ln L r ln λ 2 r χ α 2 2 ( 2 s ) ,
as an approximate ( 1 α ) -prediction interval.
The prediction intervals I 1 , , I 4 are compared via the expected length and coverage percentage in a simulation study below.

2.2.2. Upper Record Values from Pareto Distributions

The corresponding prediction intervals for upper record values based on Pareto distributed random variables can be obtained analogously and are given by
I 1 = R r exp s r r 1 ln R r ln R 1 F α 2 ( 2 ( s r ) , 2 ( r 1 ) ) , R r exp s r r 1 ln R r ln R 1 F 1 α 2 ( 2 ( s r ) , 2 ( r 1 ) ) I 2 = R r exp s r r ln R r ln μ F α 2 ( 2 ( s r ) , 2 r ) , R r exp s r r ln R r ln μ F 1 α 2 ( 2 ( s r ) , 2 r ) I 3 = R r exp ln R r ln μ 2 r χ α 2 2 ( 2 ( s r ) ) , R r exp ln R r ln μ 2 r χ 1 α 2 2 ( 2 ( s r ) ) I 4 = μ exp ln R r ln μ 2 r χ α 2 2 ( 2 s ) , μ exp ln R r ln μ 2 r χ 1 α 2 2 ( 2 s ) .
I 1 was introduced in Asgharzadeh et al. [22]. I 2 and I 4 can be derived from the corresponding intervals in Awad and Raqab [21] via an analogous formulation of the transformation in Equation (6) for upper record values. Raqab et al. [15] proposed I 3 .

2.2.3. Expected Lengths of Prediction Intervals

For the expected lengths l i = E ( b ^ i ( L ) a ^ i ( L ) ) of the prediction intervals I i = [ a ^ i ( L ) , b ^ i ( L ) ] , i = 1 , , 4 , for future lower record values, the following expressions can be derived.
According to Equation (3), the expectation of the upper bound of I 1 is given by
E L r exp s r r 1 ln L r ln L 1 F α 2 ( 2 ( s r ) , 2 ( r 1 ) ) = E exp ln ( L r ) + s r r 1 F α 2 ( 2 ( s r ) , 2 ( r 1 ) ) ( ln ( L r ) ln ( L 1 ) ) = E exp ln ( λ ) 1 β i = 1 r Z i + s r r 1 F α 2 ( 2 ( s r ) , 2 ( r 1 ) ) 1 β i = 1 r Z i + 1 β Z 1 = λ E exp 1 β Z 1 i = 2 r E exp 1 β 1 s r r 1 F α 2 ( 2 ( s r ) , 2 ( r 1 ) ) Z i = λ 1 1 + 1 β 1 1 + 1 + s r r 1 F α 2 ( 2 ( s r ) , 2 ( r 1 ) ) 1 β r 1 .
In the last step, it was used that the moment-generating function of a standard exponential random variable Z takes the form E exp ( t Z ) = 1 1 t for t < 1 . Hence,
l 1 = λ ( ( r 1 ) β ) r 1 1 + 1 β 1 ( r 1 ) β + r 1 + ( s r ) F α 2 ( 2 ( s r ) , 2 ( r 1 ) ) r 1 1 ( r 1 ) β + r 1 + ( s r ) F 1 α 2 ( 2 ( s r ) , 2 ( r 1 ) ) r 1 .
The expected lengths of the other prediction intervals can be calculated similarly. They are given by
l 2 = λ ( r β ) r 1 β r + r + ( s r ) F α 2 ( 2 ( s r ) , 2 r ) r 1 β r + r + ( s r ) F 1 α 2 ( 2 ( s r ) , 2 r ) r l 3 = λ ( 2 r β ) r 1 2 β r + 2 r + χ α 2 2 ( 2 ( s r ) ) r 1 2 β r + 2 r + χ 1 α 2 2 ( 2 ( s r ) ) r l 4 = λ ( 2 r β ) r 1 2 β r + χ α 2 2 ( 2 s ) r 1 2 β r + χ 1 α 2 2 ( 2 s ) r .
In the comparisons of prediction intervals, we restricted ourselves to the most relevant cases in practice, namely s = r + 1 and s = r + 2 . Moreover, as examples, we chose the parameters as ( λ , β ) = ( 11.3 , 70 ) and ( λ , β ) = ( 45.5 , 90 ) , which corresponded to the women’s 100 m and the men’s 400 m, respectively (see Table 1 and Table 2 and Section 3).
In Table 1, the expected lengths of the prediction intervals I 1 , , I 4 are shown for the two chosen parameter combinations and for different values of r and s. The prediction interval I 2 which uses λ as a known parameter outperforms I 1 , where the first record value is used. Compared with the other intervals, I 4 has large expected lengths, which are even increasing in the number of record values used in the prediction method. The asymptotic method I 3 tends to have slightly shorter prediction intervals than the exact ones, I 1 and I 2 . Figure 1 and Figure 2 illustrate the expected lengths of all intervals for s = r + 1 and s = r + 2 , respectively.
We performed a simulation study concerning the percentages of coverage of the prediction intervals. We generated n = 10,000 sequences of record values from a power function distribution and considered the first r record values as known and the remaining records as unknown. Then, we computed the four prediction intervals based on these sequences and their empirical percentages of coverage. The results in Table 2 show that the empirical coverage of the exact prediction intervals I 1 and I 2 was always close to 1 α . As observed in Table 1, I 3 had small expected lengths, which corresponded to percentages of coverage (considerably) smaller than 90%, throughout. The asymptotic prediction interval I 4 turned out to be too conservative. Summarizing, the exact prediction intervals I 1 and I 2 seemed to outperform the approximate approaches I 3 and I 4 .

3. Application to Athletics and American Football Data

Records play an important role in athletics. We applied statistical prediction methods for future lower and upper record values, namely, future world records, in athletic events, and we focused on the respective next record, i.e., s = r + 1 was chosen. We assumed that lower records, as in running events, were based on a power function distribution, and that upper records, as in throwing and jumping events, were based on a Pareto distribution. These assumptions have to be justified in a given data situation. As an example, we considered the women’s 100 m; data of this discipline have been studied by extreme value methods before (Einmahl and Magnus [2], Einmahl and Smeets [4], Stephenson and Tawn [7]). We used the data provided at https://www.worldathletics.org/records/all-time-toplists, accessed on 19 December 2019. These data contain the personal bests of respective athletes, and we considered results until 2018. Since the times were taken with an accuracy of 0.01 s, we further adjusted those times in order to eliminate ties as in Einmahl and Magnus [2]. A histogram of these slightly modified times is given in Figure 3.
The common record model as introduced in Section 2 is based on a sequence of iid random variables. While the assumption of independence seems to be reasonable, the assumption of identical distribution is questionable due to a possible trend in the data over time. In what follows, we stick to the iid situation. For the reason of comparability of sports results and in order to better approximately meet the assumption of identical distributions, we considered the times of top athletes only, whose running times were below a given threshold. In Figure 4 and in our prediction results, this threshold was chosen to be λ = 11.3 .
The curve in Figure 4 illustrates the density function of a power function distribution with parameters λ = 11.3 and β = 69.73 , where the latter value is a maximum likelihood estimate. Figure 4 suggests that the power function assumption seems to be reasonable. However, the respective quantile–quantile (Q–Q) plot in Figure 5 gives rise to question the assumptions.
Maybe the power function assumption should be modified as the smallest values indicate, or the assumptions of independence and identical distributions should be modified. Nevertheless, due to the histogram plot and for illustration of our theoretical results, we stuck to the power function assumption.
Then, we applied the point predictor MPSP for lower and upper records and the prediction intervals I 1 / I 1 , I 2 / I 2 and I 3 / I 3 , which were shown in Section 2. The world record progression for the different events was recorded in https://www.worldathletics.org/records/by-category/world-records, accessed on 25 November 2022. Unfortunately, not all of these lists were consistent. For example, some times in the running events were measured by hand and therefore had an accuracy of 0.1 instead of 0.01 for electronically measured times. In this listing, a record of 11.20 was regarded as faster than the old record 11.1 for the women’s 100 m. To avoid such data problems, we excluded times measured by hand. To illustrate the prediction results, Table 3 shows world records in the women’s 100 m along with respective MPSP and prediction intervals I 1 , I 2 and I 3 based on the previous record values. The point predictor (MPSP) and the prediction interval shown in row i of Table 3 are based on the previous records listed in rows 1 , , i 1 , i 2 .
It can be observed that most world records were close to the MPSP and were within the prediction intervals. An exception was the record ran by Florence Griffith-Joyner in 1988. It was considerably smaller than the MPSP and the lower bound of the statistical prediction intervals. Moreover, such an exceptional record had a strong effect on the statistical prediction of subsequent record values.
By using the same procedure in other running events and an analogous approach for upper records in throwing and jumping events with underlying Pareto distributions, where the distributional assumption was approximately met, we derived the results shown in Table 4.
Obviously, the number of world records was small in the women’s 800 m, 1500 m, marathon and javelin throw. Therefore, the predictions in Table 4 may differ from the actual next record. Moreover, the respective prediction intervals were quite large. For example, the lower bound for the next marathon record was nonsatisfying.
However, in the men’s marathon, the lower bound was smaller than two hours as well. The number of records was r = 8 , so the prediction interval was expected to be quite reasonable having a lower bound below the two-hour mark. In October 2019, the world record holder Eliud Kipchoge broke this mark under special conditions regarding, for example, the course, wherefore it was not considered an actual world record.
In general, prediction intervals tend to be large, if the number of observed records is fairly small. Thus, in such a real data situation, the prediction interval may not be of practical use (see Table 4).
In addition to the analysis of athletic events, we applied point and interval prediction to data from American football. A possible data set could result from the NFL Combine with its athletics disciplines.
However, we focused on data resulting from actual football games in the NFL. To combine most relevant metrics in the analysis of a player’s performance on the field, we considered so-called fantasy football points (see, e.g., https://fantasy.nfl.com/research/scoringleaders, accessed on 2 March 2021). These fantasy points are calculated according to the points per reception (PPR) scoring scheme (see Table 5 and the glossary at https://www.pro-football-reference.com, accessed on 14 January 2020), for each game separately. Many providers of football data additionally offer fantasy points of players that are added up to a full season performance. Here, we focus on single game performances.
Related to these data of fantasy football points, we may define world records in American football. When measured by means of the PPR scheme, the respective world record performance (as of December 2019) by quarterback Michael Vick was composed of 333 passing yards, 4 passing touchdowns, 80 rushing yards and 2 rushing touchdowns in a single game, which yielded
333 25 + 4 × 4 + 80 10 + 2 × 6 = 49.32 fantasy points .
We distinguished between the skill positions quarterback, wide receiver, running back and tight end. Other players such as linemen, defense players and special teams were not included in the analysis. The points of kickers and whole defense teams were measured in the standard scoring scheme as well, but they followed a discrete distribution and were therefore not considered here.
We determined the sequences of records for quarterbacks, wide receivers, running backs and tight ends as the basis for point and interval prediction. The underlying game data are provided at https://www.pro-football-reference.com (accessed on 14 January 2020, publicly available until 2020) and cover each game of the regular seasons from 2000 to 2019. The fantasy football points were rounded to one digit for quarterbacks as well. The Pareto assumption could be justified by comparing the histogram and the estimated density function as above. Only for the tight end’s points did a Pareto distribution not seem to be reasonable. Since there were just two records in the wide receiver’s data, we did not compute predictions for this position either. The progressions of the world records and the corresponding predictions can be found in Table 6 and Table 7.
Using box score statistics such as fantasy points to measure a player’s performance has been discussed intensively, since it does not weight the yardage made by its effectiveness in a drive. Four yards at third and three are regarded as valuable as four yards at third and five, although the outcome of the play is different.

4. Discussion

We developed and applied point and interval prediction based on a sequence of lower or upper record values to various data sets from athletics and American football to come up with predictions of future record values. In the examples, these predicted values were successively compared with actual subsequent record values. Several prediction methods were discussed and results for lower record values were established, which were used, e.g., for analyzing data from running disciplines in athletics. The procedure as well as the results can also be applied to other sports disciplines. In the model of common record values, we assumed that the considered record values were smallest or largest observations of a sequence of independent and identically distributed random variables. In particular, this assumption of underlying identical distributions presumed that the athletes were able to perform on the same level. In order to approximately meet the assumption of identical distribution, we only considered results better than some threshold interpreted in such a way that we considered just top athletes and top performances. Here, we applied power function distributions with respect to lower record values and Pareto distributions with respect to upper record values, which often are seen to have a reasonable fit to the data. Of course, there are disciplines mentioned in this article, where the distributional assumptions are not really justified, such as in women’s 800 m and in tight end’s fantasy points. Moreover, certain dependencies within sports data cannot be avoided. For example, some sportspersons in athletics broke a record more than once and therefore, their names occur several times in the list of records. These results cannot be regarded as independent. However, if there are many people involved in a list of records, independence may be assumed at least approximately. Furthermore, in sports such as American football, there are dependencies among a team or between competitors. Despite some possible criticism on the model of common record values in sports applications, the prediction methods performed quite well. Most world records and point predictions lay within the respective prediction intervals. Prominent exceptions are, e.g., the records of Florence Griffith-Joyner in the women’s 100 m and Bob Beamon’s world record in the men’s long jump. Such exceptional performances were much better than expected and had a strong influence on statistical predictions.

5. Conclusions

Based on successive upper or lower record values in a sequence of independent and identically distributed random variables with an underlying Pareto distribution and a power function distribution, respectively, we obtained explicit statistical methods for point prediction and interval prediction. In this article, the procedures were derived in the case of lower records. Prediction intervals were compared by the criterion of expected lengths, and percentages of coverage were tabulated. The results were illustrated via real data sets from several athletics disciplines and American football. The forecasting of records has been considered in the literature before, mainly by means of extreme value theory, whereas we focused on statistical prediction methods to come up with point and interval predictions. When applied to the data sets, the model assumptions of independence and identical distribution of random variables as well as the specification of the underlying distribution have to be discussed in view of the developments in the respective sports disciplines and the prediction results. This may lead one to consider other underlying distributions providing a better fit as well as to refine the record model by incorporating a possible trend in the data over time. A heuristic approach to take into account a trend in the underlying data could be to estimate this trend, to rescale the original data ahead of prediction and then to modify the prediction results accordingly. Prediction within a thorough statistical model will be the subject matter of our further research.

Author Contributions

Conceptualization, C.E. and U.K.; methodology, U.K. and G.V.; software, C.E.; validation, C.E., U.K. and G.V.; formal analysis, C.E.; investigation, C.E. and G.V.; data curation, C.E.; writing—original draft preparation, C.E. and G.V.; writing—review and editing, U.K.; visualization, C.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://www.worldathletics.org/records/all-time-toplists (accessed on 19 December 2019), https://www.worldathletics.org/records/by-category/world-records (accessed on 25 November 2022), https://www.pro-football-reference.com (accessed on 14 January 2020). The data from https://www.pro-football-reference.com were publicly available until 2020.

Acknowledgments

The authors would like to thank the reviewers for their helpful comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
iidindependent and identically distributed
cdfcumulative distribution function
pdfprobability density function
MLPmaximum likelihood predictor
MOLPmaximum observed likelihood predictor
MPSPmaximum product of spacings predictor
QQquantile–quantile
NFLNational Football League
PPRpoints per reception

References

  1. Noubary, R.D. A procedure for prediction of sports records. J. Quant. Anal. Sport. 2005, 1. [Google Scholar] [CrossRef]
  2. Einmahl, J.H.J.; Magnus, J.R. Records in athletics through extreme-value theory. J. Am. Stat. Assoc. 2008, 103, 1382–1391. [Google Scholar] [CrossRef] [Green Version]
  3. Noubary, R.D. Tail modeling, track and field records, and Bolt’s effect. J. Quant. Anal. Sport. 2010, 6. [Google Scholar] [CrossRef]
  4. Einmahl, J.H.J.; Smeets, S.G.W.R. Ultimate 100-m world records through extreme-value theory. Stat. Neerl. 2011, 65, 32–42. [Google Scholar] [CrossRef]
  5. Henriques-Rodrigues, L.; Gomes, M.; Pestana, D. Statistics of extremes in athletics. Revstat Stat. J. 2011, 9, 127–153. [Google Scholar] [CrossRef]
  6. Fraga Alves, I.; de Haan, L.; Neves, C. How far can man go? In Advances in Theoretical and Applied Statistics; Torelli, N., Pesarin, F., Bar-Hen, A., Eds.; Springer: Heidelberg, Germany, 2013; pp. 187–197. [Google Scholar] [CrossRef]
  7. Stephenson, A.G.; Tawn, J.A. Determining the best track performances of all time using a conceptual population model for athletics records. J. Quant. Anal. Sport. 2013, 9, 67–76. [Google Scholar] [CrossRef]
  8. Adam, M.B.; Tawn, J.A. Modelling record times in sport with extreme value methods. Malays. J. Math. Sci. 2016, 10, 1–21. [Google Scholar]
  9. Albert, J.; Glickman, M.E.; Swartz, T.B.; Koning, R.H. (Eds.) Handbook of Statistical Methods and Analyses in Sports; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar]
  10. Wunderlich, F.; Memmert, D. Forecasting the outcomes of sports events: A review. Eur. J. Sport Sci. 2021, 21, 944–957. [Google Scholar] [CrossRef] [PubMed]
  11. Arnold, B.C.; Balakrishnan, N.; Nagaraja, H.N. Records; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1998. [Google Scholar] [CrossRef]
  12. Kaminsky, K.S.; Rhodin, L.S. Maximum likelihood prediction. Ann. Inst. Stat. Math. 1985, 37, 507–517. [Google Scholar] [CrossRef]
  13. Volovskiy, G.; Kamps, U. Maximum observed likelihood prediction of future record values. TEST 2020, 29, 1072–1097. [Google Scholar] [CrossRef] [Green Version]
  14. Volovskiy, G.; Kamps, U. Maximum product of spacings prediction of future record values. Metrika 2020, 83, 853–868. [Google Scholar] [CrossRef] [Green Version]
  15. Raqab, M.Z.; Ahmadi, J.; Doostparast, M. Statistical inference based on record data from Pareto model. Statistics 2007, 41, 105–118. [Google Scholar] [CrossRef]
  16. Ahmadi, J.; Doostparast, M. Bayesian estimation and prediction for some life distributions based on record values. Stat. Pap. 2006, 47, 373–392. [Google Scholar] [CrossRef]
  17. Madi, M.T.; Raqab, M.Z. Bayesian prediction of temperature records using the Pareto model. Environmetrics 2004, 15, 701–710. [Google Scholar] [CrossRef]
  18. Cheng, R.C.H.; Amin, N.A.K. Estimating parameters in continuous univariate distributions with a shifted origin. J. R. Stat. Soc. Ser. B 1983, 45, 394–403. [Google Scholar] [CrossRef]
  19. Ranneby, B. The maximum spacing method. An estimation method related to the maximum likelihood method. Scand. J. Stat. 1984, 11, 93–112. [Google Scholar]
  20. Wang, B.X.; Yu, K.; Coolen, F.P. Interval estimation for proportional reversed hazard family based on lower record values. Stat. Probab. Lett. 2015, 98, 115–122. [Google Scholar] [CrossRef] [Green Version]
  21. Awad, A.M.; Raqab, M.Z. Prediction intervals for the future record values from exponential distribution: Comparative study. J. Stat. Comput. Simul. 2000, 65, 325–340. [Google Scholar] [CrossRef]
  22. Asgharzadeh, A.; Abdi, M.; Kuş, C. Interval estimation for the two-parameter Pareto distribution based on record values. Selçuk J. Appl. Math. 2011, 149–161. [Google Scholar]
Figure 1. Expected lengths of the 90 % prediction intervals I 1 , , I 4 for the next lower record value ( s = r + 1 ) based on the first r lower record values from a P o w ( λ , β ) distribution with λ = 11.3 and β = 70 .
Figure 1. Expected lengths of the 90 % prediction intervals I 1 , , I 4 for the next lower record value ( s = r + 1 ) based on the first r lower record values from a P o w ( λ , β ) distribution with λ = 11.3 and β = 70 .
Stats 06 00008 g001
Figure 2. Expected lengths of the 90 % prediction intervals I 1 , , I 4 for the next but one lower record value ( s = r + 2 ) based on the first r lower record values from a P o w ( λ , β ) distribution with λ = 11.3 and β = 70 .
Figure 2. Expected lengths of the 90 % prediction intervals I 1 , , I 4 for the next but one lower record value ( s = r + 2 ) based on the first r lower record values from a P o w ( λ , β ) distribution with λ = 11.3 and β = 70 .
Stats 06 00008 g002
Figure 3. Histogram of the times of women’s 100 m in seconds.
Figure 3. Histogram of the times of women’s 100 m in seconds.
Stats 06 00008 g003
Figure 4. Histogram of the results of women’s 100 m faster than 11.3 s and pdf of the power function distribution with λ = 11.3 and β = 69.73 , where the latter is the maximum likelihood estimate based on all times below 11.3 as upper threshold.
Figure 4. Histogram of the results of women’s 100 m faster than 11.3 s and pdf of the power function distribution with λ = 11.3 and β = 69.73 , where the latter is the maximum likelihood estimate based on all times below 11.3 as upper threshold.
Stats 06 00008 g004
Figure 5. Power function Q–Q plot for the times of women’s 100 m faster than 11.3 s, where the five smallest times were omitted when calculating the regression line.
Figure 5. Power function Q–Q plot for the times of women’s 100 m faster than 11.3 s, where the five smallest times were omitted when calculating the regression line.
Stats 06 00008 g005
Table 1. Expected lengths of the 95 % and 90 % prediction intervals I 1 , , I 4 for the sth lower record value based on the first r records from a P o w ( λ , β ) distribution.
Table 1. Expected lengths of the 95 % and 90 % prediction intervals I 1 , , I 4 for the sth lower record value based on the first r records from a P o w ( λ , β ) distribution.
α = 5 % α = 10 %
I 1 I 2 I 3 I 4 I 1 I 2 I 3 I 4
λ = 11.3 , β = 70
r = 3 , s = 4 1.451.030.541.130.980.740.440.95
r = 8 , s = 9 0.660.640.511.610.510.490.411.35
r = 8 , s = 10 1.010.970.721.670.790.770.601.40
r = 25 , s = 26 0.430.430.402.170.340.340.321.83
r = 25 , s = 27 0.630.630.572.180.510.510.471.83
r = 25 , s = 28 0.780.780.702.190.640.640.581.84
λ = 45.5 , β = 90
r = 3 , s = 4 4.723.321.723.613.152.371.393.02
r = 8 , s = 9 2.152.071.645.211.651.601.324.38
r = 8 , s = 10 3.293.162.355.432.592.501.954.56
r = 25 , s = 26 1.471.461.367.411.171.171.106.23
r = 25 , s = 27 2.162.151.967.461.761.751.626.28
r = 25 , s = 28 2.692.682.407.522.212.202.006.32
Table 2. Percentages of coverage of the 90 % prediction intervals I 1 , , I 4 for the sth lower record value based on n = 10,000 sequences of the first r lower record values from a P o w ( λ , β ) distribution.
Table 2. Percentages of coverage of the 90 % prediction intervals I 1 , , I 4 for the sth lower record value based on n = 10,000 sequences of the first r lower record values from a P o w ( λ , β ) distribution.
I 1 I 2 I 3 I 4
λ = 11.3 , β = 70
r = 3 , s = 4 0.90410.90010.82290.9404
r = 8 , s = 9 0.89650.89890.87160.9910
r = 8 , s = 10 0.89950.89940.84920.9785
r = 25 , s = 26 0.89950.89850.88900.9998
r = 25 , s = 27 0.89800.89900.88320.9988
r = 25 , s = 28 0.90320.90400.88080.9971
λ = 45.5 , β = 90
r = 3 , s = 4 0.90410.90010.82290.9404
r = 8 , s = 9 0.89650.89890.87160.9910
r = 8 , s = 10 0.89950.89940.84920.9785
r = 25 , s = 26 0.89950.89850.88900.9998
r = 25 , s = 27 0.89800.89900.88320.9988
r = 25 , s = 28 0.90320.90400.88080.9971
Table 3. World records, maximum product of spacings predictor and (approximate) 90 % prediction intervals I 1 , I 2 and I 3 for the next record of the women’s 100 m based on the previous records.
Table 3. World records, maximum product of spacings predictor and (approximate) 90 % prediction intervals I 1 , I 2 and I 3 for the next record of the women’s 100 m based on the previous records.
sWorld RecordMPSP I 1 I 2 I 3
111.20
211.0811.10 [9.46, 11.19][10.91, 11.19]
311.0710.97[9.03, 11.07][10.35, 11.07][10.76, 11.07]
411.0410.99[10.63, 11.07][10.69, 11.07][10.84, 11.07]
511.0110.98[10.77, 11.04][10.76, 11.04][10.85, 11.04]
610.8810.95[10.80, 11.01][10.78, 11.01][10.84, 11.01]
710.8110.81[10.62, 10.88][10.62, 10.88][10.68, 10.88]
810.7910.74[10.56, 10.81][10.56, 10.81][10.61, 10.81]
910.7610.73[10.58, 10.79][10.57, 10.79][10.61, 10.79]
1010.4910.70[10.57, 10.76][10.55, 10.76][10.59, 10.76]
11 10.41[10.22, 10.49][10.22, 10.49][10.26, 10.49]
Table 4. World records (until 2022), maximum product of spacings predictor and 90 % prediction interval I 2 / I 2 for the next record based on the previous records for various athletic events.
Table 4. World records (until 2022), maximum product of spacings predictor and 90 % prediction interval I 2 / I 2 for the next record based on the previous records for various athletic events.
WomenMen
EventRecordrMPSPPrediction IntervalRecordrMPSPPrediction Interval
100 m10.491010.41[10.22, 10.49]9.58139.53[9.40, 9.58]
100/110 m hurdles12.12912.04[11.83, 12.12]12.80912.71[12.50, 12.80]
200 m21.34921.15[20.68, 21.33]19.19518.90[18.03, 19.18]
400 m47.601247.25[46.42, 47.58]43.03442.46[40.53, 43.00]
800 m1:53.2821:50.06[1:32.74, 1:53.11]1:40.9171:40.20[1:38.29, 1:40.87]
1500 m3:50.0733:45.61[3:28.01, 3:49.84]3:26.0083:24.55[3:20.77, 3:25.92]
10,000 m29:01.03928:40.22[27:48.19, 28:59.95]26:11.001326:02.20[25:41.55, 26:10.55]
Marathon2:14:0422:06:45[1:30:47, 2:13:41]2:01:0982:00:11[1:57:40, 2:01:06]
Shot put22.632622.81[22.64, 23.19]23.371623.56[23.38, 24.01]
Javelin throw72.28378.70[72.60, 111.95]98.488101.07[98.61, 108.23]
Discus throw76.801777.56[76.84, 79.31]74.081274.89[74.12, 76.88]
Long jump7.52147.57[7.52, 7.70]8.9599.06[8.96, 9.36]
High jump2.09132.10[2.09, 2.13]2.45222.46[2.45, 2.48]
Table 5. List of points given in fantasy football’s PPR scoring.
Table 5. List of points given in fantasy football’s PPR scoring.
EventScore
Passing1 point per 25 yards
Passing touchdowns4 points
Interceptions thrown−2 points
Rushing/receiving yards1 point per 10 yards
Receptions1 point
Touchdowns6 points
2-Point conversions2 points
Fumbles lost−2 points
Table 6. World records (2000–2019), maximum product of spacings predictor and (approximate) 90 % prediction intervals I 1 , I 2 and I 3 for the next record based on the previous records for the fantasy points of quarterbacks.
Table 6. World records (2000–2019), maximum product of spacings predictor and (approximate) 90 % prediction intervals I 1 , I 2 and I 3 for the next record based on the previous records for the fantasy points of quarterbacks.
sPlayerTeamRecordsMPSP I 1 I 2 I 3
1Cade McNownCHI34.3
2Trent GreenSTL36.342.0 [34.7, 1621.4][34.7, 63.0]
3Peyton ManningIND37.441.3[36.4, 106.5][36.5, 89.4][36.5, 53.6]
4Trent GreenKAN37.941.2[37.5, 50.5][37.6, 61.4][37.6, 49.9]
5Michael VickATL38.240.9[38.0, 45.0][38.0, 53.1][38.0, 47.5]
6Daunte CulpepperMIN41.840.6[38.3, 43.1][38.3, 49.3][38.3, 46.0]
7Michael VickPHI49.344.7[41.9, 49.2][41.9, 54.2][41.9, 51.1]
8 53.4[49.5, 62.4][49.5, 66.7][49.5, 62.8]
Table 7. World records (2000–2019), maximum product of spacings predictor and (approximate) 90 % prediction intervals I 1 , I 2 and I 3 for the next record based on the previous records for the fantasy points of running backs.
Table 7. World records (2000–2019), maximum product of spacings predictor and (approximate) 90 % prediction intervals I 1 , I 2 and I 3 for the next record based on the previous records for the fantasy points of running backs.
sPlayerTeamRecordsMPSP I 1 I 2 I 3
1Duce StaleyPHI36.2
2Marshall FaulkSTL44.943.7 [36.6, 1284.9][36.6, 63.6]
3Marshall FaulkSTL45.654.9[45.4, 2688.2][45.4, 182.1][45.4, 82.1]
4Fred TaylorJAX51.852.4[45.9, 101.6][45.9, 93.5][45.9, 69.3]
5Shaun AlexanderSEA56.159.4[52.1, 95.7][52.2, 95.2][52.2, 78.0]
6Clinton PortisDEN57.463.6[56.4, 91.4][56.5, 93.8][56.5, 81.6]
7Jamaal CharlesKAN59.564.0[57.7, 83.8][57.7, 87.4][57.7, 79.4]
8 65.6[59.8, 82.1][59.8, 85.8][59.8, 79.8]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Empacher, C.; Kamps, U.; Volovskiy, G. Statistical Prediction of Future Sports Records Based on Record Values. Stats 2023, 6, 131-147. https://doi.org/10.3390/stats6010008

AMA Style

Empacher C, Kamps U, Volovskiy G. Statistical Prediction of Future Sports Records Based on Record Values. Stats. 2023; 6(1):131-147. https://doi.org/10.3390/stats6010008

Chicago/Turabian Style

Empacher, Christina, Udo Kamps, and Grigoriy Volovskiy. 2023. "Statistical Prediction of Future Sports Records Based on Record Values" Stats 6, no. 1: 131-147. https://doi.org/10.3390/stats6010008

Article Metrics

Back to TopTop