Quasar Identification Using Multivariate Probability Density Estimated from Nonparametric Conditional Probabilities

Farmer, Jenny; Allen, Eve; Jacobs, Donald J.

doi:10.3390/math11010155

Open AccessArticle

Quasar Identification Using Multivariate Probability Density Estimated from Nonparametric Conditional Probabilities

by

Jenny Farmer

^1,†

,

Eve Allen

²

and

Donald J. Jacobs

^2,*,†

¹

Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA

²

Department of Physics and Optical Science, University of North Carolina at Charlotte, Charlotte, NC 28223, USA

^*

Author to whom correspondence should be addressed.

^†

Affiliate Faculty of the UNC Charlotte School of Data Science.

Mathematics 2023, 11(1), 155; https://doi.org/10.3390/math11010155

Submission received: 30 October 2022 / Revised: 12 December 2022 / Accepted: 26 December 2022 / Published: 28 December 2022

(This article belongs to the Special Issue Probability Distributions and Their Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Nonparametric estimation for a probability density function that describes multivariate data has typically been addressed by kernel density estimation (KDE). A novel density estimator recently developed by Farmer and Jacobs offers an alternative high-throughput automated approach to univariate nonparametric density estimation based on maximum entropy and order statistics, improving accuracy over univariate KDE. This article presents an extension of the single variable case to multiple variables. The univariate estimator is used to recursively calculate a product array of one-dimensional conditional probabilities. In combination with interpolation methods, a complete joint probability density estimate is generated for multiple variables. Good accuracy and speed performance in synthetic data are demonstrated by a numerical study using known distributions over a range of sample sizes from 100 to

10^{6}

for two to six variables. Performance in terms of speed and accuracy is compared to KDE. The multivariate density estimate developed here tends to perform better as the number of samples and/or variables increases. As an example application, measurements are analyzed over five filters of photometric data from the Sloan Digital Sky Survey Data Release 17. The multivariate estimation is used to form the basis for a binary classifier that distinguishes quasars from galaxies and stars with up to 94% accuracy.

Keywords:

nonparametric multivariate density estimation; conditional probability product array; maximum entropy method; ordered statistics; probability density based binary classification; ROC; SDSS-DR17; quasar; galaxies; stars

MSC:

62-08; 60E05

1. Introduction

The estimation of probability density is an important topic of research in statistical inference across a myriad of applications. Density estimation has become increasingly relevant, as the amount of data available is becoming massive, especially in fields such as bioinformatics and astronomy. Nonparametric methods are of general interest, as they impose minimal conditions on the underlying functional form of probability density, and are therefore generally applicable without prior knowledge of specific data characteristics. Kernel density estimation (KDE), first introduced over 60 years ago [1,2], remains a viable model-free method with well-established properties [3]. A short-coming of KDE, however, is the poor handling of bounded data, referred to as the boundary problem, often addressed by the reflection method [4] or alternate kernel selection [5,6,7]. In addition, the selection of bandwidth to control smoothness in KDE is generally a challenge. Thus much attention has been devoted to intelligent and adaptive bandwidth methods [8,9].

The estimation of nonparametric density becomes more difficult as additional random variables are considered. In particular, KDE loses computational efficiency in higher dimensional spaces, with more serious boundary and bandwidth problems to reconcile. Despite these difficulties, recent research continues to advance the field of multivariate nonparametric estimation using kernel density techniques [10,11,12,13,14,15]. Methods of dimension reduction are often used to mitigate problems arising from high dimensions by identifying the most relevant variables [16,17]. Other methods for density estimation in high dimensions include neural networks [18,19,20,21], maximum entropy [22,23,24,25,26,27], and semi-parametric methods such as Gaussian mixed mode analysis [28]. For further discussion, comparison, and review of these and other methods, see [29,30,31,32].

Previously, Farmer and Jacobs introduced a nonparametric density estimator for univariate data [33], which does not suffer from boundary problems or bandwidth selection. Based on combining a maximum entropy approach with universal quality measures for the estimate using single order statistics [34], the method was shown to produce accurate results on many diverse data sets compared to competing implementations [35,36]. In the rest of this paper, a multivariate extension is developed in Section 2 and benchmarked against synthetic data to demonstrate accuracy and performance. In Section 3, the results are compared with a MATLAB implementation of KDE. In Section 4, a sample of astronomical data from the Sloan Digital Sky Survey is analyzed to show the utility of our multivariate probability density estimation method when applied to classification. The results are competitive with state-of-the-art ML classifications. Conclusions are made in Section 5.

2. Multivariate Probability Density Estimation

The multivariate estimator described in this section is based on an underlying univariate method available as C++ or Java classes, with user interfaces plugged into both R [37] and MATLAB [38]. The first subsection provides a brief background to previous and recent research on multivariate maximum entropy methods. The new method is then developed, and the last subsection evaluates performance in terms of accuracy and computational time. Trends according to sample size and dimensionality are presented for mixtures of Gaussian distributions and a product of Cauchy distributions using a variety of copulas.

2.1. Maximum Entropy Method

The maximum entropy method (MEM) for the estimation of probability density is a rigorous method for producing unbiased parametric estimates [39]. A brief summary of the multivariate case for d variables is provided as part of background information for related previous work. The variables are described as a vector in a d-dimensional space given by

x = (x_{1}, x_{2}, \dots, x_{d})

. Moments in the i-th direction are given as

α_{i}

. To describe different moments for each component of

x

it is convenient to define

α = (α_{1}, α_{2}, \dots, α_{d})

. A general moment in terms of

x_{i}

is given as

\prod_{i = 1}^{d} x_{i}^{α_{i}}

. These product moments can be further generalized by using d Cartesian product functions given by

μ_{α} = \int ρ (x) \prod_{i = 1}^{d} f_{i}^{α_{i}} (x_{i}) d x_{i} .

(1)

The MEM obtains an unbiased density

ρ (x)

by maximizing Shannon entropy while satisfying all known constraints including the normalization condition. This can be expressed as an optimization problem by maximizing the expression

H = \int ρ (x) log [ρ (x)] \prod_{i = 1}^{d} d x_{i} + \sum_{α} γ_{α} \int ρ (x) \prod_{i = 1}^{d} f_{i}^{α_{i}} (x_{i}) d x_{i} + λ \int ρ (x) \prod_{i = 1}^{d} d x_{i} .

(2)

Solving the extremum problem formally gives the solution to Equation (2) to be of the form

\hat{ρ} (x) = A exp (\sum_{α} γ_{α} \prod_{i = 1}^{d} f_{i}^{α_{i}} (x_{i}))

(3)

where

\hat{ρ} (x)

is the sample estimate of the probability density function (PDF), A is a normalization constant, and

γ_{α}

is a set of Lagrange coefficients. Determining

γ_{α}

for a set of moments

μ_{α}

is computationally difficult, particularly as the number of moments becomes large and as the number of dimensions increase. Computational methods have been developed for a limited number of moments and dimensions [22,25], including more recent methods based on fractional moments. In other works, MEM is combined with dimension reduction techniques [26,27] to mitigate the computational difficulties of higher dimensions. A more sophisticated approach first creates a histogram of the sampled data to obtain a rough estimate, and then estimates the coefficients of 1D slices of the histogram in each dimension. These coefficients are then expanded in d dimensions to produce a grid of coefficients used in Equation (3) with Legendre polynomials as a basis set [23,25].

Our recent nonparametric probability density estimator for univariate data [33,34,35,36] begins with MEM to obtain the general form of the solution given in Equation (3) for the case

d = 1

. Chebyshev polynomials are used as a basis, with the domain on

[- 1, 1]

after the data are scaled. The

γ_{α}

coefficients are optimized based on a trial cumulative density function, CDF, given by

F_{X} (x | γ_{α})

where

r_{n} = F_{X} (x_{n} | γ_{α})

is evaluated for all data points {

x_{n}

} in the random sample. The set {

r_{n}

} will be uniform on

[0, 1]

when the parameter set

γ_{α}

is correct within statistical resolution. Instead of specifying a preset number of Lagrange multipliers, the coefficients are optimized by a random search method, adding more terms to the expansion if necessary. The process continues until the CDF ensures that {

r_{n}

} is distributed as sampled uniform random data (SURD). The deviation from SURD is measured using a scoring function based on single order statistics, which is sample size invariant [34]. The method is stable for several hundred of Lagrange multipliers. Key features of the method include outlier removal and adaptive resolution, avoiding bandwidth and boundary problems in KDE methods. A generalization of this univariate density estimation method to higher dimensions using the form given in Equation (3) is not possible, because the properties of order statistics are lost once

d > 1

is considered. Nevertheless, a multivariate PDF can be obtained by calculating many 1D PDFs in terms of a product array of conditional probability functions.

2.2. Product Array of Conditional Probabilities

Similar in strategy to other methods [23,25], a multivariate probability density will be estimated using multiple strips of one-dimensional conditional probabilities. The multivariate joint probability distribution is expressed as products of

d - 1

conditional probability distributions. To illustrate the method, consider the simplest multivariable case of two variables. The joint probability of two variables is expressed as a product of conditional and marginal probabilities as

ρ (x_{1}, x_{2}) = ρ (x_{1} | x_{2}) ρ (x_{2})

. Estimating the conditional and marginal probability densities using the 1D PDF estimator separately, the joint probability density is estimated as

\hat{ρ} (x_{1}, x_{2}) = \hat{ρ} (x_{1} | x_{2}) \hat{ρ} (x_{2})

.

Figure 1a demonstrates a two-dimensional example with a 5 by 5 grid of cells, numbered from 1 to 25, superimposed over a scatter plot of sample data from a bivariate standard Gaussian distribution. The grid lines shown are used to stratify the data into strips. The number of grid lines is the same in each dimension. For dimension d with

d > 1

, the number of grid lines, denoted by

n_{l}

, is heuristically given as

n_{l} = ⌈ {[\frac{n}{100}]}^{\frac{1}{d - 1}} ⌉ + 1 .

(4)

where n is the sample size. The scale factor of 100 provides a good balance between accuracy and computation time. Moreover, Equation (4) suggests that

n_{l} = 3

is a practical minimum number of grid lines to use. It is found that

n \geq 100 \times 2^{d - 1}

(for

d \geq 1

) gives a reasonable guide for how many independent observations produce a good nonparametric estimate as a function of dimension. Since the details of a particular problem affect the minimum sample size to obtain an accurate estimate, this heuristic sets a reasonable reference. For instance, sharp features or heavy tails in probability density require more samples, while fewer samples in high dimensions are sufficient when there are strong correlations between the d random variables.

As seen in Figure 1, the grid line spacing is not uniform, because the spacing adapts to the density of the data in each dimension. This is achieved by first calculating the marginal densities with the 1D PDF estimator for each variable. The range of the CDF on [0, 1] is divided into equally spaced marks separated by

1 / (n_{l} - 1)

, where

n_{l}

is determined by Equation (4). To map from [0, 1] back to the variable, the location of the grid lines is obtained using the inverse of the estimated CDF, which is the estimated quantile function

{\hat{q}}_{j} (u)

for the j-th variable. The grid lines divide the percent of data uniformly for the j-th variable at the following locations

\{{\hat{q}}_{j} (\frac{1}{n}), {\hat{q}}_{j} (\frac{1}{n_{l} - 1}), {\hat{q}}_{j} (\frac{2}{n_{l} - 1}), \dots {\hat{q}}_{j} (\frac{n_{l} - 2}{n_{l} - 1}), {\hat{q}}_{j} (1 - \frac{1}{n})\} .

(5)

For the case shown in Figure 1

n_{l} = 6

. In general, this method creates

{(n_{l} - 1)}^{d}

cells.

The estimated density is at the center of the two grid lines that define the boundaries of a data strip, as shown in Figure 1b for the

x_{1}

variable. The blue points represent all observations of

x_{1}

that fall within the strip (−0.82

< x_{1} <

−0.21). These points are projected on the vertical centerline (ignoring the

x_{1}

values) to create a 1D problem where the probability density is a function of

x_{2}

along the centerline positioned at (

x_{1} = - 0.51

). This process is repeated for five centerlines representing each data strip. Special points along the estimated conditional probabilities corresponding to cell centers are evaluated to obtain an estimate of the joint probability density at the center of each cell. Interpolation is used to obtain estimates anywhere else in the hypercube if needed. The joint probability density estimate at the center of cell #9 in Figure 1a, for example, is the product of two 1D PDF estimates:

\hat{ρ} (- 0.51, - 0.58) = \hat{ρ} (x_{2} = - 0.58 | x_{1} = - 0.51) \hat{ρ} (x_{1} = - 0.51) .

(6)

The process of multiplying a conditional probability density to a marginal probability density means normalizing the joint probability density function is unnecessary. The normalization process for the entire constructed joint probability density is automatic. Since there is no smoothing function (as used in KDE through a selected kernel), the shrinkage that always exists with KDE is avoided. The general procedure of using a product array of conditional probability density and a marginal probability density is markedly stable and accurate.

This process is iterated for each additional variable. Equation (6) generalizes to d dimensions by writing a joint probability density estimate as a conditional probability density estimate multiplied by a marginal probability density estimate, given as

\hat{ρ} (x_{1}, \dots x_{d}) = \hat{ρ} (x_{1} | x_{2}, \dots x_{d}) \hat{ρ} (x_{2}, \dots x_{d}) .

(7)

The 1D PDF estimator makes conditional probability density estimates based on stratified data assigned to a centerline. At the next iteration, the marginal probability density for

d - 1

variables is expanded into another conditional probability density and a marginal probability density of

d - 2

variables. Recursively repeating this process until a marginal probability density estimate of the last variable is reached yields the expression

\begin{matrix} \hat{ρ} (x_{1} (i_{1}), \dots x_{d} (i_{d})) = & \hat{ρ} (x_{1} (i_{1}) | x_{2} (i_{2}), \dots x_{d} (i_{d})) \hat{ρ} (x_{2} (i_{2}) | x_{3} (i_{3}), \dots x_{d} (i_{d})) \\ \hat{ρ} (x_{3} (i_{3}) | x_{4} (i_{4}), \dots x_{d} (i_{d})) \dots \hat{ρ} (x_{d - 1} (i_{d - 1}) | x_{d} (i_{d})) \hat{ρ} (x_{d} (i_{d})) \end{matrix}

(8)

where the value

x_{k} (i_{k})

denotes a specific value of the k-th variable,

x_{k}

. The specific point is labeled using the index

i_{k}

, which lies on a centerline for the k-th variable, which is midway between two grid lines that bound a data strip. In particular, each

x_{k} (i_{k})

corresponds to a specific

{\hat{q}}_{k}

that is set by Equation (5). Although Equation (8) is valid for all continuous values of

x

, as indicated, the functions are evaluated only at discrete grid points. Indeed, all discrete points used to represent the d-dimensional joint probability density are initially determined by the univariate marginal probability densities for each dimension, as described above, where

n_{l}

grid lines are defined through Equation (5).

In this approach, the 1D PDF estimator is applied many times: the marginal PDF per variable as an initialization step (i.e., d times), and once per centerline representing a conditional probability density based on stratified data. For example, the 1D PDF estimator is used seven times when

n_{l} = 6

and

d = 2

as shown in Figure 1. In general, the number of calls is

N_{c} = d + \sum_{k = 2}^{d} {(n_{l} - 1)}^{k - 1}

. For comparison,

N_{c} = 785

when

n_{l} = 6

and

d = 5

. A summary of the main steps at a high level is as follows:

Five step process

Estimate 1D marginal densities for each variable.
Define strip boundaries from marginal CDF per variable at equal probability spacing.
Estimate the conditional probability density at the centerline of each strip.
Initialize the joint probability density at the center of each cell to the marginal $\hat{ρ} (x_{d})$ .
For each of the remaining variables, iterate: v = 1 to $d - 1$ ,
(a)
Extract the sample points for successive stratification as v is incremented.
(b)
Project sample points found in (a) onto the centerline representing a special point.
(c)
Estimate conditional probability density using the 1D PDF estimator per centerline.
(d)
Multiply estimated conditional probability density from (c) to current value per cell.

Note that dimension reduction methods could be used as a preprocessing step before this algorithm is used when dealing with high-dimensional data. This line of analysis will be reported elsewhere. Importantly, the implementation of a transformation as a preprocessing step does not change the algorithm presented in this paper.

2.3. Synthetic Data Benchmark

A Gaussian mixture model (GMM) of various types [12,15,16] is often used to benchmark estimation methods. The general form of a GMM with m modes is given as

ρ (x) = \sum_{k = 1}^{m} w_{k} N (x | μ_{k}, Σ_{k})

(9)

where a Gaussian distribution in d dimensions is given as

N (x | μ_{k}, Σ_{k}) = \frac{1}{\sqrt{{(2 π)}^{d} |Σ_{k}|}} exp [- \frac{1}{2} {(x - μ_{k})}^{T} Σ_{k}^{- 1} (x - μ_{k})] .

(10)

To assess the accuracy and speed of the multivariate estimator, five families of GMMs are considered where the parameters are given in Table 1. For each case,

{(Σ_{k})}_{i j} = k + (1 - k) δ_{i j}

, where the values of

k = {0.0, 0.15, 0.3, 0.45, 0.6, 0.75, 0.9}

are considered.

In addition, a single mode Cauchy model for

d = 2, 3, 4, 5, 6

is considered. The joint PDF is expressed as a function of the marginal and cumulative PDFs using a copula probability density,

c (x_{1}, x_{2}, \dots x_{d})

, according to Skylar’s theorem as

ρ (x_{1}, x_{2}, \dots, x_{d}) = c (F_{1} (x_{1}), F_{2} (x_{2}), \dots, F_{d} (x_{d}) | k) \prod_{k = 1}^{d} ρ_{k} (x_{k})

(11)

where the same Cauchy distribution is used for all the marginals in Equation (11). Five different copulas were considered for generating correlated random samples, including Gaussian, t, Frank, Clayton, and Gumbel. Gaussian and t copulas are easily extended to an arbitrary number of dimensions and were thus calculated separately and compared for all cases. The remaining Archimedean copulas were restricted to the analysis of bivariate data only. The copula correlations are set as

k = 0.0, 0.15, 0.3, 0.45, 0.6, 0.75, 0.9

to produce an analogous effect as the k used in the covariance matrices for the GMMs. The Cauchy distribution is given by

C (x | x_{0}, γ) = \frac{1}{π γ} \frac{γ^{2}}{({(x - x_{0})}^{2} + γ^{2}))},

(12)

and in standard form the location and scale parameters are set to

x_{0} = 0

and

γ = 1

, respectively. Note that for distributions with a single mode, performance is unaffected by translations and scaling of variables. Hence, only standard parameters are reported.

A visualization of a typical estimate for

d = 3

is shown in Figure 2 for the third GMM (two heavily overlapped modes) listed in Table 1 and with

k = 0

. To assess the accuracy of an estimate, the mean squared error (MSE) between the estimate and the known distribution is used. The MSE is defined as

M S E = \frac{1}{p} \sum_{g = 1}^{p} {[ρ (x (g)) - \hat{ρ} (x (g))]}^{2}

(13)

for

p = 9^{d}

evaluation points defined as

x (g) \in {{\hat{q}}_{1} (g), {\hat{q}}_{2} (g), \dots {\hat{q}}_{d} (g)}

(14)

where {g} forms a hyper cube of arguments to the estimated quantile functions at the regular uniformly spaced points

{[0.1, 0.2, \dots 0.9]}^{d}

covering all possible

9^{d}

combinations. In addition to the direct formula of Equation (13), the MSE was also calculated for subsamples of points. First, all

ρ (x (g))

are sorted from largest to smallest. Sweeping over a range of percentiles allows the estimates to be stratified into nine distinct bins. For each bin, the MSE is calculated for all points that fall into the bin. It was found that MSE is distributed approximately uniformly across all bins for all cases discussed in this subsection.

An important property of a product array of conditional probabilities to assess is sensitivity to variable ordering when calculating Equation (8) using the five-step process outlined above. For d variables, there are

d!

possible orderings that can be executed. A multivariate distribution for

d = 4

is constructed using the fifth GMM listed in Table 1 with four modes, designed to ensure each dimension has its own distinct features. Density estimates for sample sizes 100 and 1000, are calculated and compared over 10 samples for each of the 24 possible orderings. For each of the grid points determined by Equation (5), which do not depend on the variable order, there is no discernible difference between the mean variance due to different variable orders and the variance obtained between samples (data not shown). Based on this result, no conditions are imposed on how to choose the variable order. It is worth noting that an average can be performed over different variable orders if significant differences in variable order occurred. However, no case has been found where accuracy or speed is sensitive to variable ordering. The use of differences in variable ordering, if any, will be investigated when dimension reduction methods, such as principal component analysis, are combined with this algorithm.

As a further benchmark, the first to fourth GMM listed in Table 1 is used to assess the speed and accuracy of the method. The GMM parameters are such that all variable orderings are statistically the same, because the distributions are centered on the origin and along a single direction (body diagonal) in d-dimensions. This means that the center positions of all Gaussian distributions within the mixture lie along a one-dimensional line that is not optimal for any of the variables. These datasets were generated for two through six variables, using degenerate parameters in each dimension to provide a clear demonstration of trends as the dimension increases or the correlation factor is changes. In addition, dozens of other distributions were generated and analyzed (data not shown) to verify that symmetrical GMMs are not artificially more accurate than cases without special symmetries. Therefore, as a fair representation of the accuracy of the method, the average MSE is calculated over 10 trials for six distributions: the 4 GMM in Table 1, a single mode Cauchy modeled with a Gaussian copula, and an additional single mode Cauchy using a t-copula with parameter

ν = 1

. Examples for

d = 2, 3, 4, 5, 6

and five orders of magnitude variation in sample sizes are listed in Table 2.

Several trends are noticeable and summarized in the plots in Figure 3, averaged over the four GMM and two Cauchy distributions. The top row shows calculation time as a function of sample size with no correlation (left) and as a function of linear correlation for a sample size of 100,000 (right). The bottom row of Figure 3 shows the accuracy for the same parameters. In summary, as sample size increases, computational time increases and accuracy improves, as expected. As variables become strongly correlated, accuracy and computational time both decrease. For all distributions, higher dimensions require more time to estimate. Somewhat less intuitively, accuracy improves when the dimension is increased, which can be observed regardless of the subsampling used in Equation (13).

3. Comparison to Kernel Density Estimation

This section extends the synthetic data analysis to include a one-to-one comparison of our method with KDE as implemented in MATLAB. This analysis is also extended to include additional copulas for two dimensions. All comparisons were constructed using MATLAB functionality, providing robust and reliable implementations for generating copula PDFs and KDE estimations. The standard KDE implementation in MATLAB, mvksdensity(), is employed for the comparison. In addition to a random data sample, the mvksdensity() function requires as input the requested points for estimation and a bandwidth for each variable. The grid of estimation points returned from the multivariate PDF estimator was used to ensure a fair comparison for resolution and calculation time. For bandwidth, Silverman’s rule of thumb [40] was calculated as

b w (i) = σ_{i} {(\frac{4}{(d + 2) n})}^{\frac{1}{d + 4}},

(15)

where

σ_{i}

is the standard deviation of the i-variable.

As already mentioned, Gaussian and t-copulas can be easily extended to higher dimensions, but MATLAB only provides additional families of Archimedean copulas for bivariate data. Table 3 reports MSE values for a range of sample sizes and correlations for a single mode Cauchy distribution modeled with Frank, Gumbel, and Clayton copulas estimated using our method and KDE, for two variables. The MSE values for the three different models are similar for both estimation methods. However, the accuracy of our method with a product of conditional probabilities is often better for a large number of samples than KDE, regardless of the degree of correlation in the data or the copula. This trend can be seen in Figure 4. These plots are averaged over the three copula models in Table 3, the Gaussian and t-copulas, and four GMMs. The trends in MSE are the same for all seven of these distributions.

Extending the comparison, Table 4 reports MSE for three to six variables for KDE for four GMM distributions and a Cauchy distribution modeled with a t-copula. Figure 5 summarizes the comparative trends in computation time and MSE for sample size and dimension averaged over these distributions. Due to the increasingly long computation times for KDE for high sample sizes, these test cases were performed only for a single trial. For all dimensions and small sample sizes, our method is slightly slower than KDE, but with comparable accuracy, and much faster and more accurate for large sample sizes. After 100,000 samples, KDE has no advantage, although the precise crossover point depends on the number of variables and the type of distribution estimated. With KDE, however, there is a subjective choice in bandwidth, and the grid points to be evaluated must be specified. Our method, by contrast, removes subjective parameters and has automatic grid spacing. Here we used the same grid points that our method identified to make a one-to-one comparison. Nevertheless, it is clear from Figure 5 bottom left panel that often KDE provides a better estimate. The problem cases are due to the Cauchy distribution with the t-copula, but the absolute MSE is sufficiently small for practical applications. It is worth mentioning that sliding windows could be implemented in our method to help improve accuracy.

4. Quasar Identification

The classification of astronomical objects from surveys is an important fundamental task in astronomy. In the last 20 years, the Sloan Digital Sky Survey (SDSS) has collected a cumulative 652 TB of astronomical data over seventeen data releases [41] with more than 1.1 billion photometric observations. However, only a small fraction of these observations, about 3 million, were spectroscopically classified into general categories. Classification based on spectrum data (capturing a continuum of frequencies over a broad range) is extremely time consuming and cannot keep pace with the increase in photometric observations. In recent years, machine learning (ML) methods have been applied to the automatic classification of photometric data with increasing accuracy [42,43,44]. In this section, we develop a unique approach to classifying astronomical data using multivariate density estimates. The identification of quasars is presented primarily as an example of how probability densities can be applied to high-dimensional data. Nevertheless, the results of this small-scale analysis yield competitive accuracy with other ML methods.

4.1. Experimental Data

We have collected photometric data from the latest data release (DR17) [41], representing the three classes of spectroscopically confirmed objects: stars, galaxies, and quasars. Stars are a collection of luminous gasses that undergo nuclear fusion and form the building blocks of many other larger objects. Galaxies consist of gravitationally bound stars, gases, dust, and dark matter. These can be classified into three further categories: spiral, elliptical, and irregular based on shape. Quasars are a form of active galactic nuclei in which the quasar is more luminous than the surrounding stars in the galaxy, making it appear as a stellar source. However, these objects are typically more blue than stars, and some can emit in the radio frequency range, unlike stellar sources [45].

With SkyServer, photometric flags were checked to ensure that the object was detected in BINNED1 and that clean photometry data were collected. Clean photometry did not require flags that could affect photometry measurements or indicate that the object may not be accurate enough to be accepted. The flags that were checked include NOPROFILE, PEAKCENTER, NOTCHECKED, PSF_FLUX_INTERP, SATURATED, BAD_COUNTS_ERROR, DEBLEND_NOPEAK, INTERP_CENTER and COSMIC_RAY. In addition, redshift flags were checked to eliminate observations with questionable spectroscopy, which could lead to misclassification of the object. Further information about SDSS data can be found in [46,47,48,49,50]. We provide in Supplemental Materials the data set we analyzed in the form of .cvs files.

SDSS photometry covers five filters (u, g, r, i, z) ranging from the ultraviolet regime to the infrared. Each filter has an effective wavelength, and zero-point magnitude given in Table 5. The magnitudes reported from the filters are inverse hyperbolic sine magnitudes, measured in nanomaggies [51]. In astronomy, these filters are used to construct colour indexes which indicate various attributes about an observed object such as temperature and composition. Traditionally, these colour indexes are simple differences in the reported magnitudes of two filters. It is primarily from these five filter measurements that ML methods attempt to infer the classification of the object observed.

4.2. Binary Classification Using Multivariate Density

A major difficulty for many ML methods in classifying objects based on photometric data is the unbalanced representation of known objects already verified [43]. Quasars and stars are outnumbered by galaxies more than 1:5, potentially inflating the accuracy of galaxy prediction. To maintain balanced data the number of quasars and stars in the training set is limited, which can severely impact the performance of some ML methods. The classification method described in this section requires only a small subset of data to accurately estimate the probability densities of each object type. Thus, we split training data into two equal groups of 50,000 spectroscopically confirmed quasars, and 50,000 objects confirmed non-quasars. The non-quasar group consisted of 25,000 stars and 25,000 galaxies. Equal training groups containing as few as 10,000 each of quasar/non-quasar also produced reasonable results, whereas increasing the training set beyond 50,000 did not offer significant improvement.

Initially, no assumptions were made regarding which combinations of the five filters could best differentiate between classes. Separate probability distributions for quasars and non-quasars were estimated for each of the 31 possible combinations of one to five filters. For each pair of the 31 distributions, the following density ratio was constructed for all values x in the test sets for quasars, q, and non-quasars,

n q

.

r a t i o (x) = \frac{{\hat{ρ}}_{q} (x)}{{\hat{ρ}}_{q} (x) + {\hat{ρ}}_{n q} (x)}

(16)

The method of classifying densities based on this ratio was adopted from [29] and modified to empirically determine an appropriate threshold based on our training data. One hundred thresholds in equal increments between 0 and 1 were considered. For each threshold, a prediction was made with checking the condition

predict true when : r a t i o (x) \geq t h r e s h o l d

(17)

The number of true positives,

t p

, (correctly identified as quasars), true negatives,

t n

, (correctly identified as not quasars), false positives,

f p

(misclassified as quasars) and false negatives,

f n

(misclassified as not quasars) are calculated. The relationships between the true positive rate (

t r p

), false positive rate (

f p r

) and J statistic (J) defined in Equation (18) can be explored using receiver operator characteristics.

t p r = \frac{t p}{t p + f n} f p r = \frac{f p}{f p + t n} and J = \frac{t p}{t p + f n} + \frac{t n}{t n + f p} - 1

(18)

The maximum point of the J statistic, also known as Youden’s J Statistic [52], defines the optimal threshold for the prediction in Equation (17). An example of the J statistic and receiver operator characteristic (ROC) are shown in Figure 6. The probability densities created with the training data, along with the optimal threshold value, is then all that is needed to make future predictions for quasar identification based on one or more of the filters observed in photometric data.

4.3. Quasar Classification Results

To demonstrate the performance of the binary classification method described in the prior section, the remaining 750,000 quasar, star, and galaxy photometric data from the test set were predicted as either quasar or non-quasar. For each of these objects, their associated filter values were translated to probabilities of being either quasar or not quasar by multidimensional linear interpolation into the probabilities created from the training data for each type. Predictions were created according to Equations (16) and (17), and compared to the known classifications from the spectrum data. Since performance can be quantified through a variety of measures, each with their own merits, for convenience, accuracy, precision, recall, and the F1 measures are reported in Table 6.

Although any single variable alone does not discriminate markedly well, the results show that the best predictions do not require all five filters. In fact, the highest accuracy of more than 94%, is reported using only the g and r filters, with high-performing percentages across all measures. The ROC plot in Figure 6 represents the training data for these two filters, and the leftmost plot in Figure 7 shows a 2-dimensional surface plot of probability density. The bright red strip is the density of all quasars in the test set, and the lighter streak below is for the star and galaxy data. The second two plots in Figure 7 show density plots for different filter pairs. These examples also cluster well, but with slightly more overlap and lower accuracy. The results in Table 6 show that including more variables does not improve classification, and in some cases the results worsen.

Visualizing probability density plots for more than two variables can be accomplished using cross-sectional plots, such as the example in Figure 2. Cross-section plots are a useful tool for detailed high-dimensional analysis in localized areas. The plot in Figure 8 represents the clustering between quasar (red) and non-quasar (blue) for three filters, with the size of the circles corresponding to the magnitude of density. Densities of less than 0.035 are not plotted. These masked points of probability density consist of less than 20% of the total probability, but account for nearly 90% of the grid points, confirming a noted observation that most of high-dimensional space is empty. Although the relative densities are somewhat difficult to discern visually, this plot shows the clusters well.

5. Conclusions

A nonparametric multivariate probability density estimator based on a product of conditional probability densities and a marginal probability density is presented. Marginal and conditional probability densities are estimated through a series of calls to a 1-dimensional PDF estimator based on the principle of maximum entropy and optimized based on sort order statistics. This method is currently available as software packages in R, MATLAB, C++, and Java. In this paper, we show that our method is fast and accurate for up to six variables based on results from synthetic data involving four Gaussian mixture models and five different copula generating schemes using a Cauchy marginal distribution.

We also compared results for dimensions two through six with KDE using the multivariate KDE MATLAB implementation. It is seen that both methods have comparable accuracy and speed across a wide range of cases, but our method always performs better when dealing with a sufficiently large number of samples in any dimension. When the number of samples exceeds 100,000 the MATLAB implementation of multivariate KDE becomes very slow compared to our method as the number of samples continue to increase. In addition, our method has no subjective settings for a bandwidth or resolution. We compared the MATLAB implementation of KDE with the selected points that are automatically generated by our method using estimated marginal probability densities and their quantiles. Therefore, the KDE method was given an advantage in our comparisons that would normally not be present in applications.

Although the current implementation at specific points can handle up to 10 dimensions, practical considerations of computing resources for interpolation methods currently limit the algorithm to several dimensions. Future work involving computational mathematics will leverage dimension reduction methods and the sparsity of high-dimensional data by implementing data structures suitable for interpolation on a sparse hyper-dimensional grid to reduce memory and CPU needs. However, our current straight-forward implementation can handle up to 10 dimensions and a million samples very well. To illustrate applicability, we demonstrate the utility of our method by estimating the probability density for five variables, which describes astronomical photometric data to successfully classify quasars with more than 94% accuracy.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math11010155/s1, Supplemental Materials Date File: GALAXY, QSO and STAR.

Author Contributions

Conceptualization, J.F. and D.J.J.; methodology, J.F. and E.A.; software, J.F.; validation, J.F. and D.J.J.; formal analysis, J.F.; investigation, J.F. and E.A.; resources, D.J.J.; data curation, E.A.; writing—original draft preparation, J.F. and E.A.; writing—review and editing, J.F., E.A. and D.J.J.; visualization, J.F.; supervision, D.J.J.; project administration, D.J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The multivariate probability density estimator can be downloaded from the following repositories. MATLAB: https://github.com/jennyfarmer/PDFAnalyze (accessed on 11 December 2022). R: https://cran.r-project.org/web/packages/PDFEstimator/index.html (accessed on 11 December 2022).

Acknowledgments

The computing resources and support used to produce the results presented in this paper were provided by the University Research Computing group in the Office of OneIT at the University of North Carolina at Charlotte. We also thank the two reviewers who asked us to do more comparisons, which has led to stronger conclusions. Funding for the Sloan Digital Sky Survey IV has been provided by the Alfred P. Sloan Foundation, the U.S. Department of Energy Office of Science, and Participating Institutions. SDSS acknowledges support and resources from the Center for High-Performance Computing at the University of Utah. The SDSS web site is www.sdss4.org (accessed on 25 July 2022). The SDSS is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration including the Brazilian Participation Group, the Carnegie Institution for Science, Carnegie Mellon University, Center for Astrophysics|Harvard & Smithsonian (CfA), the Chilean Participation Group, the French Participation Group, Instituto de Astrofísica de Canarias, The Johns Hopkins University, Kavli Institute for the Physics and Mathematics of the Universe (IPMU)/University of Tokyo, the Korean Participation Group, Lawrence Berkeley National Laboratory, Leibniz Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Astrophysik (MPA Garching), Max-Planck-Institut für Extraterrestrische Physik (MPE), National Astronomical Observatories of China, New Mexico State University, New York University, University of Notre Dame, Observatório Nacional/MCTI, The Ohio State University, Pennsylvania State University, Shanghai Astronomical Observatory, United Kingdom Participation Group, Universidad Nacional Autónoma de México, University of Arizona, University of Colorado Boulder, University of Oxford, University of Portsmouth, University of Utah, University of Virginia, University of Washington, University of Wisconsin, Vanderbilt University, and Yale University.

Conflicts of Interest

The authors declare no conflict of interest.

References

Parzen, E. On Estimation of a Probability Density Function and Mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
Rosenblatt, M. Remarks on Some Nonparametric Estimates of a Density Function. Ann. Math. Stat. 1956, 27, 832–837. [Google Scholar] [CrossRef]
Scott, D.W. Multivariate Density Estimation: Theory, Practice, and Visualization; Wiley Series in Probability and Statistics; John Wiley & Sons, Incorporated: Somerset, UK, 2015. [Google Scholar]
Schuster, E.F. Incorporating support constraints into nonparametric estimators of densities. Commun. Stat.-Theory Methods 1985, 14, 1123–1136. [Google Scholar] [CrossRef]
Müller, H.G. Smooth Optimum Kernel Estimators Near Endpoints. Biometrika 1991, 78, 521–530. [Google Scholar] [CrossRef]
Chen, S.X. Probability Density Function Estimation Using Gamma Kernels. Ann. Inst. Stat. Math. 2000, 52, 471–480. [Google Scholar] [CrossRef]
Lapko, A.V.; Lapko, V.A. Fast Algorithm for Choosing Kernel Function Blur Coefficients in a Nonparametric Probability Density Estimate. Meas. Tech. 2018, 61, 540–545. [Google Scholar] [CrossRef]
Malarvel, M.; Singh, H.; Nayak, S.R. An Improved Kernel Density Estimation with adaptive bandwidth selection for Edge detection. In Proceedings of the 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, 4–6 February 2021; pp. 982–986. [Google Scholar] [CrossRef]
Ziane, Y.; Adjabi, S.; Zougab, N. Adaptive Bayesian bandwidth selection in asymmetric kernel density estimation for nonnegative heavy-tailed data. J. Appl. Stat. 2015, 42, 1645–1658. [Google Scholar] [CrossRef]
Markovich, L.A. Nonparametric Estimation of Multivariate Density and its Derivative by Dependent Data Using Gamma Kernels. J. Math. Sci. 2021, 254, 550–573. [Google Scholar] [CrossRef]
Bouezmarni, T.; Rombouts, J.V.K. Nonparametric density estimation for multivariate bounded data. J. Stat. Plan. Inference 2010, 140, 139–152. [Google Scholar] [CrossRef]
Wang, J.; Liu, Y.; Chang, J. An Improved Model for Kernel Density Estimation Based on Quadtree and Quasi-Interpolation. Mathematics 2022, 10, 2402. [Google Scholar] [CrossRef]
Yang, N.; Huang, Y.; Hou, D.; Liu, S.; Ye, D.; Dong, B.; Fan, Y. Adaptive Nonparametric Kernel Density Estimation Approach for Joint Probability Density Function Modeling of Multiple Wind Farms. Energies 2019, 12, 1356. [Google Scholar] [CrossRef] [Green Version]
Ngatchou-Wandji, J.; Ltaifa, M.; Njamen Njomen, D.A.; Shen, J. Nonparametric Estimation of the Density Function of the Distribution of the Noise in CHARN Models. Mathematics 2022, 10, 624. [Google Scholar] [CrossRef]
Jin, Y.; He, Y.; Huang, D. An Improved Variable Kernel Density Estimator Based on L2 Regularization. Mathematics 2021, 9, 2004. [Google Scholar] [CrossRef]
Hwang, J.N.; Lay, S.R.; Lippman, A. Nonparametric Multivariate Density Estimation: A Comparative Study. Signal Process. IEEE Trans. 1994, 42, 2795–2810. [Google Scholar] [CrossRef] [Green Version]
Li, W.; Zhang, C.; Tsung, F.; Mei, Y. Nonparametric monitoring of multivariate data via KNN learning. Int. J. Prod. Res. 2021, 59, 6311–6326. [Google Scholar] [CrossRef]
Magdon-Ismail, M.; Atiya, A. Density estimation and random variate generation using multilayer networks. IEEE Trans. Neural Netw. 2002, 13, 497–520. [Google Scholar] [CrossRef]
Peerlings, D.E.W.; Brakel, J.A.V.D.; Basturk, N.; Puts, M.J.H. Multivariate Density Estimation by Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–12. [Google Scholar] [CrossRef]
Puchert, P.; Hermosilla, P.; Ritschel, T.; Ropinski, T. Data-driven deep density estimation. Neural Comput. Appl. 2021, 33, 16773–16807. [Google Scholar] [CrossRef]
Trentin, E. Asymptotic Convergence of Soft-Constrained Neural Networks for Density Estimation. Mathematics 2020, 8, 572. [Google Scholar] [CrossRef]
Abramov, R.V. An improved algorithm for the multidimensional moment-constrained maximum entropy problem. J. Comput. Phys. 2007, 226, 621–644. [Google Scholar] [CrossRef]
Dai, H.; Zhang, H.; Wang, W. A new maximum entropy-based importance sampling for reliability analysis. Struct. Saf. 2016, 63, 71–80. [Google Scholar] [CrossRef]
De Martino, A.; De Martino, D. An introduction to the maximum entropy approach and its application to inference problems in biology. Heliyon 2018, 4, e00596. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kouskoulas, Y.; Pierce, L.E.; Ulaby, F.T. A computationally efficient multivariate maximum-entropy density estimation (MEDE) technique. IEEE Trans. Geosci. Remote Sens. 2004, 42, 457–468. [Google Scholar] [CrossRef]
Li, G.; Wang, Y.X.; Zeng, Y.; He, W.X. A new maximum entropy method for estimation of multimodal probability density function. Appl. Math. Model. 2022, 102, 137–152. [Google Scholar] [CrossRef]
Zhang, X.; Pandey, M.D. Structural reliability analysis based on the concepts of entropy, fractional moment and dimensional reduction method. Struct. Saf. 2013, 43, 28–40. [Google Scholar] [CrossRef]
Wang, X.; Wang, Y. Nonparametric multivariate density estimation using mixtures. Stat. Comput. 2013, 25, 349–364. [Google Scholar] [CrossRef]
Konopko, K.; Janczak, D. Classification method based on multidimensional probability density function estimation dedicated to embedded systems. IFAC-PapersOnLine 2018, 51, 318–323. [Google Scholar] [CrossRef]
Wang, Z.; Scott, D.W. Nonparametric density estimation for high-dimensional data—Algorithms and applications. Wiley Interdiscip. Rev. Comput. Stat. 2019, 11, e1461. [Google Scholar] [CrossRef] [Green Version]
Ruzgas, T.; Lukauskas, M.; Čepkauskas, G. Nonparametric Multivariate Density Estimation: Case Study of Cauchy Mixture Model. Mathematics 2021, 9, 2717. [Google Scholar] [CrossRef]
Wang, T.; Guan, Z. Bernstein polynomial model for nonparametric multivariate density. Statistics 2019, 53, 321–338. [Google Scholar] [CrossRef] [Green Version]
Farmer, J.; Jacobs, D. High throughput nonparametric probability density estimation. PLoS ONE 2018, 13, e0196937. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Farmer, J.; Merino, Z.; Gray, A.; Jacobs, D. Universal Sample Size Invariant Measures for Uncertainty Quantification in Density Estimation. Entropy 2019, 21, 1120. [Google Scholar] [CrossRef] [Green Version]
Farmer, J.; Jacobs, D.J. MATLAB tool for probability density assessment and nonparametric estimation. SoftwareX 2022, 18, 101017. [Google Scholar] [CrossRef]
Farmer, J.; Jacobs, D. The R Journal: PDFEstimator: An R Package for Density Estimation and Analysis. R J. 2022, 14, 305–319. [Google Scholar] [CrossRef]
Farmer, J.; Jacobs, D. PDFEstimator: Multivariate Nonparametric Probability Density Estimator. R Package Version 4.2. 2022. Available online: https://CRAN.R-project.org/package=PDFEstimator (accessed on 11 December 2022).
Farmer, J.; Jacobs, D.J. PDFAnalyze. 2022. Available online: https://github.com/jennyfarmer/PDFAnalyze (accessed on 11 December 2022).
Jaynes, E.T. Information Theory and Statistical Mechanics. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
Läuter, H. Silverman, B.W. Density Estimation for Statistics and Data Analysis. Biom. J. 1988, 30, 876–877. [Google Scholar] [CrossRef]
Abdurro’uf, N.; Accetta, K.; Aerts, C.; Silva Aguirre, V.; Ahumada, R.; Ajgaonkar, N.; Filiz Ak, N.; Alam, S.; Allende Prieto, C.; Almeida, A.; et al. The Seventeenth Data Release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar, and APOGEE-2 Data. Astrophys. J. 2022, 259, 35. [Google Scholar] [CrossRef]
Acharya, V.; Bora, P.S.; Navin, K.; Nazareth, A.; Anusha, P.S.; Rao, S. Classification of SDSS photometric data using machine learning on a cloud. Curr. Sci. 2018, 115, 249–257. [Google Scholar] [CrossRef]
Clarke, A.O.; Scaife, A.M.M.; Greenhalgh, R.; Griguta, V. Identifying galaxies, quasars, and stars with machine learning: A new catalogue of classifications for 111 million SDSS sources without spectra. Astron. Astrophys. 2020, 639, A84. [Google Scholar] [CrossRef]
Rony, M.A.T.; Reza, D.S.A.A.; Mostafa, R.; Ullah, M.A. Application of Machine Learning to Interpret Predictability of Different Models: Approach to Classification for SDSS Sources. In Proceedings of the 2021 International Conference on Electronics, Communications and Information Technology (ICECIT), Khulna, Bangladesh, 14–16 September 2021; pp. 1–4. [Google Scholar] [CrossRef]
Ryden, B.; Peterson, B.M. Foundations of Astrophysics; Addison-Wesley: Boston, MA, USA, 2010. [Google Scholar]
Blanton, M.R.; Bershady, M.A.; Abolfathi, B.; Albareti, F.D.; Allende Prieto, C.; Almeida, A.; Alonso-García, J.; Anders, F.; Anderson, S.F.; Andrews, B.; et al. Sloan Digital Sky Survey IV: Mapping the Milky Way, Nearby Galaxies, and the Distant Universe. Astron. J. 2017, 154, 28. [Google Scholar] [CrossRef]
Gunn, J.E.; Carr, M.; Rockosi, C.; Sekiguchi, M.; Berry, K.; Elms, B.; de Haas, E.; Ivezić, Ž.; Knapp, G.; Lupton, R.; et al. The Sloan Digital Sky Survey Photometric Camera. Astron. J. 1998, 116, 3040–3081. [Google Scholar] [CrossRef] [Green Version]
Fukugita, M.; Ichikawa, T.; Gunn, J.E.; Doi, M.; Shimasaku, K.; Schneider, D.P. The Sloan Digital Sky Survey Photometric System. Astron. J. 1996, 111, 1748. [Google Scholar] [CrossRef] [Green Version]
Doi, M.; Tanaka, M.; Fukugita, M.; Gunn, J.E.; Yasuda, N.; Ivezić, Ž.; Brinkmann, J.; de Haars, E.; Kleinman, S.J.; Krzesinski, J.; et al. Photometric Response Functions of the Sloan Digital Sky Survey Imager. Astron. J. 2010, 139, 1628–1648. [Google Scholar] [CrossRef]
Gunn, J.E.; Siegmund, W.A.; Mannery, E.J.; Owen, R.E.; Hull, C.L.; Leger, R.F.; Carey, L.N.; Knapp, G.R.; York, D.G.; Boroski, W.N.; et al. The 2.5 m Telescope of the Sloan Digital Sky Survey. Astron. J. 2006, 131, 2332–2359. [Google Scholar] [CrossRef]
Stoughton, C.; Lupton, R.H.; Bernardi, M.; Blanton, M.R.; Burles, S.; Castander, F.J.; Connolly, A.J.; Eisenstein, D.J.; Frieman, J.A.; Hennessy, G.S.; et al. Sloan Digital Sky Survey: Early Data Release. Astron. J. 2002, 123, 485–548. [Google Scholar] [CrossRef] [Green Version]
Youden, W.J. Index for rating diagnostic tests. Cancer 1950, 3, 32–35. [Google Scholar] [CrossRef]

Figure 1. Data partitioning: Example of adaptive cell sizes for a bivariate Gaussian distribution. (a) Cells are defined by crossing grid lines in each orthogonal direction. (b) Only vertical grid lines are shown to highlight how data are binned into strips, and how the data within a strip are projected onto its centerline.

Figure 2. Visualization: Representation of a three-variable estimate for a GMM where cross sections are shown for the third variable (top row) and then variables 1 and 2 (middle row) and variables 2 and 3 (bottom row).

Figure 3. Performance benchmarks based on synthetic data from Cauchy and GMM. Computation time (top row), MSE (bottom row), no correlation (left column), sample size of 100,000 (right column).

Figure 4. Mean squared error (MSE) for two variable synthetic data from Cauchy and GMM averaged over five different copulas. No correlation (left), sample size of 100,000 (right).

Figure 5. Performance benchmarks: mean squared error (left) and computation time (right) for synthetic data generated using the t-copula (

ν = 1

) with Cauchy distribution and GMM for different correlations. The legend on the right panel applies to the left panel as well.

Figure 5. Performance benchmarks: mean squared error (left) and computation time (right) for synthetic data generated using the t-copula (

ν = 1

) with Cauchy distribution and GMM for different correlations. The legend on the right panel applies to the left panel as well.

Figure 6. Sensitivity and specificity benchmarks: Receiver Operator Characteristics (ROC) and J-statistic curves for quasar prediction using the g and r filters from training data. The maximum of Youden’s J-statistic located at the vertical black line is taken as the optimal threshold for the density ratio. The black dashed line represents 50/50 chance of guessing correctly.

Figure 7. Two-dimensional surface plots showing relative densities for the g and r filters (left), u and z filters (middle), and r and i filters (right). The density of quasars are represented by the dark strips towards the top and the non-quasars are the lower density clusters.

Figure 8. Three-dimensional representations of quasar (red) and non-quasar (blue) clusters.

Table 1. Parameters for GMM distributions.

GMM	Characteristics	Parameters	Component Weights
1	single mode	$μ_{1} = 0$	w $_{1} = 1.00$
2	two slightly	$μ_{1} = 0$	w $_{1} = 0.40$
2	overlapped modes	$μ_{2} = 5 1_{d}$	w $_{2} = 0.60$
3	two heavily	$μ_{1} = 0$	w $_{1} = 0.40$
3	overlapped modes	$μ_{2} = - 3 1_{d}$	w $_{2} = 0.60$
4	three modes	$μ_{1} = 0$	w $_{1} = 0.40$
		$μ_{2} = 5 1_{d}$	w $_{2} = 0.30$
		$μ_{3} = - 3 1_{d}$	w $_{3} = 0.30$
5	four modes	$μ_{1} = (0, 0, 0, 0)$	w $_{1} = 0.25$
		$μ_{2} = (0, 5, 0, 0)$	w $_{2} = 0.25$
		$μ_{3} = (0, 0, - 3, 0)$	w $_{3} = 0.25$
		$μ_{4} = (5, 0, 0, - 3)$	w $_{4} = 0.25$

Table 2. Accuracy of new method based on synthetic data: assessed by MSE on a log base 10 scale.

n	Linear Correlation	1 Mode	Slight Overlap	Heavy Overlap	3 Modes	Cauchy (Gaussian Copula)	Cauchy (t Copula)
d = 2

100	0	−2.992	−2.9737	−3.5559	−3.089	−3.0531	−2.6246
100	0.6	−2.6929	−2.8666	−3.3938	−2.933	−2.8176	−2.4168
100	0.9	−1.9043	−2.5126	−2.9725	−2.4644	−2.1569	−1.9062
10,000	0	−3.9117	−4.6788	−5.0853	−4.5131	−4.083	−4.0139
10,000	0.6	−3.7526	−4.5365	−5.0591	−4.4494	−4.078	−3.8905
10,000	0.9	−3.3898	−4.1205	−4.5882	−4.1181	−3.8844	−3.4951
1,000,000	0	−6.1185	−6.4944	−6.6217	−6.249	−4.7689	−4.9609
1,000,000	0.6	−5.8909	−6.376	−6.5939	−6.207	−4.8762	−4.8306
1,000,000	0.9	−5.2806	−5.7028	−6.122	−5.61	−4.7895	−4.5139
d = 3

100	0	−3.338	−4.0794	−4.9053	−4.0782	−4.4971	−3.5923
100	0.6	−3.1286	−3.8302	−4.5882	−3.8311	−4.0811	−3.2549
100	0.9	−2.404	−3.2133	−3.8205	−3.1796	−2.8755	−2.4116
10,000	0	−4.444	−5.3379	−5.814	−5.4227	−5.4487	−4.8823
10,000	0.6	−4.1592	−4.9542	−5.388	−4.9706	−5.2078	−4.474
10,000	0.9	−3.2825	−3.8304	−4.1097	−3.7935	−4.403	−3.3035
1,000,000	0	−5.04	−6.3629	−7.1626	−6.2106	−5.5167	−5.287
1,000,000	0.6	−4.8568	−6.1395	−6.9824	−6.0369	−5.3879	−5.0574
1,000,000	0.9	−4.4647	−5.5948	−6.2233	−5.546	−5.0788	−4.4882
d = 4

100	0	−4.0605	−5.2822	−6.2842	−5.219	−5.6353	−4.4012
100	0.6	−3.6758	−4.8618	−5.8091	−4.8384	−5.144	−3.7552
100	0.9	−2.8089	−3.9414	−4.6717	−3.8819	−3.4677	−2.6696
10,000	0	−5.1289	−6.002	−6.6362	−6.0567	−5.9235	−4.8418
10,000	0.6	−4.4678	−5.403	−6.1653	−5.4537	−5.3646	−4.4309
10,000	0.9	−3.0998	−3.9566	−4.6743	−4.0733	−4.2874	−3.2767
1,000,000	0	−5.5674	−6.947	−7.8205	−6.9272	−6.9513	−6.3571
1,000,000	0.6	−5.2438	−6.5087	−7.2547	−6.4942	−6.6235	−5.9865
1,000,000	0.9	−4.372	−5.2291	−5.4616	−5.1896	−5.8956	−4.2999
d = 5

100	0	−4.9695	−6.4975	−7.677	−6.4148	−6.8743	−5.0235
100	0.6	−4.3866	−5.8932	−7.0189	−5.8546	−6.133	−4.3321
100	0.9	−3.2431	−4.6621	−5.5086	−4.6075	−4.1286	−2.864
10,000	0	−5.8235	−7.0079	−7.6207	−6.849	−6.8503	−5.0366
10,000	0.6	−4.9794	−6.2658	−7.006	−6.2862	−5.9073	−4.501
10,000	0.9	−3.3098	−4.4792	−5.5053	−4.577	−4.4204	−3.1359
1,000,000	0	−6.2117	−7.6695	−8.5417	−7.7618	−8.2213	−6.5347
1,000,000	0.6	−5.5774	−6.8932	−7.6975	−6.8798	−7.4491	−5.6284
1,000,000	0.9	−4.0165	−5.0726	−5.6349	−5.0277	−5.6588	−3.5659
d = 6

100	0	−5.9506	−7.7196	−9.0519	−7.6235	−8.2492	−5.9397
100	0.6	−5.1613	−6.9208	−8.2044	−6.8646	−7.1974	−4.874
100	0.9	−3.6589	−5.3711	−6.3335	−5.2924	−4.837	−3.0772
10,000	0	−6.5171	−7.7206	−8.4242	−7.6341	−7.5321	−3.998
10,000	0.6	−5.5881	−6.9177	−7.505	−6.8224	−5.9014	−3.1111
10,000	0.9	−2.5653	−5.3691	−6.2167	−5.214	−3.0605	−1.7374
1,000,000	0	−6.9501	−8.7879	−9.6764	−8.828	−9.0723	−7.0393
1,000,000	0.6	−6.0385	−7.6199	−8.7143	−7.5842	−8.1493	−5.9897
1,000,000	0.9	−4.1541	−5.5531	−6.1914	−5.5167	−6.121	−3.7104

Table 3. Accuracy for synthetic data comparison by copula, measured by mean squared error on a log base 10 scale.

n	Linear Correlation	KDE Frank	KDE Gumbel	KDE Clayton	PDFe Frank	PDFe Gumbel	PDFe Clayton
d = 2

100	0	−2.8728	−2.8728	−2.8728	−2.9576	−2.9576	−2.9576
100	0.6	−2.6722	−2.626	−2.6866	−2.5971	−2.763	−2.6054
100	0.9	−2.2262	−2.2918	−2.263	−1.847	−2.212	−1.8354
10,000	0	−2.7559	−2.7559	−2.7559	−4.0664	−4.0664	−4.0664
10,000	0.6	−2.5926	−2.5835	−2.599	−3.9504	−4.025	−4.009
10,000	0.9	−2.1911	−2.2492	−2.2187	−3.6872	−3.791	−3.746
1,000,000	0	−2.75	−2.75	−2.75	−4.8083	−4.8083	−4.8083
1,000,000	0.6	−2.5891	−2.5796	−2.5931	−4.7268	−4.861	−4.8188
1,000,000	0.9	−2.19	−2.2472	−2.2157	−4.5811	−4.7294	−4.5754

Table 4. Synthetic data accuracy using KDE based on MSE on a log base 10 scale.

n	Linear Correlation	1 Mode	Slight Overlap	Heavy Overlap	3 Modes	Cauchy (t Copula)
d = 3

100	0	−4.1386	−4.1395	−4.8599	−4.144	−3.4908
100	0.6	−3.4097	−3.8311	−4.556	−3.8416	−3.2515
100	0.9	−2.3603	−3.2001	−3.807	−3.1692	−2.6413
10,000	0	−4.9142	−4.8448	−5.2822	−5.237	−3.4791
10,000	0.6	−4.2714	−4.3496	−4.8517	−4.6171	−3.2383
10,000	0.9	−2.9386	−3.3834	−3.8891	−3.4686	−2.6387
1,000,000	0	−6.1506	−5.8395	−6.2219	−6.4355	−3.4791
1,000,000	0.6	−5.5695	−5.1841	−5.6007	−5.7253	−3.2383
1,000,000	0.9	−4.0055	−3.8203	−4.1742	−4.1773	−2.6387
d = 4

100	0	−4.6547	−5.2899	−6.2212	−5.2882	−4.3846
100	0.6	−3.9574	−4.8606	−5.7576	−4.8448	−3.9911
100	0.9	−2.7626	−3.9356	−4.6623	−3.8989	−3.0408
10,000	0	−5.4618	−5.5969	−6.3757	−5.8134	−4.3846
10,000	0.6	−4.4133	−5.0493	−5.8536	−5.2017	−3.9909
10,000	0.9	−2.9931	−3.9835	−4.6775	−3.998	−3.0408
1,000,000	0	−6.5901	−6.4913	−7.0893	−7.0414	−4.3846
1,000,000	0.6	−5.7393	−5.6978	−6.3405	−6.1482	−3.9909
1,000,000	0.9	−3.8813	−4.2186	−4.7903	−4.4377	−3.0408
d = 5

100	0	−5.5451	−6.5117	−7.6067	−6.4505	−5.2295
100	0.6	−4.6641	−5.8895	−6.9732	−5.8483	−4.6736
100	0.9	−3.2053	−4.6587	−5.5052	−4.6077	−3.3277
10,000	0	−6.2732	−6.6471	−7.6214	−6.6802	−5.2295
10,000	0.6	−4.9596	−5.9687	−6.9863	−6.0107	−4.6736
10,000	0.9	−3.2888	−4.6708	−5.5061	−4.6444	−3.3277
1,000,000	0	−6.8298	−7.2737	−8.0351	−7.674	−5.2295
1,000,000	0.6	−5.7404	−6.3466	−7.213	−6.6061	−4.6736
1,000,000	0.9	−3.7864	−4.7544	−5.5313	−4.8137	−3.3277
d = 6

100	0	−6.2909	−7.7238	−8.9959	−7.6472	−6.0267
100	0.6	−5.3681	−6.9159	−8.1793	−6.855	−5.3023
100	0.9	−3.6589	−5.369	−6.3328	−5.3134	−3.5044
10,000	0	−6.518	−7.7202	−9.0551	−7.6405	−6.0267
10,000	0.6	−5.5932	−6.9154	−8.2017	−6.8703	−5.3023
10,000	0.9	−3.6696	−5.369	−6.3337	−5.3154	−3.5044
1,000,000	0	−7.3333	−8.2114	−9.2153	−8.5489	−6.0267
1,000,000	0.6	−6.0458	−7.1718	−8.2881	−7.361	−5.3023
1,000,000	0.9	−3.9933	−5.4035	−6.3396	−5.4077	−3.5044

Table 5. SDSS filter effective wavelengths and zero-point magnitudes obtained from open access Sloan Digital Sky Survey repository [51].

Filter	Effective Wavelength (Å)	Zero-Point Magnitude
u	3551	24.63
g	4686	25.11
r	6166	24.80
i	7480	24.36
z	8932	22.83

Table 6. Results for binary classification of quasars. Highest values for each column are written in bold font and underlined. For reference, AUC is the area under the ROC curve, and the four measures are defined as:

a c c u r a c y = \frac{t p + t n}{t p + t n + f p + f n}

,

p r e c i s i o n = \frac{t p}{t p + f p}

,

r e c a l l = \frac{t p}{t p + f n}

and

F 1 = \frac{2 * p r e c i s i o n * r e c a l l}{p r e c i s i o n + r e c a l l}

. The underlined values in bold are the greatest value obtained per column.

Table 6. Results for binary classification of quasars. Highest values for each column are written in bold font and underlined. For reference, AUC is the area under the ROC curve, and the four measures are defined as:

a c c u r a c y = \frac{t p + t n}{t p + t n + f p + f n}

,

p r e c i s i o n = \frac{t p}{t p + f p}

,

r e c a l l = \frac{t p}{t p + f n}

and

F 1 = \frac{2 * p r e c i s i o n * r e c a l l}{p r e c i s i o n + r e c a l l}

. The underlined values in bold are the greatest value obtained per column.

TRAINING
Filters	AUC	Max J-Stat	Accuracy	Precision	Recall	F1
u	0.85407	0.61054	0.78823	0.61357	0.84218	0.70992
g	0.77359	0.47972	0.66327	0.47571	0.92431	0.62814
r	0.63121	0.19652	0.52846	0.37157	0.77031	0.50132
i	0.60007	0.15052	0.49736	0.35557	0.77987	0.48844
z	0.65549	0.2295	0.59228	0.40314	0.67651	0.50521
u,g	0.92807	0.75122	0.89691	0.84332	0.8167	0.8298
u,r	0.92698	0.75246	0.89248	0.81955	0.83427	0.82684
u,i	0.92744	0.75492	0.89659	0.8371	0.82432	0.83066
u,z	0.93799	0.77104	0.90171	0.83815	0.84343	0.84078
g,r	0.95607	0.86708	0.94133	0.9019	0.90811	0.90499
g,i	0.95149	0.84284	0.93417	0.89929	0.88518	0.89218
g,z	0.94823	0.8343	0.92712	0.87736	0.88716	0.88223
r,i	0.86871	0.6222	0.80208	0.63672	0.83074	0.7209
r,z	0.85938	0.58856	0.78651	0.61632	0.81105	0.7004
i,z	0.76759	0.43014	0.67371	0.48234	0.82517	0.60881
u,g,r	0.88845	0.67842	0.8617	0.77225	0.78082	0.77651
u,g,i	0.91927	0.74596	0.89013	0.81833	0.82637	0.82233
u,g,z	0.93486	0.77954	0.90417	0.84249	0.84689	0.84469
u,r,i	0.88751	0.71086	0.87866	0.80898	0.79287	0.80084
u,r,z	0.92097	0.77096	0.91212	0.89062	0.81442	0.85081
u,i,z	0.90818	0.77258	0.90644	0.86141	0.82935	0.84508
g,r,i	0.94753	0.84308	0.93797	0.92165	0.87258	0.89645
g,r,z	0.93511	0.78606	0.91066	0.86411	0.84205	0.85294
g,i,z	0.9504	0.86244	0.94011	0.90556	0.89912	0.90233
r,i,z	0.84166	0.53186	0.75415	0.57352	0.78399	0.66244
u,g,r,i	0.87307	0.63298	0.78556	0.60192	0.89497	0.71976
u,g,r,z	0.91732	0.7476	0.86375	0.72566	0.89588	0.80183
u,g,i,z	0.90163	0.73838	0.8569	0.71218	0.8977	0.79425
u,r,i,z	0.89835	0.71494	0.85013	0.70729	0.87509	0.78229
g,r,i,z	0.93524	0.78376	0.90894	0.86202	0.83823	0.84996
u,g,r,i,z	0.88515	0.67998	0.81467	0.64065	0.9057	0.75046

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Farmer, J.; Allen, E.; Jacobs, D.J. Quasar Identification Using Multivariate Probability Density Estimated from Nonparametric Conditional Probabilities. Mathematics 2023, 11, 155. https://doi.org/10.3390/math11010155

AMA Style

Farmer J, Allen E, Jacobs DJ. Quasar Identification Using Multivariate Probability Density Estimated from Nonparametric Conditional Probabilities. Mathematics. 2023; 11(1):155. https://doi.org/10.3390/math11010155

Chicago/Turabian Style

Farmer, Jenny, Eve Allen, and Donald J. Jacobs. 2023. "Quasar Identification Using Multivariate Probability Density Estimated from Nonparametric Conditional Probabilities" Mathematics 11, no. 1: 155. https://doi.org/10.3390/math11010155

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Quasar Identification Using Multivariate Probability Density Estimated from Nonparametric Conditional Probabilities

Abstract

1. Introduction

2. Multivariate Probability Density Estimation

2.1. Maximum Entropy Method

2.2. Product Array of Conditional Probabilities

2.3. Synthetic Data Benchmark

3. Comparison to Kernel Density Estimation

4. Quasar Identification

4.1. Experimental Data

4.2. Binary Classification Using Multivariate Density

4.3. Quasar Classification Results

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI