Next Article in Journal
Quantum Inspired Task Optimization for IoT Edge Fog Computing Environment
Next Article in Special Issue
Percolation Problems on N-Ary Trees
Previous Article in Journal
Analysis of the Dynamic Response as a Basis for the Efficient Protection of Large Structure Health Using Controllable Frequency-Controlled Drives
Previous Article in Special Issue
Bootstrapping Not Independent and Not Identically Distributed Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Quasar Identification Using Multivariate Probability Density Estimated from Nonparametric Conditional Probabilities

1
Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
2
Department of Physics and Optical Science, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
*
Author to whom correspondence should be addressed.
Affiliate Faculty of the UNC Charlotte School of Data Science.
Mathematics 2023, 11(1), 155; https://doi.org/10.3390/math11010155
Submission received: 30 October 2022 / Revised: 12 December 2022 / Accepted: 26 December 2022 / Published: 28 December 2022
(This article belongs to the Special Issue Probability Distributions and Their Applications)

Abstract

:
Nonparametric estimation for a probability density function that describes multivariate data has typically been addressed by kernel density estimation (KDE). A novel density estimator recently developed by Farmer and Jacobs offers an alternative high-throughput automated approach to univariate nonparametric density estimation based on maximum entropy and order statistics, improving accuracy over univariate KDE. This article presents an extension of the single variable case to multiple variables. The univariate estimator is used to recursively calculate a product array of one-dimensional conditional probabilities. In combination with interpolation methods, a complete joint probability density estimate is generated for multiple variables. Good accuracy and speed performance in synthetic data are demonstrated by a numerical study using known distributions over a range of sample sizes from 100 to 10 6 for two to six variables. Performance in terms of speed and accuracy is compared to KDE. The multivariate density estimate developed here tends to perform better as the number of samples and/or variables increases. As an example application, measurements are analyzed over five filters of photometric data from the Sloan Digital Sky Survey Data Release 17. The multivariate estimation is used to form the basis for a binary classifier that distinguishes quasars from galaxies and stars with up to 94% accuracy.

1. Introduction

The estimation of probability density is an important topic of research in statistical inference across a myriad of applications. Density estimation has become increasingly relevant, as the amount of data available is becoming massive, especially in fields such as bioinformatics and astronomy. Nonparametric methods are of general interest, as they impose minimal conditions on the underlying functional form of probability density, and are therefore generally applicable without prior knowledge of specific data characteristics. Kernel density estimation (KDE), first introduced over 60 years ago [1,2], remains a viable model-free method with well-established properties [3]. A short-coming of KDE, however, is the poor handling of bounded data, referred to as the boundary problem, often addressed by the reflection method [4] or alternate kernel selection [5,6,7]. In addition, the selection of bandwidth to control smoothness in KDE is generally a challenge. Thus much attention has been devoted to intelligent and adaptive bandwidth methods [8,9].
The estimation of nonparametric density becomes more difficult as additional random variables are considered. In particular, KDE loses computational efficiency in higher dimensional spaces, with more serious boundary and bandwidth problems to reconcile. Despite these difficulties, recent research continues to advance the field of multivariate nonparametric estimation using kernel density techniques [10,11,12,13,14,15]. Methods of dimension reduction are often used to mitigate problems arising from high dimensions by identifying the most relevant variables [16,17]. Other methods for density estimation in high dimensions include neural networks [18,19,20,21], maximum entropy [22,23,24,25,26,27], and semi-parametric methods such as Gaussian mixed mode analysis [28]. For further discussion, comparison, and review of these and other methods, see [29,30,31,32].
Previously, Farmer and Jacobs introduced a nonparametric density estimator for univariate data [33], which does not suffer from boundary problems or bandwidth selection. Based on combining a maximum entropy approach with universal quality measures for the estimate using single order statistics [34], the method was shown to produce accurate results on many diverse data sets compared to competing implementations [35,36]. In the rest of this paper, a multivariate extension is developed in Section 2 and benchmarked against synthetic data to demonstrate accuracy and performance. In Section 3, the results are compared with a MATLAB implementation of KDE. In Section 4, a sample of astronomical data from the Sloan Digital Sky Survey is analyzed to show the utility of our multivariate probability density estimation method when applied to classification. The results are competitive with state-of-the-art ML classifications. Conclusions are made in Section 5.

2. Multivariate Probability Density Estimation

The multivariate estimator described in this section is based on an underlying univariate method available as C++ or Java classes, with user interfaces plugged into both R [37] and MATLAB [38]. The first subsection provides a brief background to previous and recent research on multivariate maximum entropy methods. The new method is then developed, and the last subsection evaluates performance in terms of accuracy and computational time. Trends according to sample size and dimensionality are presented for mixtures of Gaussian distributions and a product of Cauchy distributions using a variety of copulas.

2.1. Maximum Entropy Method

The maximum entropy method (MEM) for the estimation of probability density is a rigorous method for producing unbiased parametric estimates [39]. A brief summary of the multivariate case for d variables is provided as part of background information for related previous work. The variables are described as a vector in a d-dimensional space given by x = ( x 1 , x 2 , , x d ) . Moments in the i-th direction are given as α i . To describe different moments for each component of x it is convenient to define α = ( α 1 , α 2 , , α d ) . A general moment in terms of x i is given as i = 1 d x i α i . These product moments can be further generalized by using d Cartesian product functions given by
μ α = ρ ( x ) i = 1 d f i α i ( x i ) d x i .
The MEM obtains an unbiased density ρ ( x ) by maximizing Shannon entropy while satisfying all known constraints including the normalization condition. This can be expressed as an optimization problem by maximizing the expression
H = ρ ( x ) log ρ ( x ) i = 1 d d x i + α γ α ρ ( x ) i = 1 d f i α i ( x i ) d x i + λ ρ ( x ) i = 1 d d x i .
Solving the extremum problem formally gives the solution to Equation (2) to be of the form
ρ ^ ( x ) = A exp α γ α i = 1 d f i α i ( x i )
where ρ ^ ( x ) is the sample estimate of the probability density function (PDF), A is a normalization constant, and γ α is a set of Lagrange coefficients. Determining γ α for a set of moments μ α is computationally difficult, particularly as the number of moments becomes large and as the number of dimensions increase. Computational methods have been developed for a limited number of moments and dimensions [22,25], including more recent methods based on fractional moments. In other works, MEM is combined with dimension reduction techniques [26,27] to mitigate the computational difficulties of higher dimensions. A more sophisticated approach first creates a histogram of the sampled data to obtain a rough estimate, and then estimates the coefficients of 1D slices of the histogram in each dimension. These coefficients are then expanded in d dimensions to produce a grid of coefficients used in Equation (3) with Legendre polynomials as a basis set [23,25].
Our recent nonparametric probability density estimator for univariate data [33,34,35,36] begins with MEM to obtain the general form of the solution given in Equation (3) for the case d = 1 . Chebyshev polynomials are used as a basis, with the domain on [ 1 , 1 ] after the data are scaled. The γ α coefficients are optimized based on a trial cumulative density function, CDF, given by F X ( x | γ α ) where r n = F X ( x n | γ α ) is evaluated for all data points { x n } in the random sample. The set { r n } will be uniform on [ 0 , 1 ] when the parameter set γ α is correct within statistical resolution. Instead of specifying a preset number of Lagrange multipliers, the coefficients are optimized by a random search method, adding more terms to the expansion if necessary. The process continues until the CDF ensures that { r n } is distributed as sampled uniform random data (SURD). The deviation from SURD is measured using a scoring function based on single order statistics, which is sample size invariant [34]. The method is stable for several hundred of Lagrange multipliers. Key features of the method include outlier removal and adaptive resolution, avoiding bandwidth and boundary problems in KDE methods. A generalization of this univariate density estimation method to higher dimensions using the form given in Equation (3) is not possible, because the properties of order statistics are lost once d > 1 is considered. Nevertheless, a multivariate PDF can be obtained by calculating many 1D PDFs in terms of a product array of conditional probability functions.

2.2. Product Array of Conditional Probabilities

Similar in strategy to other methods [23,25], a multivariate probability density will be estimated using multiple strips of one-dimensional conditional probabilities. The multivariate joint probability distribution is expressed as products of d 1 conditional probability distributions. To illustrate the method, consider the simplest multivariable case of two variables. The joint probability of two variables is expressed as a product of conditional and marginal probabilities as ρ ( x 1 , x 2 ) = ρ ( x 1 | x 2 ) ρ ( x 2 ) . Estimating the conditional and marginal probability densities using the 1D PDF estimator separately, the joint probability density is estimated as ρ ^ ( x 1 , x 2 ) = ρ ^ ( x 1 | x 2 ) ρ ^ ( x 2 ) .
Figure 1a demonstrates a two-dimensional example with a 5 by 5 grid of cells, numbered from 1 to 25, superimposed over a scatter plot of sample data from a bivariate standard Gaussian distribution. The grid lines shown are used to stratify the data into strips. The number of grid lines is the same in each dimension. For dimension d with d > 1 , the number of grid lines, denoted by n l , is heuristically given as
n l = n 100 1 d 1 + 1 .
where n is the sample size. The scale factor of 100 provides a good balance between accuracy and computation time. Moreover, Equation (4) suggests that n l = 3 is a practical minimum number of grid lines to use. It is found that n 100 × 2 d 1 (for d 1 ) gives a reasonable guide for how many independent observations produce a good nonparametric estimate as a function of dimension. Since the details of a particular problem affect the minimum sample size to obtain an accurate estimate, this heuristic sets a reasonable reference. For instance, sharp features or heavy tails in probability density require more samples, while fewer samples in high dimensions are sufficient when there are strong correlations between the d random variables.
As seen in Figure 1, the grid line spacing is not uniform, because the spacing adapts to the density of the data in each dimension. This is achieved by first calculating the marginal densities with the 1D PDF estimator for each variable. The range of the CDF on [0, 1] is divided into equally spaced marks separated by 1 / ( n l 1 ) , where n l is determined by Equation (4). To map from [0, 1] back to the variable, the location of the grid lines is obtained using the inverse of the estimated CDF, which is the estimated quantile function q ^ j ( u ) for the j-th variable. The grid lines divide the percent of data uniformly for the j-th variable at the following locations
q ^ j 1 n , q ^ j 1 n l 1 , q ^ j 2 n l 1 , q ^ j n l 2 n l 1 , q ^ j 1 1 n .
For the case shown in Figure 1 n l = 6 . In general, this method creates ( n l 1 ) d cells.
The estimated density is at the center of the two grid lines that define the boundaries of a data strip, as shown in Figure 1b for the x 1 variable. The blue points represent all observations of x 1 that fall within the strip (−0.82 < x 1 < −0.21). These points are projected on the vertical centerline (ignoring the x 1 values) to create a 1D problem where the probability density is a function of x 2 along the centerline positioned at ( x 1 = 0.51 ). This process is repeated for five centerlines representing each data strip. Special points along the estimated conditional probabilities corresponding to cell centers are evaluated to obtain an estimate of the joint probability density at the center of each cell. Interpolation is used to obtain estimates anywhere else in the hypercube if needed. The joint probability density estimate at the center of cell #9 in Figure 1a, for example, is the product of two 1D PDF estimates:
ρ ^ ( 0.51 , 0.58 ) = ρ ^ ( x 2 = 0.58 | x 1 = 0.51 ) ρ ^ ( x 1 = 0.51 ) .
The process of multiplying a conditional probability density to a marginal probability density means normalizing the joint probability density function is unnecessary. The normalization process for the entire constructed joint probability density is automatic. Since there is no smoothing function (as used in KDE through a selected kernel), the shrinkage that always exists with KDE is avoided. The general procedure of using a product array of conditional probability density and a marginal probability density is markedly stable and accurate.
This process is iterated for each additional variable. Equation (6) generalizes to d dimensions by writing a joint probability density estimate as a conditional probability density estimate multiplied by a marginal probability density estimate, given as
ρ ^ ( x 1 , x d ) = ρ ^ ( x 1 | x 2 , x d ) ρ ^ ( x 2 , x d ) .
The 1D PDF estimator makes conditional probability density estimates based on stratified data assigned to a centerline. At the next iteration, the marginal probability density for d 1 variables is expanded into another conditional probability density and a marginal probability density of d 2 variables. Recursively repeating this process until a marginal probability density estimate of the last variable is reached yields the expression
ρ ^ x 1 ( i 1 ) , x d ( i d ) = ρ ^ ( x 1 ( i 1 ) | x 2 ( i 2 ) , x d ( i d ) ) ρ ^ ( x 2 ( i 2 ) | x 3 ( i 3 ) , x d ( i d ) ) ρ ^ ( x 3 ( i 3 ) | x 4 ( i 4 ) , x d ( i d ) ) ρ ^ ( x d 1 ( i d 1 ) | x d ( i d ) ) ρ ^ ( x d ( i d ) )
where the value x k ( i k ) denotes a specific value of the k-th variable, x k . The specific point is labeled using the index i k , which lies on a centerline for the k-th variable, which is midway between two grid lines that bound a data strip. In particular, each x k ( i k ) corresponds to a specific q ^ k that is set by Equation (5). Although Equation (8) is valid for all continuous values of x , as indicated, the functions are evaluated only at discrete grid points. Indeed, all discrete points used to represent the d-dimensional joint probability density are initially determined by the univariate marginal probability densities for each dimension, as described above, where n l grid lines are defined through Equation (5).
In this approach, the 1D PDF estimator is applied many times: the marginal PDF per variable as an initialization step (i.e., d times), and once per centerline representing a conditional probability density based on stratified data. For example, the 1D PDF estimator is used seven times when n l = 6 and d = 2 as shown in Figure 1. In general, the number of calls is N c = d + k = 2 d ( n l 1 ) k 1 . For comparison, N c = 785 when n l = 6 and d = 5 . A summary of the main steps at a high level is as follows:
  • Five step process
  • Estimate 1D marginal densities for each variable.
  • Define strip boundaries from marginal CDF per variable at equal probability spacing.
  • Estimate the conditional probability density at the centerline of each strip.
  • Initialize the joint probability density at the center of each cell to the marginal ρ ^ ( x d ) .
  • For each of the remaining variables, iterate: v = 1 to d 1 ,
    (a)
    Extract the sample points for successive stratification as v is incremented.
    (b)
    Project sample points found in (a) onto the centerline representing a special point.
    (c)
    Estimate conditional probability density using the 1D PDF estimator per centerline.
    (d)
    Multiply estimated conditional probability density from (c) to current value per cell.
Note that dimension reduction methods could be used as a preprocessing step before this algorithm is used when dealing with high-dimensional data. This line of analysis will be reported elsewhere. Importantly, the implementation of a transformation as a preprocessing step does not change the algorithm presented in this paper.

2.3. Synthetic Data Benchmark

A Gaussian mixture model (GMM) of various types [12,15,16] is often used to benchmark estimation methods. The general form of a GMM with m modes is given as
ρ ( x ) = k = 1 m w k N ( x | μ k , Σ k )
where a Gaussian distribution in d dimensions is given as
N ( x | μ k , Σ k ) = 1 ( 2 π ) d Σ k exp 1 2 ( x μ k ) T Σ k 1 ( x μ k ) .
To assess the accuracy and speed of the multivariate estimator, five families of GMMs are considered where the parameters are given in Table 1. For each case, Σ k i j = k + ( 1 k ) δ i j , where the values of k = { 0.0 , 0.15 , 0.3 , 0.45 , 0.6 , 0.75 , 0.9 } are considered.
In addition, a single mode Cauchy model for d = 2 , 3 , 4 , 5 , 6 is considered. The joint PDF is expressed as a function of the marginal and cumulative PDFs using a copula probability density, c ( x 1 , x 2 , x d ) , according to Skylar’s theorem as
ρ ( x 1 , x 2 , , x d ) = c ( F 1 ( x 1 ) , F 2 ( x 2 ) , , F d ( x d ) | k ) k = 1 d ρ k ( x k )
where the same Cauchy distribution is used for all the marginals in Equation (11). Five different copulas were considered for generating correlated random samples, including Gaussian, t, Frank, Clayton, and Gumbel. Gaussian and t copulas are easily extended to an arbitrary number of dimensions and were thus calculated separately and compared for all cases. The remaining Archimedean copulas were restricted to the analysis of bivariate data only. The copula correlations are set as k = 0.0 , 0.15 , 0.3 , 0.45 , 0.6 , 0.75 , 0.9 to produce an analogous effect as the k used in the covariance matrices for the GMMs. The Cauchy distribution is given by
C ( x | x 0 , γ ) = 1 π γ γ 2 ( ( x x 0 ) 2 + γ 2 ) ) ,
and in standard form the location and scale parameters are set to x 0 = 0 and γ = 1 , respectively. Note that for distributions with a single mode, performance is unaffected by translations and scaling of variables. Hence, only standard parameters are reported.
A visualization of a typical estimate for d = 3 is shown in Figure 2 for the third GMM (two heavily overlapped modes) listed in Table 1 and with k = 0 . To assess the accuracy of an estimate, the mean squared error (MSE) between the estimate and the known distribution is used. The MSE is defined as
M S E = 1 p g = 1 p ρ ( x ( g ) ) ρ ^ ( x ( g ) ) 2
for p = 9 d evaluation points defined as
x ( g ) { q ^ 1 ( g ) , q ^ 2 ( g ) , q ^ d ( g ) }
where {g} forms a hyper cube of arguments to the estimated quantile functions at the regular uniformly spaced points [ 0.1 , 0.2 , 0.9 ] d covering all possible 9 d combinations. In addition to the direct formula of Equation (13), the MSE was also calculated for subsamples of points. First, all ρ ( x ( g ) ) are sorted from largest to smallest. Sweeping over a range of percentiles allows the estimates to be stratified into nine distinct bins. For each bin, the MSE is calculated for all points that fall into the bin. It was found that MSE is distributed approximately uniformly across all bins for all cases discussed in this subsection.
An important property of a product array of conditional probabilities to assess is sensitivity to variable ordering when calculating Equation (8) using the five-step process outlined above. For d variables, there are d ! possible orderings that can be executed. A multivariate distribution for d = 4 is constructed using the fifth GMM listed in Table 1 with four modes, designed to ensure each dimension has its own distinct features. Density estimates for sample sizes 100 and 1000, are calculated and compared over 10 samples for each of the 24 possible orderings. For each of the grid points determined by Equation (5), which do not depend on the variable order, there is no discernible difference between the mean variance due to different variable orders and the variance obtained between samples (data not shown). Based on this result, no conditions are imposed on how to choose the variable order. It is worth noting that an average can be performed over different variable orders if significant differences in variable order occurred. However, no case has been found where accuracy or speed is sensitive to variable ordering. The use of differences in variable ordering, if any, will be investigated when dimension reduction methods, such as principal component analysis, are combined with this algorithm.
As a further benchmark, the first to fourth GMM listed in Table 1 is used to assess the speed and accuracy of the method. The GMM parameters are such that all variable orderings are statistically the same, because the distributions are centered on the origin and along a single direction (body diagonal) in d-dimensions. This means that the center positions of all Gaussian distributions within the mixture lie along a one-dimensional line that is not optimal for any of the variables. These datasets were generated for two through six variables, using degenerate parameters in each dimension to provide a clear demonstration of trends as the dimension increases or the correlation factor is changes. In addition, dozens of other distributions were generated and analyzed (data not shown) to verify that symmetrical GMMs are not artificially more accurate than cases without special symmetries. Therefore, as a fair representation of the accuracy of the method, the average MSE is calculated over 10 trials for six distributions: the 4 GMM in Table 1, a single mode Cauchy modeled with a Gaussian copula, and an additional single mode Cauchy using a t-copula with parameter ν = 1 . Examples for d = 2 , 3 , 4 , 5 , 6 and five orders of magnitude variation in sample sizes are listed in Table 2.
Several trends are noticeable and summarized in the plots in Figure 3, averaged over the four GMM and two Cauchy distributions. The top row shows calculation time as a function of sample size with no correlation (left) and as a function of linear correlation for a sample size of 100,000 (right). The bottom row of Figure 3 shows the accuracy for the same parameters. In summary, as sample size increases, computational time increases and accuracy improves, as expected. As variables become strongly correlated, accuracy and computational time both decrease. For all distributions, higher dimensions require more time to estimate. Somewhat less intuitively, accuracy improves when the dimension is increased, which can be observed regardless of the subsampling used in Equation (13).

3. Comparison to Kernel Density Estimation

This section extends the synthetic data analysis to include a one-to-one comparison of our method with KDE as implemented in MATLAB. This analysis is also extended to include additional copulas for two dimensions. All comparisons were constructed using MATLAB functionality, providing robust and reliable implementations for generating copula PDFs and KDE estimations. The standard KDE implementation in MATLAB, mvksdensity(), is employed for the comparison. In addition to a random data sample, the mvksdensity() function requires as input the requested points for estimation and a bandwidth for each variable. The grid of estimation points returned from the multivariate PDF estimator was used to ensure a fair comparison for resolution and calculation time. For bandwidth, Silverman’s rule of thumb [40] was calculated as
b w ( i ) = σ i 4 ( d + 2 ) n 1 d + 4 ,
where σ i is the standard deviation of the i-variable.
As already mentioned, Gaussian and t-copulas can be easily extended to higher dimensions, but MATLAB only provides additional families of Archimedean copulas for bivariate data. Table 3 reports MSE values for a range of sample sizes and correlations for a single mode Cauchy distribution modeled with Frank, Gumbel, and Clayton copulas estimated using our method and KDE, for two variables. The MSE values for the three different models are similar for both estimation methods. However, the accuracy of our method with a product of conditional probabilities is often better for a large number of samples than KDE, regardless of the degree of correlation in the data or the copula. This trend can be seen in Figure 4. These plots are averaged over the three copula models in Table 3, the Gaussian and t-copulas, and four GMMs. The trends in MSE are the same for all seven of these distributions.
Extending the comparison, Table 4 reports MSE for three to six variables for KDE for four GMM distributions and a Cauchy distribution modeled with a t-copula. Figure 5 summarizes the comparative trends in computation time and MSE for sample size and dimension averaged over these distributions. Due to the increasingly long computation times for KDE for high sample sizes, these test cases were performed only for a single trial. For all dimensions and small sample sizes, our method is slightly slower than KDE, but with comparable accuracy, and much faster and more accurate for large sample sizes. After 100,000 samples, KDE has no advantage, although the precise crossover point depends on the number of variables and the type of distribution estimated. With KDE, however, there is a subjective choice in bandwidth, and the grid points to be evaluated must be specified. Our method, by contrast, removes subjective parameters and has automatic grid spacing. Here we used the same grid points that our method identified to make a one-to-one comparison. Nevertheless, it is clear from Figure 5 bottom left panel that often KDE provides a better estimate. The problem cases are due to the Cauchy distribution with the t-copula, but the absolute MSE is sufficiently small for practical applications. It is worth mentioning that sliding windows could be implemented in our method to help improve accuracy.

4. Quasar Identification

The classification of astronomical objects from surveys is an important fundamental task in astronomy. In the last 20 years, the Sloan Digital Sky Survey (SDSS) has collected a cumulative 652 TB of astronomical data over seventeen data releases [41] with more than 1.1 billion photometric observations. However, only a small fraction of these observations, about 3 million, were spectroscopically classified into general categories. Classification based on spectrum data (capturing a continuum of frequencies over a broad range) is extremely time consuming and cannot keep pace with the increase in photometric observations. In recent years, machine learning (ML) methods have been applied to the automatic classification of photometric data with increasing accuracy [42,43,44]. In this section, we develop a unique approach to classifying astronomical data using multivariate density estimates. The identification of quasars is presented primarily as an example of how probability densities can be applied to high-dimensional data. Nevertheless, the results of this small-scale analysis yield competitive accuracy with other ML methods.

4.1. Experimental Data

We have collected photometric data from the latest data release (DR17) [41], representing the three classes of spectroscopically confirmed objects: stars, galaxies, and quasars. Stars are a collection of luminous gasses that undergo nuclear fusion and form the building blocks of many other larger objects. Galaxies consist of gravitationally bound stars, gases, dust, and dark matter. These can be classified into three further categories: spiral, elliptical, and irregular based on shape. Quasars are a form of active galactic nuclei in which the quasar is more luminous than the surrounding stars in the galaxy, making it appear as a stellar source. However, these objects are typically more blue than stars, and some can emit in the radio frequency range, unlike stellar sources [45].
With SkyServer, photometric flags were checked to ensure that the object was detected in BINNED1 and that clean photometry data were collected. Clean photometry did not require flags that could affect photometry measurements or indicate that the object may not be accurate enough to be accepted. The flags that were checked include NOPROFILE, PEAKCENTER, NOTCHECKED, PSF_FLUX_INTERP, SATURATED, BAD_COUNTS_ERROR, DEBLEND_NOPEAK, INTERP_CENTER and COSMIC_RAY. In addition, redshift flags were checked to eliminate observations with questionable spectroscopy, which could lead to misclassification of the object. Further information about SDSS data can be found in [46,47,48,49,50]. We provide in Supplemental Materials the data set we analyzed in the form of .cvs files.
SDSS photometry covers five filters (u, g, r, i, z) ranging from the ultraviolet regime to the infrared. Each filter has an effective wavelength, and zero-point magnitude given in Table 5. The magnitudes reported from the filters are inverse hyperbolic sine magnitudes, measured in nanomaggies [51]. In astronomy, these filters are used to construct colour indexes which indicate various attributes about an observed object such as temperature and composition. Traditionally, these colour indexes are simple differences in the reported magnitudes of two filters. It is primarily from these five filter measurements that ML methods attempt to infer the classification of the object observed.

4.2. Binary Classification Using Multivariate Density

A major difficulty for many ML methods in classifying objects based on photometric data is the unbalanced representation of known objects already verified [43]. Quasars and stars are outnumbered by galaxies more than 1:5, potentially inflating the accuracy of galaxy prediction. To maintain balanced data the number of quasars and stars in the training set is limited, which can severely impact the performance of some ML methods. The classification method described in this section requires only a small subset of data to accurately estimate the probability densities of each object type. Thus, we split training data into two equal groups of 50,000 spectroscopically confirmed quasars, and 50,000 objects confirmed non-quasars. The non-quasar group consisted of 25,000 stars and 25,000 galaxies. Equal training groups containing as few as 10,000 each of quasar/non-quasar also produced reasonable results, whereas increasing the training set beyond 50,000 did not offer significant improvement.
Initially, no assumptions were made regarding which combinations of the five filters could best differentiate between classes. Separate probability distributions for quasars and non-quasars were estimated for each of the 31 possible combinations of one to five filters. For each pair of the 31 distributions, the following density ratio was constructed for all values x in the test sets for quasars, q, and non-quasars, n q .
r a t i o ( x ) = ρ ^ q ( x ) ρ ^ q ( x ) + ρ ^ n q ( x )
The method of classifying densities based on this ratio was adopted from [29] and modified to empirically determine an appropriate threshold based on our training data. One hundred thresholds in equal increments between 0 and 1 were considered. For each threshold, a prediction was made with checking the condition
predict   true   when : r a t i o ( x ) t h r e s h o l d
The number of true positives, t p , (correctly identified as quasars), true negatives, t n , (correctly identified as not quasars), false positives, f p (misclassified as quasars) and false negatives, f n (misclassified as not quasars) are calculated. The relationships between the true positive rate ( t r p ), false positive rate ( f p r ) and J statistic (J) defined in Equation (18) can be explored using receiver operator characteristics.
t p r = t p t p + f n f p r = f p f p + t n and J = t p t p + f n + t n t n + f p 1
The maximum point of the J statistic, also known as Youden’s J Statistic [52], defines the optimal threshold for the prediction in Equation (17). An example of the J statistic and receiver operator characteristic (ROC) are shown in Figure 6. The probability densities created with the training data, along with the optimal threshold value, is then all that is needed to make future predictions for quasar identification based on one or more of the filters observed in photometric data.

4.3. Quasar Classification Results

To demonstrate the performance of the binary classification method described in the prior section, the remaining 750,000 quasar, star, and galaxy photometric data from the test set were predicted as either quasar or non-quasar. For each of these objects, their associated filter values were translated to probabilities of being either quasar or not quasar by multidimensional linear interpolation into the probabilities created from the training data for each type. Predictions were created according to Equations (16) and (17), and compared to the known classifications from the spectrum data. Since performance can be quantified through a variety of measures, each with their own merits, for convenience, accuracy, precision, recall, and the F1 measures are reported in Table 6.
Although any single variable alone does not discriminate markedly well, the results show that the best predictions do not require all five filters. In fact, the highest accuracy of more than 94%, is reported using only the g and r filters, with high-performing percentages across all measures. The ROC plot in Figure 6 represents the training data for these two filters, and the leftmost plot in Figure 7 shows a 2-dimensional surface plot of probability density. The bright red strip is the density of all quasars in the test set, and the lighter streak below is for the star and galaxy data. The second two plots in Figure 7 show density plots for different filter pairs. These examples also cluster well, but with slightly more overlap and lower accuracy. The results in Table 6 show that including more variables does not improve classification, and in some cases the results worsen.
Visualizing probability density plots for more than two variables can be accomplished using cross-sectional plots, such as the example in Figure 2. Cross-section plots are a useful tool for detailed high-dimensional analysis in localized areas. The plot in Figure 8 represents the clustering between quasar (red) and non-quasar (blue) for three filters, with the size of the circles corresponding to the magnitude of density. Densities of less than 0.035 are not plotted. These masked points of probability density consist of less than 20% of the total probability, but account for nearly 90% of the grid points, confirming a noted observation that most of high-dimensional space is empty. Although the relative densities are somewhat difficult to discern visually, this plot shows the clusters well.

5. Conclusions

A nonparametric multivariate probability density estimator based on a product of conditional probability densities and a marginal probability density is presented. Marginal and conditional probability densities are estimated through a series of calls to a 1-dimensional PDF estimator based on the principle of maximum entropy and optimized based on sort order statistics. This method is currently available as software packages in R, MATLAB, C++, and Java. In this paper, we show that our method is fast and accurate for up to six variables based on results from synthetic data involving four Gaussian mixture models and five different copula generating schemes using a Cauchy marginal distribution.
We also compared results for dimensions two through six with KDE using the multivariate KDE MATLAB implementation. It is seen that both methods have comparable accuracy and speed across a wide range of cases, but our method always performs better when dealing with a sufficiently large number of samples in any dimension. When the number of samples exceeds 100,000 the MATLAB implementation of multivariate KDE becomes very slow compared to our method as the number of samples continue to increase. In addition, our method has no subjective settings for a bandwidth or resolution. We compared the MATLAB implementation of KDE with the selected points that are automatically generated by our method using estimated marginal probability densities and their quantiles. Therefore, the KDE method was given an advantage in our comparisons that would normally not be present in applications.
Although the current implementation at specific points can handle up to 10 dimensions, practical considerations of computing resources for interpolation methods currently limit the algorithm to several dimensions. Future work involving computational mathematics will leverage dimension reduction methods and the sparsity of high-dimensional data by implementing data structures suitable for interpolation on a sparse hyper-dimensional grid to reduce memory and CPU needs. However, our current straight-forward implementation can handle up to 10 dimensions and a million samples very well. To illustrate applicability, we demonstrate the utility of our method by estimating the probability density for five variables, which describes astronomical photometric data to successfully classify quasars with more than 94% accuracy.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math11010155/s1, Supplemental Materials Date File: GALAXY, QSO and STAR.

Author Contributions

Conceptualization, J.F. and D.J.J.; methodology, J.F. and E.A.; software, J.F.; validation, J.F. and D.J.J.; formal analysis, J.F.; investigation, J.F. and E.A.; resources, D.J.J.; data curation, E.A.; writing—original draft preparation, J.F. and E.A.; writing—review and editing, J.F., E.A. and D.J.J.; visualization, J.F.; supervision, D.J.J.; project administration, D.J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The multivariate probability density estimator can be downloaded from the following repositories. MATLAB: https://github.com/jennyfarmer/PDFAnalyze (accessed on 11 December 2022). R: https://cran.r-project.org/web/packages/PDFEstimator/index.html (accessed on 11 December 2022).

Acknowledgments

The computing resources and support used to produce the results presented in this paper were provided by the University Research Computing group in the Office of OneIT at the University of North Carolina at Charlotte. We also thank the two reviewers who asked us to do more comparisons, which has led to stronger conclusions. Funding for the Sloan Digital Sky Survey IV has been provided by the Alfred P. Sloan Foundation, the U.S. Department of Energy Office of Science, and Participating Institutions. SDSS acknowledges support and resources from the Center for High-Performance Computing at the University of Utah. The SDSS web site is www.sdss4.org (accessed on 25 July 2022). The SDSS is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration including the Brazilian Participation Group, the Carnegie Institution for Science, Carnegie Mellon University, Center for Astrophysics|Harvard & Smithsonian (CfA), the Chilean Participation Group, the French Participation Group, Instituto de Astrofísica de Canarias, The Johns Hopkins University, Kavli Institute for the Physics and Mathematics of the Universe (IPMU)/University of Tokyo, the Korean Participation Group, Lawrence Berkeley National Laboratory, Leibniz Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Astrophysik (MPA Garching), Max-Planck-Institut für Extraterrestrische Physik (MPE), National Astronomical Observatories of China, New Mexico State University, New York University, University of Notre Dame, Observatório Nacional/MCTI, The Ohio State University, Pennsylvania State University, Shanghai Astronomical Observatory, United Kingdom Participation Group, Universidad Nacional Autónoma de México, University of Arizona, University of Colorado Boulder, University of Oxford, University of Portsmouth, University of Utah, University of Virginia, University of Washington, University of Wisconsin, Vanderbilt University, and Yale University.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Parzen, E. On Estimation of a Probability Density Function and Mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
  2. Rosenblatt, M. Remarks on Some Nonparametric Estimates of a Density Function. Ann. Math. Stat. 1956, 27, 832–837. [Google Scholar] [CrossRef]
  3. Scott, D.W. Multivariate Density Estimation: Theory, Practice, and Visualization; Wiley Series in Probability and Statistics; John Wiley & Sons, Incorporated: Somerset, UK, 2015. [Google Scholar]
  4. Schuster, E.F. Incorporating support constraints into nonparametric estimators of densities. Commun. Stat.-Theory Methods 1985, 14, 1123–1136. [Google Scholar] [CrossRef]
  5. Müller, H.G. Smooth Optimum Kernel Estimators Near Endpoints. Biometrika 1991, 78, 521–530. [Google Scholar] [CrossRef]
  6. Chen, S.X. Probability Density Function Estimation Using Gamma Kernels. Ann. Inst. Stat. Math. 2000, 52, 471–480. [Google Scholar] [CrossRef]
  7. Lapko, A.V.; Lapko, V.A. Fast Algorithm for Choosing Kernel Function Blur Coefficients in a Nonparametric Probability Density Estimate. Meas. Tech. 2018, 61, 540–545. [Google Scholar] [CrossRef]
  8. Malarvel, M.; Singh, H.; Nayak, S.R. An Improved Kernel Density Estimation with adaptive bandwidth selection for Edge detection. In Proceedings of the 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, 4–6 February 2021; pp. 982–986. [Google Scholar] [CrossRef]
  9. Ziane, Y.; Adjabi, S.; Zougab, N. Adaptive Bayesian bandwidth selection in asymmetric kernel density estimation for nonnegative heavy-tailed data. J. Appl. Stat. 2015, 42, 1645–1658. [Google Scholar] [CrossRef]
  10. Markovich, L.A. Nonparametric Estimation of Multivariate Density and its Derivative by Dependent Data Using Gamma Kernels. J. Math. Sci. 2021, 254, 550–573. [Google Scholar] [CrossRef]
  11. Bouezmarni, T.; Rombouts, J.V.K. Nonparametric density estimation for multivariate bounded data. J. Stat. Plan. Inference 2010, 140, 139–152. [Google Scholar] [CrossRef]
  12. Wang, J.; Liu, Y.; Chang, J. An Improved Model for Kernel Density Estimation Based on Quadtree and Quasi-Interpolation. Mathematics 2022, 10, 2402. [Google Scholar] [CrossRef]
  13. Yang, N.; Huang, Y.; Hou, D.; Liu, S.; Ye, D.; Dong, B.; Fan, Y. Adaptive Nonparametric Kernel Density Estimation Approach for Joint Probability Density Function Modeling of Multiple Wind Farms. Energies 2019, 12, 1356. [Google Scholar] [CrossRef] [Green Version]
  14. Ngatchou-Wandji, J.; Ltaifa, M.; Njamen Njomen, D.A.; Shen, J. Nonparametric Estimation of the Density Function of the Distribution of the Noise in CHARN Models. Mathematics 2022, 10, 624. [Google Scholar] [CrossRef]
  15. Jin, Y.; He, Y.; Huang, D. An Improved Variable Kernel Density Estimator Based on L2 Regularization. Mathematics 2021, 9, 2004. [Google Scholar] [CrossRef]
  16. Hwang, J.N.; Lay, S.R.; Lippman, A. Nonparametric Multivariate Density Estimation: A Comparative Study. Signal Process. IEEE Trans. 1994, 42, 2795–2810. [Google Scholar] [CrossRef] [Green Version]
  17. Li, W.; Zhang, C.; Tsung, F.; Mei, Y. Nonparametric monitoring of multivariate data via KNN learning. Int. J. Prod. Res. 2021, 59, 6311–6326. [Google Scholar] [CrossRef]
  18. Magdon-Ismail, M.; Atiya, A. Density estimation and random variate generation using multilayer networks. IEEE Trans. Neural Netw. 2002, 13, 497–520. [Google Scholar] [CrossRef]
  19. Peerlings, D.E.W.; Brakel, J.A.V.D.; Basturk, N.; Puts, M.J.H. Multivariate Density Estimation by Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–12. [Google Scholar] [CrossRef]
  20. Puchert, P.; Hermosilla, P.; Ritschel, T.; Ropinski, T. Data-driven deep density estimation. Neural Comput. Appl. 2021, 33, 16773–16807. [Google Scholar] [CrossRef]
  21. Trentin, E. Asymptotic Convergence of Soft-Constrained Neural Networks for Density Estimation. Mathematics 2020, 8, 572. [Google Scholar] [CrossRef]
  22. Abramov, R.V. An improved algorithm for the multidimensional moment-constrained maximum entropy problem. J. Comput. Phys. 2007, 226, 621–644. [Google Scholar] [CrossRef]
  23. Dai, H.; Zhang, H.; Wang, W. A new maximum entropy-based importance sampling for reliability analysis. Struct. Saf. 2016, 63, 71–80. [Google Scholar] [CrossRef]
  24. De Martino, A.; De Martino, D. An introduction to the maximum entropy approach and its application to inference problems in biology. Heliyon 2018, 4, e00596. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  25. Kouskoulas, Y.; Pierce, L.E.; Ulaby, F.T. A computationally efficient multivariate maximum-entropy density estimation (MEDE) technique. IEEE Trans. Geosci. Remote Sens. 2004, 42, 457–468. [Google Scholar] [CrossRef]
  26. Li, G.; Wang, Y.X.; Zeng, Y.; He, W.X. A new maximum entropy method for estimation of multimodal probability density function. Appl. Math. Model. 2022, 102, 137–152. [Google Scholar] [CrossRef]
  27. Zhang, X.; Pandey, M.D. Structural reliability analysis based on the concepts of entropy, fractional moment and dimensional reduction method. Struct. Saf. 2013, 43, 28–40. [Google Scholar] [CrossRef]
  28. Wang, X.; Wang, Y. Nonparametric multivariate density estimation using mixtures. Stat. Comput. 2013, 25, 349–364. [Google Scholar] [CrossRef]
  29. Konopko, K.; Janczak, D. Classification method based on multidimensional probability density function estimation dedicated to embedded systems. IFAC-PapersOnLine 2018, 51, 318–323. [Google Scholar] [CrossRef]
  30. Wang, Z.; Scott, D.W. Nonparametric density estimation for high-dimensional data—Algorithms and applications. Wiley Interdiscip. Rev. Comput. Stat. 2019, 11, e1461. [Google Scholar] [CrossRef] [Green Version]
  31. Ruzgas, T.; Lukauskas, M.; Čepkauskas, G. Nonparametric Multivariate Density Estimation: Case Study of Cauchy Mixture Model. Mathematics 2021, 9, 2717. [Google Scholar] [CrossRef]
  32. Wang, T.; Guan, Z. Bernstein polynomial model for nonparametric multivariate density. Statistics 2019, 53, 321–338. [Google Scholar] [CrossRef] [Green Version]
  33. Farmer, J.; Jacobs, D. High throughput nonparametric probability density estimation. PLoS ONE 2018, 13, e0196937. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  34. Farmer, J.; Merino, Z.; Gray, A.; Jacobs, D. Universal Sample Size Invariant Measures for Uncertainty Quantification in Density Estimation. Entropy 2019, 21, 1120. [Google Scholar] [CrossRef] [Green Version]
  35. Farmer, J.; Jacobs, D.J. MATLAB tool for probability density assessment and nonparametric estimation. SoftwareX 2022, 18, 101017. [Google Scholar] [CrossRef]
  36. Farmer, J.; Jacobs, D. The R Journal: PDFEstimator: An R Package for Density Estimation and Analysis. R J. 2022, 14, 305–319. [Google Scholar] [CrossRef]
  37. Farmer, J.; Jacobs, D. PDFEstimator: Multivariate Nonparametric Probability Density Estimator. R Package Version 4.2. 2022. Available online: https://CRAN.R-project.org/package=PDFEstimator (accessed on 11 December 2022).
  38. Farmer, J.; Jacobs, D.J. PDFAnalyze. 2022. Available online: https://github.com/jennyfarmer/PDFAnalyze (accessed on 11 December 2022).
  39. Jaynes, E.T. Information Theory and Statistical Mechanics. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
  40. Läuter, H. Silverman, B.W. Density Estimation for Statistics and Data Analysis. Biom. J. 1988, 30, 876–877. [Google Scholar] [CrossRef]
  41. Abdurro’uf, N.; Accetta, K.; Aerts, C.; Silva Aguirre, V.; Ahumada, R.; Ajgaonkar, N.; Filiz Ak, N.; Alam, S.; Allende Prieto, C.; Almeida, A.; et al. The Seventeenth Data Release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar, and APOGEE-2 Data. Astrophys. J. 2022, 259, 35. [Google Scholar] [CrossRef]
  42. Acharya, V.; Bora, P.S.; Navin, K.; Nazareth, A.; Anusha, P.S.; Rao, S. Classification of SDSS photometric data using machine learning on a cloud. Curr. Sci. 2018, 115, 249–257. [Google Scholar] [CrossRef]
  43. Clarke, A.O.; Scaife, A.M.M.; Greenhalgh, R.; Griguta, V. Identifying galaxies, quasars, and stars with machine learning: A new catalogue of classifications for 111 million SDSS sources without spectra. Astron. Astrophys. 2020, 639, A84. [Google Scholar] [CrossRef]
  44. Rony, M.A.T.; Reza, D.S.A.A.; Mostafa, R.; Ullah, M.A. Application of Machine Learning to Interpret Predictability of Different Models: Approach to Classification for SDSS Sources. In Proceedings of the 2021 International Conference on Electronics, Communications and Information Technology (ICECIT), Khulna, Bangladesh, 14–16 September 2021; pp. 1–4. [Google Scholar] [CrossRef]
  45. Ryden, B.; Peterson, B.M. Foundations of Astrophysics; Addison-Wesley: Boston, MA, USA, 2010. [Google Scholar]
  46. Blanton, M.R.; Bershady, M.A.; Abolfathi, B.; Albareti, F.D.; Allende Prieto, C.; Almeida, A.; Alonso-García, J.; Anders, F.; Anderson, S.F.; Andrews, B.; et al. Sloan Digital Sky Survey IV: Mapping the Milky Way, Nearby Galaxies, and the Distant Universe. Astron. J. 2017, 154, 28. [Google Scholar] [CrossRef]
  47. Gunn, J.E.; Carr, M.; Rockosi, C.; Sekiguchi, M.; Berry, K.; Elms, B.; de Haas, E.; Ivezić, Ž.; Knapp, G.; Lupton, R.; et al. The Sloan Digital Sky Survey Photometric Camera. Astron. J. 1998, 116, 3040–3081. [Google Scholar] [CrossRef] [Green Version]
  48. Fukugita, M.; Ichikawa, T.; Gunn, J.E.; Doi, M.; Shimasaku, K.; Schneider, D.P. The Sloan Digital Sky Survey Photometric System. Astron. J. 1996, 111, 1748. [Google Scholar] [CrossRef] [Green Version]
  49. Doi, M.; Tanaka, M.; Fukugita, M.; Gunn, J.E.; Yasuda, N.; Ivezić, Ž.; Brinkmann, J.; de Haars, E.; Kleinman, S.J.; Krzesinski, J.; et al. Photometric Response Functions of the Sloan Digital Sky Survey Imager. Astron. J. 2010, 139, 1628–1648. [Google Scholar] [CrossRef]
  50. Gunn, J.E.; Siegmund, W.A.; Mannery, E.J.; Owen, R.E.; Hull, C.L.; Leger, R.F.; Carey, L.N.; Knapp, G.R.; York, D.G.; Boroski, W.N.; et al. The 2.5 m Telescope of the Sloan Digital Sky Survey. Astron. J. 2006, 131, 2332–2359. [Google Scholar] [CrossRef]
  51. Stoughton, C.; Lupton, R.H.; Bernardi, M.; Blanton, M.R.; Burles, S.; Castander, F.J.; Connolly, A.J.; Eisenstein, D.J.; Frieman, J.A.; Hennessy, G.S.; et al. Sloan Digital Sky Survey: Early Data Release. Astron. J. 2002, 123, 485–548. [Google Scholar] [CrossRef] [Green Version]
  52. Youden, W.J. Index for rating diagnostic tests. Cancer 1950, 3, 32–35. [Google Scholar] [CrossRef]
Figure 1. Data partitioning: Example of adaptive cell sizes for a bivariate Gaussian distribution. (a) Cells are defined by crossing grid lines in each orthogonal direction. (b) Only vertical grid lines are shown to highlight how data are binned into strips, and how the data within a strip are projected onto its centerline.
Figure 1. Data partitioning: Example of adaptive cell sizes for a bivariate Gaussian distribution. (a) Cells are defined by crossing grid lines in each orthogonal direction. (b) Only vertical grid lines are shown to highlight how data are binned into strips, and how the data within a strip are projected onto its centerline.
Mathematics 11 00155 g001
Figure 2. Visualization: Representation of a three-variable estimate for a GMM where cross sections are shown for the third variable (top row) and then variables 1 and 2 (middle row) and variables 2 and 3 (bottom row).
Figure 2. Visualization: Representation of a three-variable estimate for a GMM where cross sections are shown for the third variable (top row) and then variables 1 and 2 (middle row) and variables 2 and 3 (bottom row).
Mathematics 11 00155 g002
Figure 3. Performance benchmarks based on synthetic data from Cauchy and GMM. Computation time (top row), MSE (bottom row), no correlation (left column), sample size of 100,000 (right column).
Figure 3. Performance benchmarks based on synthetic data from Cauchy and GMM. Computation time (top row), MSE (bottom row), no correlation (left column), sample size of 100,000 (right column).
Mathematics 11 00155 g003
Figure 4. Mean squared error (MSE) for two variable synthetic data from Cauchy and GMM averaged over five different copulas. No correlation (left), sample size of 100,000 (right).
Figure 4. Mean squared error (MSE) for two variable synthetic data from Cauchy and GMM averaged over five different copulas. No correlation (left), sample size of 100,000 (right).
Mathematics 11 00155 g004
Figure 5. Performance benchmarks: mean squared error (left) and computation time (right) for synthetic data generated using the t-copula ( ν = 1 ) with Cauchy distribution and GMM for different correlations. The legend on the right panel applies to the left panel as well.
Figure 5. Performance benchmarks: mean squared error (left) and computation time (right) for synthetic data generated using the t-copula ( ν = 1 ) with Cauchy distribution and GMM for different correlations. The legend on the right panel applies to the left panel as well.
Mathematics 11 00155 g005
Figure 6. Sensitivity and specificity benchmarks: Receiver Operator Characteristics (ROC) and J-statistic curves for quasar prediction using the g and r filters from training data. The maximum of Youden’s J-statistic located at the vertical black line is taken as the optimal threshold for the density ratio. The black dashed line represents 50/50 chance of guessing correctly.
Figure 6. Sensitivity and specificity benchmarks: Receiver Operator Characteristics (ROC) and J-statistic curves for quasar prediction using the g and r filters from training data. The maximum of Youden’s J-statistic located at the vertical black line is taken as the optimal threshold for the density ratio. The black dashed line represents 50/50 chance of guessing correctly.
Mathematics 11 00155 g006
Figure 7. Two-dimensional surface plots showing relative densities for the g and r filters (left), u and z filters (middle), and r and i filters (right). The density of quasars are represented by the dark strips towards the top and the non-quasars are the lower density clusters.
Figure 7. Two-dimensional surface plots showing relative densities for the g and r filters (left), u and z filters (middle), and r and i filters (right). The density of quasars are represented by the dark strips towards the top and the non-quasars are the lower density clusters.
Mathematics 11 00155 g007
Figure 8. Three-dimensional representations of quasar (red) and non-quasar (blue) clusters.
Figure 8. Three-dimensional representations of quasar (red) and non-quasar (blue) clusters.
Mathematics 11 00155 g008
Table 1. Parameters for GMM distributions.
Table 1. Parameters for GMM distributions.
GMMCharacteristicsParametersComponent Weights
1single mode μ 1 = 0 w 1 = 1.00
2two slightly μ 1 = 0 w 1 = 0.40
overlapped modes μ 2 = 5 1 d w 2 = 0.60
3two heavily μ 1 = 0 w 1 = 0.40
overlapped modes μ 2 = 3 1 d w 2 = 0.60
4three modes μ 1 = 0 w 1 = 0.40
μ 2 = 5 1 d w 2 = 0.30
μ 3 = 3 1 d w 3 = 0.30
5four modes μ 1 = ( 0 , 0 , 0 , 0 ) w 1 = 0.25
μ 2 = ( 0 , 5 , 0 , 0 ) w 2 = 0.25
μ 3 = ( 0 , 0 , 3 , 0 ) w 3 = 0.25
μ 4 = ( 5 , 0 , 0 , 3 ) w 4 = 0.25
Table 2. Accuracy of new method based on synthetic data: assessed by MSE on a log base 10 scale.
Table 2. Accuracy of new method based on synthetic data: assessed by MSE on a log base 10 scale.
nLinear Correlation1 ModeSlight OverlapHeavy Overlap3 ModesCauchy (Gaussian Copula)Cauchy (t Copula)
d = 2
1000−2.992−2.9737−3.5559−3.089−3.0531−2.6246
1000.6−2.6929−2.8666−3.3938−2.933−2.8176−2.4168
1000.9−1.9043−2.5126−2.9725−2.4644−2.1569−1.9062
10,0000−3.9117−4.6788−5.0853−4.5131−4.083−4.0139
10,0000.6−3.7526−4.5365−5.0591−4.4494−4.078−3.8905
10,0000.9−3.3898−4.1205−4.5882−4.1181−3.8844−3.4951
1,000,0000−6.1185−6.4944−6.6217−6.249−4.7689−4.9609
1,000,0000.6−5.8909−6.376−6.5939−6.207−4.8762−4.8306
1,000,0000.9−5.2806−5.7028−6.122−5.61−4.7895−4.5139
d = 3
1000−3.338−4.0794−4.9053−4.0782−4.4971−3.5923
1000.6−3.1286−3.8302−4.5882−3.8311−4.0811−3.2549
1000.9−2.404−3.2133−3.8205−3.1796−2.8755−2.4116
10,0000−4.444−5.3379−5.814−5.4227−5.4487−4.8823
10,0000.6−4.1592−4.9542−5.388−4.9706−5.2078−4.474
10,0000.9−3.2825−3.8304−4.1097−3.7935−4.403−3.3035
1,000,0000−5.04−6.3629−7.1626−6.2106−5.5167−5.287
1,000,0000.6−4.8568−6.1395−6.9824−6.0369−5.3879−5.0574
1,000,0000.9−4.4647−5.5948−6.2233−5.546−5.0788−4.4882
d = 4
1000−4.0605−5.2822−6.2842−5.219−5.6353−4.4012
1000.6−3.6758−4.8618−5.8091−4.8384−5.144−3.7552
1000.9−2.8089−3.9414−4.6717−3.8819−3.4677−2.6696
10,0000−5.1289−6.002−6.6362−6.0567−5.9235−4.8418
10,0000.6−4.4678−5.403−6.1653−5.4537−5.3646−4.4309
10,0000.9−3.0998−3.9566−4.6743−4.0733−4.2874−3.2767
1,000,0000−5.5674−6.947−7.8205−6.9272−6.9513−6.3571
1,000,0000.6−5.2438−6.5087−7.2547−6.4942−6.6235−5.9865
1,000,0000.9−4.372−5.2291−5.4616−5.1896−5.8956−4.2999
d = 5
1000−4.9695−6.4975−7.677−6.4148−6.8743−5.0235
1000.6−4.3866−5.8932−7.0189−5.8546−6.133−4.3321
1000.9−3.2431−4.6621−5.5086−4.6075−4.1286−2.864
10,0000−5.8235−7.0079−7.6207−6.849−6.8503−5.0366
10,0000.6−4.9794−6.2658−7.006−6.2862−5.9073−4.501
10,0000.9−3.3098−4.4792−5.5053−4.577−4.4204−3.1359
1,000,0000−6.2117−7.6695−8.5417−7.7618−8.2213−6.5347
1,000,0000.6−5.5774−6.8932−7.6975−6.8798−7.4491−5.6284
1,000,0000.9−4.0165−5.0726−5.6349−5.0277−5.6588−3.5659
d = 6
1000−5.9506−7.7196−9.0519−7.6235−8.2492−5.9397
1000.6−5.1613−6.9208−8.2044−6.8646−7.1974−4.874
1000.9−3.6589−5.3711−6.3335−5.2924−4.837−3.0772
10,0000−6.5171−7.7206−8.4242−7.6341−7.5321−3.998
10,0000.6−5.5881−6.9177−7.505−6.8224−5.9014−3.1111
10,0000.9−2.5653−5.3691−6.2167−5.214−3.0605−1.7374
1,000,0000−6.9501−8.7879−9.6764−8.828−9.0723−7.0393
1,000,0000.6−6.0385−7.6199−8.7143−7.5842−8.1493−5.9897
1,000,0000.9−4.1541−5.5531−6.1914−5.5167−6.121−3.7104
Table 3. Accuracy for synthetic data comparison by copula, measured by mean squared error on a log base 10 scale.
Table 3. Accuracy for synthetic data comparison by copula, measured by mean squared error on a log base 10 scale.
nLinear CorrelationKDE FrankKDE GumbelKDE ClaytonPDFe FrankPDFe GumbelPDFe Clayton
d = 2
1000−2.8728−2.8728−2.8728−2.9576−2.9576−2.9576
1000.6−2.6722−2.626−2.6866−2.5971−2.763−2.6054
1000.9−2.2262−2.2918−2.263−1.847−2.212−1.8354
10,0000−2.7559−2.7559−2.7559−4.0664−4.0664−4.0664
10,0000.6−2.5926−2.5835−2.599−3.9504−4.025−4.009
10,0000.9−2.1911−2.2492−2.2187−3.6872−3.791−3.746
1,000,0000−2.75−2.75−2.75−4.8083−4.8083−4.8083
1,000,0000.6−2.5891−2.5796−2.5931−4.7268−4.861−4.8188
1,000,0000.9−2.19−2.2472−2.2157−4.5811−4.7294−4.5754
Table 4. Synthetic data accuracy using KDE based on MSE on a log base 10 scale.
Table 4. Synthetic data accuracy using KDE based on MSE on a log base 10 scale.
nLinear Correlation1 ModeSlight OverlapHeavy Overlap3 ModesCauchy (t Copula)
d = 3
1000−4.1386−4.1395−4.8599−4.144−3.4908
1000.6−3.4097−3.8311−4.556−3.8416−3.2515
1000.9−2.3603−3.2001−3.807−3.1692−2.6413
10,0000−4.9142−4.8448−5.2822−5.237−3.4791
10,0000.6−4.2714−4.3496−4.8517−4.6171−3.2383
10,0000.9−2.9386−3.3834−3.8891−3.4686−2.6387
1,000,0000−6.1506−5.8395−6.2219−6.4355−3.4791
1,000,0000.6−5.5695−5.1841−5.6007−5.7253−3.2383
1,000,0000.9−4.0055−3.8203−4.1742−4.1773−2.6387
d = 4
1000−4.6547−5.2899−6.2212−5.2882−4.3846
1000.6−3.9574−4.8606−5.7576−4.8448−3.9911
1000.9−2.7626−3.9356−4.6623−3.8989−3.0408
10,0000−5.4618−5.5969−6.3757−5.8134−4.3846
10,0000.6−4.4133−5.0493−5.8536−5.2017−3.9909
10,0000.9−2.9931−3.9835−4.6775−3.998−3.0408
1,000,0000−6.5901−6.4913−7.0893−7.0414−4.3846
1,000,0000.6−5.7393−5.6978−6.3405−6.1482−3.9909
1,000,0000.9−3.8813−4.2186−4.7903−4.4377−3.0408
d = 5
1000−5.5451−6.5117−7.6067−6.4505−5.2295
1000.6−4.6641−5.8895−6.9732−5.8483−4.6736
1000.9−3.2053−4.6587−5.5052−4.6077−3.3277
10,0000−6.2732−6.6471−7.6214−6.6802−5.2295
10,0000.6−4.9596−5.9687−6.9863−6.0107−4.6736
10,0000.9−3.2888−4.6708−5.5061−4.6444−3.3277
1,000,0000−6.8298−7.2737−8.0351−7.674−5.2295
1,000,0000.6−5.7404−6.3466−7.213−6.6061−4.6736
1,000,0000.9−3.7864−4.7544−5.5313−4.8137−3.3277
d = 6
1000−6.2909−7.7238−8.9959−7.6472−6.0267
1000.6−5.3681−6.9159−8.1793−6.855−5.3023
1000.9−3.6589−5.369−6.3328−5.3134−3.5044
10,0000−6.518−7.7202−9.0551−7.6405−6.0267
10,0000.6−5.5932−6.9154−8.2017−6.8703−5.3023
10,0000.9−3.6696−5.369−6.3337−5.3154−3.5044
1,000,0000−7.3333−8.2114−9.2153−8.5489−6.0267
1,000,0000.6−6.0458−7.1718−8.2881−7.361−5.3023
1,000,0000.9−3.9933−5.4035−6.3396−5.4077−3.5044
Table 5. SDSS filter effective wavelengths and zero-point magnitudes obtained from open access Sloan Digital Sky Survey repository [51].
Table 5. SDSS filter effective wavelengths and zero-point magnitudes obtained from open access Sloan Digital Sky Survey repository [51].
FilterEffective Wavelength (Å)Zero-Point Magnitude
u355124.63
g468625.11
r616624.80
i748024.36
z893222.83
Table 6. Results for binary classification of quasars. Highest values for each column are written in bold font and underlined. For reference, AUC is the area under the ROC curve, and the four measures are defined as: a c c u r a c y = t p + t n t p + t n + f p + f n , p r e c i s i o n = t p t p + f p , r e c a l l = t p t p + f n and F 1 = 2 p r e c i s i o n r e c a l l p r e c i s i o n + r e c a l l . The underlined values in bold are the greatest value obtained per column.
Table 6. Results for binary classification of quasars. Highest values for each column are written in bold font and underlined. For reference, AUC is the area under the ROC curve, and the four measures are defined as: a c c u r a c y = t p + t n t p + t n + f p + f n , p r e c i s i o n = t p t p + f p , r e c a l l = t p t p + f n and F 1 = 2 p r e c i s i o n r e c a l l p r e c i s i o n + r e c a l l . The underlined values in bold are the greatest value obtained per column.
TRAININGTESTING
FiltersAUCMax J-Stat AccuracyPrecisionRecallF1
u0.854070.61054 0.788230.613570.842180.70992
g0.773590.47972 0.663270.475710.924310.62814
r0.631210.19652 0.528460.371570.770310.50132
i0.600070.15052 0.497360.355570.779870.48844
z0.655490.2295 0.592280.403140.676510.50521
u,g0.928070.75122 0.896910.843320.81670.8298
u,r0.926980.75246 0.892480.819550.834270.82684
u,i0.927440.75492 0.896590.83710.824320.83066
u,z0.937990.77104 0.901710.838150.843430.84078
g,r0.956070.86708 0.941330.90190.908110.90499
g,i0.951490.84284 0.934170.899290.885180.89218
g,z0.948230.8343 0.927120.877360.887160.88223
r,i0.868710.6222 0.802080.636720.830740.7209
r,z0.859380.58856 0.786510.616320.811050.7004
i,z0.767590.43014 0.673710.482340.825170.60881
u,g,r0.888450.67842 0.86170.772250.780820.77651
u,g,i0.919270.74596 0.890130.818330.826370.82233
u,g,z0.934860.77954 0.904170.842490.846890.84469
u,r,i0.887510.71086 0.878660.808980.792870.80084
u,r,z0.920970.77096 0.912120.890620.814420.85081
u,i,z0.908180.77258 0.906440.861410.829350.84508
g,r,i0.947530.84308 0.937970.921650.872580.89645
g,r,z0.935110.78606 0.910660.864110.842050.85294
g,i,z0.95040.86244 0.940110.905560.899120.90233
r,i,z0.841660.53186 0.754150.573520.783990.66244
u,g,r,i0.873070.63298 0.785560.601920.894970.71976
u,g,r,z0.917320.7476 0.863750.725660.895880.80183
u,g,i,z0.901630.73838 0.85690.712180.89770.79425
u,r,i,z0.898350.71494 0.850130.707290.875090.78229
g,r,i,z0.935240.78376 0.908940.862020.838230.84996
u,g,r,i,z0.885150.67998 0.814670.640650.90570.75046
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Farmer, J.; Allen, E.; Jacobs, D.J. Quasar Identification Using Multivariate Probability Density Estimated from Nonparametric Conditional Probabilities. Mathematics 2023, 11, 155. https://doi.org/10.3390/math11010155

AMA Style

Farmer J, Allen E, Jacobs DJ. Quasar Identification Using Multivariate Probability Density Estimated from Nonparametric Conditional Probabilities. Mathematics. 2023; 11(1):155. https://doi.org/10.3390/math11010155

Chicago/Turabian Style

Farmer, Jenny, Eve Allen, and Donald J. Jacobs. 2023. "Quasar Identification Using Multivariate Probability Density Estimated from Nonparametric Conditional Probabilities" Mathematics 11, no. 1: 155. https://doi.org/10.3390/math11010155

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop