Logistic Regression Based on Individual-Level Predictors and Aggregate-Level Responses

Xu, Zheng

doi:10.3390/math11030746

Open AccessArticle

Logistic Regression Based on Individual-Level Predictors and Aggregate-Level Responses

by

Zheng Xu

Department of Mathematics and Statistics, Wright State University, Dayton, OH 45435, USA

Mathematics 2023, 11(3), 746; https://doi.org/10.3390/math11030746

Submission received: 17 January 2023 / Revised: 27 January 2023 / Accepted: 30 January 2023 / Published: 2 February 2023

(This article belongs to the Special Issue Statistical Methods in Data Science and Applications)

Download Versions Notes

Abstract

:

We propose estimation methods to conduct logistic regression based on individual-level predictors and aggregate-level responses. We derive the likelihood of logistic models in this situation and proposed estimators with different optimization methods. Simulation studies have been conducted to evaluate and compare the performance of the different estimators. A real data-based study has been conducted to illustrate the use of our estimators and compare the different estimators.

Keywords:

Poisson binomial distribution; logistic regression; data aggregation; likelihood; numerical optimization

MSC:

62J12

1. Introduction

Data can be reported at different levels due to various considerations including economic, confidentiality, and data collection difficulty. For example, the US Census Bureau reports income at the household level. The aggregate-level data in this example are household income, which is a measure of the combined incomes of all people sharing a particular household or place of residence. The individual-level data in this example are individuals’ incomes. The aggregate-level data are defined as data aggregated from individual-level data by groups. Although there are risks in estimating individual-level relationships based on aggregate-level data, such as unequal correlations between variables in aggregate-level data and between the same variables in individual-level data [1,2], researchers continue to use aggregate-level data because in many situations, individual-level data are not available and valid methods for estimating individual-level relationships based on aggregate-level data can be derived [1,3]. The terms “individual” and “aggregate” refer to the different levels and units of analysis [1].

This article intends to solve the problem of estimating models describing an individual-level relationship based on an aggregate-level response variable Y and individual-level predictors X. Examples of data situations include survey data, multivariate time series, social data, and biological data, collected and reported at different levels.

Our interest in developing methods to analyze aggregate data was motivated by real-life examples. One example is group testing of infectious diseases in bio-statistics. To reduce the costs, a two-stage sequential testing strategy is applied. In the first stage, group testing is conducted. Individuals showing positive in the first stage are called back for a follow-up individual test. With the first-stage group testing data available, analyses can be conducted. The second example is consumer demand studies in economics. The consumer’s characteristics data are available at the individual level, whereas the consumer’s purchase data are available only at the aggregate level. The third example is the analysis of multivariate time series. It is likely that different time series are reported at different frequencies. To study the relationships between multiple time series with different frequencies, researchers need to develop statistical methods.

Suppose there are n observations in the sample,

(X_{i}, Y_{i})

,

i = 1, 2, \dots, n

,

X \in R^{p}

,

Y \in R

, aggregated into K groups,

G_{1}, G_{2}, \dots, G_{K}

, with group sizes, respectively, of

n_{1}, n_{2}, \dots, n_{K}

,

\sum n_{g} = n

. Denote the set of observations in Group g as

G_{g} = {g 1, g 2, \dots, g n_{g}}

. Aggregate-level X and Y, i.e.,

(X_{g}^{*}, Y_{g}^{*})

,

g = 1, 2, \dots, K

are

X_{g}^{*} = \sum_{i \in G_{g}} X_{i} = \sum_{i = 1}^{n_{g}} X_{g i} and Y_{g}^{*} = \sum_{i \in G_{g}} Y_{i} = \sum_{i = 1}^{n_{g}} Y_{g i} .

(1)

Note that

Y_{g}^{*}

can be any summary statistic calculated from individual-level Y in Group g, and we study summation aggregation in this paper.

Researchers have solved this problem for linear models [4,5,6]. Suppose the linear model describing individual-level data

(X_{i}, Y_{i})

is

Y_{i} = X_{i}^{T} β + ϵ_{i}, i = 1, 2, \dots, n .

Then, the corresponding model describing the aggregated data

(X_{g}^{*}, Y_{g}^{*})

is

Y_{g}^{*} = {(X_{g}^{*})}^{T} β + ϵ_{g}^{*}, g = 1, 2, \dots, K,

where

ϵ_{g}^{*} = \sum_{i \in G_{g}} ϵ_{i}

is the aggregate-level error so that weighted least squares (WLS) can be applied when

ϵ_{i} \sim i . i . d . N (0, σ^{2})

[4]. More estimators have been proposed for linear regression based on aggregate data or partially aggregate data including Palm and Nijman’s MLE estimator [5] and Rawashdeh and Obeidat’s Bayesian estimator [6].

Although the estimations of linear regression models in the above data situation have been well studied, more studies are needed for the estimations of other regression models. The aim of this article is to study the estimations of logistic models in the data situation of aggregate-level Y and individual-level X. We derive the likelihoods and our estimators with different optimization methods in Section 2, conduct simulation studies to evaluate and compare the performances of different estimators in Section 3, illustrate the use of different estimators in real data-based studies in Section 4, provide discussions in Section 5, and draw conclusions in Section 6.

2. Methods

Suppose n independent observations

(X_{i}, Y_{i})

are modeled by a logistic model

\log (\frac{P (Y_{i} = 1)}{1 - P (Y_{i} = 1)}) = X_{i}^{T} β, i = 1, 2, \dots, n .

(2)

Then,

Y_{i} \sim Bernoulli (π_{i})

, where

π_{i} = P (Y_{i} = 1) = \frac{e x p (X_{i}^{T} β)}{1 + e x p (X_{i}^{T} β)} .

When individual-level X and Y are both available, the logistic model as a general linear model can be estimated using a range of methods including the Newton–Raphson method and Fisher’s scoring method [7,8].

2.1. Likelihood of Aggregate-Level Y and Individual-Level X

When individual-level Y is not available, we can derive estimators based on aggregate-level Y and individual-level X. Suppose the n observations of

(X_{i}, Y_{i})

are aggregated into K groups, as described in the introduction section, with the aggregated data

(X_{g}^{*}, Y_{g}^{*}), g = 1, \dots, K

, defined in Equation (1).

Aggregate-level Y is obtained by summing all Y within each group. Thus, the distribution of the sum of multiple independent random variables is helpful for studying data aggregation. In our logistic regression scenario, we need to calculate the sum of multiple Bernoulli random variables. In statistics, the Poisson binomial distribution is the distribution of a sum of independent Bernoulli random variables, which do not necessarily have different success probabilities [9,10]. The term

PoissonBinomial (n, (π_{1}, π_{2}, \dots, π_{n}))

is used to refer to the distribution of the sum of n independent Bernoulli random variables with success probabilities

π_{1}, π_{2}, \dots, π_{n}

[9].

Because

Y_{g}^{*}

is the sum of

n_{g}

independent Bernoulli random variables,

Y_{g}^{*} \sim PoissonBinomial (n_{g}, (π_{g 1}, π_{g 2}, \dots, π_{g n_{g}})),

(3)

where the success probability for the ith individual in Group g is

π_{g i} = P (Y_{g i} = 1) = \frac{e x p (X_{g i}^{T} β)}{1 + e x p (X_{g i}^{T} β)} .

(4)

Denote the individual likelihood for

Y_{g}^{*}

as

L_{g} (β) = P (Y_{g}^{*}; X_{g 1}, \dots, X_{g n_{g}}, β)

. Then, the aggregate likelihood

L (β) = \prod_{g = 1}^{K} L_{g} (β)

.

2.2. Calculation and Maximization of Likelihood

Computing the likelihood function needs to calculate the probability mass function for

Y_{g}^{*} \sim PoissonBinomial (n_{g}, (π_{g 1}, π_{g 2}, \dots, π_{g n_{g}}))

. The variable

Y_{g}^{*}

will reduce to

Binomial (n_{g}, π)

when

π_{g 1} = π_{g 2} = \dots = π_{g n_{g}}

. This case can happen when aggregation is based on the values of X and the individual-level predictors

X_{i}

are the same within each group. This specific aggregation has been well studied in the topic of logistic regression based on aggregate data [7,11]. We consider aggregation not based on X, i.e., allowing different values of X in a group, in this paper.

In general, for a variable

Y \sim PoissonBinomial (n, (π_{1}, π_{2}, \dots, π_{n}))

, the probability mass function is

P (Y = y) = \sum_{A \in F_{y}} \prod_{i \in A} π_{i} \prod_{j \in A^{c}} (1 - π_{j})

, where

F_{y}

is the set of all subsets of y integers that can be selected from

{1, 2, 3, \dots, n}

and

A^{c}

is the complement of A [9]. The set

F_{k}

contains

(\binom{n}{k})

elements so the sum over it is computationally intensive and even infeasible for large n. Instead, more efficient ways were proposed, including the use of a recursive formula to calculate

P (Y = y)

based on

P r (Y = k)

,

k = 0, \dots, y - 1

, which is numerically unstable for large n [12], and the inverse Fourier transform method [13]. Hong [10] further developed it by proposing an algorithm that efficiently implements the exact formula with a closed expression for the Poisson binomial distribution. We adopted Hong’s algorithm [10] and exact formula in calculating the likelihood function

L (β)

since they are more precise and numerically stable.

Commonly used optimization methods were adopted to maximize the likelihood

L (θ)

, including (1) Nelder and Mead’s simplex method (NM) [14], (2) the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method [15], and (3) the conjugate gradient (CG) method [16].

2.3. Large-Sample Properties of Estimators

As mentioned above, our proposed estimators are obtained by maximizing the aggregate likelihood

L (β)

using different optimization methods (NM, BFGS, and CG). The MLE

{\hat{β}}_{M L E}

is an estimator that maximizes the aggregate likelihood function, i.e.,

{\hat{β}}_{M L E} = a r g m a x_{β} L (β)

. If our three optimization methods can always obtain the maximizer of

L (β)

, the three estimators will be equal and exactly the same as the MLEs.

In practice, the three optimization methods may not obtain the same value as the MLE. We observed that as the sample size increases, the values obtained using the three optimizations become closer and nearly the same for a large sample size. In discussing large-sample properties, we refer to the scenario of an infinite number of observations and assume that the three optimization methods can always obtain MLEs under the scenario of large samples, i.e., the scenario of an infinite number of observations. Then, our three estimators are identical to the MLE and have the same large-sample properties as the MLE. We add a cautious note that if our estimators are still quite different from the MLE under the large-sample scenario, we cannot state that our estimators have the same large-sample properties as the MLE.

The large-sample properties of the MLE

{\hat{β}}_{M L E}

[17] include (i) consistency, i.e.,

{\hat{β}}_{M L E} \to β

in probability, and (ii) asymptotic normality, i.e.,

{\hat{β}}_{M L E} \sim N (β, I {(β)}^{- 1})

, where

I (β)

is the expected information matrix, defined as the negative expectation of the second derivative of the log-likelihood. The expected information matrix can be approximated using the observed information matrix, which is the negative of the second derivative (the Hessian matrix) of the log-likelihood function [17].

2.4. Software Implementation

All analyses in this paper were conducted using R software (version 4.2.0). Multiple R packages were used as follows:

The $P o i s s o n B i o m i a l$ package. This package implements multiple exact and approximate methods to calculate Poisson binomial distributions [10]. We used this package to calculate the Poisson binomial distributions and aggregate likelihood $L (β)$ .
The $s t a t s$ package. This package contains the $o p t i m ()$ function, which can conduct general-purpose optimization based on multiple optimization methods, including the Nelder–Mead, BFGS, and CG methods. We used this function to obtain our three estimators using three optimization methods.
The $g l m$ package. This package can be used to fit generalized linear models including logistic regression. We used this package to conduct logistic regression.

2.5. Computational Burden

The computational burden of our method relies on three factors: (1) p, (2) aggregate-level data sample size K, and (3) group size

n_{g}

.

Our estimator for

β

is obtained by maximizing the aggregate likelihood

L (β) = \prod_{g = 1}^{K} L_{g} (β)

,

β \in R^{p}

using three optimization methods (NM, BFGS, and CG). The number of evaluations of the optimization function

L (β)

and the derivatives will increase with respect to an increase in p. Large p will decrease the performance. Given a small fixed number p, the number of function evaluations is

O (1)

. Because

L (β) = \prod_{g = 1}^{K} L_{g} (β)

, the computational amount for

L (β)

is K times the computational amount for

L_{g} (β)

.

The computation of

L_{g} (β)

includes two steps. In Step 1, the success probabilities are calculated using Equation (4). The computational burden of Step 1 is

O (n_{g})

. In Step 2, the probability mass for a Poisson binomial random variable described in Equation (3) is calculated. This step adopts Hong’s Algorithm A, which is an efficient implementation of the discrete Fourier transform of the characteristic function (DFT-CF) of the Poisson binomial distribution [10]. The computational burden of Step 2 is

O (n_{g}^{2})

. In total, the computational burden of our estimation method is

O (1) \times K \times O (n_{g}^{2}) = O (K n_{g}^{2})

, given a small constant p.

3. Simulation Studies

We conducted simulation studies to evaluate the performance of the five estimators. The first estimator, named individual-LR, is the logistic regression estimator based on individual-level X and Y. This estimator is infeasible when only aggregate Y is available. Because aggregate-level Y contains less information compared to individual-level Y, we expect that this infeasible estimator can provide an upper bound for the performance of feasible estimators based on aggregate-level Y. The second estimator, named naive LR, is the logistic regression estimator based on the mean X in each group and the aggregate Y, i.e.,

Y_{g}^{*} \sim B i n (n_{g}, X_{g}^{*} / n_{g}),

g = 1, 2, \dots, K

. This estimator can provide a rough approximate estimation.

Estimators 3 to 5 are our estimators that maximize the aggregate likelihood

L (β)

using the Nelder–Mead optimization, CG optimization, and BFGS optimization, named aggregate LR with NM, aggregate LR with CG, and aggregate LR based on BFGS, respectively.

The performances of the estimators were compared in three scenarios. In each scenario, simulations were conducted with sample sizes (

K = 300, 500, 1000

), equal group sizes (

n_{g} = 7, 30

), and different parameter values. Data were generated as follows:

In Scenario 1, $X_{i 1} \sim N (0, 1)$ , $X_{i} = {(1, X_{i 1})}^{T}$ , $Y_{i} \sim Bernoulli (e^{X_{i}^{T} β} / (1 + e^{X_{i}^{T} β}))$ ,
$β = {(1, - 2)}^{T}$ (Scenario 1A) or ${(1, 3)}^{T}$ (Scenario 1B).
In Scenario 2, $X_{i 1} \sim N (0, 1)$ , $X_{i 2} \sim t (d f = 5)$ , $X_{i} = {(1, X_{i 1}, X_{i 2})}^{T}$ ,
$Y_{i} \sim Bernoulli (e^{X_{i}^{T} β} / (1 + e^{X_{i}^{T} β}))$ , $β = {(- 1, 1, 2)}^{T}$ (Scenario 2A) or ${(0, - 2, 1)}^{T}$
(Scenario 2B).
In Scenario 3, $(X_{i 1}, X_{i 2}) \sim BivariateNormal (0, 2, 1, 4, ρ = 0.5)$ , $X_{i 3} \sim Cauchy (0, 1)$ , $X_{i} = {(1, X_{i 1}, X_{i 2}, X_{i 3})}^{T}$ , $Y_{i} \sim Bernoulli (e^{X_{i}^{T} β} / (1 + e^{X_{i}^{T} β}))$ , $β = {(- 1, 1, 0, - 1)}^{T}$
(Scenario 3A) or ${(0, - 2, 1, 1)}^{T}$ (Scenario 3B).

The bias, variance, mean square error (MSE), and mean absolute deviation (MAD) of each of the five estimators’ (E1 to E5) model parameters

(β_{0}, \dots, β_{p})

were calculated. Denote the bias, variance, MSE, and MAD of the q-th estimator of

β_{j}

as

B i a s ({\hat{β}}_{j, E_{q}})

,

V a r ({\hat{β}}_{j, E_{q}})

,

M S E ({\hat{β}}_{j, E_{q}})

, and

M A D ({\hat{β}}_{j, E_{q}})

. The average squared bias, variance, MSE, and MAD of the qth estimator were calculated as

\begin{matrix} \bar{B i a s^{2}} (E_{q}) & = & [(B i a s^{2} ({\hat{β}}_{0, E_{q}}) + \dots + (B i a s^{2} ({\hat{β}}_{p, E_{q}})] / (p + 1), \\ \bar{V a r} (E_{q}) & = & [V a r ({\hat{β}}_{0, E_{q}}) + \dots + V a r ({\hat{β}}_{p, E_{q}})] / (p + 1), \\ \bar{M S E} (E_{q}) & = & [M S E ({\hat{β}}_{0, E_{q}}) + \dots + (M S E ({\hat{β}}_{p, E_{q}})] / (p + 1), \\ \bar{M A D} (E_{q}) & = & [M A D ({\hat{β}}_{0, E_{q}}) + \dots + (M A D ({\hat{β}}_{p, E_{q}})] / (p + 1) . \end{matrix}

Please note that we averaged over the squared bias instead of the bias because the positive bias and negative bias can cancel out when averaging the bias. The average across the parameters allows us to obtain the average performance in terms of the squared bias, variance, MSE, and MAD and still maintain the equality of the bias, variance, and MSE, i.e.,

\bar{M S E} (E_{q}) = \bar{B i a s^{2}} (E_{q}) + \bar{V a r} (E_{q}) .

In Table 1, we report the average squared biases and variances for the five estimators (E1 to E5) under the different scenarios, sample sizes K, and aggregation sizes

n_{g}

. As we expected, there was a relatively large bias for the naive estimator E2, which used an approximate likelihood by conducting logistic regressions using the average X. Our estimators (E3 to E5) had relatively small biases because these estimators were working on the correct and exact likelihood functions. The first estimator E1 had the smallest bias by working on individual-level X and individual-level Y. This estimator is widely used when individual-level Y is available. However, under the scenario we intended to solve, only aggregate-level Y was available. Thus, the E1 estimator is infeasible. We still report the performance of E1 to provide some measurements of the possible upper bound of the performance. Because data aggregation will discard information, we expect that estimator E1 will generally perform better than the estimators based on aggregate Y.

Next, we check the variances of all five estimators. The variances of all five estimators were similar in the same magnitude level. There was no estimator that performed uniformly better or even generally better than the other estimators. The naive estimator E2 had similar performance or even slightly better performance in the average variance compared with the other estimators (E1, E3–E5). Our estimators (E3 to E5) were slightly worse in terms of variance. We think the slightly worse performance of our estimators (E3–E5) was likely due to the nonlinear optimization to find the MLE in our estimators. In comparison, the logistic regression estimators (E1 and E2) were calculated using iteratively re-weighted least squares (IRLS) (logistic regression ensures global concavity so that it is simpler to find the MLE), which was numerically more stable compared to the nonlinear optimization of a general likelihood function using (1) Nelder and Mead’s simplex method [14], (2) the BFGS method [15], and (3) the conjugate gradient (CG) method [16].

We point out that although the naive estimator E2 worked on an incorrect (or approximate) likelihood function, which can lead to a large bias due to the incorrect likelihood function, the performance of the variance of E2 did not necessarily become worse. A similar phenomenon was the under-fitting in the data analysis. Suppose the true relationship is a quadratic function. If a linear function is used in model fitting, the estimator will have a large bias due to model mis-specification, whereas the variance may not increase. We note that the main disadvantage of estimator E2 was the use of an incorrect or approximate likelihood function, which can lead to a large bias. Using the correct exact likelihood, i.e., our estimators (E3 to E5), can solve the issue of bias due to the slight increase in variance from the switch in finding the MLE from iteratively reweighted least squares (IRLS) to nonlinear optimization using the Nelder and Mead’s simplex, BFGS, and CG methods. We compared the decrease in bias and increase in variance and think the bias reduction will dominate the variance increase in our estimators. We calculated the overall performance in terms of the MSE and MAD to confirm it.

Our simulation results showed that the naive estimator had a large bias due to the use of an incorrect or approximate likelihood function, which can hurt the MSE. Thus, in Table 2, we report the average performance of the five estimators (E1 to E5) in terms of the MSE and MAD. Our simulation results indicated that our proposed estimators (E3 to E5) were better than the naive LR estimator (E2). As expected, the infeasible estimator (E1) based on individual-level Y performed better than the other four feasible estimators (E2 to E5) based on aggregate-level Y due to the loss of information in the data aggregation. Our estimator based on Nelder and Mead’s simplex optimization (E3) performed slightly better than our estimator based on BFGS optimization (E4) and CG optimization (E5).

We found the performances of our estimators (E3, E4, E5) were slightly worse when the group size

n_{g} = 30

compared with the performances of our estimators when the group size

n_{g} = 7

. We expect that the performance of our estimators may decrease for a large group size

n_{g}

due to rounding errors in computation.

4. Real Data-Based Studies

We used real data to illustrate the use of our estimators and compare the different estimators. The dataset used was the “Social-Network-Ads” dataset from the Kaggle Machine Learning Forum (https://www.kaggle.com, accessed on 12 January 2023).

The dataset has been used by statisticians and data scientists to illustrate the use of logistic regression in categorical data analysis. We used the dataset to illustrate the use of our method to conduct logistic regression in the presence of data aggregation.

The Social-Network-Ads dataset in Kaggle is a categorical dataset for determining whether a user purchased a particular product. The dataset (https://www.kaggle.com/datasets, accessed on 12 January 2023) contains 400 people/observations. The information about the person’s purchase action (purchased with a binary variable of 1 denotes purchased and 0 denotes not purchased), as well as the person’s age and estimated salary, is provided. Logistic regression has been recommended in Kaggle to model the person’s purchase action based on the person’s age and estimated salary. We intend to apply our method to this dataset in the presence of data aggregation.

The original dataset is at the individual level, which allows us to conduct logistic regression based on individual-level Y and X. We standardized X by

X^{*} = (X - m e a n (X)) / s d (X)

in data pre-processing. Standardization of X allows for better estimation and interpretation. Standardized coefficients

β^{*}

are obtained by logistics regression of Y on standardized data

X^{*}

. The original slope coefficients in

β

can be calculated by the formula

\hat{β} = {\hat{β}}^{*} \times s d (X)

and then the intercept coefficient can be calculated.

We imposed data aggregation on this dataset with an aggregation size

n_{g} = 3, 5, 7

. We randomly divided the persons into groups of size

n_{g}

and calculated the group aggregate of the purchase actions Y. Due to confidentiality and the cost of collecting individual-level data, businesses and organizations can choose to post data information at an aggregate level. We mimicked the data aggregation process by random grouping and calculated the aggregate-level Y based on the individual-level Y. We repeated the data aggregation 300 times. In this way, we generated 300 datasets, with the individual-level X and aggregate-level Y calculated.

For each dataset, we conducted logistic regression based on individual-level X and Y and obtained our estimator E1. Since data aggregation discards information, we evaluated the other estimators by checking whether they were close to estimator E1. Because the true values of the coefficients in individual-level logistic regression models are not known in real data-based studies, we used estimator E1 as a gold-standard estimator. We compared the other estimators based on aggregate-level Y to determine which estimator was closer to our gold-standard estimator E1. Note that E1 is an infeasible estimator when individual-level X is not available.

The estimator E1 was calculated based on individual-level X and individual-level Y. The estimated value of estimator E1 remained the same in our 300 generated datasets and E1 was treated as the gold-standard estimator; thus, we denote it as

(β_{0}, β_{1}, β_{2})

.

Denote the estimated value of

β_{i}

for the j-th estimator in the k-th dataset by

{\hat{β}}_{i, E j} (D_{k})

. The bias, variance, MSE, and MAD of estimators E2 to E5 for

β_{0}

,

β_{1}

, and

β_{2}

were calculated by the formulae

\begin{matrix} \bar{{\hat{β}}_{i, E j}} & = & \sum_{k = 1}^{300} {\hat{β}}_{i, E j} (D_{k}) / 300 \\ B i a s ({\hat{β}}_{i, E j}) & = & \bar{{\hat{β}}_{i, E j}} - β_{i} \\ V a r ({\hat{β}}_{i, E j}) & = & \sum_{k = 1}^{300} {{\hat{β}}_{i, E j} (D_{k}) - \bar{{\hat{β}}_{i, E j}}}^{2} / 300 \\ M S E ({\hat{β}}_{i, E j}) & = & \sum_{k = 1}^{300} {{\hat{β}}_{i, E j} (D_{k}) - β_{i}}^{2} / 300 \\ M A D ({\hat{β}}_{i, E j}) & = & \sum_{k = 1}^{300} | {\hat{β}}_{i, E j} (D_{k}) - β_{i} | / 300 \end{matrix}

For the four estimators based on aggregate-level Y and individual-level X, i.e., E2 to E5, we report the biases and variances in Table 3. We can see that in most cases, there are large biases in estimating

β_{0}

and

β_{2}

and relatively smaller biases in estimating

β_{1}

using the naive estimator E2. Our proposed estimators (E3 to E5) always achieved smaller biases compared to the naive estimator E2. This is because the naive estimator E2 used an approximate likelihood instead of an exact likelihood, which our proposed estimators are based on. In terms of variance, the naive estimator had a relatively smaller variance compared with our estimators E3 to E5. We point out that the calculation algorithm used in E2, i.e., iteratively reweighted least squares (IRLS), was more numerically stable compared with the nonlinear optimization algorithms adopted by our estimators, i.e., Nelder and Mead’s simplex method, the BFGS method, and the conjugate gradient method.

We then checked the overall performance of the different estimators and report the MSE and MAD in Table 4. We found that our estimators (E3 to E5) had better performance than the naive estimator (E2) in terms of the MSE and MAD in all situations based on the Social-Network-Ads dataset.

5. Discussion

Our estimators are obtained by maximizing the nonlinear likelihood function

L (β)

,

β \in R^{p}

. Different optimization methods can influence the performance of our estimators. Further studies can be conducted on other optimization methods such as the genetic algorithm or using multiple starting values. The performance of optimization is expected to decrease when p increases.

We only consider independent individual-level data, i.e.,

(X_{i}, Y_{i})

,

i = 1, 2, \dots, n

. The n observations are randomly divided into groups of size

n_{g}

and the aggregate-level Y is calculated after grouping. In this paper, we only consider the situation of “grouping completely at random”, which means that the grouping mechanism is completely random. The values of X and Y do not influence the grouping. Further studies can be conducted beyond this type of grouping mechanism.

Our aggregation scheme is based on independent individual-level data. There are more aggregations schemes. For example, temporal aggregation can aggregate dependent data, which can generate aggregated low-frequency time series based on high-frequency time series by summing every m consecutive time points. For example, we can aggregate daily time series into weekly time series by summing every

m = 7

consecutive daily observations. Temporal aggregation is often based on a time series model such as an integer-valued generalized autoregressive conditional heteroskedasticity (INGARCH) [18].

We note that the proposed methods also allow for other link functions in addition to the logit link. For example, when a probit link function is used, we can estimate individual-level probit models based on aggregate-level Y and individual-level X. In addition, we only consider binary responses in this paper. A follow-up study to extend our methods to handle responses with more than two levels are under development.

6. Conclusions

We proposed methods to estimate logistic models based on individual-level predictors and aggregate-level responses. We conducted simulation studies to evaluate the performance of the estimators and show the advantage of our estimators. We then used the Social-Network-Ads dataset to illustrate the use of our estimators in the presence of data aggregation and compared the different estimators. Both the simulation studies and real data-based studies have shown the advantage of our estimators in estimating logistics models describing individual-level behaviors based on aggregate-level Y and individual-level X, i.e., when there is data aggregation in the response variable.

Funding

This research received no external funding.

Data Availability Statement

All data used in the study are publicly available.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BFGS	Broyden–Fletcher–Goldfarb–Shanno method
CF	Characteristic function
CG	Conjugate gradient
DFT	Discrete Fourier transform
IRLS	Iteratively re-weighted least squares
LR	Logistics regression
MAD	Mean Absolute Deviation
MSE	Mean Square Error
NM	Nelder-Mead method

References

Firebaugh, G. A rule for inferring individual-level relationships from aggregate data. Am. Sociol. Rev. 1978, 43, 557–572. [Google Scholar] [CrossRef]
Robinson, W.S. Ecological correlations and the behavior of individuals. Int. J. Epidemiol. 2009, 38, 337–341. [Google Scholar] [CrossRef]
Hammond, J.L. Two sources of error in ecological correlations. Am. Sociol. Rev. 1973, 38, 764–777. [Google Scholar] [CrossRef]
Hsiao, C. Linear regression using both temporally aggregated and temporally disaggregated data. J. Econom. 1979, 10, 243–252. [Google Scholar] [CrossRef]
Palm, F.C.; Nijman, T.E. Linear regression using both temporally aggregated and temporally disaggregated data. J. Econom. 1982, 19, 333–343. [Google Scholar] [CrossRef]
Rawashdeh, A.; Obeidat, M. A Bayesian Approach to Estimate a Linear Regression Model with Aggregate Data. Austrian J. Stat. 2019, 48, 90–100. [Google Scholar] [CrossRef]
Agresti, A. Categorical Data Analysis; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2013. [Google Scholar]
Givens, G.; Hoeting, J. Computational Statistics; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
Wang, Y.H. On the number of successes in independent trials. Stat. Sin. 1993, 3, 295–312. [Google Scholar]
Hong, Y. On computing the distribution function for the Poisson binomial distribution. Comput. Stat. Data Anal. 2013, 59, 41–51. [Google Scholar] [CrossRef]
Bilder, C.; Loughin, T. Analysis of Categorical Data with R; Chapman & Hall/CRC Texts in Statistical Science; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar]
Chen, X.H.; Dempster, A.P.; Liu, J.S. Weighted finite population sampling to maximize entropy. Biometrika 1994, 81, 457–469. [Google Scholar] [CrossRef]
Fernández, M.; Williams, S. Closed-form expression for the poisson-binomial probability density function. IEEE Trans. Aerosp. Electron. Syst. 2010, 46, 803–817. [Google Scholar] [CrossRef]
Nelder, J.A.; Mead, R. A simplex method for function minimization. Comput. J. 1965, 7, 308–313. [Google Scholar] [CrossRef]
Fletcher, R. A new approach to variable metric algorithms. Comput. J. 1970, 13, 317–322. [Google Scholar] [CrossRef]
Flecher, R.; Reeves, C. Function minimization by conjugate gradient. Comput. J. 1964, 7, 149–154. [Google Scholar] [CrossRef]
Shao, J. Mathematical Statistics; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
Su, B.; Zhu, F. Temporal aggregation and systematic sampling for INGARCH processes. J. Stat. Plan. Inference 2022, 219, 120–133. [Google Scholar] [CrossRef]

Table 1. Average Squared Bias and Variance of Estimators E1 to E5 in Scenarios 1A to 3B. K is the sample size of the aggregate data.

n_{g}