Next Article in Journal
The Analysis of Risk Measurement and Association in China’s Financial Sector Using the Tail Risk Spillover Network
Next Article in Special Issue
Subgroup Identification in Survival Outcome Data Based on Concordance Probability Measurement
Previous Article in Journal
Spatial Effects of Phase Dynamics on Oscillators Close to Bifurcation
Previous Article in Special Issue
Diagnosing Vascular Aging Based on Macro and Micronutrients Using Ensemble Machine Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Scrambling Reports: New Estimators for Estimating the Population Mean of Sensitive Variables

by
Pablo O. Juárez-Moreno
1,*,
Agustín Santiago-Moreno
2,
José M. Sautto-Vallejo
2 and
Carlos N. Bouza-Herrera
3
1
Higher School of Sociology, Universidad Autónoma de Guerrero, Acapulco 39310, Mexico
2
Faculty of Mathematics, Universidad Autónoma de Guerrero, Acapulco 39650, Mexico
3
Faculty of MATCOM, Universidad de La Habana, La Habana 11300, Cuba
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(11), 2572; https://doi.org/10.3390/math11112572
Submission received: 13 April 2023 / Revised: 12 May 2023 / Accepted: 29 May 2023 / Published: 4 June 2023
(This article belongs to the Special Issue Current Research in Biostatistics)

Abstract

:
Warner proposed a methodology called randomized response techniques, which, through the random scrambling of sensitive variables, allows the non-response rate to be reduced and the response bias to be diminished. In this document, we present a randomized response technique using simple random sampling. The scrambling of the sensitive variable is performed through the selection of a report Ri, i = 1,2,3. In order to evaluate the accuracy and efficiency of the proposed estimators, a simulation was carried out with two databases, where the sensitive variables are the destruction of poppy crops in Guerrero, Mexico, and the age at first sexual intercourse. The results show that more accurate estimates are obtained with the proposed model.

1. Introduction

When carrying out survey sampling, the goal of the sampler is to collect, based on a sample, the greatest amount of information in order to estimate a certain characteristic of the population under study. To accomplish the objective of having accurate and truthful measurements, the sampler must have a sufficient amount of financial and methodological resources. If the sampler cannot solve any of the aforementioned issues, in practice, problems will arise in the collection of the information of interest, and these problems are a component of so-called “sampling errors”. These errors are mainly due to a lack of response (non-response) or response bias. In addition, these sampling errors increase when the information to be obtained is about a sensitive characteristic. That is, respondents are more likely to avoid answering or give untruthful responses to questions on topics such as drugs, sexual violence, alcoholism, crime, etc.
We can find in the literature different techniques or methodologies to obtain answers to direct questions of a sensitive nature, such as the bogus pipeline developed by Jones and Sigall [1], unmatched count developed by Raghavarao and Federer in [2], and randomized response (RR) proposed by Warner [3]. The bogus pipeline and unmatched count techniques serve their purpose of protecting the confidentiality of respondents. However, their shortcomings compared to randomized response techniques lie in the implementation costs, the veracity of the results due to the lack of unbiased estimators, and their characteristics (variance, estimation error, and so on); see [4,5]. Due to its methodology and statistical foundations, Warner’s [3] proposal is the most appropriate for reducing response bias and non-response rates, estimating the characteristic of interest, and maintaining the confidentiality of the respondent so as to protect them from being stigmatized when providing a sensitive response.
In the first work on randomized responses by Warner [3], he considered a dichotomous population U of size N; that is, the elements of the population are classified according to their possible responses in the groups U A , consisting of people who have the sensitive characteristic Y, and U A ¯ , consisting of people who do not have the sensitive characteristic Y. Using simple random sampling (SRS), a sample s of size n is selected in order to estimate the proportion of people with the sensitive qualitative characteristic: π A . Using the following model, he scrambled the sensitive response of the respondent, assisted by a randomization device that selects the sensitive question with the probability P. Hence, π y = P π A + ( 1 P ) ( 1 π A ) . Warner’s proposal for estimating the population proportion π A   of a sensitive characteristic A is π ^ A = ρ y s ( 1 P ) 2 P 1 . Extensions to deal with quantitative sensitive variables were developed by Greenberg et al. [6], Eriksson et al. [7], Huang [8], Bouza [9], Arnab [10], Singh and Gorey [11], Hussain and Shahid [12], Narjis et al. [13], Bouza et al. [14], Hussain et al. [15], and Azeem and Ali [16], among other works. Another utility of RR techniques is their applicability to sensitive issues, such as health areas (see Murtaza et al. [17]), social issues (see Chong et al. [18]), and drug use (see Perri et al. [19] and Kirtasze et al. [20]), among other sensitive issues. We present a variation of Saleem et al.’s [21] paper, in which the authors proposed a scrambling procedure for quantitative sensitive variables.
In this study, we used two databases to evaluate the estimators. One of them was obtained from a census on the cultivation of illegal drugs in the State of Guerrero, Mexico (see México Unido Contra La Delincuencia [22]). The sensitive variable is the area devoted to such crops. This research is very important because, despite efforts to curb the production of illicit drugs, their cultivation increases. A goal of the involved authorities is to examine behavior when using scrambling techniques to provide farmers with the confidence that their reports are not going to stigmatize them. Eradication efforts of such crops have an impact on ecosystems, as policies disproportionately affect not only smallholders, pushing them to marginality, but also programs such as the aerial spraying of herbicides, which affect biodiversity by fragmentizing and degrading forest habitat and wildlife. Severe damage to the environment, which may be a consequence of eradication policies, imposes the need to periodically review their effects from societal perspectives. Sample surveys should be developed periodically. The other database was provided by research on the age of first sexual intercourse (see Secretaria De Salud [23]). Early sexual activity in adolescence has multiple short- and long-term negative impacts on further emotional development and the quality of health, both mental and physical. Different studies maintain that having sex before the age of 13 increases the likelihood of sexually transmitted infections and other unhealthy behaviors, such as alcohol abuse. It is also associated with delinquency, violence, intergenerational health due to unintended pregnancies, etc. See Epstein et al. [24] for a discussion on these facts. Previous reliability studies on first intercourse have given some idea of the rates of falsified answers. See Brener et al. [25] as an example. Obtaining truthful answers while protecting privacy is possible with the use of RR techniques. They also provide higher rates of response from surveyed persons.
The content of this document is organized as follows. In the first part, we propose a variation of the model proposed by Saleem et al. [21] under SRS. The goal of this variation is to improve Saleem et al.’s [21] model in terms of precision, resulting in an R3 report with an unbiased estimator of the mean under specified conditions. In the second section, we evaluate the quality of the estimators in terms of accuracy and efficiency. We developed numerical studies on the behavior of the RR techniques presented using the two databases. Both studies provide recommendations on the use of the estimators derived for the considered scrambling procedures. Numerical and graphical studies were performed using simulations.

2. Materials and Methods

Proposed RR Scrambling Procedure Using SRSWR

Randomized response techniques increase the participation of respondents to direct questions regarding a sensitive characteristic by providing them with confidence when reporting the value of their sensitive characteristic Y. Otherwise, the sampler is generally faced with a high proportion of non-responses and/or false responses. In practice, RR techniques, which are better at scrambling the sensitive value Y, will be perceived with more confidence by the respondents, who are more likely to supply its true value. We propose a variation of the work of Saleem et al. [21]. The RR proposed is a compulsory randomized response technique, in which the respondent’s response is randomly scrambled by one of the following three reports:
R 1 = Y i + S i ,   R 2 = Y i S i   or   R 3 = Y i S i .
They individually scramble the true value of Y.
Take g 0,1 and α { 1,1 } , which are independent constants known and/or generated by the sampler. S is an auxiliary or scrambling variable, with the mean E S = μ S = 0 and variance σ S 2 fixed by the sampler. The report is
Z * = g Y + α S + 1 g Y S
Our proposal substitutes the last alternative report with R ( 3 ) = Y i / S i and S with the mean μ S > 0 and variance σ S 2 . It is also a compulsory randomized response technique. Now, the respondent’s response is randomly scrambled by R 1 ,   R 2 , or R ( 3 ) . Therefore, the RR model is given by:
Z = g Y + α S + 1 g Y / S ,
SRSWR (simple random sampling with replacement) is used to select a sample s of size n from a population U in the reports. It is of interest to know the population characteristics of the sensitive value Y. Looking at the characteristics for R 1 and R 2 proposed by Saleem et al. [21]:
Y ¯ ( R 1 ) = R ¯ 1 μ S , and its variance V R ¯ 1 = σ Y 2 + σ S 2 n for R 1 ; Y ¯ ( R 2 ) = R ¯ 2 + μ S , and its variance is V R ¯ 2 = σ Y 2 + σ S 2 n for R 2 . For both reports, V ^ R ¯ i = σ ^ Y 2 + σ S 2 n for i = 1, 2, where σ ^ Y 2 = S Z 2 σ S 2 n with S Z 2 = 1 n 1 i = 1 n Z i Z ¯ 2 . His proposal of an estimator of μ Y * for the Z* model is μ ^ Y * = Z ¯ g with the variance V μ ^ Y * = 1 n g 2 σ Y 2 + α 2 σ S 2 + 1 g 2 σ S 2 σ Y 2 + Y ¯ 2 + 2 α g 1 g Y ¯ σ S 2 . We propose the following estimator of the variance: V ^ μ ^ Y * = 1 n g 2 σ ^ Y 2 + α 2 σ S 2 + 1 g 2 σ S 2 σ ^ Y 2 + μ ^ Y * 2 + 2 α g 1 g μ ^ Y * σ S 2 , where σ ^ Y 2 = S z 2 g 2 α 2 σ S 2 1 g 2 σ S 2 μ ^ Y * 2 2 α g 1 g μ ^ Y * σ S 2 g 2 + 1 g 2 σ S 2 .
Our proposal uses R ( 3 ) i = Y i / S i instead of R 3 i = Y i S i . It seems that respondents will perceive that R ( 3 ) i provides more confidence in scrambling Yi. The next lemma gives the statistical properties of an estimation of the population mean based on reports R ( 3 ) i , i = 1,…,n.
Lemma 1.
The estimator of the mean of Y using the scrambling procedure  R ( 3 ) i s Y ¯ ( R ( 3 ) ) R ¯ ( 3 ) / 1 μ S + 1 μ S 3 σ S 2  with the variance  V Y ¯ ( R ( 3 ) ) 1 n σ Y 2 + 1 μ S 2 + 3 μ S 4 σ S 2 1 μ S + 1 μ S 3 σ S 2 2 1 ( σ Y 2 + μ Y 2 ) .
Proof. 
Expectation. Note that it is a ratio estimator. Note that the expectation of R ( 3 ) under the model is E R ( 3 ) R ( 3 ) i | i = E R 3 Y i S i | i = Y i E R 3 1 S i | i Y i 1 μ S + 1 μ S 3 σ S 2 . This expression is derived by using a Taylor Series approximation E 1 S i 1 E S i + 1 E S i 3 V a r S i = 1 μ S + 1 μ S 3 σ S 2 . See Singh [26]. Therefore, E Y ¯ ( R ( 3 ) ) Y ¯ / 1 μ S + 1 μ S 3 σ S 2 . Calculating the design expectation, E E R ( 3 ) i | i = E d E R 3 Y i S i | i E d Y i 1 μ S + 1 μ S 3 σ S 2 = μ Y 1 μ S + 1 μ S 3 σ S 2 .
Hence, the estimator R ¯ 3 1 μ S + 1 μ S 3 σ S 2 = μ Y R 3 is an approximately unbiased estimator of μ Y .
Variance of the estimator. The variance of R 3 under the model is V R 3 R ( 3 ) i | i = V R 3 Y i S i | i = Y i 2 V R 3 1 S i | i Y i 2 1 μ S 2 + 3 μ S 4 σ S 2 1 μ S + 1 μ S 3 σ S 2 2 , where V 1 S i = E 1 S i 2 E 1 S i 2 1 μ S 2 + 3 μ S 4 σ S 2 1 μ S + 1 μ S 3 σ S 2 2 . Using, in both expectations, a Taylor Series approximation, as developed by Singh [26],
V R 3 V R 3 / 1 μ S + 1 μ S 3 σ S 2 | i = V d 1 1 μ S + 1 μ S 3 σ S 2 n i s E R R 3 i + E d 1 1 μ S + 1 μ S 3 σ S 2 2 n 2 i s V R R 3 i V d 1 1 μ S + 1 μ S 3 σ S 2 n i s Y i 1 μ S + 1 μ S 3 σ S 2 + E d 1 1 μ S + 1 μ S 3 σ S 2 2 n 2 i s Y i 2 1 μ S 2 + 3 μ S 4 σ S 2 1 μ S + 1 μ S 3 σ S 2 2 = 1 n 2 i s V d Y i + 1 μ S 2 + 3 μ S 4 σ S 2 1 μ S + 1 μ S 3 σ S 2 2 1 μ S + 1 μ S 3 σ S 2 2 n 2 i s E d Y i 2 = 1 n σ Y R ( 3 ) 2 + 1 μ S 2 + 3 μ S 4 σ S 2 1 μ S + 1 μ S 3 σ S 2 2 1 ( σ Y R ( 3 ) 2 + μ Y R ( 3 ) 2 )
Then, the lemma is proved. □
Since the estimator is not unbiased, the bias is:
E E R ( 3 ) i | i = E d E R 3 Y i S i | i E d Y i 1 μ S + 1 μ S 3 σ S 2 = μ Y 1 μ S + 1 μ S 3 σ S 2 = R 3 1 μ S + 1 μ S 3 σ S 2 = μ Y ( 3 )
B = E μ Y R ( 3 ) μ Y = μ Y μ S 2 + σ S 2 μ S 3 μ Y = μ Y 1 μ S 2 + σ S 2 μ S 3 1
Remark 1.
The sampler is able to diminish this bias using a variable S such that  1 μ S 2 + σ S 2 μ S 3 1 .  Then, the Mean Squared Error of  μ Y R ( 3 )  is
M S E μ Y R ( 3 ) = 1 n σ Y R ( 3 ) 2 + 1 μ S 2 + 3 μ S 4 σ S 2 1 μ S + 1 μ S 3 σ S 2 2 1 ( σ Y R ( 3 ) 2 + μ Y R ( 3 ) 2 ) + μ Y 1 μ S 2 + σ S 2 μ S 3 1 2
Remark 2.
Note that  μ S 2 + σ S 2 μ S 3 , and then the estimator will be unbiased if  σ S 2 μ S 3 μ S 2  or, in the same way, if a large sample n is taken such that it satisfies the equality  σ S 2 3 = μ S . All these conditions are possible as long as the distributions of  μ S  and  σ S 2  are fixed by the researcher, as we pointed out above. Note that  n , and hence, R(3) is consistent.
Our proposal uses the estimator
μ ^ Y = Z ¯ α g μ S g + 1 g S P .
An estimation theory for this RR scrambling procedure is given in Lemma 2.2.
Lemma 2.
The use of the  Z  report has the following characteristics:
(i)
μ ^ Y Z α g μ S g + 1 g S P , which is an estimator of the population mean of Y.
(ii)
V μ ^ Y 1 g + 1 g S P 2 n σ Y 2 g 2 + 1 g 2 S P 2 + g 2 α 2 σ S 2 + 1 g 2 S P V σ Y 2 + μ Y 2 , which is the variance of the estimator.
(iii)
M S E μ ^ Y = V μ ^ Y + μ Y α g μ S g + 1 g S P μ Y 2 .
(iv)
V ^ μ ^ Y 1 g + 1 g S P 2 n σ ^ Y 2 g 2 + 1 g 2 S P 2 + g 2 α 2 σ S 2 + 1 g 2 S P V σ ^ Y 2 + μ ^ Y 2  is an estimator of the variance, where  σ ^ Y 2 S z 2 g + 1 g S P 2 n g 2 α 2 σ S 2 + 1 g 2 S P V μ ^ Y 2 g 2 + 1 g 2 S P 2 + 1 g 2 S P V , and  S z 2 = i s ( z i z ¯ ) 2 n 1 .
Proof. 
The conditional expectation of Z i is E Z i | i = E d E Z i g Y i + α S i + 1 g Y i / S i | i = E d g Y i + α E Z i S i + 1 g Y i E Z i 1 S i | i E d g Y i + α μ S + 1 g Y i 1 μ S + 1 μ S 3 σ S 2 = g μ Y + α μ S + 1 g μ Y 1 μ S + 1 μ S 3 σ S 2 =
g μ Y + α μ S + 1 g μ Y S P ; hence, Z ¯ α g μ S g + 1 g S P is the estimator of μ Y , where S P = 1 μ S + 1 μ S 3 σ S 2 .
The expectation of Z i under the model is E Z i Z i | i = E M g Y i + α S i | i + E M 1 g Y i / S i | i g Y i + α μ S + 1 g Y i S P . The variance of Z i under the model is V Z i Z i | i = V M g Y i + α S i | i + V M 1 g Y i / S i | i g 2 α 2 σ S 2 + 1 g 2 Y i 2 S P V , where S P V = V 1 S i 1 μ S 2 + 3 μ S 4 σ S 2 1 μ S + 1 μ S 3 σ S 2 2 .
Therefore, the variance of the estimator is given by
V μ ^ Y = V Z ¯ α g μ S g + 1 g S P = V Z ¯ g + 1 g S P = V d 1 g + 1 g S P n i s E Z i Z i | i + E d 1 g + 1 g S P 2 n 2 i s V Z i Z i | i V d 1 g + 1 g S P n i s g Y i + α μ S + 1 g Y i S P + E d 1 g + 1 g S P 2 n 2 i s g 2 α 2 σ S 2 + 1 g 2 Y i 2 S P V = 1 g + 1 g S P 2 n 2 i s g 2 V d Y i + 1 g 2 S P 2 V d Y i + 1 g + 1 g S P 2 n 2 i s g 2 α 2 σ S 2 + 1 g 2 S P V E d Y i 2 = g 2 σ Y 2 + 1 g 2 S P 2 σ Y 2 g + 1 g S P 2 n + g 2 α 2 σ S 2 + 1 g 2 S P V σ Y 2 + μ Y 2 g + 1 g S P 2 n = 1 g + 1 g S P 2 n σ Y 2 g 2 + 1 g 2 S P 2 + g 2 α 2 σ S 2 ) + ( 1 g 2 S P V σ Y 2 + μ Y 2
A natural estimator for the variance is
S z 2 = 1 g + 1 g S P 2 n σ Y 2 g 2 + 1 g 2 S P 2 + g 2 α 2 σ S 2 + 1 g 2 S P V σ Y 2 + μ Y 2
Say,
S z 2 g + 1 g S P 2 n = [ σ Y 2 g 2 + 1 g 2 S P 2 + g 2 α 2 σ S 2 + 1 g 2 S P V   σ Y 2 + 1 g 2 S P V   μ Y 2 ]
That is,
S z 2 g + 1 g S P 2 n = σ Y 2 g 2 + 1 g 2 S P 2 + 1 g 2 S P V + g 2 α 2 σ S 2 + 1 g 2 S P V μ Y 2
We denote
S z 2 g + 1 g S P 2 n g 2 α 2 σ S 2 + 1 g 2 S P V μ Y 2 g 2 + 1 g 2 S P 2 + 1 g 2 S P V = σ ^ Y 2 .
The lemma is proved. □
Note that the bias is:
B = E μ ^ Y μ Y = E Z ¯ α g μ S g + 1 g S P μ Y = μ Y α g μ S g + 1 g S P μ Y = μ Y 1 g + 1 g S P 1 α g μ S g + 1 g S P
With the same conditions fixed for the R(3) report, we have g + 1 g S P 1 . Then, the expression of the bias will be zero, and the choice of the researcher to use the proposed report R(3), that is, to have g = 0, will make the estimate unbiased.

3. Results

In this section, we evaluate the accuracy and efficiency of the estimators. Because the expectation of the R 3 report by Saleem et al. [21] is zero, it is not possible to make a comparison with the R ( 3 ) report, so only the Z* and Z models were compared using simple random sampling with replacement (SRSWR). We present two ways to analyze the behavior of the estimators: the first is numerically and the second is graphically. To carry out the analysis, two different databases were used. For each one, two simulations of 1000 iterations were carried out, and the averages were computed. We have fixed α = 0.5 , because we want to have the same probability of choosing R1 or R2 since addition and subtraction are inverse processes of each other. Furthermore, in each database, we ran the simulation twice, fixing g = 0.7 for the first run and g = 0.3 for the second run. The values of the auxiliary variable S were fixed in such a way that the reports, R’s, produce results similar to the data in the databases.
This evaluation was performed with the following measurements. The ratio of the relative errors is the measure to evaluate the comparative accuracy of μ ^ Y between the estimators of models Z* and Z, which is E r r o r R E Z * R E Z s , where R E k = y ^ k Y k Y ¯ k . On the other hand, we have several measures to evaluate the efficiency of the estimator of the variance of the estimated mean in each model; these are:
(i) The average coefficient of variation, A C V = 100 σ ^ Y 2 μ ^ Y ; (ii) the actual coverage percentage, ACP = percentage of replicates for which the CI covers μ Y , where the confidence interval of 95% for μ Y is μ ^ Y 1.96 σ ^ Y 2 , μ ^ Y + 1.96 σ ^ Y 2 , ; (iii) the average length of the confidence intervals, AL; and (iv) the average of the ratio of variances, E V ^ ( Z * ) V ^ ( Z ) s . For SRSWR, n = N σ Y 2 N ( e ) 2 + σ Y 2 was calculated with a fixed sampling error e .

3.1. Simulation with Data of Illicit Crops in Guerrero, Mexico

In the first database, we considered a sensitive variable to be the amount, in hectares, of destruction of poppy crops by the federal government in Mexico; we only used data from the State of Guerrero [22]. We considered that variable to be sensitive due to the media and social repercussions for the State of Guerrero, since it is a state where the majority of inhabitants make a living from tourism. The parameters of the sensitive variable are N = 1157 , μ = 35.0968 , and σ 2 = 4947.115 . The data used for the simulation cover the period 2015–2021. We used e = 2.5 as the error; therefore, n = 470 for SRSWR. Table 1 shows the numerical results of the estimations and measures for the models Z* and Z. Table 2 shows the results of the accuracy and efficiency of Z* against Z.
The numerical results in Table 2 show that, for the accuracy of the estimation of the sensitive value Y with respect to the parameter μ Y , it is better to use our proposed model Z than the Z* model because its estimate is closer to the true parameter μ Y = 35.096 and thus is more accurate. This is confirmed by the relative errors in the parameter, which are smaller values for both cases where g = 0.7 and g = 0.3 . Regarding efficiency, it is better to use the Z* model than the proposed Z model, since it provides smaller values of the variance estimator.
In Table 1, we can confirm what was described above; in addition, we can specify that scrambling the sensitive value Y with R(3) (g = 0.3) provides more accuracy than R1/R2 (g = 0.7, for Z* and Z) and R3 (g = 0.3) in Z*. In addition, the ACP results for Z* show the inaccuracy of its estimator. On the other hand, the Z* model provides smaller values of ACV and AL.

3.2. Simulation with Data about First Sexual Intercourse

In the second database, we used data from the National Health and Nutrition Survey 2021 [23] collected by the Ministry of Health of Mexico. From these data, as the sensitive variable Y, we selected the question, “At what age did you have your first sexual intercourse?” The responses have numeric values between 1 and 49, with N = 7240 , μ Y = 18.1221 , and σ 2 = 12.79736 . It should be noted that this question from the survey was only posed to women and men between 20 and 49 years old. We set the sampling error e =  0.1 for SRSWR, and the resulting n is 1087. As in the previous simulation, we show the results of accuracy and efficiency in Table 3 and Table 4.
Regarding accuracy and efficiency when using Z* or Z, the numerical results in Table 4 coincide with the conclusions of the previous simulation; that is, the estimation is more accurate when using our proposed model than when using Z*. Again, like the previous simulation, Table 3 shows that the R(3) report (g = 0.3) is more accurate than the others, and the percentage of replicates for which the CI covers μ Y is zero when using the Z* model. In addition, it is better to use Z* than Z to reduce the variance.

3.3. Graphical Simulation

Another way to analyze the behavior of Z* and Z with both databases is by visualizing the values of the following statistics: E r r o r R E Z * R E Z and E V ^ ( Z * ) V ^ ( Z ) . For the first database, the sample size increases to n = 25, 50, …, 1000, and for the second database, the sample size increases to n = 50, 100, …, 2000. In the next figures, we can observe the accuracy and efficiency using both designs when we fixed α = 0.5 , where g = 0.7 and g = 0.3.
In Figure 1, in terms of the accuracy of the estimator μ ^ Y , it can be seen that it is better to use the Z model than the Z* model; as in the numerical results, it is more accurate to use the R(3) report (g = 0.3). Using the Z* model over the Z model with any report produces the minimum variance in the results. The graphs in Figure 2 agree with all the results already shown, where it is better to use the Z model for greater accuracy and the Z* model for the minimum variance.

4. Discussion

In this document, we propose a new randomized response technique, which allows us to obtain information on a variable of interest Y considered sensitive. In the study of the behavior of the proposed estimators, as already mentioned in this document, we treated the following as sensitive variables: the amount, in hectares, of destruction of poppy crops by the federal government of Mexico in the State of Guerrero and “At what age did you have your first sexual intercourse?”.
As a consequence of this study, for the first sensitive variable, it is preferable for researchers to use the proposed Z model to more accurately estimate the amount of poppy destruction. This is important in the national context since, due to the public policies of the current federal government [27] in implementing drug prevention programs or licit crop programs in order to reduce poppy crops, it is important to estimate what is closest to reality since, based on these estimates, the budgets for said programs are assigned. Otherwise, there would be an underestimation, causing an inadequate budget for the implementation of the programs, or an overestimation, which would cause other programs in other areas to have a lower budget. Neither sampling error is acceptable in a country such as Mexico.
In the analysis of the sensitive question “At what age did you have your first sexual intercourse?”, the same considerations can be made since the Z model provides the best estimate of the true value. On the other hand, if a researcher in the area of health [28], according to our sensitive variable, is also interested, in addition to knowing the estimated value of a sensitive characteristic, in knowing between which values the true value of this characteristic lies, that is, in building confidence intervals, it is better to use Z* due to its minimum variance, since it will provide smaller confidence intervals and, hypothetically, estimates with greater precision. This last statement is valid for unbiased estimators.
As a limitation of this work, the estimators in our proposal are as biased as in the work of Saleem et al. [21]. In our case, this is due to the use of ratio estimators, which, by their nature, are biased. In addition, the applicability of the ratio report R(3) is made more difficult in practical use compared to an addition, subtraction, or multiplication report. Finally, only a simple random sampling design was used.
For the aforementioned issues, it is recommended that, in future works, the estimators of the Z model under simple random sampling (SRS) be extended to stratified simple random sampling (SSRS). This variation is for the purpose of determining under which conditions it is better to use SRS or SSRS with the Z and Z* models, defining the gain in accuracy and optimal allocation, and so on. In addition, it would be desirable to propose other estimators that are not of the ratio type to make comparisons, in terms of accuracy and efficiency, with the estimators proposed in this document.

Author Contributions

Conceptualization, C.N.B.-H.; methodology, C.N.B.-H. and P.O.J.-M.; software, J.M.S.-V.; investigation, A.S.-M.; writing—original draft preparation, C.N.B.-H. and P.O.J.-M.; writing—review and editing, J.M.S.-V. and A.S.-M.; supervision, C.N.B.-H.; project administration, C.N.B.-H.; funding acquisition, C.N.B.-H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The first database was taken from a report by the Mexican government, on the destruction of poppy crops in the State of Guerrero, Mexico, during the years 2015 to 2021. The link to access them is https://www.mucd.org.mx. Data from the second simulated case are open data from the National Health and Nutrition Survey (ENSANUT), by its acronym in Spanish and correspond to the year 2021. The link to access them is https://ensanut.insp.mx/.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Jones, E.E.; Sigall, H. The bogus pipeline: A new paradigm for measuring affect and attitude. Psychol. Bull. 1971, 76, 349–364. [Google Scholar] [CrossRef] [Green Version]
  2. Raghavarao, D.; Federer, W.T. Block Total Response as an Alternative to the Randomized Response Method in Surveys. J. R. Stat Soc. Ser. B 1979, 41, 40–45. [Google Scholar] [CrossRef] [Green Version]
  3. Gupta, S.; Thornton, B. Circumventing Social Desirability Response Bias in Personal Interview Surveys. Am. J. Math. Manag. Sci. 2002, 22, 369–383. [Google Scholar] [CrossRef]
  4. Warner, S.L. Randomized response: A survey technique for eliminating evasive answer bias. J. Am. Stat. Assoc. 1965, 60, 63–69. [Google Scholar] [CrossRef] [PubMed]
  5. Bahadivand, S.; Doosti-Irani, A.; Karami, M. Prevalence of high-risk behaviors in reproductive age women in Alborz province in 2019 using unmatched count technique. BMC Women’s Health 2020, 20, 186. [Google Scholar] [CrossRef] [PubMed]
  6. Greenberg, B.G.; Kuebler, R.R.J.; Abernathy, J.R.; Horvitz, D.G. Application of the Randomized Response Technique in Obtaining Quantitative Data. J. Am. Stat. Assoc. 1971, 66, 243–250. [Google Scholar] [CrossRef]
  7. Eriksson, S.A. A new model for randomized response. Int. Stat. Rev. 1973, 41, 40–43. [Google Scholar] [CrossRef]
  8. Huang, K.C. Estimation of sensitive characteristics using optional randomized technique. Qual. Quant. 2008, 42, 679–686. [Google Scholar] [CrossRef]
  9. Bouza, C.N. Ranked set sampling and randomized response procedures for estimating the mean of a sensitive quantitative character. Metrika 2009, 70, 267–277. [Google Scholar] [CrossRef]
  10. Arnab, R. Optional randomized response techniques for quantitative characteristics. Commun. Stat. Theory Methods 2018, 48, 4154–4170. [Google Scholar] [CrossRef]
  11. Singh, H.P.; Gorey, S. On Two Stage Optional Randomized Response Model. Elixir Stat. 2018, 123C, 51963–51987. [Google Scholar]
  12. Hussain, Z.; Shahid, M.I. Improved Randomized Response in Optional Scrambling Models. J. Stat. Theory Pract. 2019, 18, 351–360. [Google Scholar] [CrossRef] [Green Version]
  13. Narjis, G.; Shabbir, J.; Onyango, R. Partial Randomized Response Model for Simultaneous Estimation of Means of Two Sensitive Variables. Math. Probl. Eng. 2022, 2022, 1–13. [Google Scholar] [CrossRef]
  14. Bouza-Herrera, C.N.; Juárez-Moreno, P.O.; Santiago-Moreno, A.; Sautto-Vallejo, J.M. A Two-Stage Scrambling Procedure: Simple and Stratified Random Sampling. An Evaluation of COVID 19’s data in Mexico. Investig. Oper. 2022, 43, 421–430. [Google Scholar]
  15. Hussain, Z.; Shakeel, S.; Cheema, S.A. Estimation of stigmatized population total: A new additive quantitative randomized response model. Commun. Stat. Theory Methods 2022, 51, 8741–8753. [Google Scholar] [CrossRef]
  16. Azeem, M.; Ali, S. A neutral comparative analysis of additive, multiplicative, and mixed quantitative randomized response models. PLoS ONE 2023, 18, 4. [Google Scholar] [CrossRef]
  17. Murtaza, M.; Singh, S.; Hussain, Z. Use of correlated scrambling variables in quantitative randomized response technique. Biom. J. 2020, 63, 134–147. [Google Scholar] [CrossRef]
  18. Chong, A.; Chu, A.; So, M.; Chung, R. Asking Sensitive Questions Using the Randomized Response Approach in Public Health Research: An Empirical Study on the Factors of Illegal Waste Disposal. Int. J. Environ. Res. Public Health 2019, 16, 970. [Google Scholar] [CrossRef] [Green Version]
  19. Perri, P.F.; Cobo-Rodríguez, B.; Rueda-García, M. A mixed-mode sensitive research on cannabis use and sexual addiction: Improving self-reporting by means of indirect questioning techniques. Qual. Quant. 2018, 52, 1593–1611. [Google Scholar] [CrossRef]
  20. Kirtadze, I.; Otiashvili, D.; Tabatadze, M.; Vardanashvili, I.; Sturua, L.; Zabransky, T.; Anthony, J.C. Republic of Georgia estimates for prevalence of drug use: Randomized response technique suggest under-estimation. Drug Alcohol. Depend. 2018, 187, 300–304. [Google Scholar] [CrossRef]
  21. Saleem, I.; Sanaullah, A.; Koyuncu, N. Estimation of Mean of a Sensitive Quantitative Variable in Complex Survey: Improved Estimator and Scrambled Randomized Response Model. J. Sci. 2019, 32, 1021–1043. [Google Scholar]
  22. México Unido Contra La Delincuencia. Datos Abiertos Sobre Acciones Antidrogas. Available online: https://www.mucd.org.mx (accessed on 19 January 2023).
  23. De Salud, S. Encuesta Nacional de Salud y Nutrición. Available online: https://www.ensanut.insp.mx (accessed on 19 January 2023).
  24. Epstein, M.; Bailey, J.; Manhart, L.; Hill, K.; Hawkins, D.; Haggerty, K.; Catalano, R. Understanding the Link Between Early Sexual Initiation and Sexually Transmitted Infection: Test and Replication in Two Longitudinal Studies. J. Adolesc. Health. 2014, 54, 435–441. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  25. Brener, N.D.; Eaton, D.K.; Kann, L. The Association of Survey Setting and Mode with Self-Reported Health Risk Behaviors Among High School Students. Public. Opin. Q. 2014, 70, 354–374. [Google Scholar] [CrossRef] [Green Version]
  26. Singh, S. Advanced Sampling Theory with Application, 1st ed.; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2003. [Google Scholar]
  27. México, Monitoreo de Plantíos de Amapola 2019–2020. Available online: https://www.unodc.org (accessed on 23 January 2023).
  28. Candia, R.; Caiozzi, G. Intervalos de confianza. Rev. Méd. Chile 2005, 133, 1111–1115. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Figure 1. Accuracy and efficiency under Z* and Z in database of illicit crops.
Figure 1. Accuracy and efficiency under Z* and Z in database of illicit crops.
Mathematics 11 02572 g001
Figure 2. Accuracy and efficiency under Z* and Z in database of first sexual intercourse.
Figure 2. Accuracy and efficiency under Z* and Z in database of first sexual intercourse.
Mathematics 11 02572 g002
Table 1. Estimates and measures to evaluate the estimators of the models.
Table 1. Estimates and measures to evaluate the estimators of the models.
Z*Z
α = 0.5 g = 0.7g = 0.3g = 0.7g = 0.3
μ ^ Y = 57.53157.933.2332.85
A C V  =6.73%3.35%185.5%129%
ACP =1%0%100%100%
AL =15.3320.77241.8181.54
V ^ μ ^ Y = 15.6428.5538812216.5
R E = 0.6393.4890.0970.109
Table 2. Accuracy and efficiency of the estimators.
Table 2. Accuracy and efficiency of the estimators.
α = 0.5 g = 0.7g = 0.3
E r r o r R E Z * R E Z = 6.58732.009
E V ^ ( Z * ) V ^ ( Z ) = 0.0040.01
Table 3. Estimates and measures to evaluate the estimators of the models.
Table 3. Estimates and measures to evaluate the estimators of the models.
Z*Z
α = 0.5 g = 0.7g = 0.3g = 0.7g = 0.3
μ ^ Y = 29.0377.3417.5617.77
A C V  =0.88%1.09%26.89%29.85%
ACP =0%0%100%100%
AL =1.0053.33118.5120.79
V ^ μ ^ Y = 0.0650.72222.3228.15
R E = 0.6013.2680.03110.0195
Table 4. Accuracy and efficiency of the estimators.
Table 4. Accuracy and efficiency of the estimators.
α = 0.5 g = 0.7g = 0.3
E r r o r R E Z * R E Z = 19.324167.589
E V ^ ( Z * ) V ^ ( Z ) = 0.00290.0025
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Juárez-Moreno, P.O.; Santiago-Moreno, A.; Sautto-Vallejo, J.M.; Bouza-Herrera, C.N. Scrambling Reports: New Estimators for Estimating the Population Mean of Sensitive Variables. Mathematics 2023, 11, 2572. https://doi.org/10.3390/math11112572

AMA Style

Juárez-Moreno PO, Santiago-Moreno A, Sautto-Vallejo JM, Bouza-Herrera CN. Scrambling Reports: New Estimators for Estimating the Population Mean of Sensitive Variables. Mathematics. 2023; 11(11):2572. https://doi.org/10.3390/math11112572

Chicago/Turabian Style

Juárez-Moreno, Pablo O., Agustín Santiago-Moreno, José M. Sautto-Vallejo, and Carlos N. Bouza-Herrera. 2023. "Scrambling Reports: New Estimators for Estimating the Population Mean of Sensitive Variables" Mathematics 11, no. 11: 2572. https://doi.org/10.3390/math11112572

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop