Next Article in Journal
The Utility of Conventional CT, CT Perfusion and Quantitative Diffusion-Weighted Imaging in Predicting the Risk Level of Gastrointestinal Stromal Tumors of the Stomach: A Prospective Comparison of Classical CT Features, CT Perfusion Values, Apparent Diffusion Coefficient and Intravoxel Incoherent Motion-Derived Parameters
Previous Article in Journal
A Rare Case of Collision Tumours of the Ovary: An Ovarian Serous Cystadenoma Coexisting with Fibrothecoma
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Partial Verification Bias Correction Using Inverse Probability Bootstrap Sampling for Binary Diagnostic Tests

by
Wan Nor Arifin
1,2,* and
Umi Kalsom Yusof
1,*
1
School of Computer Sciences, Universiti Sains Malaysia, Gelugor 11800, Pulau Pinang, Malaysia
2
Biostatistics and Research Methodology Unit, School of Medical Sciences, Universiti Sains Malaysia, Kubang Kerian 16150, Kelantan, Malaysia
*
Authors to whom correspondence should be addressed.
Diagnostics 2022, 12(11), 2839; https://doi.org/10.3390/diagnostics12112839
Submission received: 19 October 2022 / Revised: 3 November 2022 / Accepted: 15 November 2022 / Published: 17 November 2022
(This article belongs to the Section Pathology and Molecular Diagnostics)

Abstract

:
In medical care, it is important to evaluate any new diagnostic test in the form of diagnostic accuracy studies. These new tests are compared to gold standard tests, where the performance of binary diagnostic tests is usually measured by sensitivity (Sn) and specificity (Sp). However, these accuracy measures are often biased owing to selective verification of the patients, known as partial verification bias (PVB). Inverse probability bootstrap (IPB) sampling is a general method to correct sampling bias in model-based analysis and produces debiased data for analysis. However, its utility in PVB correction has not been investigated before. The objective of this study was to investigate IPB in the context of PVB correction under the missing-at-random assumption for binary diagnostic tests. IPB was adapted for PVB correction, and tested and compared with existing methods using simulated and clinical data sets. The results indicated that IPB is accurate for Sn and Sp estimation as it showed low bias. However, IPB was less precise than existing methods as indicated by the higher standard error (SE). Despite this issue, it is recommended to use IPB when subsequent analysis with full data analytic methods is expected. Further studies must be conducted to reduce the SE.

1. Introduction

Diagnostic tests play a central role in medical care; therefore, it is important to ensure the clinical validity of any new diagnostic tests [1,2] in the form of diagnostic accuracy studies. The validation involves comparing a new test with the clinically accepted gold standard test, where the performance of the new test is assessed by accuracy measures [1,3,4]. For binary diagnotic tests, sensitivity (Sn) and specificity (Sp) are commonly reported [3,4,5]. However, most often, the verification of disease status by the gold standard test is costly, time-consuming, and invasive [1,5,6,7,8]. This issue with verification causes partial verification bias (PVB), which occurs when only some patients are selected for disease verification by the gold standard test [1]. These patients are usually those with positive diagnostic test results, while those with negative test results are less likely to be selected [6,8,9]. Whenever the disease status is missing for some patients because it is not verified and the decision to verify depends on the result of the diagnostic test, this gives rise to the missing-at-random (MAR) missing data mechanism [5,6].
PVB is known to cause biased accuracy measures [1,6,10], so it is crucial to correct for this bias in analysis. Methods are available for PVB correction, depending on the scale of the diagnostic and gold standard tests, and the missing data mechanism. A recent review extensively covered all these methods [2], while a specific review on binary diagnostic and gold standard tests with practical implementation was covered in another article [11]. This study focused on PVB correction for binary diagnostic and gold standard tests under the MAR missing data mechanism.
For the binary diagnostic test and disease status (as verified by the gold standard test) under the MAR assumption, the available PVB correction methods can be roughly divided into Begg and Greene’s (BG)-based methods, propensity score (PS)-based methods, and multiple imputation (MI) method. BG-based and MI methods rely on estimating the probability of disease status given test result as an intermediate step before correcting the Sn and Sp estimates. This approach works because this probability, commonly known as positive and negative predictive values [3], is unbiased under MAR assumption [12]. PS-based methods estimate the probability of verification given the test result, before correcting for the bias by a weighting method [13,14]. By estimating the verification probability, PS demonstrates a clear and direct relationship with the verification problem, in this case, the PVB problem. Recent implementations of PVB correction methods can be seen in studies evaluating MRI and ultrasound in prostate cancer [15], serum pepsinogens in gastric cancer [16], and fine needle aspiration cytology in breast cancer [17], where the studies utilized BG-based and MI methods.
In a separate development in the field of ecology, Nahorniak et al. [18] proposed using inverse probability bootstrap (IPB) sampling in order to eliminate the effect of sampling bias in model-based analysis. Although bootstrap is generally known as a technique to obtain standard error of statistical estimates, they showed that the technique can also be utilized to obtain unbiased parameter estimates [18]. They achieved this by generating weighted bootstrap samples. IPB allows the use of the commonly used bootstrap technique that is easy to understand and apply by transforming the sample instead of having to modify or develop specific method to account for the bias [18]. Because IPB is basically a bootstrap technique, it allows easy estimation of the standard error of a parameter estimate to obtain the confidence interval for statistical inference, although it may require a cross-validation technique for this purpose in more complicated situations [18].
There is a common link between the PS-based method of PVB correction and IPB; both start with estimating the selection probability, or the verification probability in the context of PVB, before utilizing the probability to correct the bias by the same weighting methods. IPB offers an appealing approach to bias correction given its reliance on the bootstrap technique given the advantages of the technique. The weighted bootstrap sampling, as utilized by the IPB method, has not been investigated in the context of PVB correction, so its potential use and how it can be adapted in this context remains to be studied. Therefore, this study aimed to investigate the applicability of the IPB sampling method in the context of PVB correction under the MAR assumption for binary diagnostic tests.

2. Materials and Methods

This section describes the simulated and clinical data sets used in this study, the proposed implementation of IPB sampling for PVB correction, the metrics for performance evaluation, the selected methods for comparison, and the experimental setup of this study. In addition, the notations used are T = test result, D = disease status, and V = verification status.

2.1. Data Sets

Simulated and clinical data sets were used in this study for performance evaluation and comparison between the methods. The use of simulated data sets allow performance evaluation in comparison to known parameter values [18,19]. The use of real data sets allows for comparison between the methods using reference data sets, following the practice of previous research in PVB correction [20,21,22,23].

2.1.1. Simulated Data Sets

The simulated data sets were generated by adapting the settings described in Harel and Zhou [21], Ünal and Burgut [22], and Rochani et al. [23]. The settings were as follows:
  • True disease prevalence (p) or P ( D = 1 ) : moderate = 0.40 and low = 0.10.
  • True sensitivity (Sn) P ( T = 1 | D = 1 ) : moderate = 0.6, high = 0.9.
  • True specificity (Sp) P ( T = 0 | D = 0 ) : moderate = 0.6, high = 0.9.
  • Verification probabilities: When the verification depends only on test result, this gives an MAR missingness mechanism. Fixed verification probabilities given the test result P ( V = 1 | T = t ) were set at P ( V = 1 | T = 1 ) = 0.8 and P ( V = 1 | T = 0 ) = 0.4 [21]. In words, patients are more likely to be verified when their test results are positive with a probability of 0.8, while patients are less likely to be verified when their test results are negative with a probability of 0.4.
  • Sample sizes, N: 200 and 1000.
The probability of counts in the complete data of a 2 × 2 cross-tabulated table for test result T versus disease status D V = 1 ) are distributed as a multinomial distribution [21,23]. Based on pre-specified Sn = P ( T = 1 | D = 1 ) , Sp = P ( T = 0 | D = 0 ) and p = P ( D = 1 ) = π , the probabilities of counts are distributed as M ( π 1 , π 2 , π 3 , π 4 ) , where:
π 1 = P ( T = 1 , D = 1 ) = P ( T = 1 | D = 1 ) P ( D = 1 ) , π 2 = P ( T = 0 , D = 1 ) = P ( T = 0 | D = 1 ) P ( D = 1 ) = [ 1 P ( T = 1 | D = 1 ) ] P ( D = 1 ) , π 3 = P ( T = 1 , D = 0 ) = P ( T = 1 | D = 0 ) P ( D = 0 ) = [ 1 P ( T = 0 | D = 0 ) ] P ( D = 0 ) , π 4 = P ( T = 0 , D = 0 ) = P ( T = 0 | D = 0 ) P ( D = 0 ) .
Then, for each sample size setting N, the steps to generate a simulated PVB data set for MAR are as follows:
  • A complete data set of size N distributed as multinomial distribution, M ( π 1 , π 2 , π 3 , π 4 ) was generated. This generated values ranging from 1 to 4 based on the probability values.
  • The values were converted into realizations of T = t and D = d variables, where 1 ( T = 1 , D = 1 ) , 2 ( T = 0 , D = 1 ) , 3 ( T = 1 , D = 0 ) and 4 ( T = 0 , D = 0 ) .
  • Under the MAR assumption, a PVB data set with verification probability of P ( V = 1 | T = 1 ) = 0.8 and P ( V = 1 | T = 0 ) = 0.4 was generated.

2.1.2. Clinical Data Sets

Two commonly used clinical data sets [21,23,24,25,26,27] to illustrate PVB correction methods were utilized. The original data in these studies were converted to an analysis-ready format (.csv). These data sets are described as follows:
  • Hepatic Scintigraphy Test
    The data set is about the hepatic scintigraphy test for the detection of liver cancer [24]. Hepatic scintigraphy is an imaging method (diagnostic test) to detect liver cancer. It was performed on 650 patients. Of the patients, 344 patients were later verified by liver pathological examination (gold standard test). The percentage of unverified patients is 47.1%. The data set contains the following variables:
    (a)
    Liver cancer, d i s e a s e : Binary, 1 = Yes, 0 = No;
    (b)
    Hepatic Scintigraphy, t e s t : Binary, 1 = Positive, 0 = Negative;
    (c)
    Verified, v e r i f i e d : Binary, 1 = Yes, 0 = No.
  • Diaphanography Test
    The data set is about the diaphanography test for detection of breast cancer [25]. Diaphanography is a noninvasive method (diagnostic test) of breast examination by transillumination using visible or infrared light to detect the presence of breast cancer. It was tested on 900 patients, where 88 patients were later verified by breast tissue biopsy for histological examination (gold standard test). The percentage of unverified patients is 90.2%. The data set contains the following variables:
    (a)
    Breast cancer, d i s e a s e : Binary, 1 = Yes, 0 = No;
    (b)
    Diaphanography, t e s t : Binary, 1 = Positive, 0 = Negative;
    (c)
    Verified, v e r i f i e d : Binary, 1 = Yes, 0 = No.

2.2. Inverse Probability Bootstrap Sampling

Nahorniak et al. [18] proposed inverse probability bootstrap (IPB) sampling to correct for selection bias, comprising of seven steps. In this study, the steps were adapted and simplified to five steps as follows:
  • Calculate selection probability P i from the biased sample of size N by any statistical method.
  • Calculate inverse sampling probability ( P i , I P B ) as
    P i , I P B = 1 P i i = 1 n 1 i = 1 n P i ,
    where P i , I P B is scaled such that the sum equals one and n is the sample size for complete cases.
  • Generate b bootsrap samples of size n by resampling with replacement b times.
  • Estimate parameter of interest as the mean of parameter estimates from the b bootstrap samples.
  • Estimate standard error (SE) as the standard deviation of the parameter estimates from the b bootstrap samples.
In this study, IPB sampling [18] was proposed for PVB correction by creating synthetic samples, where the samples are corrected for the bias. This was done by implementing Step (1) above using the propensity score ( P S i ) in place of P i , defined as
P S i = P ( V i = 1 | T i ) ,
where P S i may be known or is obtained from a logistic regression on the observed data [13,28,29]. Please note that in the context of PVB, the n specified in Step (2) (i.e., the size of complete cases after excluding observations with missing D) will be smaller than than the sample size in Step (1) (the size of full data containing V and T variables), denoted as N, where n equals N times the percentage of verification.
In Step (4), the parameters of interest are Sn and Sp. For each bootstrap sample, Sn and Sp estimates are calculated according the standard formula of Sn and Sp, as given in Equations (7) and (8), respectively. Following this calculation, the mean of the estimates for b bootstrap samples are calculated. In Step (5), the SE is utilized to obtain the 100(1 α )% confidence interval (CI) of the respective parameter estimate by bootstrap normal CI [30,31] as
S n ^ ± z 1 α / 2 × S E b o o t s t r a p ( S n ^ ) ,
S p ^ ± z 1 α / 2 × S E b o o t s t r a p ( S p ^ ) .
Assuming the bootstrap distribution is approximately normal with small bias, the bootstrap normal interval gives a reasonable estimate [31]. Other common bootstrap intervals [30,31] are also possible as IPB is based on the bootstrap technique.

2.3. Performance Evaluation

2.3.1. Performance Metrics

The performance evaluation was based on the metrics that measure the difference between an estimate and its true value [19,32,33]. The selected performance metrics, bias and standard error, are defined below. For a finite number of simulations B, these are calculated as follows:
  • Bias
    Bias of a point estimator θ ^ is the difference between the expected value of θ ^ and the true value of a parameter θ [33]. Bias is calculated as follows:
    B i a s = E [ θ ^ ] θ = 1 B i = 1 B θ ^ i θ .
  • Standard Error
    Standard error (SE) is the square root of the variance, calculated as follows:
    S E = V a r ( θ ^ ) = 1 B 1 i = 1 B ( θ ^ i θ ¯ ) 2 ,
    where θ ¯ is the mean of θ ^ i across repetitions.
Bias is often the main metric of interest [19], where it indicates the accuracy of a method [33] and whether, on average, the method targets the parameter θ [19]. SE shows the precision of the method [19,33], where a smaller SE indicates better precision [33].

2.3.2. Methods for Comparison

The following are existing methods for comparison with the IPB sampling method for PVB correction. Each method is described briefly, followed by the formula to calculate Sn and Sp. To compare the PVB correction methods, the BG method (for BG-based method), the inverse probability weighting estimator method (for PS-based method), and the MI method were selected to represent different approaches for PVB correction.
  • Full data analysis
    Full data analysis (FDA) represents the ideal analysis performed whenever full data are available without missing observations and bias, which is the standard way of calculating Sn and Sp. Sn and Sp for FDA [3] are calculated as follows:
    S n ^ F D A = P ^ ( T = 1 | D = 1 ) ,
    S p ^ F D A = P ^ ( T = 0 | D = 0 ) .
  • Complete case analysis
    Complete case analysis (CCA) method accuracy estimates are calculated from the complete cases only [34]. CCA is biased in the presence of partial verification bias, and hence, represents the uncorrected method. Sn and Sp for CCA are calculated as follows:
    S n ^ C C A = P ^ ( T = 1 | D = 1 , V = 1 ) ,
    S p ^ C C A = P ^ ( T = 0 | D = 0 , V = 1 ) .
  • Begg and Greenes’s method
    Begg and Greenes (BG) [35] proposed a correction method based on Bayes’ theorem whenever the missing data mechanism is MAR. Sn and Sp for the BG method [3,21,27,35] are calculated as follows:
    S n ^ B G = P ^ ( T = 1 ) P ^ ( D = 1 | T = 1 , V = 1 ) P ^ ( T = 1 ) P ^ ( D = 1 | T = 1 , V = 1 ) + P ^ ( T = 0 ) P ^ ( D = 1 | T = 0 , V = 1 ) ,
    S p ^ B G = P ^ ( T = 0 ) P ^ ( D = 0 | T = 0 , V = 1 ) P ^ ( T = 1 ) P ^ ( D = 0 | T = 1 , V = 1 ) + P ^ ( T = 0 ) P ^ ( D = 0 | T = 0 , V = 1 ) .
  • Inverse Probability Weighting Estimator
    Alonzo and Pepe [13] proposed the inverse probability weighting estimator (IPWE) method for PVB correction, which was based on the work of Horvitz and Thompson [36]. After estimating the verification probability P S i , the IPWE method weights each observation in the verified sample by the inverse of the P S i to obtain the corrected Sn and Sp [13]. Sn and Sp for the IPWE method Alonzo and Pepe [13] are calculated as follows:
    S n ^ I P W E = i = 1 n T i V i D i / P S ^ i i = 1 n V i D i / P S ^ i ,
    S p ^ I P W E = i = 1 n ( 1 T i ) V i ( 1 D i ) / P S ^ i i = 1 n V i ( 1 D i ) / P S ^ i .
  • Multiple Imputation
    Harel and Zhou [21] proposed using MI, where each missing disease status is replaced by m > 1 plausible values, resulting in m complete data sets [5,21]. Each of these data sets is then analyzed by complete data methods; thereafter, the m estimates are combined to provide final estimates [5,21]. In this study, logistic regression was utilized in the imputation step of the MI method. The disease status was imputed using the following logistic regression model:
    l o g i t [ P ( D i = 1 | T i ) ] = β 0 + β 1 × T i ,
    using the observed data. Following the imputation, Sn and Sp for the MI method are calculated as follows:
    S n ¯ M I = 1 m j = 1 m S n ^ F D A , j ,
    S p ¯ M I = 1 m j = 1 m S p ^ F D A , j .
For the simulated data sets, the methods are compared by the mean of the estimates and performance metrics (bias and SE), arranged by the sample sizes and Sn–Sp combinations. We did not consider coverage, i.e., the proportion of times the CI includes the estimate [19,32,33], as one of the performance metric for comparing the methods for the simulation. Because IPB is a bootstrap technique, the commonly used calculation to obtain the bootstrap CI was implemented and we did not propose a new method to obtain the CI.
For the clinical data sets, point estimates and the respective 95% CIs were estimated for comparison. For FDA and CCA, the CIs for Sn and Sp were calculated by using the Wald interval, while for the BG method, the calculation step given in the original article was followed [35]. For IPWE, the CIs were obtained by the bootstrap technique [13,14] using bootstrap bias-corrected and accelerated (BCa) interval [31]. For MI, the CIs were obtained by Rubin’s rule [11,37].

2.3.3. Experimental Setup

R statistical programming language [38] version 3.6.3 was used to run the experiments within RStudio [39] integrated development environment. mice [40] version 3.14.0 and simstudy [41] version 0.5.0 R packages were used. The seed number for the random number generator was set to 3209673. Other experimental settings were the numbers of simulation runs B = 500 and bootstraps b = 1000 [22]. As for MI, the number of imputations was m = 100 for simulated data sets [42,43], while m = the percentage of incomplete cases for real clinical data sets [44,45,46].

3. Results

3.1. Simulated Data Sets

The simulation results for the FDA, CCA, and PVB correction methods are displayed in Table 1 for p = 0.4 . The results are arranged by the sample sizes N = 200 and 1000, followed by Sn = (0.6, 0.9) and Sp = (0.6, 0.9) parameter combinations. The proportions of verification P ( V = 1 ) were 0.59, 0.52, and 0.64 for (0.6, 0.6), (0.6, 0.9), and (0.9, 0.6) for the (Sn, Sp) pairs, respectively. For experimental conditions with p = 0.4 , without any correction, using CCA for analysis resulted in biased estimates, while in the ideal research situation with the availability of full data, FDA showed very small bias. The results showed that for all PVB correction methods, including IPB, the bias values for Sn and Sp were very small for all Sn and Sp combinations, which reduced further at the larger N = 1000 . However, of all the correction methods, IPB displayed relatively larger SEs for both Sn and Sp estimation at N = 200 , while the SEs became smaller at N = 1000 .
Next, the simulation results for p = 0.1 are displayed in Table 2. Similarly, the results are arranged by the sample sizes, followed by Sn and Sp parameter combinations. The proportions of verification were 0.57, 0.46, and 0.58 for (0.6, 0.6), (0.6, 0.9), and (0.9, 0.6) for (Sn, Sp) pairs, respectively. The results showed different patterns for p = 0.1 , where the bias values for Sp were very small for all Sn and Sp combinations at all sample sizes. All PVB correction methods, including IPB, showed small bias values of Sn for all Sn and Sp combinations. However, all correction methods underestimated the true Sn of 0.9, showing higher bias for the (Sn = 0.9, Sp = 0.6) combination at N = 200 , with MI having the highest bias value at −0.100, followed by −0.068 for IPB, and −0.063 for BG and IPWE. MI also showed relatively higher bias values of Sn for (Sn = 0.6, Sp = 0.6) and (Sn = 0.6, Sp = 0.9) combinations at N = 200 . CCA instead showed the smallest bias value even in comparison to FDA; this was because as CCA consistently overestimated Sn, at a higher Sn value, its estimate coincided with the true value, hence, the pseudo-good result. At N = 1000 , all PVB correction methods showed very small bias values of Sn. Similar to p = 0.4 , IPB displayed relatively larger SEs for both Sn and Sp estimation at N = 200 , although the SEs became smaller at N = 1000 .

3.2. Clinical Data Sets

The results for the PVB correction methods using the clinical data sets are displayed in Table 3. CCA is displayed to illustrate the results without bias correction. All correction methods, including the proposed IPB, showed closely similar point estimates for Sn and Sp of the hepatic data set and Sp of the diaphanography data set. The MI method showed slightly lower point estimate of Sn for the diaphanography data set as compared to the rest of the methods. For the hepatic data set, all methods showed relatively similar 95% CIs. For the diaphanography data set, the 95% CIs of Sn for BG and MI methods were close to each other, while the 95% CI of Sn for the IPB method was the widest of the rest of the methods. The same was observed for the 95% CI of Sp for the IPB method for this data set.

4. Discussion

The objective of this study was to investigate the applicability of IPB sampling method in the context of PVB correction. It was found that, based on the simulated data sets, the IPB method had good performance in terms of bias, although the SE was relatively larger than the other methods for comparison. Its performance was consistent for both moderate and low disease prevalence, while the MI method was the most affected at low disease prevalence. All methods showed better results at a larger sample size. All correction methods showed very small bias for Sp, while these methods varied in performance in correcting Sn estimates. Based on the clinical data sets, IPB was found to be consistent with other correction methods for the hepatic data set. For the diaphanogprahy data set, although the point estimates were consistent with other methods, the CIs were relatively wider than the rest of the methods.
Based on the results from the simulated data sets, in terms of bias, IPB was found to be as good as BG and IPWE in most experimental conditions, while being better than MI at low disease prevalence for estimating Sn. However, the SEs of Sn and Sp for IPB were larger than other methods, most notably at small sample size and low disease prevalence. As IPB only bootstraps the verified observations V = 1 , the bootstrapped sample size n is smaller than N (i.e., P ( V = 1 ) × N ). This in effect leads to larger standard errors, as it is generally known that smaller sample sizes lead to larger standard errors [31]. This explains why, as N became larger, the SEs for IPB improved as n also became larger. In addition, IPB showed larger SEs in a low prevalence setting because the group size ( D = 1 ) became smaller with lower disease prevalence, where the size is p × n . Again, as N became larger, the SEs for IPB improved as the group size also became larger.
Next, based on the results from the clinical data sets, IPB showed consistent results for both the point and interval estimates for the hepatic data set. However, it showed wider 95% CIs of Sn and Sp based on the diaphanography data set. As observed in the simulated data sets, IPB exhibited a relatively large SE when the disease prevalence was low. The diaphanography data set has a large percentage of missing observations, which stood at 90.2%. Quite likely, the true disease prevalence was also low for this data set, although this could not be verified without the full data. At the same time, the observed sample size for the data set was only 88 patients. These factors might explain the wide CIs for IPB, while this also indicates IPB is sensitive to small sample sizes, which in effect will lead to larger SEs. As pointed out by Nahorniak et al. [18], although the SE for IPB was expected to be reasonably accurate, further study is required to study the performance of its SE.
While IPB was shown to be a viable alternative PVB correction method, its precision as indicated by the SE was slightly lower. Despite this shortcoming, there are several advantages of IPB over the other PVB correction methods. First, IPB was found to be less biased than MI at low disease prevalence, while being comparable to BG and IPWE in terms of bias. Second, IPB shares the same advantage with MI as both allows the use of any full data analytic methods, while BG and IPWE do not have the same advantage. The difference between IPB and MI was that, while MI restores full data N by imputing the missing values for the outcome D, IPB restores the correct distribution of the data containing the complete cases n only. The ability to utilize the full data approach is advantageous in applying new method for PVB correction as shown by Roldán-Nofuentes and Regad [47] in applying MI for the estimation of the average Kappa coefficient of a binary diagnostic test, where the IPB method might also be applicable. Third, it is straight forward to use IPB as opposed to MI, where it only requires the estimation of PS values, followed by the weighted bootstrap sampling procedure. In contrast, there are many imputation methods to choose from for MI [37], and the performance of MI depends on the chosen imputation method [48]. Finally, IPB shares the same advantage with IPWE by using PS (i.e., the probability of verification given test result, P ( V = 1 | T = t ) ). While BG-based and MI methods rely on the correct probability of verification given test result, P ( D = d | T = t ) , the PS-based methods rely on the correct P ( V = 1 | T = t ) ) to perform the correction. Since P ( D = d | T = t ) will be incorrect when a case-control study design is used for diagnostic accuracy studies [8], the use of PS-based methods is advantageous in this situation.

5. Conclusions

PVB correction is important to ensure valid results for diagnostic accuracy studies affected by the PVB issue. Various correction methods have been developed to perform the correction, each with strengths and limitations. The IPB method is a general method to correct sampling bias in model-based analysis, and its utility in PVB correction has not been investigated before. This study investigated the IPB sampling method in the context of PVB correction under MAR assumption for binary diagnostic tests. The results showed that for PVB correction, IPB demonstrated low bias, indicating the method is accurate for estimation of Sn and Sp. However, IPB showed slightly higher SE than other correction methods that indicates the method is less precise. Despite this issue, as previously highlighted in the previous section, IPB has several advantages over other PVB correction methods. It is recommended to use IPB as an alternative to MI when debiased data are required for further analysis with full data analytic methods. Nevertheless, since the main disadvantage of IPB at this juncture is the larger SE, further research must be conducted to overcome this issue. In addition, since IPB by itself is a bootstrap technique, more research can be conducted on different bootstrap intervals to find the most suitable bootstrap interval in the context of PVB correction.

Author Contributions

Conceptualization, W.N.A. and U.K.Y.; methodology, W.N.A. and U.K.Y.; software, W.N.A.; validation, W.N.A. and U.K.Y.; formal analysis, W.N.A.; investigation, W.N.A.; resources, U.K.Y.; data curation, W.N.A.; writing—original draft preparation, W.N.A.; writing—review and editing, W.N.A. and U.K.Y.; supervision, U.K.Y.; project administration, W.N.A.; funding acquisition, W.N.A. and U.K.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Research Creativity and Management Office, School of Computer Sciences, and School of Medical Sciences at Universiti Sains Malaysia for the article processing charge.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and code presented in this article are available from this GitHub repository: https://github.com/wnarifin/ipb_in_pvb (accessed on 10 November 2022).

Acknowledgments

We thank our colleagues at the School of Computer Sciences and School of Medical Sciences, Universiti Sains Malaysia for their comments on the early findings of this study and this article’s draft.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
bNumber of bootstrap samples
BNumber of repetitions
BGBegg and Greenes’ method
CCAComplete case analysis
CIConfidence interval
DDisease status
FDAFull data analaysis
IPBInverse probability bootstrap
IPWEInverse probability weighting estimator
mNumber of imputation
MARMissing at random
MIMultiple imputation
nSample size for complete cases
NSample size
PVBPartial verification bias
SEStandard error
SnSensitivity
SpSpecificity
TTest result
VVerified

References

  1. O’Sullivan, J.W.; Banerjee, A.; Heneghan, C.; Pluddemann, A. Verification bias. BMJ Evid. Based Med. 2018, 23, 54–55. [Google Scholar] [CrossRef]
  2. Umemneku Chikere, C.M.; Wilson, K.; Graziadio, S.; Vale, L.; Allen, A.J. Diagnostic test evaluation methodology: A systematic review of methods employed to evaluate diagnostic tests in the absence of gold standard–An update. PLoS ONE 2019, 14, e0223832. [Google Scholar] [CrossRef]
  3. Zhou, X.H.; Obuchowski, N.A.; McClish, D.K. Statistical Methods in Diagnostic Medicine, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
  4. Pepe, M.S. The Statistical Evaluation of Medical Tests for Classification and Prediction; Oxford University Press: New York, NY, USA, 2011. [Google Scholar]
  5. Alonzo, T.A. Verification bias-impact and methods for correction when assessing accuracy of diagnostic tests. Revstat Stat. J. 2014, 12, 67–83. [Google Scholar]
  6. de Groot, J.A.H.; Bossuyt, P.M.M.; Reitsma, J.B.; Rutjes, A.W.S.; Dendukuri, N.; Janssen, K.J.M.; Moons, K.G.M. Verification problems in diagnostic accuracy studies: Consequences and solutions. BMJ 2011, 343, d4770. [Google Scholar] [CrossRef] [Green Version]
  7. Schmidt, R.L.; Walker, B.S.; Cohen, M.B. Verification and classification bias interactions in diagnostic test accuracy studies for fine-needle aspiration biopsy. Cancer Cytopathol. 2015, 123, 193–201. [Google Scholar] [CrossRef] [PubMed]
  8. Kohn, M.A. Studies of diagnostic test accuracy: Partial verification bias and test result-based sampling. J. Clin. Epidemiol. 2022, 145, 179–182. [Google Scholar] [CrossRef] [PubMed]
  9. Schmidt, R.L.; Factor, R.E. Understanding Sources of Bias in Diagnostic Accuracy Studies. Arch. Pathol. Lab. Med. 2013, 137, 558–565. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  10. Rutjes, A.W.S.; Reitsma, J.B.; Coomarasamy, A.; Khan, K.S.; Bossuyt, P.M.M. Evaluation of diagnostic tests when there is no gold standard. A review of methods. Health Technol. Assess. 2007, 11, 50. [Google Scholar] [CrossRef] [Green Version]
  11. Arifin, W.N.; Yusof, U.K. Correcting for partial verification bias in diagnostic accuracy studies: A tutorial using R. Stat. Med. 2022, 41, 1709–1727. [Google Scholar] [CrossRef]
  12. Zhou, X.H. Effect of verification bias on positive and negative predictive values. Stat. Med. 1994, 13, 1737–1745. [Google Scholar] [CrossRef]
  13. Alonzo, T.A.; Pepe, M.S. Assessing accuracy of a continuous screening test in the presence of verification bias. J. R. Stat. Soc. Ser. (Appl. Stat.) 2005, 54, 173–190. [Google Scholar] [CrossRef]
  14. He, H.; McDermott, M.P. A robust method using propensity score stratification for correcting verification bias for binary tests. Biostatistics 2012, 13, 32–47. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Day, E.; Eldred-Evans, D.; Prevost, A.T.; Ahmed, H.U.; Fiorentino, F. Adjusting for verification bias in diagnostic accuracy measures when comparing multiple screening tests—An application to the IP1-PROSTAGRAM study. BMC Med. Res. Methodol. 2022, 22, 70. [Google Scholar] [CrossRef] [PubMed]
  16. Robles, C.; Rudzite, D.; Polaka, I.; Sjomina, O.; Tzivian, L.; Kikuste, I.; Tolmanis, I.; Vanags, A.; Isajevs, S.; Liepniece-Karele, I.; et al. Assessment of Serum Pepsinogens with and without Co-Testing with Gastrin-17 in Gastric Cancer Risk Assessment—Results from the GISTAR Pilot Study. Diagnostics 2022, 12, 1746. [Google Scholar] [CrossRef]
  17. El Chamieh, C.; Vielh, P.; Chevret, S. Statistical methods for evaluating the fine needle aspiration cytology procedure in breast cancer diagnosis. BMC Med. Res. Methodol. 2022, 22, 40. [Google Scholar] [CrossRef]
  18. Nahorniak, M.; Larsen, D.P.; Volk, C.; Jordan, C.E. Using Inverse Probability Bootstrap Sampling to Eliminate Sample Induced Bias in Model Based Analysis of Unequal Probability Samples. PLoS ONE 2015, 10, e0131765. [Google Scholar] [CrossRef]
  19. Morris, T.P.; White, I.R.; Crowther, M.J. Using simulation studies to evaluate statistical methods. Stat. Med. 2019, 38, 2074–2102. [Google Scholar] [CrossRef] [Green Version]
  20. Kosinski, A.S.; Barnhart, H.X. Accounting for nonignorable verification bias in assessment of diagnostic tests. Biometrics 2003, 59, 163–171. [Google Scholar] [CrossRef]
  21. Harel, O.; Zhou, X.H. Multiple imputation for correcting verification bias. Stat. Med. 2006, 25, 3769–3786. [Google Scholar] [CrossRef]
  22. Ünal, İ.; Burgut, H.R. Verification bias on sensitivity and specificity measurements in diagnostic medicine: A comparison of some approaches used for correction. J. Appl. Stat. 2014, 41, 1091–1104. [Google Scholar] [CrossRef]
  23. Rochani, H.; Samawi, H.M.; Vogel, R.L.; Yin, J. Correction of Verication Bias using Log-Linear Models for a Single Binaryscale Diagnostic Tests. J. Biom. Biostat. 2015, 6, 266. [Google Scholar] [CrossRef]
  24. Drum, D.E.; Christacopoulos, J.S. Hepatic scintigraphy in clinical decision making. J. Nucl. Med. 1972, 13, 908–915. [Google Scholar] [PubMed]
  25. Marshall, V.; Williams, D.C.; Smith, K.D. Diaphanography as a means of detecting breast cancer. Radiology 1984, 150, 339–343. [Google Scholar] [CrossRef] [PubMed]
  26. Greenes, R.; Begg, C. Assessment of diagnostic technologies. Methodology for unbiased estimation from samples of selectively verified patients. Investig. Radiol. 1985, 20, 751–756. [Google Scholar] [CrossRef]
  27. Zhou, X.H. Maximum likelihood estimators of sensitivity and specificity corrected for verification bias. Commun. Stat. Theory Methods 1993, 22, 3177–3198. [Google Scholar] [CrossRef]
  28. Austin, P.C. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. Multivar. Behav. Res. 2011, 46, 399–424. [Google Scholar] [CrossRef] [Green Version]
  29. Yasunaga, H. Introduction to applied statistics—Chapter 1 propensity score analysis. Ann. Clin. Epidemiol. 2020, 2, 33–37. [Google Scholar] [CrossRef]
  30. Davison, A.C.; Hinkley, D.V. Bootstrap Methods and Their Application; Number 1; Cambridge University Press: New York, NY, USA, 1997. [Google Scholar]
  31. Woodward, M. Epidemiology: Study Design and Data Analysis; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar]
  32. Burton, A.; Altman, D.G.; Royston, P.; Holder, R.L. The design of simulation studies in medical statistics. Stat. Med. 2006, 25, 4279–4292. [Google Scholar] [CrossRef]
  33. Casella, G.; Berger, R.L. Statistical Inference, 2nd ed.; Duxbury Advanced Series; Cengage Learning: Delhi, India, 2002. [Google Scholar]
  34. de Groot, J.A.H.; Janssen, K.J.M.; Zwinderman, A.H.; Bossuyt, P.M.M.; Reitsma, J.B.; Moons, K.G.M. Correcting for partial verification bias: A comparison of methods. Ann. Epidemiol. 2011, 21, 139–148. [Google Scholar] [CrossRef]
  35. Begg, C.B.; Greenes, R.A. Assessment of diagnostic tests when disease verification is subject to selection bias. Biometrics 1983, 39, 207–215. [Google Scholar] [CrossRef]
  36. Horvitz, D.G.; Thompson, D.J. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 1952, 47, 663–685. [Google Scholar] [CrossRef]
  37. van Buuren, S. Flexible Imputation of Missing Data, 2nd ed.; Chapman & Hall/CRC Interdisciplinary Statistics; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
  38. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2020. [Google Scholar]
  39. R Studio Team. RStudio: Integrated Development for R; RStudio, Inc.: Boston, MA, USA, 2020. [Google Scholar]
  40. van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef] [Green Version]
  41. Goldfeld, K.; Wujciak-Jens, J. simstudy: Illuminating research methods through data generation. J. Open Source Softw. 2020, 5, 2763. [Google Scholar] [CrossRef]
  42. Dong, Y.; Peng, C.Y.J. Principled missing data methods for researchers. SpringerPlus 2013, 2, 222. [Google Scholar] [CrossRef] [Green Version]
  43. Royston, P.; White, I. Multiple Imputation by Chained Equations (MICE): Implementation in Stata. J. Stat. Softw. 2011, 45, 1–20. [Google Scholar] [CrossRef] [Green Version]
  44. Bodner, T.E. What Improves with Increased Missing Data Imputations? Struct. Equ. Model. Multidiscip. J. 2008, 15, 651–675. [Google Scholar] [CrossRef]
  45. White, I.R.; Royston, P.; Wood, A.M. Multiple imputation using chained equations: Issues and guidance for practice. Stat. Med. 2011, 30, 377–399. [Google Scholar] [CrossRef]
  46. Pedersen, A.; Mikkelsen, E.; Cronin-Fenton, D.; Kristensen, N.; Pham, T.M.; Pedersen, L.; Petersen, I. Missing data and multiple imputation in clinical epidemiological research. Clin. Epidemiol. 2017, 9, 157–166. [Google Scholar] [CrossRef] [Green Version]
  47. Roldán-Nofuentes, J.A.; Regad, S.B. Estimation of the Average Kappa Coefficient of a Binary Diagnostic Test in the Presence of Partial Verification. Mathematics 2021, 9, 1694. [Google Scholar] [CrossRef]
  48. Faisal, S.; Tutz, G. Multiple imputation using nearest neighbor methods. Inf. Sci. 2021, 570, 500–516. [Google Scholar] [CrossRef]
Table 1. Comparison between IPB and existing PVB correction methods for p = 0.4 with N = 200 and 1000 under three combinations of Sn and Sp.
Table 1. Comparison between IPB and existing PVB correction methods for p = 0.4 with N = 200 and 1000 under three combinations of Sn and Sp.
MethodsMeanBiasSEMeanBiasSE
N = 200
Sn = 0.6Sp = 0.6
FDA0.6030.0030.0550.6020.0020.044
CCA0.7540.1540.0600.430−0.1700.060
BG0.6070.0070.0720.6020.0020.050
IPWE0.6070.0070.0720.6020.0020.050
MI0.6050.0050.0750.599−0.0010.052
IPB0.6090.0090.1050.6020.0020.078
Sn = 0.6Sp = 0.9
FDA0.6030.0030.0550.9020.0020.027
CCA0.7540.1540.0610.822−0.0780.054
BG0.6080.0080.0750.9030.0030.030
IPWE0.6080.0080.0750.9030.0030.030
MI0.6050.0050.0760.9010.0010.031
IPB0.6050.0050.1180.9030.0030.049
Sn = 0.9Sp = 0.6
FDA0.899−0.0010.0330.6010.0010.044
CCA0.9450.0450.0270.427−0.1730.057
BG0.896−0.0040.0460.6000.0000.046
IPWE0.896−0.0040.0460.6000.0000.046
MI0.889−0.0110.0460.598−0.0020.047
IPB0.894−0.0060.0640.595−0.0050.072
N = 1000
Sn = 0.6Sp = 0.6
FDA0.6020.0020.0230.6000.0000.019
CCA0.7510.1510.0260.428−0.1720.026
BG0.6020.0020.0310.6000.0000.022
IPWE0.6020.0020.0310.6000.0000.022
MI0.6010.0010.0320.599−0.0010.022
IPB0.599−0.0010.0440.6010.0010.034
Sn = 0.6Sp = 0.9
FDA0.6020.0020.0230.9000.0000.012
CCA0.7520.1520.0260.818−0.0820.025
BG0.6020.0020.0310.9000.0000.014
IPWE0.6020.0020.0310.9000.0000.014
MI0.6020.0020.0320.9000.0000.014
IPB0.599−0.0010.0480.899−0.0010.024
Sn = 0.9Sp = 0.6
FDA0.9010.0010.0150.6010.0010.020
CCA0.9480.0480.0120.429−0.1710.027
BG0.9010.0010.0220.6010.0010.021
IPWE0.9010.0010.0220.6010.0010.021
MI0.899−0.0010.0230.6000.0000.021
IPB0.9010.0010.0280.6000.0000.033
Abbreviations: CCA, complete case analysis; BG, Begg and Greenes’ method; FDA, Full data analysis; IPWE, inverse probability weighting estimator; MI, multiple imputation; N, sample size; p, disease prevalence; SE, standard error; Sn, sensitivity; Sp, specificity; IPB, inverse probability bootstrap.
Table 2. Comparison between IPB and existing PVB correction methods for p = 0.1 with N = 200 and 1000 under three combinations of Sn and Sp.
Table 2. Comparison between IPB and existing PVB correction methods for p = 0.1 with N = 200 and 1000 under three combinations of Sn and Sp.
MethodsMeanBiasSEMeanBiasSE
N = 200
Sn = 0.6Sp = 0.6
FDA0.596−0.0040.1120.6010.0010.037
CCA0.7430.1430.1170.429−0.1710.049
BG0.6030.0030.1440.6010.0010.038
IPWE0.6030.0030.1440.6010.0010.038
MI0.579−0.0210.1360.598−0.0020.039
IPB0.595−0.0050.2020.6060.0060.062
Sn = 0.6Sp = 0.9
FDA0.6000.0000.1150.9000.0000.022
CCA0.7380.1380.1190.818−0.0820.043
BG0.598−0.0020.1430.9000.0000.023
IPWE0.598−0.0020.1430.9000.0000.023
MI0.568−0.0320.1340.9000.0000.023
IPB0.599−0.0010.2140.898−0.0020.042
Sn = 0.9Sp = 0.6
FDA0.875−0.0250.0620.6020.0020.035
CCA0.9100.0100.0420.430−0.1700.048
BG0.837−0.0630.0680.599−0.0010.037
IPWE0.837−0.0630.0680.599−0.0010.037
MI0.800−0.1000.0800.597−0.0030.037
IPB0.832−0.0680.1330.6050.0050.059
N = 1000
Sn = 0.6Sp = 0.6
FDA0.6000.0000.0480.6000.0000.017
CCA0.7540.1540.0530.429−0.1710.022
BG0.6070.0070.0670.6010.0010.017
IPWE0.6070.0070.0670.6010.0010.017
MI0.6010.0010.0670.6000.0000.017
IPB0.6050.0050.0930.6000.0000.027
Sn = 0.6Sp = 0.9
FDA0.6000.0000.0480.9000.0000.010
CCA0.7490.1490.0520.819−0.0810.019
BG0.6010.0010.0650.9000.0000.010
IPWE0.6010.0010.0650.9000.0000.010
MI0.593−0.0070.0650.9000.0000.010
IPB0.6040.0040.0990.9010.0010.018
Sn = 0.9Sp = 0.6
FDA0.899−0.0010.0280.6000.0000.016
CCA0.9470.0470.0250.428−0.1720.021
BG0.9000.0000.0440.6000.0000.017
IPWE0.9000.0000.0440.6000.0000.017
MI0.889−0.0110.0460.6000.0000.017
IPB0.897−0.0030.0600.6010.0010.027
Abbreviations: CCA, complete case analysis; BG, Begg and Greenes’ method; FDA, Full data analysis; IPWE, inverse probability weighting estimator; MI, multiple imputation; N, sample size; p, disease prevalence; SE, standard error; Sn, sensitivity; Sp, specificity; IPB, inverse probability bootstrap.
Table 3. Sn and Sp estimates of IPB and other methods with the respective 95% CIs using clinical data sets.
Table 3. Sn and Sp estimates of IPB and other methods with the respective 95% CIs using clinical data sets.
Hepatic Data SetDiaphanography Data Set
MethodsSn (95% CI)Sp (95% CI)Sn (95% CI)Sp (95% CI)
CCA0.895 (0.858, 0.933)0.628 (0.526, 0.730)0.788 (0.648, 0.927)0.800 (0.694, 0.906)
BG0.836 (0.788, 0.884)0.738 (0.662, 0.815)0.292 (0.134, 0.449)0.973 (0.958, 0.988)
IPWE0.836 (0.784, 0.883)0.738 (0.651, 0.809)0.292 (0.165, 0.509)0.973 (0.955, 0.986)
MI0.834 (0.782, 0.885)0.738 (0.661, 0.815)0.279 (0.124, 0.435)0.972 (0.957, 0.987)
IPB0.838 (0.793, 0.882)0.738 (0.653, 0.822)0.290 (0.059, 0.520)0.973 (0.935, 1.000)
Abbreviations: CCA, complete case analysis; CI, confidence interval; BG, Begg and Greenes’ method; FDA, Full data analysis; IPWE, inverse probability weighting estimator; MI, multiple imputation; Sn, sensitivity; Sp, specificity; IPB, inverse probability bootstrap.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Arifin, W.N.; Yusof, U.K. Partial Verification Bias Correction Using Inverse Probability Bootstrap Sampling for Binary Diagnostic Tests. Diagnostics 2022, 12, 2839. https://doi.org/10.3390/diagnostics12112839

AMA Style

Arifin WN, Yusof UK. Partial Verification Bias Correction Using Inverse Probability Bootstrap Sampling for Binary Diagnostic Tests. Diagnostics. 2022; 12(11):2839. https://doi.org/10.3390/diagnostics12112839

Chicago/Turabian Style

Arifin, Wan Nor, and Umi Kalsom Yusof. 2022. "Partial Verification Bias Correction Using Inverse Probability Bootstrap Sampling for Binary Diagnostic Tests" Diagnostics 12, no. 11: 2839. https://doi.org/10.3390/diagnostics12112839

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop