Next Article in Journal
Local Entanglement of Electrons in 1D Hydrogen Molecule
Next Article in Special Issue
On the Relationship between Feature Selection Metrics and Accuracy
Previous Article in Journal
Adversarial Defense Method Based on Latent Representation Guidance for Remote Sensing Image Scene Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Modified Local Linear Estimators in Partially Linear Additive Models with Right-Censored Data Based on Different Censorship Solution Techniques

1
Department of Statistics, Mugla Sıtkı Kocman University, Mugla 48000, Turkey
2
Department of Mathematics and Statistics, Brock University, St. Catharines, ON L2S 3A1, Canada
*
Author to whom correspondence should be addressed.
Entropy 2023, 25(9), 1307; https://doi.org/10.3390/e25091307
Submission received: 12 July 2023 / Revised: 31 August 2023 / Accepted: 6 September 2023 / Published: 7 September 2023
(This article belongs to the Special Issue Information-Theoretic Criteria for Statistical Model Selection)

Abstract

:
This paper introduces a modified local linear estimator (LLR) for partially linear additive models (PLAM) when the response variable is subject to random right-censoring. In the case of modeling right-censored data, PLAM offers a more flexible and realistic approach to the estimation procedure by involving multiple parametric and nonparametric components. This differs from the widely used partially linear models that feature a univariate nonparametric function. The LLR method is employed to estimate unknown smooth functions using a modified backfitting algorithm, delivering a non-iterative solution for the right-censored PLAM. To address the censorship issue, three approaches are employed: synthetic data transformation (ST), Kaplan–Meier weights (KMW), and the kNN imputation technique (kNNI). Asymptotic properties of the modified backfitting estimators are detailed for both ST and KMW solutions. The advantages and disadvantages of these methods are discussed both theoretically and practically. Comprehensive simulation studies and real-world data examples are conducted to assess the performance of the introduced estimators. The results indicate that LLR performs well with both KMW and kNNI in the majority of scenarios, along with a real data example.

1. Introduction

Partially linear models (PLMs) have gained considerable attention in the field of survival analysis, especially for modeling right-censored data. The flexibility and capability of PLMs to capture both parametric and nonparametric components make them a favored choice for analyzing survival data with complex relationships. The classical PLM is expressed as follows for completely observed data with a sample size n :
y i = x i T β + f t i + ε i ,   1 i n  
where y i ’s are the completely observed response values (or lifetimes in survival analysis), x i R n × p are the parametric covariates, β = β 1 , , β p T denotes the p × 1 dimensional vector of regression coefficients, and f . is the univariate unknown smooth function to be estimated based on the values of the nonparametric covariate t i ’s. Finally, ε i ’s are the random error terms with i   ε i ~   N 0 , σ ε 2   and   i i   C o v ε i , x i = 0 ,   i i i   E ε i | x i , t i = 0 . Without censored data, model (1) has been studied by many researchers, and some of the notable studies include [1,2], among others. Additionally, ref. [3] proposed the local linear regression (LLR) estimation for model (1). In the right-censored case, the response variable, y i , is incompletely observed and censored from the right by random censoring variable c i i = 1 n under the assumption that x i and t i are completely observed. Accordingly, the censoring mechanism and some new variables can be obtained as follows:
z i = min y i , c i   with   δ i = 0 ,   i f   y i   i s   c e n s o r e d   y i > c i 1 ,   i f   y i   i s   u n c e n s o r e d   y i c i  
where z i denotes the incompletely observed response variable with the censoring indicator δ i . Thus, instead of y i , data pairs z i , δ i are used in the modeling procedure. There are several important studies on the estimation of model (1) under right-censored data, as given in (2), such as refs. [4,5,6], among others.
While model (1) offers reliable performance for both censored and uncensored data due to its ability to incorporate both parametric and nonparametric components, it encompasses only a singular nonparametric component. This constraint necessitates that researchers select a sole nonparametric covariate from the dataset, a premise that might not align with many real-world situations. Furthermore, adhering to this limitation could result in less dependable estimations unless the dataset genuinely contains only one nonparametric covariate. To improve estimation accuracy and provide a more adaptable model that considers the right-censored response variable, z i , this research delves into the partially linear additive model (PLAM), tailored for q nonparametric functions:
z i = β 0 + x i T β + j = 1 q f j t i j + ε i ,   1 i n  
Here, q represents the number of nonparametric components, a value determined based on the nature of the relationship between t i j and y i . When this relationship cannot be adequately captured by a linear parametric component, it is treated as a nonparametric covariate, characterized by an unknown smooth function f j t i j . As a result, the overall nonparametric component of model (3) is formed by the summation of these functions. The use of PLAMs in survival analysis with right-censored data allows for more realistic modeling of the relationship between covariates and survival outcomes by incorporating both multiple parametric and nonparametric components. By introducing nonparametric components, PLAMs provide a more adaptable framework for capturing potential nonparametric relationships between covariates and survival times. It is crucial to acknowledge that model (3) cannot be estimated unless the censorship problem is suitably addressed. Numerous studies in the literature have concentrated on estimating (3) for data that is fully observed and devoid of any censoring. Ref. [7] discussed the combination of smoothing splines with semiparametric additive models, while ref. [8] studied the asymptotic properties of M-estimators for model (3). Additionally, Ref. [9] presented a comprehensive review of partially linear additive models based on various smoothing techniques.
Distinct from the studies previously mentioned, this paper presents modified LLR estimators for PLAM (3) using three distinct censoring solutions: synthetic data transformation (ST), Kaplan–Meier weights (KMW), and kNN imputation (kNNI). Through the examination of these modified estimators and the exploration of various techniques to tackle censorship, valuable insights can be gained, and the accuracy and effectiveness of modeling right-censored data may be improved. This paper also explains the procedure for obtaining these estimators, encompassing the modified backfitting technique and a non-iterative approach, accompanied by comparative numerical studies. To the best of our knowledge, this research fills a gap in the literature on modeling right-censored data.
The remaining part of the paper is organized as follows: In Section 2, the fundamentals of right-censored data are presented, and solution approaches are explained. Section 3 covers the estimation of PLAM using modified LLR estimators based on various censorship solution techniques. In Section 4, the statistical properties of the estimators are provided. Section 5 and Section 6 present simulation and real data studies, respectively. Finally, Section 7 includes the conclusions of the paper.

2. Right-Censored Data and Solution Methods

In this section, we provide theoretical insights into modeling right-censored data. Let F and G represent the probability distribution functions of the F observed response variable ( y i ) and the censoring variable ( c i ), respectively. Thus, for any arbitrary data point “ u ”, these functions can be expressed as follows:
F u = P y i u   and   G u = P c i u ,
It is essential to highlight that the estimation procedure for the model, utilizing the specified distributions (4), critically relies on two “censorship assumptions”. These constrain all variables within model (2). These assumptions, as outlined by ref. [10] and elaborated by ref. [11] in the context of right-censored regression models, hold significant significance. In essence, the dataset must meet the subsequent criteria.
A1. 
y i  and  c i  are independent.
A2. 
P y i c i | y i , x i , t i j = P y i c i | y i .
The assumption (A1) and (A2) can be explained as follows: (A2) posits that the covariates in the model lack any information about the censorship in y i . Assumption (A1) is particularly crucial when implementing censorship solutions. For a more in-depth discussion, one can refer to [10]’s writings. Drawing from the aforementioned details, this section provides the three censorship solutions. Additionally, towards the section’s close, a figure is showcased to illustrate the practical distinctions between synthetic data transformation and the kNN imputation methods.
Synthetic data transformation: To incorporate the impact of censorship into the modeling procedure, synthetic data transformation is a commonly employed solution method. Consequently, the incomplete response pairs z i ,   δ i ,   i = 1 , , n must be substituted for a synthetic response variable, as proposed by ref. [12]. Assuming that G is a continuous and known function, it becomes possible to modify the observed lifetimes z i in a manner that ensures an unbiased estimation:
z i G = δ i z i 1 G z i ,   i = 1 , 2 , , n  
where z i G represents the synthetic response variable with E z i G | x i , t i j = E z i | x i , t i j = x i β + j = 1 q f j t i j . Nevertheless, the true distribution of the censoring variable G remains unknown. To address this challenge, ref. [12] suggested replacing G with its estimated version, known as the Product-Limit estimator (Kaplan–Meier estimator). This estimator calculates the survival probabilities at the arbitrary positive data point “ u ” as follows:
1 G ^ u = i = 1 n n i n i + 1 I z i u ,   δ i = 0   ,   u 0
where z 1 , , z n are the sorted values of the right-censored response variable z i and δ i are the corresponding censoring indicators associated to z i . Hence, instead of G z i in (5), G ^ z i is used and z G ^ = z 1 G ^ , , z n G ^ T can be obtained to fit the PLAM.
Kaplan–Meier weights: Kaplan–Meier weights (KMW), as proposed by ref. [13], are a technique used in survival analysis to address the issue of right-censored data. The Kaplan–Meier estimator is a nonparametric method prevalent nonparametric approach used for estimating survival probabilities amidst censoring. Nonetheless, using standard regression techniques on censored data can lead to biased outcomes. Stute (1993) addressed this by presenting Kaplan–Meier weights, derived from the Kaplan–Meier survival probabilities for each data point. These weights are used to adjust the contribution of each observation in the regression analysis, effectively accounting for the censoring mechanism. By incorporating the Kaplan–Meier weights into the regression model, unbiased estimates of the regression coefficients can be obtained.
Before computing the KMW, let us assume that z i denotes the ordered values of the incomplete response values and x i T ,   δ i   and t i = t i 1 , , t i q are the correspondingly ordered values. Then, Kaplan–Meier weight w i , associating with the z i , is computed based on the Kaplan–Meier estimator F ^ z i given in (5) as follows:
w i = F ^ z i F ^ z i 1 = δ i n i + 1 r = 1 i 1 n r n r + 1 δ r  
And KMW is obtained for all possible values of z i as a diagonal matrix W = diag w 1 , , w n . To reach further information about (7) and implanting these weights into the regression models, see refs. [5,6].
kNN imputation method: kNN imputation is a prevalent technique for addressing missing data across various domains, as discussed by researchers including [14]. Additionally, some studies, such as ref. [15], have adapted the kNN imputation method to manage right-censored data. This method allows for the practical estimation of right-censored data points without the constraints of theoretical limitations. In this context, we provide a succinct overview of the kNN imputation technique and an algorithm tailored for the PLAM dataset. Essentially, the kNN method is a machine learning technique that hinges on the similarity between data points, utilizing distance metrics for predictions. The choice of a suitable similarity measure can greatly impact the results. The Euclidean norm is commonly employed as a measure of distance in numerous studies. The Euclidean norm is a well-known distance and can be computed for the context of censored data points as d E x j , x i = i = 1 n c x j c x i c 2 where n c is the number of censored data points and x j c and x i c denote the j t h and i t h associated values of a regressor which has a strong correlation between response variable z i . Details are provided in Algorithm 1. For imputation, the algorithm introduced by ref. [15] can be employed. The choice of the appropriate number of neighbors, “k”, is pivotal, especially given the possibility of some neighbors being right-censored. While ref. [16] suggests a smaller value for “k”, such as 1 or 2, an optimal “k” ranging between 2 to 10 is chosen in this context to minimize the mean squared error (MSE). This approach ensures precision in imputation, taking into account the distinct attributes of the data.
Algorithm 1 Algorithm for k NN imputation for the right-censored data
I n p u t s
I 1 : Right censored   dataset   z i
I 2 : Censoring   indicator   δ i
I 3 : Number   of   nearest   neigbours   k
I 4 : Values   of   predictor   variable   x i       high correlated   one   with   z i
O u t p u t : Imputed   dataset   z k n n = z 1 k n n , ,   z n k n n T
1: b e g i n
2: f o r   i = 1   to   n do
3 : i f δ i = 0   d o     if   data   point   is   censored
4 :     f o r j = 1   to   n   do
5 :     Find   the   distances   between   x j   a n d   x i   for   each   censored   data   point
6 :     S ort   the   distances   from   small   to   large
7 :     f o r   j = 1   to   k   do
8 :     Take   the   first   u n c e n s o r e d   k   values   of   z i   associated   to   sorted   distances  
9 :     C alculate   the   i thimputed   value   z i k n n   with   average   of   nearest   k records   of   z i
10 :   Replace   the   imputed   values   z i k n n   with   censored   data   points   z i , δ i = 0   in   censored   data   set   z = ( z 1 , , z n )
11 :   Return   z k n n = z 1 k n n , ,   z n k n n T
12: e n d
As previously mentioned, Figure 1 has been created to illustrate the practical distinctions between the manipulative solution techniques, namely ST and kNNI. This visualization provides insights into how these methods impact the response variable and the changes they bring about. It should be noted that the effect of KMW is not demonstrated in the figure since it is incorporated into the objective function of the right-censored PLAM as weights. However, further explanation regarding KMW will be provided in the next section when obtaining the modified LLR estimators.

3. Modified Estimator for PLAM

3.1. Fundamentals of PLAM

Before explaining the modified LLR estimators, this section provides a concise overview of the fundamental concepts of PLAM and summarizes the steps involved in utilizing the backfitting algorithm. Additionally, we express right-censored PLAM (3) in vector and matrix form as follows:
Z = β 0 + X β + j = 1 q f j + ε  
Below, we present the explicit expressions for the vector and matrices in (8) as follows:
Z = Z 1 Z n ,   X = x 1 T x n T ,   f j = f j t j 1 f k t j n   and   ε = ε 1 ε n  
The literature offers only a handful of studies specifically addressing the right-censored partially linear additive model (PLAM). In terms of estimating model (8), ref. [17] presented the primary optimization problem for the nonparametric additive model, which mean X β = 0 in model (8), and ref. [18] formulated a similar problem for (8) as follows:
min β , f   E Y X β β 0 j = 1 q f j 2  
Accordingly, the solution expression for the j t h function f j z j in the objective (10) can be written as f j t j = E Y k j f k z k |   z j and, based on this statement, the following equation system can be used for the general solution of the model. Accordingly, let S 1 , , S q be smoothing matrices obtained from the LLR procedure. Then, the equation system for the estimation of model (8) can be obtained as follows:
I S 1 S 1 S 2 I S 2 S q S q I n q × n q f ^ 1 f ^ 2 f ^ q n q × 1 = S 1 Y X β ^ S 2 Y X β ^ S q Y X β ^ n q × 1
where β ^ denotes estimated coefficients by LLR, which is shown in Section 3.2. For further details on (11), refer to [9]. The solution to system (11) effectively yields the estimates of the functions f j z j j = 1 q . However, it is evident that inverting the matrix on the left-hand side of (11), which comprises the smoothing matrices, becomes infeasible if the dimension of ( n q × n q ) is sufficiently large. As the dimension grows, solving the system in (11) becomes progressively more challenging, potentially reaching a point where it is unmanageable and cannot be directly addressed (refer to [18]).
Hence, in practical applications, the system (11) is typically solved using the backfitting method, incorporating initial-valued components notated as f ^ j 0 j = 1 q . Consequently, the LLR estimators are derived by the modified backfitting algorithm, which is given at the end of Section 3.

3.2. Local Linear Regression

Local linear regression (LLR) is a widely employed smoothing technique for nonparametric, semiparametric, and additive models. Its effectiveness has been demonstrated across diverse domains, such as medical research, engineering, and the analysis of time-to-event (or survival) data in time-series studies. In this section, we present three LLR estimators for the partially linear additive model (PLAM) described in (8), employing the introduced censorship solution methods. These estimators are derived using a modified backfitting algorithm. Local linear regression (LLR) is a kernel-based method that differs from kernel regression in that it performs a local estimation of a line rather than a constant. To illustrate the working procedure of LLR, let us consider a partially linear model with a univariate function when q = 1 , as given in (1), involving an unknown smooth function f . . The key concept of LLR is to estimate model (1) linearly within small input intervals. To estimate the parameters of (1), the backfitting algorithm introduced by ref. [19] is used. Accordingly, the backfitting estimators β ^   , f ^   for model (1) where f ^ 1   = f 1 t 1 , , f 1 t n T by replacing the corresponding matrices that are S h 1   and H 1   in the algorithm given in Algorithm 2 can be obtained where H 1   = S h 1   + X ˜ ( X ˜ X ˜ 1 X I S h 1   for X ˜ = I S h 1   X . Here, S h 1   is computed based on the bandwidth parameter h 1 > 0 for LLR, which is formed by using nonparametric variables t 1 i ’s.
In order to adapt the LLR method for estimating the parameters of the right-censored PLAM, a closer examination of the elements of the smoother matrix S h j is required. Let S h j   j q be written with open form as S h j   = s j 1 , , s j n T , where s j 1 , , s j n show the row vectors of S h j   obtained from values of h t h nonparametric covariate t j = t j 1 , , z j n T . From the theory of LLR, s j r T for any t j 1 m   t j n   can be obtained as follows:
s j m T = d 1 T t j m T W j m t j m 1 t j m T W j m
where t j m , d 1 , and W j m can be expressed as follows:
t j m = 1 t j 1 m 1 t j n m ,   d 1 = 1 0  
and
W j m = diag h 1 K t j 1 m h , , h 1 K t j n m h  
Based on the provided information, it can be inferred that the extension of LLR estimators to PLAM requires further adjustments. Moreover, it is crucial to satisfy the standard assumptions of LLR, such as where K . is the kernel function, which is continuous, and its moment is written as μ i K u i K u d u = 0 when μ 2 K 0 for odd values of j . The density of t j i can be given as g t m > 0 , for all m s u p g t , and also, as a common assumption, since n , h 0 , and n h . Finally, a second derivative of the nonparametric smooth function f . exists and is continuous. Details about the assumptions are discussed in detail in ref. [20].
In the backfitting estimation procedure, to make simple the definition of the model (8), some restrictions on f j t i j j = 1 q are needed. At first, E f j t i j = 0 is assumed. Secondly, the parametric covariates x i T ’s and right-censored response values z i ’s are assumed to be scaled around zero. In order to construct the centered smoother matrix S h j used in the LLR estimation, these constraints are necessary. Thus, the conditional expectation of model (8) can be expressed as follows:
E z i | x i , t i = β 0 + x i T β + j = 1 q f j t i j ,   i = 1 , , n  
By using the modified backfitting algorithm given in Algorithm 2, solutions can be obtained based on S h j   for PLAM parameters β and f j j = 1 q . Thus, without any censoring adjustment, PLAM estimators β ^   , f ^   based on the LLR are obtained.
Algorithm 2 Modified Backfitting Algorithm for Right-Censored PLAM
Inputs:     β 0 = E Z i = Z ¯ ;   X :   n × p -dimensional covariates of parametric component
    Z : n × q -dimensional scaled nonparametric covariates; f k 0 k = 1 q : Initial smooth functions
    β 0 :   Initial regression coefficients; Z   : n × 1 -dim. vector of right-censored response values
                   Tolerance value, t o l = 0.05 and max. iteration = 100.
Outputs: Modified PLAM estimators:
                    O1:kNNI basis LLR estimators β ^ i m p   and   f ^ 1 i m p , , f ^ q i m p
                    O2:ST basis estimators β ^ S T   and   f ^ 1 S T , , f ^ q S T
                    O3: KMW basis estimators β ^ K M W   and   f ^ 1 K M W , , f ^ q K M W
Begin
1: Initialize β and f 1 , , f q as β 0 and f j 0 j = 1 q by covariates X and t 1 , , t q .
2: while  t o l 0.05 and i < m a x . i t e r a t i o n
Selection of optimal bandwidth parameter  h j  by  G C V  between steps: 3–8
3:   Create a sequence of tunning parameter h s e q = 0.01 ,   1.5 for determined length
4:          for ( l   i n   1 : l e n g t h )  do
5:                Compute the smoothing matrix S h s e q l .
6:                 if censorship solution is KMW
7:                  Compute X ˜ and H j l = S h s e q l + X ˜ ( X ˜ W X ˜ 1 X T W I S h s e q l where X ˜ = I S h s e q l X
8:                 Else
9:                  Compute X ˜ and H j l = S h s e q l + X ˜ ( X ˜ W X ˜ 1 X T W I S h s e q l where X ˜ = I S h s e q l X
10:                Calculate GCV h s e q l as given in Equation (24)
11:          end
12:          Select optimal h ^ j which minimizes G C V h j for j t h function f j .
13:          Compute S h ^ j for each criterion (and method).
                                               Solution of censorship problem between steps: 14–25
14:          if the censorship solution is kNNI
15:                  Replace Z with Z i m p using algorithm in Algorithm 1.
16:          if the censorship solution is ST
17:                  Replace Z with Z S T as shown in Equation (5)
18:          for  j   i n   1 : q  do
19:                  if the censorship solution is KMW
20:                                  β ^ j i = X W X 1 X W Z β 0 m < j q f ^ m i m > j q f ^ m i 1  
21:                                  f ^ j i = S h ^ j Z β 0 X β ^ j i m < j q f ^ m i m > j q f ^ m i 1
22:                  Else
23:                                  β ^ j i = X X 1 X Z β 0 m < j q f ^ m i m > k q f ^ m i 1  
24:                                  f ^ j i = S λ ^ k Y α 0 X β ^ k i m < k q f ^ m i m > k q f ^ m i 1
25:          end
26:                   i = i + 1
27:                   t o l = n q 1 f k i f k i 1 T 1   where   1 = 1 , , 1 T .
28:  end
29: Return  β ^   and   f ^ 1 , , f ^ q
30: end
Furthermore, it should be noted that ref. [20] presented a non-iterative formulation equivalent to the backfitting algorithm based on an additive smoother matrix S   A = j = 1 q S j   * to demonstrate the LLR estimation process in the absence of censorship issues, which reveals the relationship between Z and f ^   A = j = 1 q f ^ j   . Here, S j * is computed from the equation system (11) based on the S h j   (see ref. [9]). Additionally, this information elucidates the connection between a unique solution and the iterative backfitting process.
Accordingly, LLR estimators for PLAM can be found as for both ST and kNNI by replacing Z by Z S T and Z k N N I :
β ^ A = X T X ˜ 1 X T Z ˜
f ^   A = S   A Z α 0 X β ^ A
And for KMW solution, non-iterative estimators are obtained as follows:
β ^ K M W A = X T W X ˜ 1 X T W Z ˜
f ^ K M W A = S   A Z α 0 X β ^ A
where X ˜ = I S   A X , Z ˜ = I S   A Z . It should be noted that the validity of Equations (14)–(17) depends on the existence of a unique solution. Furthermore, the vector of fitted values for LLR can be expressed as follows:
μ ^   = E Z | X , Z = Z ^   = H   A Z  
where H   A = S   A + X ˜ X ˜ T X ˜ 1 X T I S   A and for the KMW solution H K M W A = S   A + X ˜ X ˜ T W X ˜ 1 X T W I S   A . Note that under completely observed data, H   A is derived by [21] for the LLR estimator of PLAM.
To effectively demonstrate and interpret each nonparametric component individually, the introduced modified backfitting algorithm is more suitable than Equations (16)–(18), which yield an additive outcome for the nonparametric component. Additionally, computing S L L A becomes significantly challenging as the dimension of the additive component increases. In this paper, the modified backfitting estimators β ^ A , f ^ A of LLR, obtained through an algorithm given in Algorithm 2, are employed. This approach aims to showcase the performance of the estimated functions f ^ = f ^ j   j = 1 q . In the introduced algorithm given in Algorithm 2, to calculate the selection criterion GCV, the degrees of freedom of (DF) are computed by D F j = t r I H j T I H j = n 2 t r H j + t r H j T H j where H j denotes the hat matrix based on the j t h nonparametric component. Also, to see details about the algorithm given in Algorithm 2, see ref. [9].

4. Properties of the Estimator

The objective of this section is to assess the bias and variance of the modified LLR estimators introduced in the previous section. When evaluating the performance of the parametric component, the variances and biases of the regression coefficients are calculated using the non-iterative solutions given in Equations (14)–(17), owing to its theoretical simplicity.
Empirical studies can be conducted to calculate the bias and variance properties of the estimators. However, when considering LLR as demonstrated in Equations (14)–(17), non-iterative formulations can be employed to compute finite-sample properties for the other two methods. In this matter, conditional bias E β ^ A β | X , t and variance V a r β ^ A are obtained based on Equations (14)–(17).
Let us rewrite β ^ A as:
β ^ A = β + X T X ˜ 1 X T f ˜   A + X T X ˜ 1 X T I S   A   ε
where S   A = j = 1 q S j * , and f ˜   A = f ˜ 1 + + f ˜ q for f ˜ j = I S h k   f j j = 1 q . Then B β ^ A and V a r β ^ A can be given by:
B β ^ A = E β ^ A β | X , t = X T X ˜ 1 X T f ˜   A
V a r β ^ A = σ ^ ε 2 X T X ˜ 1 X T I S   A   2 X X T X ˜ 1
And for the KMW solution, Equations (19) and (20) are given by:
B β ^ K M W A = E β ^ A β | X , t = X T W X ˜ 1 X T W f ˜ K M W A
V a r β ^ K M W A = σ ^ ε 2 X T W X ˜ 1 X T W I S   A   2 X X T W X ˜ 1
where σ ^ ε 2 is the model variance estimated based on LLR and it can be computed using the hat matrix H   A or H K M W A for the KMW solution that are defined after Equation (18). In addition, one can replace Z by Z S T or Z i m p . Accordingly, σ ^ ε 2 is formulated as follows:
σ ^ ε 2 = Z T I H   A T I H   A Z t r I H L L A T I H L L A
where the degree of freedom (DF), which is given in the denominator of (23), is calculated by D F A = t r I H   A T I H   A = n 2 t r H A + t r H A T   H A and H K M W A is used for the KMW solution. For the further details of D F A , see ref. [17]. The modified backfitting algorithm provided in Algorithm 2 requires the estimation of the model variance for each individual nonparametric function in order to calculate the GCV score for bandwidth parameter selection. Consequently, if H   A is replaced by H j or H K M W j in (23), then the individual variance estimator σ ^ ε j 2 can be easily obtained. The fundamental concept behind computing σ ^ ε j 2 lies in selecting the appropriate smoothing and bandwidth parameters using the GCV criterion, as it relies on the estimated model variance. The GCV criterion can be summarized as follows.
G C V criterion: Generalized cross-validation is used to obtain a minimum score based on the optimal tuning parameter for the regression model. In terms of bandwidth selection in additive models with LLR, ref. [22] presented a detailed work on using GCV and its properties. Accordingly, to choose the optimal h j for j t h function f j , G C V h j score can be computed based on μ ^   given in (18):
G C V h j = Z μ ^ T Z μ ^ n 1 n 1 t r H j 2    
where H j is the hat matrix obtained for f j which is provided at the end of the Section 3. Notice that calculating the true D F j in PLAM is asymptotically justifiable if parametric and nonparametric covariates x i , t j are independent. If there is multicollinearity, then Equation (24) may be regularized properly due to overestimated D F j .

4.1. Evaluation of Performance

4.1.1. Metrics for the Parametric Component

In this section, two metrics are presented to assess the performance of the LLR estimator of the parametric component of the model β ^ that are scalar versions of the dispersion error (SMDE) and the relative efficiency (RE), which is computed by ratio of the SMDE values. The formulations are given below:
S M D E β ^   , β = E β β ^ β β ^ = t r M S E β ^ , β
where M S E β ^ , β is expressed as a summation of bias square and variance of β ^ , and given by:
M S E β ^ , β = E β β ^ β β ^ = V a r β ^ + B β ^ 2
Then, using (25), R E s of the methods on estimating β can be computed. In this paper, methods are considered for use as censorship solution techniques for R E s.
Let β ^ 1 and β ^ 2 represent the estimates of parametric components based on two different censorship solutions. Accordingly, R E can be formulated as follows:
R E β ^ 1 , β ^ 2 = S M D E β ^ 1 , β / S M D E β ^ 2 , β    
where R E β ^ 1 , β ^ 2 < 1 indicates that β ^ 1 is more efficient than β ^ 2 .

4.1.2. Metrics for the Nonparametric Component

To evaluate the quality of the estimated nonparametric component, two measures are presented. The first measure is the root mean squared error ( R M S E ), which measures the accuracy of each individual estimated function in the model. The second measure is the averaged root mean squared error ( A R M S E ) which is specifically designed to assess the performance of the overall additive component f ^ = f ^ 1 ,   , f ^ q . The formulations of R M S E and A R M S E are written as:
R M S E j f j , f ^ j = n 1 i = 1 n f j z i j f ^ j z i j 2 ,   1 j q  
and
A R M S E f A , f ^ A = q 1 j = 1 q R M S E j f j , f ^ j  
where f = j = 1 q f j and f ^ = j = 1 q f ^ j .

5. Simulation Study

The practical performance of the modified LLR estimators in the context of right-censored PLAM with various censorship solution methods is analyzed in this section. To achieve this, different settings for sample size ( n ), the number of additive nonparametric components ( q ), and the level of censoring (CL) are considered. Specifically, three sample sizes ( n = 50 ,   100 , and 200 ) and three levels of censoring ( C L = 5 % ,   20 % , and 35 % ) are chosen. A total of eight scenarios are obtained by combining these configurations. Additionally, a total of 24 cases for analysis are formed by using three censorship solution methods. Moreover, accelerated failure time model estimation results are presented as benchmark performance scores. To achieve that existing function, the survival library in R is used. Note that the function written in R for this paper is provided via link: https://github.com/yilmazersin13/Censored-Partially-linear-additive-models/tree/main, accessed on 9 August 2023. The simulation design and setup used in this study are designed in a manner commonly found in the literature (see ref. [4]). Small, medium, and large sample sizes are chosen, along with three different censoring levels, in accordance with reference articles. Furthermore, the nonparametric component count has been determined in two distinct ways, introducing a novel approach that differs from most similar studies (see ref. [9]).
After establishing the design, the data generation procedure for the right-censored PLAM is outlined here. Firstly, PLAM with completely observed responses is generated as:
y i = x i T β + j = 1 q f j t j i + ε i ,   1 i n  
where x i T = x i 1 , x i 2 T , is n × 2 dimensional parametric covariate matrix with normally distributed and independently x i ’s that are generated as x i ~ N μ x = 0 , σ x 2 = 1 . Also, the vector of regression coefficients is determined as β = 1 , 0.5 T . Regarding the nonparametric component, smooth functions are generated by f 1 t 1 = 1 48 t 1 + 218 t 1 2 315 t 1 3 + 145 t 1 4 with t 1 = i 0.5 / n i = 1 n and f 2 t 2 = sin 2 t 2 + 2 e 16 t 2 2 with t 2 = U 2 ,   2 when q = 2 . Note that, due to how all the variables are scaled in the simulation study, the constant term α 0 is not used throughout the section. Finally, the random error terms ε i ’s are independent and identically distributed with zero mean and constant variance, which can be shown as ε i ~ N 0 , σ ε 2 = 0.5 .
After generating (30), by applying the censorship procedure given in Algorithm 3, right-censored response variable Z is generated based on random censoring variable C = c 1 , , c n T   and censoring indicator δ = δ 1 , , δ n T .
Algorithm 3 Censoring Procedure
Input: Completely observed y i
Output: Right-censored dependent variable z i
1: For given censoring level (CL), produce δ i = I y i c i from the binomial distribution
2: for  i   i n   1   t o   n
3:             If  δ i = 0
4:                     while  y i c i
5:                     generate c i ~ N μ y , σ y 2
6:             Else
7:                       c i = z i
8: end (for loop in Step 2)
9: for  i   i n   1   t o   n
10:          If  y i c i
11:                      z i = y i
12:          Else
13:                      z i = c i
14: end (for loop in Step 9)
Then, right-censored PLAM is obtained with the incomplete response variable Z = Z 1 , , Z n T . Accordingly, the following figures and tables are provided based on the censorship solution techniques. Algorithms 2 and 3 present the results for the performance of the parametric component estimation, specifically the SMDE and RE values, respectively. In addition, as a benchmark method, the performance of AFT model estimation based on Cox’s semiparametric proportional hazards (CPH) estimator is provided in both simulation and real data examples. The estimates are obtained a using “Survival” package in R.
Prior to presenting the findings, we offer a visual representation in Figure 2 that elucidates the process of bandwidth selection across diverse scenarios. This illustration sheds light on how the choice of bandwidth is intricately intertwined with the extent of censoring and the specific methods employed for addressing censorship. The discerning eye will note that in the context of f 1 , the selection of bandwidth appears to exhibit a lesser degree of sensitivity to variations in the level of censoring and sample size. However, in the case of the f 2   function, it becomes clear that the level of censorship exerts a discernible influence on the chosen bandwidth value. Notably, when confronted with elevated censorship levels across all solution strategies, a preference for smaller bandwidths becomes evident. This outcome is intuitively reasonable since, especially in scenarios involving ST and kNNI, the structural complexity of the data to be fitted takes on a more undulating nature. Therefore, it is evident that we can extrapolate that accounting for the degree of censorship is a pivotal factor when navigating the terrain of bandwidth selection. These findings resonate with prior research in this domain. Ref. [23] demonstrated similar behavior in a related context, highlighting the sensitivity of bandwidth to censorship levels. In line with the in-depth investigations of ref. [24], our observations underscore the need for cautious bandwidth selection in scenarios characterized by substantial censorship, promoting the accurate modeling of intricate data structures.
The results in Table 1 demonstrate that the estimation quality of the modified LLR estimators for the parametric component β improves with lower censoring levels and larger sample sizes across all censorship techniques. These tendencies align with the expected theoretical behavior. Specifically, the LLR-KMW estimator exhibits dominant performance in many simulation combinations, closely followed by the LLR-kNNI estimator with competitive SMDE scores. However, the LLR-ST does not yield good performance. Also, as a benchmark method for the model, SMDE scores of the CPH estimator are presented in the table. It is evident that due to the model involving serious complexity with two different nonparametric functions, there is a significant distance between the LLR-based estimators and the CPH estimator, which is expected.
Interestingly, in cases where n = 50 and C L = 5 % or C L = 20 % , the LLR-kNNI estimator outperforms the LLR-KMW estimator. As the sample size increases, LLR-KMW takes the lead, in accordance with its theoretical behavior. It is worth noting that due to its fully nonparametric nature, LLR-kNNI may yield better results under different configurations, demonstrating relative independence from specific simulation settings. This characteristic is observed in the combination of n = 200 and C L = 20 % .
Additionally, to assess the impact of censorship on the solution techniques, the increase in SMDE scores between censorship levels is examined. The results indicate that the the LLR-ST estimator is the most affected by censorship, which aligns with the theoretical background of ST presented in Section 2.
In Table 2, the calculation of the RE scores follows a decision where the nominators represent the columns, and the denominators represent the rows. Therefore, an RE value of less than 1 in Table 2 indicates that the method in the column is more effective than the methods in the corresponding row. Please note that, for the sake of saving space, only certain simulation configurations are considered in Table 2. The results in the table confirm that LLR-KMW is more efficient than LLR-ST in all cases. Simultaneously, LLR-KMW and LLR-kNNI exhibit similar outcomes, indicating that they are not distinctly efficient in any simulation configurations for estimating the parametric component of the PLAM.
Furthermore, when the censoring level is very high ( C L = 35 % ), the RE scores deviate from 1, making the performance differences among the LLR estimators based on the solution techniques more apparent. Once again, it is evident that, especially for n = 50 , ST is the most sensitive technique to censorship compared with the other two methods. Additionally, the results reveal that LLR-kNNI and LLR-KMW display similar RE scores in every combination. In addition, in Table 2, REs of CPH show that there is a clear dominance of LLR-basis estimators for the estimation of right-censored PLAM. This result also proves that the introduced estimator has important potential to be an alternative estimator for the model of interest that is used in survival analysis.
In Figure 3, the averaged values of the RE scores are displayed, confirming the interpretations from Table 2. The figure also shows both the effects of censorship and the sample size. In panel (a), the RE values are very close to each other due to the very low censoring level ( C L = 5 % ). Panels (b) and (c) demonstrate the change in RE scores as the censoring level increases, with the differences between the estimators becoming more distinct, as mentioned earlier. Consequently, the LLR-kNNI and LLR-KMW estimators are more efficient than the LLR-ST estimator. In panel (c), the performances are once again close to each other, reflecting the large sample size ( n = 200 ).
After analyzing the parametric component, the estimation of the additive nonparametric components is presented in Table 3 and Table 4. Table 3 displays the RMSE values computed for the individual functions, while Table 4 provides the ARMSE values for all simulation configurations, serving as a measure of the overall performance in estimating the nonparametric component of the right-censored PLAM. Upon initial examination, the LLR-KMW estimator demonstrates a significantly superior performance compared with the other two estimators across all simulation configurations. This dominance is further evidenced by the ARMSE results presented in Table 4, which contrast the outcomes observed in the parametric component estimation.
An interesting distinction in estimating the nonparametric component is that the performances of the introduced estimators deteriorate as the sample size increases. To explain this phenomenon, it is crucial to note that in the estimation of PLAMs, there exists a balance between the estimation of parametric and nonparametric components, which exhibits an inverse relationship. Furthermore, when data points are scattered widely around the representative smooth curve, the bias of the fitted curve increases. Additionally, the RMSE scores for the three modified LLR estimators are fairly similar to each other, confirming that the modified backfitting algorithm functions effectively with the censorship solution techniques.
Table 4 presents a strong case, confirming the dominant role of the LLR-KMW estimator in estimating nonparametric components within the context of right-censored PLAM. The success of the LLR-KMW estimator lies in its clever use of weighted estimation, which works well for both the parametric and nonparametric aspects of PLAM. Notably, the LLR-KMW estimator does not just improve β estimates, it also works well together with the LLR-kNNI estimator, forming a powerful estimation duo. When we carefully analyze Table 4 and take a close look at Figure 4 and Figure 5, a clear pattern emerges. Both the LLR-KMW and LLR-kNNI estimators perform very similarly when it comes to estimating the nonparametric component. What is even more interesting is that both estimators outperform the LLR-ST estimator, as these enlightening visuals below beautifully demonstrate. In terms of estimating nonparametric components, it is naturally expected that the CPH estimator does not show a good performance due to its theoretical structure. However, its behaviors are similar to LLR-basis estimators in sample size and censoring level changes. In summary, the introduced LLR-basis estimators show better performance than the classical CPH estimator.
Figure 4 illustrates the behavior of the estimators under different censoring levels with fixed sample sizes. In panels (a)–(b), the effect of the censoring level is investigated when the sample size is small ( n = 50 ). It can be observed that while f 2 t 2 is not significantly affected, the estimate of f 1 t 1 is heavily influenced by the censored data points. It is important to note that this inference is also related to the initial values β 0 , f 0 determined in the algorithm and their compatibility with the unknown functions f 1 and f 2 , respectively (see [9] for further discussions). Furthermore, the results demonstrate that the weakness of the LLR-ST estimator (red dotted line) is clear in all four panels (a), (b), (c), and (d), for both n = 50 and n = 200 . Additionally, panels (c) and (d) support the findings of Table 3 and Table 4, leading to the conclusion that, for larger sample sizes, the fitted curves become more sensitive to the censoring level, resulting in a decrease in their performance.
Figure 5 investigates the effect of sample size ( n ) for fixed censoring levels in the upper and lower panels, particularly for C L = 35 % in panels (c) and (d), while LLR-KMW and LLR-ST exhibit a slightly more pronounced response to increasing sample size compared with LLR-kNNI. This result is expected due to the nonparametric nature of kNNI. Furthermore, the changes observed in the fitted curves are more noticeable for the estimation of f 1 t 1 , as shown in Figure 4. Additionally, the differences between sample sizes for the lower censoring level ( C L = 5 % ) in panels (a)–(b) indicate that there is minimal variation between the fitted curves for both functions.
These trends are consistent with the findings reported by ref. [25], where a similar sensitivity of the ST basis estimator to sample size was identified in a related context. The reaction of the kNNI, KMW, and ST estimators to sample size fluctuations aligns with the observations made by ref. [26] reinforcing the notion that these estimators can exhibit greater flexibility in accommodating varying sample sizes.
To assess the performance of the introduced modified LLR estimators on real-world data and compare them with the simulation results, a real data example is presented in the following section, focusing on the hepatocellular carcinoma dataset.

6. Hepatocellular Carcinoma Data Example

In this section, the Hepatocellular Carcinoma dataset is modeled using the modified LLR estimators: LLR-ST, LLR-KMW, and LLR-kNNI. Their performances are compared with similar simulation configurations presented in Section 5. The dataset was originally presented by ref. [27] to investigate the gene expression of CXCL17 in hepatocellular carcinoma. Ref. [6] also studied this dataset, comparing parametric and semiparametric models on right-censored data. However, their study focused on a semiparametric model with a univariate nonparametric component using the covariate age. This paper considers a more realistic partially linear additive model (PLAM) that involves two nonparametric covariates.
The dataset consists of 227 data points and five explanatory variables: age, recurrence-free survival (RFS), CXCL17T (CXCT), CXCL17P (CXCP), and CXCL17N (CXCN). It should be noted that the logarithm of the response variable, overall survival time ( O S ), is used in this analysis. The parametric component of the PLAM is determined by the covariates CXCL17T, CXCL17P, and CXCL17N. Additionally, A g e and R F S are considered as nonparametric covariates due to their nonlinear structures, as depicted in Figure 6. The figure also illustrates the censored data points versus the transformed data points using the kNNI and ST solutions. Furthermore, panels (C) and (D) display hypothetical curves that represent the data structure and nonlinearity.
The dataset contains 84 right-censored OS points, indicating a censoring level of C L = 37 % . This level of censorship can be classified as heavy censoring. Therefore, we expect that the results from the real data analysis may resemble the corresponding simulation configuration of n = 200 and C L = 35 % . Based on the information provided above, the partially linear additive model (PLAM) for the right-censored Hepatocellular Carcinoma dataset can be expressed as follows:
log O S i = β 0 + β 1 CXCL 17 T i + β 2 CXCL 17 P i + β 3 CXCL 17 N i + f 1 A g e i + f 2 R F S i + ε i  
where i = 1 , , 227 ,   β = β 1 , β 2 , β 3   and   f = f 1 , f 2 . While estimating PLAM in (31), log O S is replaced by its ST version log O S G ^ and kNNI version log O S i m p . Also, KMW is applied. The outcomes of the Hepatocellular Carcinoma dataset with the modified LLR estimators are provided in Table 5.
Table 5 largely confirms the findings of the simulation study and demonstrates the superior performance of the LLR-KMW estimator in the estimation of the parametric component. However, in contrast to the simulation study, the LLR-ST estimator also provides results that are closer to the other two estimators, while the performance of LLR-kNNI is less satisfactory than expected. It should be noted that these conditions may be attributed to the relatively large sample size in terms of censored data. Additionally, regarding the bias of β, as anticipated, both ST and KMW yield lower values compared with kNNI, as they theoretically promise less biased estimates. Overall, the performance evaluation in Table 6 confirms that LLR-KMW exhibits the best results, which are evident from the RE scores.
In both Table 5 and Table 6, the performance of benchmark CPH estimators is also provided and, as expected, it does not show a good performance, especially in the estimation of the nonparametric component. On the other hand, in terms of bias, Table 5 shows that CPH has satisfying bias values but with large variances that cause large SMDE scores. This poor performance is highly related to the lack of the ability of CPH to represent smooth functions. RE scores highly confirm this inference. Summing up the comprehensive assessment presented in Table 6, we encounter an unequivocal affirmation of the preeminent standing of the LLR-KMW estimator. This affirmation is elegantly illuminated by the notable RE scores, reflecting an ensemble of successful estimation endeavors.
In Figure 7, bar plots of the calculated relative efficiencies (RE) are presented. Consistent with the findings in Table 5, LLR-KMW exhibits lower RE scores compared with the other two estimators, which aligns with the results of the simulation study. It is worth noting that while the difference in performance between the estimators may appear significant, numerically they are relatively close to each other, with the RE values scattered around one.
After assessing the estimation of the parametric component, Figure 8 presents the results of the estimation of the nonparametric components f 1   A g e and f 2   R F S . It is noteworthy that in this dataset, the relative failure of LLR-kNNI and the relative success of LLR-ST can be attributed to the structure of the nonparametric components. Both functions f 1 and f 2 exhibit favorable structures for the properties of LLR-ST, such as magnifying the magnitudes of uncensored data points and assigning zero to censored ones, as clearly observed in panel (ii) of Figure 8.
To provide a more precise understanding of the solution procedures, the ST points and kNNI points are also included in the plots. These points illustrate why the fitted curves tend to lie below the region where all data points are scattered, especially in panel (ii). This is primarily influenced by the heavy censoring level, C L = 37 % . Additionally, in panel (i), one can observe the LLR-ST’s fitted curve being pulled down by the zeros. As expected, LLR-KMW follows a balanced approach between the other two estimators, as shown in Table 5, yielding the smallest ARMSE scores in the estimation of the nonparametric component of the PLAM.

7. Conclusions

This paper introduces three modified LLR estimators based on different censorship solutions: ST, KMW, and kNNI, to model the right-censored PLAM. For the solution methods that have a theoretical background, such as ST and KMW, the statistical properties and some asymptotic properties of LLR-ST and LLR-KMW are presented. This paper focuses on two main objectives and successfully achieves them. The two purposes of this study are to combine the backfitting LLR estimator with the censorship solutions and to compare them, both theoretically and practically. The performances of the modified LLR estimators are observed through simulation and real data studies. The following conclusions have been drawn from this study:
  • In the simulation study, the performance of the estimators is measured individually for both parametric and nonparametric components. Regarding the parametric component estimation, it is observed that LLR-KMW provides the best results, followed by LLR-kNNI. On the other hand, LLR-ST does not yield good results for any simulation configuration, and it is the estimator most affected by the censorship as its performance dramatically changes when the censoring level increases. In this case, LLR-KMW can be considered the most robust estimator, as it reacts to censorship in a more balanced way compared with the other two. In addition, the introduced estimators are also compared with the benchmark estimator for the survival model, CPH. It is observed that the LLR-basis estimators perform better than the CPH, as discussed in Section 6.
  • In the estimation of the nonparametric components, the effects of sample size and censoring level are clearly different compared with the parametric component. However, similar to the parametric component, LLR-KMW exhibits dominant performance for both nonparametric functions. It is noteworthy that, as the sample size increases, all three estimators tend to provide closer performances in terms of fitted curves. Furthermore, it should be noted that the performance of the introduced estimators is highly dependent on the structure of the nonparametric component and its compatibility with the chosen censorship solution. Hence, this paper investigates the three different solutions in detail. Ultimately, because the CPH model lacks a smoother structural framework, it falls short when compared with the newly introduced estimators.
  • The analysis of the Hepatocellular Carcinoma data serves as a real-world example in this study. This dataset is selected due to its censoring level and sample size, which align closely with one of the simulation configurations ( n = 200 and C L = 35 % ), enabling a more realistic comparison. The results of the real data modeling demonstrate that the three introduced modified LLR estimators effectively handle the estimation of the right-censored PLAM for both parametric and nonparametric components. They exhibit a good level of agreement with the corresponding simulation configuration, with some minor differences. As expected, LLR-KMW yields the best results. Also, CPH does not show a good performance except in the bias of regression coefficients, as observed in the simulation study. Notably, one important difference between the real data and the simulation study is that LLR-ST exhibits a surprisingly better performance than LLR-kNNI in the estimation of both parametric and nonparametric components. However, this discrepancy can be attributed to the relatively large sample size ( n = 227 ), and it does not imply inconsistency with the simulation results. On the contrary, it indicates a close agreement among all performances.

Author Contributions

Conceptualization: S.E.A. and D.A.; Methodology: E.Y. and D.A.; Formal analysis and investigation: D.A. and E.Y.; Writing—original draft preparation: E.Y.; Writing—review and editing: S.E.A. and E.Y.; Data Curation: E.Y.; Visualization: E.Y.; Software: E.Y.; Supervision: S.E.A. and D.A.; Funding acquisition: S.E.A. and D.A.; Resources: S.E.A. and D.A.; Supervision: S.E.A. and D.A. All authors have read and agreed to the published version of the manuscript.

Funding

The research of Dursun Aydın was supported by the TUBITAK 1002 project with the project number: 122F045.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The Hepatocellular Carcinoma dataset is publicly available in R-package named “asaur”.

Acknowledgments

The research of S. Ejaz Ahmed was supported by the Natural Sciences and the Engineering Research Council (NSERC) of Canada.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ruppert, D.; Wand, M.P.; Carroll, R.J. Semiparametric Regression (No. 12); Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  2. Zhang, H.H.; Cheng, G.; Liu, Y. Linear or nonlinear? Automatic structure discovery for partially linear models. J. Am. Stat. Assoc. 2011, 106, 1099–1112. [Google Scholar] [CrossRef]
  3. Hamilton, S.A.; Truong, Y.K. Local linear estimation in partly linear models. J. Multivar. Anal. 1997, 60, 1–19. [Google Scholar] [CrossRef]
  4. Aydin, D.; Yilmaz, E. Modified estimators in semiparametric regression models with right-censored data. J. Stat. Comput. Simul. 2018, 88, 1470–1498. [Google Scholar] [CrossRef]
  5. Orbe, J.; Virto, J. Penalized spline smoothing using Kaplan-Meier weights in semiparametric censored regression models. Stat. Oper. Res. Trans. 2022, 46, 95–114. [Google Scholar]
  6. Yenilmez, I.; Yılmaz, E.; Kantar, Y.M.; Aydın, D. Comparison of parametric and semi-parametric models with randomly right-censored data by weighted estimators: Two applications in colon cancer and hepatocellular carcinoma datasets. Stat. Methods Med. Res. 2022, 31, 372–387. [Google Scholar] [CrossRef]
  7. Opsomer, J.D.; Ruppert, D.; Wand, M.P.; Holst, U.; Hössjer, O. Kriging with nonparametric variance function estimation. Biometrics 1999, 55, 704–710. [Google Scholar] [CrossRef]
  8. Ichimura, H.; Lee, S. Characterization of the asymptotic distribution of semiparametric M-estimators. J. Econom. 2010, 159, 252–266. [Google Scholar] [CrossRef]
  9. Ahmed, S.E.; Aydın, D.; Yılmaz, E. A survey of smoothing techniques based on a backfitting algorithm in estimation of semiparametric additive models. Wiley Interdiscip. Rev. Comput. Stat. 2023, 15, e1605. [Google Scholar] [CrossRef]
  10. Stute, W. Nonlinear censored regression. Stat. Sin. 1999, 9, 1089–1102. [Google Scholar]
  11. Aydın, D.; Ahmed, S.E.; Yılmaz, E. Estimation of semiparametric regression model with right-censored high-dimensional data. J. Stat. Comput. Simul. 2019, 89, 985–1004. [Google Scholar] [CrossRef]
  12. Koul, H.; Susarla, V.; Van Ryzin, J. Regression analysis with randomly right-censored data. Ann. Stat. 1981, 9, 1276–1288. [Google Scholar] [CrossRef]
  13. Stute, W. Consistent estimation under random censorship when covariables are present. J. Multivar. Anal. 1993, 45, 89–103. [Google Scholar] [CrossRef]
  14. Zhang, S. Nearest neighbor selection for iteratively kNN imputation. J. Syst. Softw. 2012, 85, 2541–2552. [Google Scholar] [CrossRef]
  15. Ahmed, S.E.; Aydin, D.; Yılmaz, E. Nonparametric regression estimates based on imputation techniques for right-censored data. In International Conference on Management Science and Engineering Management; Springer International Publishing: Cham, Switzerland, 2019; pp. 109–120. [Google Scholar]
  16. Cartwright, M.H.; Shepperd, M.J.; Song, Q. Dealing with missing software project data. In Proceedings of the 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No. 03EX717), Sydney, Australia, 5 September 2004; IEEE: Piscataway, NJ, USA, 2004; pp. 154–165. [Google Scholar]
  17. Hastie, T.J.; Tibshirani, R.J. Generalized Additive Models; CRC Press: Boca Raton, FL, USA, 1990; Volume 43. [Google Scholar]
  18. Härdle, W.; Müller, M.; Sperlich, S.; Werwatz, A. Nonparametric and Semiparametric Models; Springer: Berlin, Germany, 2004; Volume 1. [Google Scholar]
  19. Buja, A.; Hastie, T.; Tibshirani, R. Linear smoothers and additive models. Ann. Stat. 1989, 17, 453–510. [Google Scholar] [CrossRef]
  20. Opsomer, J.D.; Ruppert, D. A root-n consistent backfitting estimator for semiparametric additive modeling. J. Comput. Graph. Stat. 1999, 8, 715–732. [Google Scholar] [CrossRef]
  21. Wei, C.H.; Liu, C. Statistical inference on semi-parametric partial linear additive models. J. Nonparametr. Stat. 2012, 24, 809–823. [Google Scholar] [CrossRef]
  22. Kauermann, G.; Opsomer, J.D. Generalized cross-validation for bandwidth selection of backfitting estimates in generalized additive models. J. Comput. Graph. Stat. 2004, 13, 66–89. [Google Scholar] [CrossRef]
  23. Chu, C.K. Bandwidth selection in nonparametric regression with general errors. J. Stat. Plan. Inference 1995, 44, 265–275. [Google Scholar] [CrossRef]
  24. Hanley, J.A.; Parnes, M.N. Nonparametric estimation of a multivariate distribution in the presence of censoring. Biometrics 1983, 39, 129–139. [Google Scholar] [CrossRef]
  25. Wang, Q.; Dinse, G.E. Linear regression analysis of survival data with missing censoring indicators. Lifetime Data Anal. 2011, 17, 256–279. [Google Scholar] [CrossRef]
  26. Aydin, D.; Yilmaz, E. Semiparametric regression estimates based on some transformation techniques for right-censored data. Eskişehir Tech. Univ. J. Sci. Technol. A—Appl. Sci. Eng. 2019, 20, 1–12. [Google Scholar] [CrossRef]
  27. Li, L.; Yan, J.; Xu, J.; Liu, C.-Q.; Zhen, Z.-J.; Chen, H.-W.; Ji, Y.; Wu, Z.-P.; Hu, J.-Y.; Zheng, L.; et al. CXCL17 expression predicts poor prognosis and correlates with adverse immune infiltration in hepatocellular carcinoma. PLoS ONE 2014, 9, e110064. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Working procedures of ST in panel (A) and KNNI in panel (B) for generated data.
Figure 1. Working procedures of ST in panel (A) and KNNI in panel (B) for generated data.
Entropy 25 01307 g001
Figure 2. Selection of bandwidth parameter ( h ) for different scenarios and censorship solution methods when n = 50 . In each panel, (i) and (ii) involve the selection processes for f 1 t 1 and f 2 t 2 , respectively.
Figure 2. Selection of bandwidth parameter ( h ) for different scenarios and censorship solution methods when n = 50 . In each panel, (i) and (ii) involve the selection processes for f 1 t 1 and f 2 t 2 , respectively.
Entropy 25 01307 g002
Figure 3. Bar plots of averaged R E scores.
Figure 3. Bar plots of averaged R E scores.
Entropy 25 01307 g003
Figure 4. Fitted curves to show the effect of the censoring level ( C L ). In each panel, (i) and (ii) show fitted curves for f 1 t 1 and f 3 t 2 respectively.
Figure 4. Fitted curves to show the effect of the censoring level ( C L ). In each panel, (i) and (ii) show fitted curves for f 1 t 1 and f 3 t 2 respectively.
Entropy 25 01307 g004
Figure 5. Fitted curves to show the effect of the sample size ( n ). In each panel, (i) and (ii) show fitted curves for f 1 t 1 and f 3 t 2 respectively.
Figure 5. Fitted curves to show the effect of the sample size ( n ). In each panel, (i) and (ii) show fitted curves for f 1 t 1 and f 3 t 2 respectively.
Entropy 25 01307 g005
Figure 6. Descriptive plots for the Hepatocellular Carcinoma dataset.
Figure 6. Descriptive plots for the Hepatocellular Carcinoma dataset.
Entropy 25 01307 g006
Figure 7. Bar plots of the REs for the modified LLR estimators based on the censorship solutions methods.
Figure 7. Bar plots of the REs for the modified LLR estimators based on the censorship solutions methods.
Entropy 25 01307 g007
Figure 8. Fitted curves obtained for the Hepatocellular Carcinoma dataset. In panel (i) f A g e is shown and in panel (ii) involves f R F S .
Figure 8. Fitted curves obtained for the Hepatocellular Carcinoma dataset. In panel (i) f A g e is shown and in panel (ii) involves f R F S .
Entropy 25 01307 g008
Table 1. Calculated S M D E values for all simulation combinations.
Table 1. Calculated S M D E values for all simulation combinations.
n C L LLR-STLLR-KMWLLR-kNNICPH
505%0.5610.5570.5450.991
20%0.7240.6810.6241.029
35%1.0840.7380.7441.173
1005%0.1210.1030.1040.702
20%0.1400.1220.1350.764
35%0.1680.1420.1480.834
2005%0.0270.0240.0260.471
20%0.0310.0290.0280.480
35%0.0340.0310.0330.497
Bold color denotes the best performance score.
Table 2. Comparative R E scores for the modified LLR estimators.
Table 2. Comparative R E scores for the modified LLR estimators.
n C L MethodLLR-STLLR-KMWLLR-kNNICPH
505%LLR-ST1.0000.9920.9701.766
LLR-KMW1.0071.0000.9771.779
LLR-kNNI1.0301.0231.0001.818
AFT0.5660.5620.5491.000
35%LLR-ST1.0000.6860.6801.082
LLR-KMW1.4561.0000.9911.589
LLR-kNNI1.4681.0081.0001.576
AFT0.9240.6290.6341.000
2005%LLR-ST1.0000.9740.9186.333
LLR-KMW1.0251.0000.9427.125
LLR-kNNI1.0881.0601.0006.576
AFT0.1580.1400.1521.000
35%LLR-ST1.0000.9630.9205.794
LLR-KMW1.0381.0000.9566.354
LLR-kNNI1.0851.0451.0005.969
AFT0.1730.1570.1671.000
Bold color denotes the best performance score.
Table 3. RMSE values of individual nonparametric functions for both functions f 1 t 1 and f 2 t 2 .
Table 3. RMSE values of individual nonparametric functions for both functions f 1 t 1 and f 2 t 2 .
Functions f 1 t 1 f 2 t 2
n C L LLR-STLLR-KMWLLR-kNNILLR-STLLR-KMWLLR-kNNI
505%0.2830.2560.2600.4910.4730.478
20%0.3530.2410.2710.5350.4330.483
35%0.4470.2560.2730.6130.4060.479
1005%0.3830.3400.3640.6890.6370.668
20%0.4080.3190.3660.7040.5810.657
35%0.4660.3230.3710.7540.5270.655
2005%0.5160.4830.5070.9360.8960.931
20%0.5370.4380.5140.9670.8000.927
35%0.5570.4520.5171.0100.7270.923
Bold color denotes the best performance score.
Table 4. A R M S E f ^ 1 , f ^ 2 values for all simulation configurations.
Table 4. A R M S E f ^ 1 , f ^ 2 values for all simulation configurations.
n C L LLR-STLLR-KMWLLR-kNNICPH
505%0.2810.2670.2710.872
20%0.3190.2470.2750.967
35%0.3740.2330.2761.008
1005%0.3930.3620.3860.778
20%0.4020.3340.3770.814
35%0.4420.3100.3810.860
2005%0.5440.5190.5390.775
20%0.5650.4630.5410.784
35%0.5830.4380.5380.841
Bold color denotes the best performance score.
Table 5. Performance scores of the introduced three estimators.
Table 5. Performance scores of the introduced three estimators.
LLR-STLLR-KMWLLR-kNNICPH
B i a s β 1 ; β 2 ; β 3 0.42;0.17;0.080.30;0.16;0.170.40;0.20;0.210.24;1.65;0.40
V a r β 1 ; β 2 ; β 3 0.08;0.26;0.050.05;0.24;0.080.06;0.26;0.090.15;0.68;0.40
S M D E 0.2200.1540.2561.341
R M S E f 1 A g e 0.4400.5330.491-
R M S E f 2 R F S 0.3500.1680.208-
A R M S E f 1 , f 2 0.3950.3500.3501.822
Bold color denotes the best performance score.
Table 6. Relative efficiencies; R E s .
Table 6. Relative efficiencies; R E s .
EstimatorLLR-STLLR-KMWLLR-kNNICPH
LLR-ST1.0000.6991.1606.095
LLR-KMW1.4291.0001.6598.707
LLR-kNNI0.8610.6021.0005.238
CPH0.1640.1140.1901.000
Bold color denotes the best performance score.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yılmaz, E.; Aydın, D.; Ahmed, S.E. Modified Local Linear Estimators in Partially Linear Additive Models with Right-Censored Data Based on Different Censorship Solution Techniques. Entropy 2023, 25, 1307. https://doi.org/10.3390/e25091307

AMA Style

Yılmaz E, Aydın D, Ahmed SE. Modified Local Linear Estimators in Partially Linear Additive Models with Right-Censored Data Based on Different Censorship Solution Techniques. Entropy. 2023; 25(9):1307. https://doi.org/10.3390/e25091307

Chicago/Turabian Style

Yılmaz, Ersin, Dursun Aydın, and S. Ejaz Ahmed. 2023. "Modified Local Linear Estimators in Partially Linear Additive Models with Right-Censored Data Based on Different Censorship Solution Techniques" Entropy 25, no. 9: 1307. https://doi.org/10.3390/e25091307

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop