Latent Class Regression Utilizing Fuzzy Clusterwise Generalized Structured Component Analysis

Park, Seohee; Kim, Seongeun; Ryoo, Ji Hoon

doi:10.3390/math8112076

Open AccessFeature PaperArticle

Latent Class Regression Utilizing Fuzzy Clusterwise Generalized Structured Component Analysis

by

Seohee Park

¹,

Seongeun Kim

² and

Ji Hoon Ryoo

^3,*

¹

Department of Psychological and Quantitative Foundations, College of Education, University of Iowa, Iowa City, IA 52242, USA

²

Department of Educational Research Methodology, School of Education, University of North Carolina at Greensboro, Greensboro, NC 27412, USA

³

Department of Education, College of Educational Sciences, Yonsei University, Seoul 03722, Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2020, 8(11), 2076; https://doi.org/10.3390/math8112076

Submission received: 30 October 2020 / Revised: 17 November 2020 / Accepted: 18 November 2020 / Published: 20 November 2020

(This article belongs to the Special Issue Operations Research Using Fuzzy Sets Theory)

Download

Browse Figures

Versions Notes

Abstract

:

Latent class analysis (LCA) has been applied in many research areas to disentangle the heterogeneity of a population. Despite its popularity, its estimation method is limited to maximum likelihood estimation (MLE), which requires large samples to satisfy both the multivariate normality assumption and local independence assumption. Although many suggestions regarding adequate sample sizes were proposed, researchers continue to apply LCA with relatively smaller samples. When covariates are involved, the estimation issue is encountered more. In this study, we suggest a different estimating approach for LCA with covariates, also known as latent class regression (LCR), using a fuzzy clustering method and generalized structured component analysis (GSCA). This new approach is free from the distributional assumption and stable in estimating parameters. Parallel to the three-step approach used in the MLE-based LCA, we extend an algorithm of fuzzy clusterwise GSCA into LCR. This proposed algorithm has been demonstrated with an empirical data with both categorical and continuous covariates. Because the proposed algorithm can be used for a relatively small sample in LCR without requiring a multivariate normality assumption, the new algorithm is more applicable to social, behavioral, and health sciences.

Keywords:

fuzzy clustering; generalized structured component analysis; latent class regression; three-step approach

1. Introduction

Latent class analysis (LCA [1,2,3]) is a popular statistical tool to identify the relationship between categorical latent and observed categorical variables in a variety of research areas such as education [4], psychology [5], sociology [6], medicine [7,8], and public health [9]. LCA has been used to classify mutually exclusive heterogenous subpopulations, also known as latent classes, based on participants’ responses collected as a set of observed categorical variables. In other words, LCA enumerates the latent classes in which sample units respond in similar patterns in terms of observed categorical variables. Model specification of LCA includes two sets of parameters: class membership and item-response probabilities within each class. We identified the characteristics of each class using item-response probabilities and predicted the participant’s likelihood of belonging to each class based on the parameter estimates. Owing to the advent of maximum likelihood estimation methods (MLE [10]) and the development of software packages including Mplus [11], Proc LCA [12], Latent Gold [13], poLCA [14], LCA has become more popular recently in a variety of research areas.

Although LCA has become more popular, there are still concerns respecting the estimation method in LCA. Aforementioned MLE of LCA using expectation-maximization (EM) algorithm [15] causes several estimation issues [16]. The most common issue is related to defining parameters in model identification. When a model is underidentified, i.e., the amount of observed information from response data is smaller than the amount of unknown parameters or even larger than the amount unknown parameters, the unique solution may not be easily obtained because of the occurrence of multiple local maxima. In order to avoid the multiple local maxima and obtain the global maximum, most statistical programs of LCA use multiple sets of initial/starting values. However, not all cases find the unique maximum solution because the multiple sets of initial values do not guarantee that they point to the definite and unique solution.

Another identification issue may occur when the amount of unknown parameters of the models is large. As the number of parameters increases (e.g., the increase in the number of classes), the larger data from the contingency table, represented by the frequency table across item response categories and latent class, are required. However, a large size contingency table often faces the issue of sparseness because not all cells of the contingency table have a large enough cell size. Increasing sample size for the analysis may appease the sparseness in the contingency table due to the fact that the larger sample size would achieve to fill up the cell that had a small cell size before [16]. In practice, however, a larger sample size does not always mitigate sparseness due to too many cells. Lastly, MLE-based estimation assumes multivariate normality of the set of parameters. The multivariate normality is too strong of an assumption that is hard to confirm and easy to violate. As the number of parameters increases, the estimation is more likely to produce biases due to the deviation from the normality assumption. In sum, the MLE-based estimation approach requires a large sample size to avoid the estimation issue, although this is not a panacea to resolving estimation issues. On the other hand, large samples are not often achieved in studies conducting empirical data analyses and the “largeness” also depends on many factors in the sample size calculation [17,18].

To resolve these estimation issues, the authors of [19] proposed an alternative estimation approach for LCA by using fuzzy clusterwise generalized structured component analysis (gscaLCA). Although we provided the conceptual and computational algorithm of gscaLCA, the model used in our previous study was somewhat limited to a simple LCA model focusing only on classifying the number of classes without considering covariates. On the other hand, modeling covariates into LCA, also known as latent class regression (LCR), has been widely used because it can demonstrate how covariates affect the membership prevalence as well as membership probabilities [16]. Considering the popularity of LCR, this current study aims to propose an algorithm to estimate LCR by updating the gscaLCA algorithm. The study’s findings contribute a new and more flexible method proposed to apply LCR without being bounded by the normality assumption but with stable and reliable estimates. This current paper is organized as follows. First, we review the theoretical framework of generalized structured component analysis (GSCA [20]), procedure for gscaLCA, and MLE-based LCA with covariates, which are the underlying concepts to build the algorithm of gscaLCA with covariates, hereinafter referred to as gscaLCR. Second, we provide a detailed algorithm for gscaLCR. These are followed by an illustration with empirical data from the National Longitudinal Study of Adolescent to Adult Health (Add Health [21]).

2. GscaLCR Algorithm

2.1. Generalized Structured Component Analysis (GSCA)

GSCA is a component-based approach for structural equation modeling (SEM), which encompasses three sub-models: measurement, structural, and weighted relation sub-models [22]. The measurement sub-model explains the relationship between latent variables and indicators, and the structural sub-model refers to the relationship between latent variables and/or between a latent variable and other observed variables than indicators as in the factor-based SEM. Lastly, the weighted relation sub-model defines a component or latent variable as a weighted composite or component of indicators, which is a unique part of GSCA. This weighted relation sub-model eases the estimation by allowing calculation of latent variable scores through the distinctive component scores, which also facilitates estimating the parameters of the other two sub-models through alternating least squares estimation (LSE [23]). This is a key factor to state that the GSCA algorithm using alternating LSE with the existence of global estimation and the compatibility with bootstrap methods is often beneficial over the factor-based SEM (for example, a complicated model, such as a dynamic system using fMRI [24]).

2.2. Generalized Structured Component Analysis for Latent Class Analysis (gscaLCA)

The gscaLCA algorithm was recently proposed by the authors of [19]. It was developed by combining fuzzy clusterwise GSCA [25] and optimal scaling in GSCA [22], which allows the algorithm to fit latent class analysis (LCA) within a component-based SEM framework. The fuzzy clusterwise GSCA updates membership probabilities according to the distance from centroids, which determines latent classes. This process simultaneously estimates the parameters of the GSCA model. Through the alternating estimation procedure, we estimate the membership probabilities so that we assign latent classes to each sample unit that are taken into account by the relationship between observed variables and latent variables. Prior to the gscaLCA algorithm, the fuzzy clusterwise GSCA was proposed by [22] but limited to the cases where outcome variables are continuous. To extend limited outcome variable type (or continuous outcome variable) in fuzzy clusterwise GSCA to discrete variables, the gscaLCA algorithm [19] was engaged with the optimal scaling technique [26], which is also known as optimal data transformation [22].

The algorithm of gscaLCA estimates the parameters of the three sub-models with optimal scaling, followed by fuzzy clustering [27,28,29,30,31,32,33]. The fuzzy clustering updates the individuals’ membership based on optimal scaled data for each iteration. Alternating processes of updating the parameters with optimal scaling and fuzzy clustering recursively are applied with the aim of minimizing the residuals of models in the LSE in gscaLCA [19].

2.3. Latent Class Regression

Introducing covariates into MLE-based LCA, i.e., latent class regression (LCR), has been developed by many researchers and the analytic method has been widely used [34,35,36,37,38]. The covariates in LCA play roles in predicting or explaining class memberships [3,16]. Two different approaches have been typically used as algorithms to estimate parameters of LCR: a one-step and a three-step approach. The one-step approach [39,40] associates the LCA model and multinomial logistic regression of latent class membership on covariates simultaneously, which most software packages employ (e.g., poLCA [14]). On the other hand, the three-step approach consists of estimating the parameters of the LCA model (step 1), assigning a latent class for subjects based on the estimated membership probabilities (step 2), and fitting a structural model with latent class scores and covariates (step 3). The last step, for example, is a multinomial logistic regression using the latent class scores as the dependent variable and covariates as the independent variable, successively. The three-step approach follows the factor score regression [41] or latent structure model [42], which is a sequential ad hoc approach based on estimated latent variable scores after latent variable models are estimated.

The one-step approach estimates the parameters of the LCA model and fits a structural model simultaneously, which often cause serious bias due to model misspecification. Therefore, the LCA with covariates is also hard to estimate because of the large number of parameters, and the number of parameters often exceeds known information [3,42]. Compared to the one-step approach, the three-step approach is relatively less likely to cause identification issues as well as the bias due to model misspecification. On the other hand, the three-step approach in MLE-based LCA often estimates the relationship between covariates and latent class membership incorrectly because of the classification error. The classification error refers to the difference between predicted latent class scores and true latent class scores as consistent estimates of latent class scores, and it has downward bias in estimating parameters [42]. The three-step approach has been improved to diminish the effect of the classification error. The improved three-step approach corrects the bias and reflects the effect of covariates in the structural model in the latent structure model separately from estimating the parameters of LCA [3,42,43]. The improved procedure becomes a standard in LCA with covariates in MLE-based LCA. However, the updated three-step approach in LCR is not entirely free from the estimation issues associated with multivariate normality assumption. On the other hand, in gscaLCA with covariates (gscaLCR), such a bias does not occur because the estimation procedure does not require the normality assumption that causes the classification error in the MLE-based LCA. In this current study, we discuss the three-step approach for LCA using GSCA with both categorical and continuous covariates.

2.4. Generalized Structured Component Analysis for Latent Class Regression (gscaLCR)

Using the advantages of gscaLCA, such as avoiding estimation burden and relaxing the normality assumption, the current study demonstrates a new algorithm with which to implement LCA with covariates using GSCA, gscaLCR. The proposed algorithm is based on fuzzy clusterwise GSCA, which aligns with the aforementioned three-step approach in MLE-based LCA. The algorithm of gscaLCR reduces the unnecessary extra estimation procedures of LCA when adding and removing covariates, and it also enables researchers to model the influence of covariates in determining memberships. Encompassing these advantages, the gscaLCR algorithm is proposed and it consists of three main steps: (STEP 1) estimating parameters of gscaLCA, (STEP 2) assigning the membership and (STEP 3) estimating the effect of covariates on membership. The flowchart of algorithm is in Figure 1.

2.4.1. Step 1: Estimate Parameters of gscaLCA

To fit gscaLCA into the data, specifying the three sub-models of GSCA is necessary: measurement sub-model (denoted by

C

), structural sub-model (denoted by

B

), and weighted relation sub-model (denoted by

W

). The covariates affecting the relationships between observed variables can be involved in the three sub-models. The inclusion of covariates in the GSCA model reflects the relationship between observed variables, and affects the clustering results. However, this inclusion is not for examining the effects of covariates but controlling for enumerating latent classes. The decision of whether to include covariate effects in the GSCA model is up to the researchers’ discretion, which is out of the scope of this study. An example of the model specification of covariates in this step was featured in Figure 2. The sub-models of GSCA are integrated into a single model equation as follows [22]

V^{'} s_{i} = A^{'} W^{'} s_{i} + e,

(1)

where

V = [\begin{matrix} I \\ W^{'} \end{matrix}]

,

A = [\begin{matrix} C^{'} \\ B^{'} \end{matrix}]

for

I

(an identity matrix) and ‘ for matrix transpose,

e

is a vector of residuals of observed and latent variables, and

s_{i}

is the optimal scaled data of the subject

i

(i.e.,

s_{i} = O S (z_{i}

), where

O S

refers to a transformation of the original categorical variable to the optimally scaled counterpart, and

z_{i}

is the original observed categorical data of the subject

i

). The estimation for parameters in GSCA is executed by minimizing

e

.

Specifically, we aim to minimize the following criterion for fuzzy clusterwise generalized structured component analysis (see Appendix 4.1 in [22])

ϕ = \sum_{k = 1}^{K} \sum_{i = 1}^{N} u_{k i}^{m} S S (V_{k}^{'} s_{i} - A_{k}^{'} W_{k}^{'} s_{i}) = \sum_{k = 1}^{K} S S {(S V_{k} - S W_{k} A_{k})}_{U_{k}^{m}},

(2)

with respect to

u_{k i}^{m}

,

W_{k}

, and

A_{k}

, under the constraint of

\sum_{k = 1}^{K} u_{k i}^{m} = 1 .

W_{k}

,

A_{k}

, and

V_{k}

are the weighted relation model matrix, the combined measurement and structural model matrix, and the combined identity and weighted relation model matrix in terms of latent class (cluster)

k

, respectively. The number of latent classes in Equation (2) is

K

. Equation (2) becomes equivalent to Equation (1) when

K = 1

. The matrix

S = {[s_{i}, \dots, s_{N}]}^{'}

is an

N

by

J

optimal scaled data matrix of

N

subjects and

J

observed. Lastly,

u_{i k}^{m}

refers to the fuzzy membership or membership probability of subject

i

in latent class

k

, and a power

m

of

u_{i k}

is the fuzzifier ranging from 1 to infinity, which is determined in advance. In practice,

m = 2

as the most popular choice in fuzzy clustering [27,30,44,45]. In the second line of Equation (2),

U_{k}^{m}

denote a diagonal matrix consisting of fuzzy membership,

u_{i k}^{m}

(i.e.,

U_{k}^{m} = d i a g (u_{k 1}^{m}, \dots, u_{k N}^{m})

), and

S S {(M)}_{U_{k}^{m}}

is equal to

t r a c e (M^{'} U_{k}^{m} M)

where

M = S V_{k} - S W_{k} A_{k}

. To minizine the criterion in Equation (2), the following two steps are implemented alternatively until it converges.

Step 1.1. Update the Parameters of Generalized Structured Component Analysis ( $W_{k}$ and $A_{k}$ ) in each cluster for fixed $U_{k}^{m}$

This step locates the parameters of

W_{k}

and

A_{k}

to minimize the following

ϕ = Σ_{k = 1}^{K} S S ({(U_{k}^{m})}^{\frac{1}{2}} (S V_{k} - S W_{k} A_{k})) = Σ_{k = 1}^{K} S S (S_{k} V_{k} - S_{k} W_{k} A_{k})

(3)

where

S_{k} = {(U_{k}^{m})}^{\frac{1}{2}} S .

This minimizes the residual

e

in Equation (1) for each latent class (cluster) meaning that the parameters

W_{k}

,

A_{k}

, and

V_{k}

are updated within each latent class (see the details in [22]). By integrating the criterion for multiple latent classes, we can minimize the criterion in Equation (3). Another task to be done in this step is a transformation of the original categorical variable to the optimally scaled counterpart because the indicators in LCA are categorical. This optimal scaling procedure is executed to maintain measurement characteristics of the observed data. The optimally scaled data for nominal variable are restricted to obtain an identical value in terms of observations that falls in the same category, and the optimally scaled data for ordinal variables are required to preserve the observed order after transformation. This optimal scale transformation is executed each iteration until it converges.

Step 1.2. Update Membership Parameter $u_{k i}$ for the Fixed $W_{k}$ and $A_{k}$

Finding

u_{k i}

by minimizing Equation (2) with respect to

u_{i k}^{m}

subject to

\sum_{k = 1}^{K} u_{k i}^{m} = 1

is equivalent to finding

u_{k i}

by solving the system of equations obtained from partial derivatives of a Lagrange function,

ϕ^{*}

, defined by

ϕ^{*} = ϕ + λ (\sum_{k = 1}^{K} u_{k i}^{m} - 1) = \sum_{k = 1}^{K} \sum_{i = 1}^{N} u_{k i}^{m} d_{i k} + λ (\sum_{k = 1}^{K} u_{k i}^{m} - 1),

(4)

where

λ

is a Lagrangian multiplier, and

d_{k i} = S S (V_{k}^{'} s_{i} - A_{k}^{'} W_{k}^{'} s_{i}) .

Locating

{\hat{u}}_{k i}^{m} = {(\frac{λ}{m d_{k i}})}^{\frac{1}{m - 1}}

with

\hat{λ} = {(\sum_{k = 1}^{K} {(\frac{1}{m d_{k i}})}^{\frac{1}{m - 1}})}^{1 - m}

can be used an optimization [22,30].

2.4.2. Step 2: Assign Each Subject into a Latent Class Based on the Parameter Estimates from the Previous Step

Two types of assignments are considered in this step: hard partitioning and soft partitioning assignments [3,46]. The hard participating is yielded from modal or random assignment. The assignment of membership for a subject is determined by latent indicator function,

v_{k i}

, in which the membership probabilities (

u_{k i}^{m}

) of the subject are the largest. That is,

v_{k i} = 1

if

u_{k i}

is largest, and

v_{k i} = 0

otherwise. On the other hand, the soft partitioning focuses on proportional assignment. That is, a subject belongs to latent class proportionally with the estimated membership probabilities of each subject (

u_{k i}

) rather than selecting one membership for each subject.

2.4.3. Step 3: Fit a Regression Model with Covariates

Regardless of which partitioning assignment is selected in step 2, two possible procedures can be followed. With hard partitioning, multinomial or binomial logistic regression of covariates can be used by considering the assigned latent classes as dependent variables. When

V_{i}

refers to the categorical membership of a subject

i

via the hard partitioning, it can be expressed as

K

categories. For example,

v_{i}

can be 1, 2, or 3 when the number of latent class,

K

, is 3. The multinomial logistic regression on covariates can be presented by

\log (\frac{V = k}{V = k^{0}}) = β_{0, k} + β_{1, k} c o v_{1, k} + \dots + β_{C, k} c o v_{C, k},

(5)

where

k^{0}

is the reference category,

c o v_{c}

indicates covariates, and

β_{c, k}

is regression coefficient for the class, k, and

c

= 1, …,

C

, where the number of covariates is

C

. The other possible regression model is a binomial logistic regression with dummy variables (

v_{k i}

), i.e., the class,

k

, vs. the others. With three latent class model, we have three dummy variables. For each latent class, we can fit the binomial regression

l o g i t (v_{k} = 1) = \log \frac{π_{v_{k}}}{1 - π_{v_{k}}} = β_{0, k} + β_{1, k} c o v_{1, k} + \dots + β_{C, k} c o v_{C, k},

(6)

where

π_{v_{k}}

is the probability of being in the kth class. When

K = 2

, fitting multinomial or binomial logistic regression brings the same results. The choice between multinomial and binomial logistic regression is based on researchers’ discretion. Comparison between two regressions is unnecessary with respect to the methodological point of view. The multinomial logistic regression can be used when comparison between certain two latent classes according to the effect of covariates is required. When the change in membership on each latent class is the main interest, binary logistic regression can be used.

When the soft partition is used for the analysis, we can still fit multinomial and binominal logistic regressions, the same as in Step 3. However, the proportion and membership probabilities (

u_{k i}

) need to be considered in the regression models as weights. The weights play the role of adjusting the contribution of units in the regression based on their weights. Although the regression models are easy to be fitted, this may not be what the researchers are looking for in LCA in some sense, because LCA mainly purports to identify the heterogeneous subgroups instead of their likelihood. However, this is beyond the scope of this study. The effects of the covariates will be examined via hypotheses tests using bootstrap methods that can obtain the interval estimates in the LSE in the multinomial and binomial logistic regression.

3. An Illustration of Fitting gscaLCR into Empirical Data

3.1. Empirical Data

Data from the National Longitudinal Study of Adolescent to Adult Health (Add Health) were used in this illustration. The Add Health study was conducted to investigate how adolescents’ health trajectories influence adulthood life course. As a longitudinal study, there have been five waves in Add Health to date. In this study, we focused on Wave IV data that represent relatively active substance usage. The data consisted of 5144 subjects aged from 24 to 32. Seven variables about substance usage and demographic data were considered for this illustration of fitting gscaLCR. Five variables are dichotomous data regarding the usage of smoking, alcohol, marijuana, cocaine, and other illegal drugs. Two additional variables served as covariates indicating gender and education level. A total of 65.2% of subjects had experience in smoking an entire cigarette, 80.3% had drank beer, wine, or liquor more than two or three times, 54.7% had used marijuana, 19.1% said yes to cocaine usage, and 21.5% said yes to other types of illegal drugs [21]. Gender covariate is coded as a dichotomous variable, and 54.0% of participants were female. Education level covariate is considered as a continuous variable ranging from 1 to 8: (1) not graduating high school (8%), (2) graduating high school (16%), (3) vocational training after high school (10%), (4) college (33%), (5) college completed (20%), (6) graduate school (4%), (7) completed master’s degree (5%), and (8) beyond master’s degree (4%). The mean educational level was 3.923, which in this scope is close to college level.

3.2. Method

To focus on the demonstration of the gscaLCR algorithm, this current study used a three-solution model with the Add Health data that were found in the previous study [19]. It is worth noting that we aim to introduce the procedure of fitting covariates into the existing LCA model rather than enumerating the number of latent classes in this study. In the illustration purpose of the gscaLCR algorithm, two different GSCA models were considered: (Model 1) gscaLCA model without any covariates and (Model 2) gscaLCA with a covariate of gender in Step 1 when enumerating latent classes in the GSCA model. The latter case, Model 2, was illustrated in Figure 2, and the specified model here assumes that each observed variable is defined by a phantom latent variable and those phantom variables are associated with each other.

Both hard and soft partitioning approaches described in Step 2 were employed. In step 3, two methods of fitting regressions (multinomial logistic or binomial logistic) of covariates on latent classes were considered. Thus, we demonstrated eight results of gscaLCR (Models 1 and 2, hard and soft partitioning, and multinomial and binomial logistic regressions) from the flow chart of gscaLCR algorithm in Figure 1. All analyses were executed in the gscaLCA package on the R program [47] that we created by implementing the gscaLCR functions generated.

3.3. Results

The overall relationship between latent class membership prevalence and five indicators is displayed in Figure 3. The results were consistent with the results of our previous study. We utilized same number of names for latent classes for Add Health data as in [19]: (Class 1) smoking and drinking group, (Class 2) heavy smoking and binge drinking, and (Class 3) heavy substance users. Depending on the inclusion of the gender covariate in the GSCA model (Model 1 vs. Model 2), the results were slightly different regarding the membership prevalence. The membership prevalence of three classes is 47.73%, 20.25%, and 32.02% in Model 1, whereas the prevalence with the gender covariate in Model 2 is 54.32%, 20.29%, and 25.38%. When the gender covariate was added into the GSCA model in Step 1, more subjects were in the smoking and drinking class (Class 1) and less subjects were involved in the substance user class (Class 3). On the other hand, the item response probabilities showed similar patterns in both two models. The smoking and drinking class (Class 1) had moderate item response probabilities of experience in smoking (Model 1: 0.457 and Model 2: 0.401) and alcohol (Model 1: 0.619 and Model 2: 0.664) and low probability of other three indicators (drug: 0.013 and 0.012; marijuana: 0.075 and 0.187; cocaine: 0.013 and 0.012 in Model 1 and Model 2, respectively). Heavy smoking and binge drinking (Class 2) had high probabilities of experience in smoking (0.959 for both Model 1 and 2), alcohol (Model 1: 0.990 and Model 2: 0.997), and marijuana (Model 1: 0.997 and Model 2: 0.995) and low probabilities of experience in drug use (Model 1: 0.041 and Model 2: 0.043) and cocaine Model 1: 0.041 and Model 2: 0.043). The substance user class (Class 3) had relatively high probability of experience in all indicators (smoking: 0.747 and 0.941; alcohol: 0.954 and 0.946; drug: 0.630 and 0.792; marijuana: 0.967 and 0.963; cocaine: 0.553 and 0.696 in Model 1 and Model 2, respectively).

3.3.1. Multinomial Logistic Regression: Hard Partitioning

The estimated coefficients from multinomial logistic regression to examine the effect of gender and education level are presented in Table 1, which includes results from both hard partitioning and soft partitioning with weights. For instances of hard partitioning, the estimated coefficients of Model 1 regarding the latent class 2, when the reference category is the latent class 1, were −0.836 (intercept;

β_{0, c l a s s 2}, p < 0.001

), −0.240 (effect of gender;

β_{1, c l a s s 2}

,

p = 0.002

), and 0.029 (effect of education level;

β_{3, c l a s s 2}

,

p = 0.188

). The results showed that gender had a statistically significant effect on the prevalence of latent class 2 over latent class 1, but the effect of education level was not statistically significant on the prevalence of latent class 2 over latent class 1. These coefficients can be transformed into odds by taking the exponents of the estimated coefficients. When we consider the male group with the average education level (3.923), the odds of membership in latent class 2 in relation to the reference latent class 1 is

\frac{v_{c l a s s 2}}{v_{c l a s s 1}} = \exp (- 0.836 - 0.240 (0) + 0.029 (3.923)) = 0.4857

(7)

We interpret that, when considering the male group with average education levels, the probability of being in latent class 2 is 0.4857 times less likely than the probability of being in latent class 1. The odds of membership in latent class 2 in relation to the reference latent class 1 are 0.382 when we consider the female group with average education levels (3.923). Similarly, we found that the gender effect was statistically significant while the effect of education level was not. We also calculated the odds for latent class 3 relative to latent class 1. The computed odds are presented in Table 2.

For Model 2, which encompassed the gender covariate when fitting gscaLCA in Step 1, the results were similar but the effect of education level at the comparison between latent class 3 and latent class 1 was different from the results in Model 1. As shown in Table 1, the education level was a significant predictor of being in latent class 3 instead of latent class 1. In sum, the results of estimated coefficients and odds showed that the gender effect was statistically significant for latent classes 2 and 3 in comparison to reference latent class 1 regardless of whether the gender covariate was included in the GSCA model in Step 1. On the contrary, the education levels were not significantly influential on the prevalence of latent classes 2 and 3 over latent class 1 except for the estimated coefficient for latent class 3 relative to latent class 1 in Model 2, which was significant.

3.3.2. Multinomial Logistic Regression: Soft Partitioning

When weights were taken into an account (i.e., soft partitioning was applied), the estimated coefficients were similar as the estimated coefficients with hard partitioning in Model 1 but slightly different from those of hard partitioning in Model 2. In Model 1, the estimated coefficients of covariate effects were slightly lower than the coefficients based on hard partitioning. Nevertheless, the statistical test results of the covariate effect in Model 1 were consistent with the results before applying the weights; the gender effect was statistically significant while the education level was not.

On the contrary, in Model 2, we found one inconsistency with those of hard partitioning. For the comparison between latent class 2 and latent class 1 (reference group), both gender and education level effects were not significant as −0.167 (

β_{1, c l a s s 2}

,

p = 0.090

) and 0.004 (

β_{2, c l a s s 2}

,

p = 0.878

), respectively. For the comparison between latent class 3 and latent class 1, both effects of gender and education level were significant as −0.646 (

β_{1, c l a s s 3}

,

p < 0.001

) and −0.061 (

β_{2, c l a s s 3}, p = 0.016

), which are aligned with the statistical test results of hard partitioning.

In summary, the results in terms of statistical significance were the same as in the hard/soft partitioning in Model 1. However, the results between Model 1 and Model 2 were different. Model 2 indicated different results over hard/soft partitioning regarding the effect of gender.

3.3.3. Binomial Logistic Regression: Hard Partitioning

Binomial logistic regressions were fitted for each latent class separately, while the class variable was coded into three latent class dummy variables based on the modal assignment. Interpretations of binomial logistic regression are more straightforward compared to multinomial logistic regression. Equation (6) can be re-written as follows

p r o b (v_{k} = 1) = \frac{\exp (β_{0, k} + β_{1, k} c o v_{1, k} + \dots + β_{C, k} c o v_{C, k})}{1 + \exp (β_{0, k} + β_{1, k} c o v_{1, k} + \dots + β_{C, k} c o v_{C, k})},

(8)

which is referred to as the probability that a subject is in the latent class

k

. For example, the probability of being the latent class 1, the smoking and drinking class, in Model 1 can be presented as

p r o b (v_{1} = 1) = \frac{\exp (- 0.245 + 0.427 (G e n d e r : F) - 0.020 (E d u . L e v e l))}{1 + \exp (- 0.245 + 0.427 (G e n d e r : F) - 0.020 (E d u . L e v e l))} .

(9)

by using the coefficients in Table 3. When the gender covariate was not associated with enumerating the latent classes in Step 1 (Model 1), the probability that a subject who is male and of an average educational level (

E d u . L e v e l = 3.923

) belongs to the latent class 1 is 42.98%, and the probability that a subject who is female with an average educational level (

E d u . L e v e l = 3.923

) belongs to the latent class 1 is 52.58%. Similarly, the probabilities for the other two latent classes were estimated, which are presented in Table 3. The results showed that there was around 10% difference in membership prevalence between male and female, and the difference was statistically significant for the latent class 1 (

β_{1, c l a s s 1} = 0.427, p < 0.001

) and the latent class 3 (

β_{1, c l a s s 3} = - 0.473, p < 0.001

). There were more females in latent class 1, smoking and drinking, and more males in latent class 3, heavy substance users. On the other hand, educational level did not show a significant influence on membership prevalence for all three latent classes.

With the same approach, we can interpret the estimated regression coefficients with Model 2 based on the results in Table 2 and Table 3. The results showed that there was around 10% difference in the membership prevalence in latent class 1 (

β_{1, c l a s s 1} = 0.426, p < 0.001;

Male = 48.59% and Female = 59.13%) and latent class 3 (

β_{1, c l a s s 3} = - 0.540, p < 0.001;

Male = 30.83% and Female = 20.62%), and the difference was statistically significant. Although the absolute magnitude prevalence for Model 1 and Model 2 was different, the differences between male and female in latent classes 1 and 3 are similar (around 10%). Another noticeable feature in Model 2 is that the education level was significant in latent classes 1 and 3, which was not significant in Model 1. This infers that a specified model is influenced not only by the membership prevalence but also by the relationship between the membership and covariates.

3.3.4. Binomial Logistic Regression: Soft Partitioning

After applying the weights, results of binomial logistic regression are present in the bottom part of Table 3. The weighting yielded the comparable results over Model 1 and Model 2. The gender effect on latent class 1 (

β_{1, c l a s s 1} = 0.353, p = 0.008

in Model 1;

β_{1, c l a s s 1} = 0.548, p < 0.001

in Model 2) and latent class 3 (

β_{1, c l a s s 3} = - 0.485, p < 0.001

in Model 1;

β_{1, c l a s s 3} = - 0.440, p < 0.001

in Model 2) were statistically significant, but the gender effect was not statistically significant on latent class 2 (

β_{1, c l a s s 2} = - 0.131, p = 0.250

in Model 1;

β_{1, c l a s s 2} = 0.083, p < 0.449

in Model 2). Results were consistent regardless of whether the gender variable was included in Step 1. This means that more females were in latent class 1 and more males were in latent class 3, which is aligned with the results before adjusting the gender in Step 1. On the other hand, the effects of education level changed over Model 1 and Model 2. Specifically, in Model 1, the education level statistically significantly affected the prevalence of latent class 2 and latent class 3 when applying the weights, although the magnitude was small. Conversely, the education level on latent classes 1 and 3 became a non-influential factor after the adjustment of the gender variable in Step 1 (Model 2). These results show that the weighting would provide slightly different results over Model 1 and Model 2, although the main trend is maintained. In sum, the results in terms of the statistical significance were same for the gender effect over hard/soft partitioning in Model 1. However, the results for the education level were different over hard/soft partitioning as well as between Model 1 and Model 2.

4. Discussion

The new algorithm for latent class analysis utilizing fuzzy clustering within GSCA with covariates (i.e., gscaLCR) was discussed with a real-world example in this study. More importantly, the specific algorithm of gscaLCR was established and its function applying the three-step approach to the gscaLCR is now available in an R package, gscaLCA [19]. This proposed algorithm diminishes the effect of the abnormality in enumerating the latent classes and the examination of covariate effects on the latent class prevalence due to the implementation of the LSE approach. This means that researchers are now able to examine the effects of covariates using gscaLCR parallel to the maximum-likelihood-based LCR. Although gscaLCR and MLE-based LCR function similarly to address the research questions in LCA, they are neither comparable nor competing approaches in an exploratory sense, but rather one of them should be selected prior to running the analysis in a confirmatory sense [48]. In this study, we focused and demonstrated that gscaLCR functions well to identify homogenous subgroups when the response variables are categorical.

It is still controversial to include a covariate in Step 1 in LCR. Our results on the regression analyses in Step 3 indicated levels of discrepancy between LCA models with and without covariates in Step 1. However, it may be meaningless to compare the two approaches. That is, including covariates in Step 1 is not optional, but it is necessary for the researcher to decide what kind of covariates should be included in Step 1, with theoretical rationale for each covariate. On the other hand, in this study, stability in parameter estimates over hard/soft partitioning was observed in the LCA model without covariates in Step 1, compared with a model with covariates.

Although the gscaLCR algorithm is sound and promising in LCR, there are several limitations. First, the gscaLCR algorithm was applied with two response options, i.e., dichotomous indicators in this current study. However, the optimal scaling method is applicable to ordered categorical variables, and thus, the gscaLCR algorithm can easily be extended into ordered categorical variables. Although we could not include the results as an example in this study, we have checked that the gscaLCR worked well with indicators with more than two response options and implemented the function in the gscaLCA package. The second limitation is that the efficiency of parameter estimation has not been examined in this study, which is not because of its difficulty in conducting the study but because we focused on proposing the new gscaLCR algorithm. Our future research looks at the efficiency of parameter recovery. Third, we did not specifically state how small a sample size would be good enough to run the gscaLCR. This may depend on the study design, but suggesting adequate sample sizes is needed, as was done in the MLE-based LCR. It will be studied as a follow-up research topic. The last limitation is related to tools of model evaluation. In GSCA, there are several evaluation tools including FIT, Adjusted FIT (AFIT), and Goodness of Fit Index (GFI) [22] and confirmatory tetrad analysis [49]. However, not many criteria of good fit in those tools are available. In LCR, it is necessary to identify the number of latent classes as an extension of LCA. To do so, we need to compare LCR models objectively. We did not touch on this issue in this study, but it is needed to provide such tools along with gscaLCR.

It should also be noted that this study is not proposing a new modeling in the LCR literature, but it is providing a new enumerating approach, fuzzy clustering, and a new parameter estimation method, alternating least square, for LCR using GSCA. That is, it is beyond the scope of this research to discuss the modeling building procedure in detail. We also tried to extend testing the effects of covariates into soft and hard partitioning in gscaLCR using the multinomial and binary logistic regressions, which has been done in the MLE-based LCA.

As noted, the bias from the classification error does not occur in gscaLCR because gscaLCR does not require multivariate normality. Therefore, discussion of the classification error brought by Block et al. [42] is not necessary in gscaLCR. However, we still have room to develop beyond gscaLCA and in the current estimation method using the algorithm of gscaLCR. First, the listwise deletion for missing data is only available in the current version of the gscaLCA package. It is a future research topic to implement model-based multiple imputation. Second, the current gscaLCR does not provide any analytical tool for multilevel modeling or multiple groups that are available in MLE-based LCA. These limitations are not just a disadvantage but need to be developed. In spite of these limitations, it is necessary to continue the development of such new approaches in latent class analysis because the MLE-based LCR does show many estimation and identification issues and requires a relatively large sample size, which can be addressed by the gscaLCR algorithm.

We believe that the gscaLCR algorithm provides a new framework of fitting LCR for researchers in mixture modeling and it will also be developed for a structural equation mixture modeling combining factor analysis with mixture modeling.

Author Contributions

Conceptualization, S.P., S.K. and J.H.R.; methodology, S.P., S.K. and J.H.R.; software, S.P., S.K. and J.H.R.; validation, S.P., S.K. and J.H.R.; formal analysis, S.P., S.K. and J.H.R.; investigation, S.P., S.K. and J.H.R.; resources, S.P., S.K. and J.H.R.; data curation, S.P., S.K. and J.H.R.; writing—original draft preparation, S.P., S.K. and J.H.R.; writing—review and editing, S.P., S.K. and J.H.R.; visualization, S.P., S.K. and J.H.R.; supervision, J.H.R.; funding acquisition, J.H.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Yonsei University Research Grant of 2020.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lazarsfeld, P.; Henry, N. Latent Structure Analysis; Houghton Mifflin: Boston, MA, USA, 1968. [Google Scholar]
McCutcheon, A. Latent Class Analysis; Sage Publications: Beverly Hills, CA, USA, 1987. [Google Scholar]
Vermunt, J.K. Latent class modeling with covariates: Two improved three-step approaches. Polit. Anal. 2010, 18, 450–469. [Google Scholar] [CrossRef] [Green Version]
Urick, A.; Bowers, A.J. What are the different types of principals across the United States? A latent class analysis of principal perception of leadership. Educ. Adm. Q. 2014, 50, 96–134. [Google Scholar]
Lanza, S.T.; Cooper, B.R. Latent class analysis for developmental research. Child Dev. Perspect. 2016, 10, 59–64. [Google Scholar] [CrossRef] [PubMed]
Xia, J.; Evans, F.H.; Spilsbury, K.; Ciesielski, V.; Arrowsmith, C.; Wright, G. Market segments based on the dominant movement patterns of tourists. Tour. Manag. 2010, 31, 464–469. [Google Scholar] [CrossRef] [Green Version]
Doctor, S.M.; Liu, Y.; Whitesell, A.; Thwai, K.L.; Taylor, S.M.; Janko, M.; Emch, M.; Kashamuka, M.; Muwonga, J.; Tshefu, A.; et al. Malaria surveillance in the Democratic Republic of the Congo: Comparison of microscopy, PCR, and rapid diagnostic test. Diagn. Microbiol. Infect. Dis. 2016, 85, 16–18. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Formann, A.K.; Kohlmann, T. Latent class analysis in medical research. Stat. Methods Med. Res. 1996, 5, 179–211. [Google Scholar] [CrossRef]
Jiang, Y.; Perry, D.K.; Hesser, J.E. Suicide patterns and association with predictors among Rhode Island public high school students: A latent class analysis. Am. J. Public Health 2010, 100, 1701–1707. [Google Scholar] [CrossRef]
Harville, D.A. Maximum likelihood approaches to variance component estimation and to related problems. J. Am. Stat. Assoc. 1977, 72, 320–338. [Google Scholar] [CrossRef]
Muthen, B.; Muthen, L. Mplus User’s Guide, 8th ed.; Muthen & Muthen: Los Angeles, CA, USA, 1998. [Google Scholar]
Lanza, S.T.; Dziak, J.J.; Huang, L.; Wagner, A.; Collins, L.M. PROC LCA & PROC LTA Users’ Guide Version 1.3.2.; The Methodology Center, Penn State: University Park, PA, USA, 2015. [Google Scholar]
Vermunt, J.K.; Magidson, J. Latent GOLD 4.0 User’s Guide; Statistical Innovations Inc.: Belmont, MA, USA, 2005. [Google Scholar]
Linzer, D.A.; Lewis, J.B. poLCA: An R package for polytomous variable latent class analysis. J. Stat. Softw. 2011, 42, 1–29. [Google Scholar] [CrossRef] [Green Version]
Dempster, A.; Laird, N.; Rubin, D. Maximum likelihood from incomplete data via the “EM” algorithm. J. R. Stat. Soc. Ser. B Methodol. 1977, 39, 1. [Google Scholar]
Collins, L.M.; Lanza, S.T. Latent Class and Latent Transition Analysis: With Applications in the Social Behavioral, and Health Sciences (Vol. 718); Wiley Series in Probability and Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2010; ISBN 978-0-470-22839-5. [Google Scholar]
Dziak, J.J.; Lanza, S.T.; Tan, X. Effect size, statistical power, and sample size requirements for the bootstrap likelihood ratio test in latent class analysis. Struct. Equ. Model. Multidiscip. J. 2014, 21, 534–552. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Gudicha, D.W.; Tekle, F.B.; Vermunt, J.K. Power and sample size computation for Wald tests in latent class models. J. Classif. 2016, 33, 30–51. [Google Scholar] [CrossRef] [Green Version]
Ryoo, J.H.; Park, S.; Kim, S. Categorical latent variable modeling utilizing fuzzy clustering generalized structured component analysis as an alternative to latent class analysis. Behaviometrika 2020, 47, 291–306. [Google Scholar] [CrossRef]
Hwang, H.; Takane, Y. Generalized structured component analysis. Psychometrika 2004, 69, 81–99. [Google Scholar] [CrossRef]
Harris, K.M.; Udry, J.R. National Longitudinal Study of Adolescent to Adult Health (Add Health) 1994–2008; Inter-University Consortium for Political and Social Research (ICPSR): Ann Arbor, MI, USA, 2018. [Google Scholar] [CrossRef]
Hwang, H.; Takane, Y. Generalized Structured Component Analysis: A Component-Based Approach to Structural Equation Modeling; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar]
De Leeuw, J.; Young, F.W.; Takane, Y. Additive structure in qualitative data: An alternating least squares method with optimal scaling features. Psychometrika 1976, 41, 471–503. [Google Scholar] [CrossRef]
Jung, K.; Takane, Y.; Hwang, H.; Woodward, T.S. Dynamic GSCA (Generalized Structured Component Analysis) with applications to the analysis of effective connectivity in functional neuroimaging data. Psychometrika 2012, 77, 827–848. [Google Scholar] [CrossRef]
Hwang, H.; DeSarbo, W.S.; Takane, Y. Fuzzy clusterwise generalized structured component analysis. Psychometrika 2007, 72, 181. [Google Scholar] [CrossRef] [Green Version]
Young, F.W. Quantitative analysis of qualitative data. Psychometrika 1981, 46, 357–388. [Google Scholar] [CrossRef]
Bezdek, J.C. Pattern Recognition with Fuzzy Objective Function Algorithms; Springer: Boston, MA, USA, 1981; ISBN 978-1-4757-0452-5. [Google Scholar]
Bezdek, J.C. Numerical taxonomy with fuzzy sets. J. Math. Biol. 1974, 1, 57–71. [Google Scholar] [CrossRef]
Dunn, J.J. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybern. 1974, 3, 32–57. [Google Scholar] [CrossRef]
Wedel, M.; Steenkamp, J.-B.E. A clusterwise regression method for simultaneous fuzzy market structuring and benefit segmentation. J. Mark. Res. 1991, 28, 385–396. [Google Scholar] [CrossRef]
Maulik, U.; Saha, I. Automatic fuzzy clustering using modified differential evolution for image classification. IEEE Trans. Geosci. Remote Sens. 2010, 48, 3503–3510. [Google Scholar] [CrossRef]
Aliahmadipour, L.; Torra, V.; Eslami, E. On hesitant fuzzy clustering and clustering of hesitant fuzzy data. In Fuzzy Sets, Rough Sets, Multisets and Clustering; Torra, V., Dahlbom, A., Narukawa, Y., Eds.; Studies in Computational Intelligence; Springer: Cham, Switzerland, 2017; Volume 671, pp. 157–168. [Google Scholar]
Cacciola, M.; La Foresta, F.; Morabito, F.C.; Versaci, M. Advanced use of soft computing and eddy current test to evaluate mechanical integrity of metallic plates. NDT E Int. 2007, 40, 357–362. [Google Scholar] [CrossRef]
Dayton, C.M.; Macready, G.B. Concomitant-variable latent-class models. J. Am. Stat. Assoc. 1988, 83, 173–178. [Google Scholar] [CrossRef]
Dayton, C.M.; Macready, G.B. Use of categorical and continuous covariates in latent class analysis. In Applied Latent Class Analysis; Hagenaars, J., McCutcheon, A.L., Eds.; Cambridge University Press: Cambridge, UK, 2002; pp. 213–233. ISBN 0-521-59451-0. [Google Scholar]
DeSarbo, W.S.; Oliver, R.L.; Rangaswamy, A. A simulated annealing methodology for clusterwise linear regression. Psychometrika 1989, 54, 707–736. [Google Scholar] [CrossRef]
Wedel, M.; Kistemaker, C. Consumer benefit segmentation using clusterwise linear regression. Int. J. Res. Mark. 1989, 6, 45–59. [Google Scholar] [CrossRef]
Van der Heijden, P.G.M.; Dessens, J.; Bockenholt, U. Estimating the concomitant-variable latent-class model with the EM algorithm. J. Educ. Behav. Stat. 1996, 21, 215–229. [Google Scholar] [CrossRef]
Vermunt, J.K. LEM: A General Program for the Analysis of Categorical Data. Ph.D. Thesis, Tilburg University, The Netherlands, 1997. [Google Scholar]
Yamaguchi, K. Multinomial logit latent-class regression models: An analysis of the predictors of gender-role attitudes among Japanese women. Am. J. Sociol. 2000, 105, 1702–1740. [Google Scholar] [CrossRef]
Skrondal, A.; Laake, P. Regression among factor scores. Psychometrika 2001, 66, 563–575. [Google Scholar] [CrossRef] [Green Version]
Bolck, A.; Croon, M.; Hagenaars, J. Estimating latent structure models with categorical variables: One-step versus three-step estimators. Polit. Anal. 2004, 12, 3–27. [Google Scholar] [CrossRef]
Croon, M. Ordering the classes. In Applied Latent Class Analysis; Hagenaars, J., McCutcheon, A.L., Eds.; Cambridge University Press: Cambridge, UK, 2002; pp. 137–162. ISBN 0-521-59451-0. [Google Scholar]
Gordon, G.J. Approximate Solutions to Markov Decision Processes. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1999. (No. CMU-CS-99-143). [Google Scholar]
Hruschka, H. Market definition and segmentation using fuzzy clustering methods. Int. J. Res. Mark. 1986, 3, 117–134. [Google Scholar] [CrossRef]
Dias, J.G.; Vermunt, J.K. A bootstrap-based aggregate classifier for model-based clustering. Comput. Stat. 2008, 23, 643–659. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2019. [Google Scholar]
Hwang, H.; Takane, Y.; Jung, K. Generalized structured component analysis with uniqueness terms for accommodating measurement error. Front. Psychol. 2017, 8, 1–12. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ryoo, J.H.; Hwang, H. Model evaluation in generalized structured component analysis using confirmatory tetrad analysis. Front. Psychol. 2017, 8, 916. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Flow chart of three-step approach with the Generalized Structured Component Analysis for Latent Class Regression (gscaLCR) algorithm: (Step 1) Estimating parameters of gsca with latent class analysis (gscaLCA), (Step 2) Assigning the membership, and (Step 3) Estimating the effect of covariates.

Figure 2. A path diagram used for the GSCA model with the gender covariate (Model 2). L1-L6 are the latent trait for Smoking, Alcohol, Drug, Marijuana, Cocaine, and Gender. L1-L5 are associated with each other phantom constructs, and the effect of the gender to the other five indicators are represented by the paths from L6 to L1–L5.

Figure 3. Item response probabilities of Model 1 and Model 2. Model 1: no adding covariates in GSCA model; Model 2: adding the gender covariate in GSCA model. Class 1: smoking and drinking; Class 2: binge drinking and heavy smoking; Class 3: heavy substance abuse.

Table 1. Multinomial Logistic Regression Coefficients of gscaLCR with Add Health data.

Model 1			Estimate	Standard Error	z Value	Pr (>\|z\|)
Hard Partitioning	Class 2/Class 1	(Intercept)	−0.836	0.100	−8.403	<0.001 **
		Gender(F)	−0.240	0.076	−3.173	0.002 **
		Edu. Level	0.029	0.022	1.316	0.188
	Class 3/Class 1	(Intercept)	−0.165	0.085	−1.948	0.051
		Gender(F)	−0.545	0.065	−8.341	<0.001 **
		Edu. Level	0.014	0.019	0.716	0.474
Soft Partitioning	Class 2/Class 1	(Intercept)	−0.374	0.108	−3.458	<0.001 **
		Gender(F)	−0.275	0.082	−3.353	0.001 **
		Edu. Level	0.005	0.024	0.194	0.846
	Class 3/Class 1	(Intercept)	0.163	0.099	1.634	0.102
		Gender(F)	−0.640	0.076	−8.375	<0.001 **
		Edu. Level	−0.024	0.023	−1.043	0.297
Model 2
Hard Partitioning	Class 2/Class 1	(Intercept)	−0.876	0.098	−8.933	<0.001 **
		Gender(F)	−0.211	0.074	−2.845	0.004 **
		Edu. Level	0.003	0.022	0.117	0.907
	Class 3/Class 1	(Intercept)	−0.209	0.088	−2.365	0.018 *
		Gender(F)	−0.598	0.069	−8.685	<0.001 **
		Edu. Level	−0.063	0.021	−3.081	0.002 **
Soft Partitioning	Class 2/Class 1	(Intercept)	−0.824	0.131	−6.287	<0.001 **
		Gender(F)	−0.167	0.099	−1.693	0.090
		Edu. Level	−0.004	0.029	−0.153	0.878
	Class 3/Class 1	(Intercept)	0.082	0.110	0.744	0.457
		Gender(F)	−0.646	0.085	−7.590	<0.001 **
		Edu. Level	−0.061	0.025	−2.403	0.016 *

Class 1: smoking and drinking; Class 2: binge drinking and heavy smoking; Class 3: Heavy substance abuse Model 1: no adding covariates in GSCA model; Model 2: adding gender covariate in GSCA model. ** p < 0.01, * p < 0.05.

Table 2. Interpretation Forms of Estimated Logistic Regressions based on hard partitioning.

		Model 1		Model 2
		Male	Female	Male	Female
Multinomial Logistic Regression	Odds
	Class 2/Class 1	0.4857	0.3820	0.4214	0.3412
	Class 3/Class 1	0.8958	0.5194	0.6337	0.3485
Binomial Logistic Regression	Percentage
	Class 1	41.98%	52.59%	48.59%	59.13%
	Class 2	20.43%	20.12%	20.43%	20.12%
	Class 3	37.58%	27.28%	30.83%	20.62%

Class 1: smoking and drinking; Class 2: binge drinking and heavy smoking; Class 3: heavy substance abuse Model 1: no adding covariates in GSCA model; Model 2: adding gender covariate in GSCA model.

Table 3. Binomial Logistic Regression Coefficients of gscaLCR with Add Health data.

Model 1			Estimate	Standard Error	z Value	Pr (>\|z\|)
Hard Partitioning	Class 2/Class 1	(Intercept)	−0.245	0.075	−3.272	0.001 **
		Gender(F)	0.427	0.057	7.451	<0.001 **
		Edu. Level	−0.020	0.017	−1.177	0.239
	Class 3/Class 1	(Intercept)	−1.454	0.093	−15.640	<0.001 **
		Gender(F)	−0.019	0.071	−0.270	0.787
		Edu. Level	0.024	0.021	1.138	0.255
Soft Partitioning	Class 2/Class 1	(Intercept)	1.743	0.175	9.988	<0.001 **
		Gender(F)	0.353	0.134	2.632	0.008 **
		Edu. Level	−0.027	0.038	−0.719	0.472
	Class 3/Class 1	(Intercept)	0.555	0.149	3.738	<0.001 **
		Gender(F)	−0.131	0.114	−1.151	0.250
		Edu. Level	0.068	0.034	2.023	0.043 *
Model 2
Hard Partitioning	Class 2/Class 1	(Intercept)	−0.186	0.075	−2.474	0.013
		Gender(F)	0.426	0.057	7.422	<0.001 **
		Edu. Level	0.033	0.017	1.966	0.049*
	Class 3/Class 1	(Intercept)	−1.446	0.093	−15.578	<0.001 **
		Gender(F)	−0.019	0.071	−0.268	0.789
		Edu. Level	0.022	0.021	1.079	0.281
Soft Partitioning	Class 2/Class 1	(Intercept)	0.804	0.143	5.603	<0.001 **
		Gender(F)	0.548	0.112	4.881	<0.001 **
		Edu. Level	0.051	0.033	1.545	0.122
	Class 3/Class 1	(Intercept)	−0.480	0.145	−3.313	0.001 **
		Gender(F)	0.083	0.110	0.757	0.449
		Edu. Level	0.039	0.032	1.203	0.229

Class 1: smoking and drinking; Class 2: binge drinking and heavy smoking; Class 3: heavy substance Model 1: no adding covariates in GSCA model; Model 2: adding gender covariate in GSCA model ** p < 0.01, * p < 0.05.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, S.; Kim, S.; Ryoo, J.H. Latent Class Regression Utilizing Fuzzy Clusterwise Generalized Structured Component Analysis. Mathematics 2020, 8, 2076. https://doi.org/10.3390/math8112076

AMA Style

Park S, Kim S, Ryoo JH. Latent Class Regression Utilizing Fuzzy Clusterwise Generalized Structured Component Analysis. Mathematics. 2020; 8(11):2076. https://doi.org/10.3390/math8112076

Chicago/Turabian Style

Park, Seohee, Seongeun Kim, and Ji Hoon Ryoo. 2020. "Latent Class Regression Utilizing Fuzzy Clusterwise Generalized Structured Component Analysis" Mathematics 8, no. 11: 2076. https://doi.org/10.3390/math8112076

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Latent Class Regression Utilizing Fuzzy Clusterwise Generalized Structured Component Analysis

Abstract

1. Introduction

2. GscaLCR Algorithm

2.1. Generalized Structured Component Analysis (GSCA)

2.2. Generalized Structured Component Analysis for Latent Class Analysis (gscaLCA)

2.3. Latent Class Regression

2.4. Generalized Structured Component Analysis for Latent Class Regression (gscaLCR)

2.4.1. Step 1: Estimate Parameters of gscaLCA

Step 1.1. Update the Parameters of Generalized Structured Component Analysis ( $W_{k}$ and $A_{k}$ ) in each cluster for fixed $U_{k}^{m}$

Step 1.2. Update Membership Parameter $u_{k i}$ for the Fixed $W_{k}$ and $A_{k}$

2.4.2. Step 2: Assign Each Subject into a Latent Class Based on the Parameter Estimates from the Previous Step

2.4.3. Step 3: Fit a Regression Model with Covariates

3. An Illustration of Fitting gscaLCR into Empirical Data

3.1. Empirical Data

3.2. Method

3.3. Results

3.3.1. Multinomial Logistic Regression: Hard Partitioning

3.3.2. Multinomial Logistic Regression: Soft Partitioning

3.3.3. Binomial Logistic Regression: Hard Partitioning

3.3.4. Binomial Logistic Regression: Soft Partitioning

4. Discussion

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Latent Class Regression Utilizing Fuzzy Clusterwise Generalized Structured Component Analysis

Abstract

1. Introduction

2. GscaLCR Algorithm

2.1. Generalized Structured Component Analysis (GSCA)

2.2. Generalized Structured Component Analysis for Latent Class Analysis (gscaLCA)

2.3. Latent Class Regression

2.4. Generalized Structured Component Analysis for Latent Class Regression (gscaLCR)

2.4.1. Step 1: Estimate Parameters of gscaLCA

Step 1.1. Update the Parameters of Generalized Structured Component Analysis ( W k and A k ) in each cluster for fixed U k m

Step 1.2. Update Membership Parameter u k i for the Fixed W k and A k

2.4.2. Step 2: Assign Each Subject into a Latent Class Based on the Parameter Estimates from the Previous Step

2.4.3. Step 3: Fit a Regression Model with Covariates

3. An Illustration of Fitting gscaLCR into Empirical Data

3.1. Empirical Data

3.2. Method

3.3. Results

3.3.1. Multinomial Logistic Regression: Hard Partitioning

3.3.2. Multinomial Logistic Regression: Soft Partitioning

3.3.3. Binomial Logistic Regression: Hard Partitioning

3.3.4. Binomial Logistic Regression: Soft Partitioning

4. Discussion

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Step 1.1. Update the Parameters of Generalized Structured Component Analysis ( $W_{k}$ and $A_{k}$ ) in each cluster for fixed $U_{k}^{m}$

Step 1.2. Update Membership Parameter $u_{k i}$ for the Fixed $W_{k}$ and $A_{k}$