The best way to do latent class analysis is by using Mplus, or if you are interested in some very specific LCA models you may need Latent Gold. Another decent option is to use PROC LCA in SAS. All the other ways and programs might be frustrating, but are helpful if your purposes happen to coincide with the specific R package.
CRAN offers plenty of different ways to get clusters on your data, but most of these packages have a very narrow and specific utility. For example, I found at least 15 packages involving latent class models, of which only six perform latent class analysis in the form of classification based on indicators, and only two of them allow including nominal indicators, and none allows including ordinal indicators.
First, there are all kinds of mixture models whose main purpose is to look for the classes in which the regression parameters differ. The meaning of the latent classes here is different as they are based not on the responses of respondents, but on the effects of one variables on other. These packages include flexmix, fpc, mmlcr, lcmm, and others. I do not discuss them.
Second, there are updated Lazarsfeld’s latent class models whose main purpose is to obtain a discrete latent variable based on the responses of respondents. These models look for homogeneous (in terms of responses) classes which differ by responses. Optionally, one can add some predictors and distal outcomes (variables that depend on classes but not their indicators).
R packages capabilities
Continuous responses | Binary responses | Categorical unordered (nominal) responses | Allows to add | |
poLCA | + | + | class predictors | |
lcca | + | + | class predictors, multiple groups | |
depmixS4 | + | + | + | class predictors, constaints |
covLCA | + | + | class predictors | |
RandomLCA | + | random intercept | ||
BayesLCA | + | priors | ||
e1071::lca | + | |||
Mclust | + | priors |
NONE may utilize ordinal responses (although poLCA manual insists there is no big practical difference with treating responses as nominal ones). NONE can apply 3-step approach in adding covariates. Below I describe three packages that allow for nominal indicators: poLCA, depmixS4, and lcca.
poLCA -[Polytomous Latent class analysis]
Latent classes based on nominal responses (only), may add predictors of all latent classes (in one stage).
Nice features:
- simple input. You just put:
> poLCA(cbind(indicator1, indicator2, indicator3)~1, data=mydata)
and poLCA gives class probabilities, conditional response probabilities and the fit statistics. The output is clear and simple.
- repeats fitting the model as many times as you like, and manipulate convergence criterion and number of iterations.
- builds graphs automatically, although not always these graphs are useful.
Not so nice features:
- seems to have problems in dealing with indicators with different number of categories; for example, poLCA.entropy wasn’t calculated in my example;
- doesn’t handle ordinal or metric indicators;
- you can see only probabilities but not the original thresholds; you don’t have control over any parameters, cannot constrain them to test hypotheses;
- no function for comparing models, it needs to be done manually;
- in the manual the authors defend one-step approach of adding covariates using outdated publication by Bolck, Croon and Hagenaars (2004), whereas adjustments to the three-step approach were suggested a while ago (Vermunt 2010) and successfully implemented in LatentGold and and Mplus software. You can’t even do it manually with poLCA.
Example. I use WVS round 5 data from Canada and three items on trust: generalized interpersonal 10-point trust scale, and two 4-point scales of trust to family and trust to strangers.
The input commands are very simple, excluding an odd need to cbind the LCA indicators.
lca.polca <- poLCA(
cbind(trust, trust_family, trust_strangers)~1,
nclass=2,
data=d1,
nrep=1,
na.rm=F,
graphs=T,
maxiter = 100000
)
Conditional item response (column) probabilities,
by outcome variable, for each class (row)
$trust_family
Pr(1) Pr(2) Pr(3) Pr(4)
class 1: 0.9058 0.0901 0.0019 0.0022
class 2: 0.7275 0.2356 0.0278 0.0091
$trust
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7) Pr(8) Pr(9) Pr(10)
class 1: 0.0000 0.0027 0.0058 0.0239 0.1103 0.1077 0.2272 0.3008 0.1407 0.0810
class 2: 0.1098 0.0612 0.1324 0.1287 0.1825 0.1846 0.1164 0.0665 0.0137 0.0042
$trust_strangers
Pr(1) Pr(2) Pr(3) Pr(4)
class 1: 0.0190 0.6439 0.274 0.063
class 2: 0.0094 0.2356 0.505 0.250
Estimated class population shares
0.6428 0.3572
Predicted class memberships (by modal posterior prob.)
0.677 0.323
=========================================================
Fit for 2 latent classes:
=========================================================
number of observations: 2164
number of fully observed cases: 2071
number of estimated parameters: 31
residual degrees of freedom: 128
maximum log-likelihood: -7651.997
AIC(2): 15365.99
BIC(2): 15542.07
G^2(2): 132.5952 (Likelihood ratio/deviance statistic)
X^2(2): 181.2104 (Chi-square goodness of fit)
.
So, there are two classes and their profiles are nicely plotted, though the plot isn’t too informative as it’s hard to compare class profiles.
poLCA.entropy(lca.polca) Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 4, 10
Entropy wasn’t calculated in my case. All the other functions worked smoothly.
Here’s a quick fix for an “absolute entropy” native poLCA function:
poLCA.entropy.fix <- function (lc)
{
K.j <- sapply(lc$probs, ncol)
fullcell <- expand.grid(sapply(K.j, seq, from = 1))
P.c <- poLCA.predcell(lc, fullcell)
return(-sum(P.c * log(P.c), na.rm = TRUE))
}
poLCA.entropy.fix(lca.polca)
[1] 3.602397
And here is a solution from Daniel Oberski (slide 66), called Entropy R2:
entropy.R2 <- function(fit) { entropy <- function(p) sum(-p * log(p)) error_prior <- entropy(fit$P) # Class proportions error_post <- mean(apply(fit$posterior, 1, entropy)) R2_entropy <- (error_prior - error_post) / error_prior R2_entropy } entropy_R2(lca.polca) [1] 0.4205958
lca.polca$predclass gives a vector of predicted class membership for each respondent.
lca.polca$posterior gives a matrix of probabilities for each respondent and each class.
You may add predictors to the model by substituting ~1 with ~predictor1 + predictor2… But be careful, it applies one-step approach which means the classification will depend not only on your indicators, but on the predictors too.
depmixS4 [Dependent Mixture Models]
The package allows to fit various kinds of hidden Markov models, to which LCA model is also generalized. To avoid time-consuming mistakes in model specification, the analysis involves two steps: construction of a model with mix
function and fitting it with fit
function. family
argument of mix
function allows specifying a type of observed variables – whether they are continuous, nominal, or count by adding to a list corresponding distribution name, i.g. guassian or multinomial.
Nice features
- can be used with nominal, continuous, count, and other kinds of variables, additionally, users can construct response distribution themselves;
- can mix different kinds of indicators, for example, multinomial with continuous;
- likelihood ratio test is included and is as easy as
llratio(fit1, fit2)
- constraints are allowed
Not so nice features
- Due to its different main purpose (which is latent transition modeling of time series), terminology differs from conventional LCA. Classes are called states.
- Some basic functions are needed to be specified manually: There is no easy way to plot probabilities; no G2 or X2 values, no entropy.
Example. The same model is built
library(depmixS4) model_definition <- mix(list(trust_family ~ 1, trust ~ 1, trust_strangers ~ 1), family = list(multinomial("identity"), #For every corresponding multinomial("identity"), # indicator a family of distribution multinomial("identity")), # should be indicated in the list. data = wvs.canada, nstates = 2, #This is the number of classes nstart=c(0.5, 0.5), # Prior probabilities of classes initdata = wvs.canada #Our data ) fit.mod <- fit(model_definition) # Fit the model iteration 0 logLik: -7754.755 ..... iteration 335 logLik: -7652.003 converged at iteration 339 with logLik: -7652.003 fit.mod Convergence info: Log likelihood converged to within tol. (relative change) 'log Lik.' -7652.003 (df=31) AIC: 15366.01 BIC: 15540.71
Loglikelihood values as well as AIC and BIC are very close to the ones found in with poLCA
.
summary
function shows found probabilities of classes and conditional response probabilities.
summary(fit.mod)
Mixture probabilities model
pr1 pr2
0.635535 0.364465
Response parameters
Resp 1 : multinomial
Resp 2 : multinomial
Resp 3 : multinomial
Re1.1 Re1.2 Re1.3 Re1.4 Re2.1 Re2.2 Re2.3 Re2.4 Re2.5 Re2.6 Re2.7
St1 0.9070074 0.08930021 0.001778364 0.001914039 0.0000000 0.002556402 0.005311421 0.02358108 0.1101063 0.1075408 0.2269529
St2 0.7288673 0.23409146 0.027594026 0.009447176 0.1076035 0.060335151 0.130693172 0.12715298 0.1814410 0.1833505 0.1189929
Re2.8 Re2.9 Re2.10 Re3.1 Re3.2 Re3.3 Re3.4
St1 0.30171593 0.14132783 0.080907262 0.019127226 0.6473959 0.2723859 0.06109097
St2 0.06946556 0.01515732 0.005807863 0.009429918 0.2377029 0.5032355 0.24963171
St1
here stands for state 1, which is class 1. Re1.1
stands for variable 1, response to category 1, in my case it’s trust to family item, response “Trust completely”. Taking together, a probability of a respondent to give response “Trust completely” to the question about their family given this respondent is member of latent class 1 is 0.9070074, or, rounding 0.9.
With a little messy function I could plot these conditional probabilities:
depmixS4.plot.probs <- function(fit.mod) { require(gridExtra) require(grid) require(ggplot2) require(reshape2) a<-lapply(1:length(fit.mod@response[[1]]), function(y) { sapply(fit.mod@response, function(x) x[[y]]@parameters$coefficients) }) names(a)<-sapply(1:length(fit.mod@response[[1]]), function(y) as.character(fit.mod@response[[1]][[y]]@formula[[2]])) g<-grid::gList() for(i in 1:length(a)) { g[[i]]<- ggplotGrob(ggplot(melt(a[[i]]), aes(paste("Class", rev(Var2) ), value, fill=as.factor(Var1), label=round(value, 2)))+ geom_col()+geom_text(position=position_stack(vjust=0.5))+coord_flip()+scale_fill_brewer(type="seq", palette=i)+ labs(x="", y="", title=names(a)[i], fill="Categories")+theme_minimal()) } grid.arrange(grobs=g, nrow=length(a) ) } depmixS4.plot.probs(fit.mod)
fit.mod@posterior
gives a predicted class membership variable and posterior probabilities of membership.
Expanding example to mixed distributions. Let’s take advantage of the depmixS4
abilities and assign variable trust normal distribution instead of multinomial:
mod2 <- mix(list(trust_family ~ 1, trust ~ 1, trust_strangers ~ 1), family = list(multinomial("identity"), gaussian("identity"), # This line has changed multinomial("identity")), data = wvs.canada, nstates = 2, nstart=c(0.5, 0.5), initdata = wvs.canada) fit.mod2 <- fit(mod2) iteration 0 logLik: -7958.793 ... iteration 140 logLik: -7754.039 converged at iteration 143 with logLik: -7754.038 fit.mod2 Convergence info: Log likelihood converged to within tol. (relative change) 'log Lik.' -7754.038 (df=17) AIC: 15542.08 BIC: 15637.89 summary(fit.mod2) Mixture probabilities model pr1 pr2 0.6056074 0.3943926 Response parameters Resp 1 : multinomial Resp 2 : gaussian Resp 3 : multinomial Re1.1 Re1.2 Re1.3 Re1.4 Re2.(Intercept) Re2.sd Re3.1 Re3.2 Re3.3 Re3.4 St1 0.906664 0.08935629 0.001804677 0.002175068 7.577571 1.337904 0.01845478 0.6356047 0.2792356 0.06670489 St2 0.742831 0.22308549 0.025606576 0.008476908 4.642214 2.045671 0.01119042 0.2865875 0.4753749 0.22684724
The table of results shows that first and third indicators have multinomial responses, hence it shows probabilities for each response category given the class membership. The second indicator is continuous, so it has two parameters Re2.(Intercept)
which us mean, and Re2.sd
, which is standard deviation.
lcca [Latent class causal analysis]
this is my favorite, although it’s a dead-born package – since publication of 2.0.0 version in 2013 it hasn’t been supported, it’s an alpha software, it runs only on Windows, and owners of the package are not together anymore (the developers seem to switch to PROC LCA instead). I hope there will be some progress and they will find a way to keep developing lcca.
Nice features:
- In addition to classic LCA with nominal indicators, it can do a multiple group LCA models and fix or relax all the response probabilities.
- Like poLCA it allows to add covariates that have an effect on class probabilities (class sizes). It allows to compare likelihoods of several models with the same number of classes and differing covariates.
- Another nice feature of this package is its ability to account for sampling weights and specific features of sample, including sampling clusters and strata, which is very useful given your using of survey data.
- You can also “flatten rhos” which means avoiding too large and too small thresholds corresponding to conditional response probabilities (it’s often appears).
- Furthermore, it implements the LCA version of Rubin causal model, but I am not sure how popular it could be among social scientists.
Not-so-nice features:
- you still don’t have control over specific parameters excluding fixing all conditional probabilities to be equal across groups in multiple groups analysis;
- it doesn’t provide usual fit statistics such as G-square or Chi-square;
- as can be seen from below output it doesn’t nicely handle items with differing number of categories;
- still no 3-step approach to adding covariates;
- it’s not in the CRAN, for installation visit the link;
- available only for PC.
Example: WVS data from Canada, three trust items. The basic input commands are similar to poLCA:
lca.lcca<-lcca::lca( cbind(trust_family, trust, trust_strangers)~1, nclass=2, data=d1, flatten.rhos=1 ) summary.lca(lca.lcca, show.all=T) Summary of Latent-Class Analysis ==================================================== Data and model information ==================================================== Number of cases: 2164 Total frequency for all cases: 2164 Number of measurement items: 3 Number of categories per item: 4 10 4 Number of latent classes: 2 Starting values for rhos: randomly generated Random seed 1: 873 Random seed 2: 906 Starting values for gammas: uniform values Flattening constant for rhos: 1 Max. number of EM iterations: 5000 Convergence criterion: 0.000001 ==================================================== Fit statistics ==================================================== The EM algorithm CONVERGED in: 382 iterations Standard errors computed successfully. Standard-error method: STANDARD Number of free parameters estimated: 31.000000 Loglikelihood: -7652.201060 Loglikelihood + penalty: -7668.500516 -2 * Loglikelihood: 15304.402121 AIC (smaller is better): 15366.402121 BIC (smaller is better): 15542.473244 ==================================================== Parameter estimates ==================================================== Class prevalences (gammas): Class: 1 2 0.6567 0.3433 Item-response probabilities (rhos): Response category 1 Class: 1 2 trust_family 0.9029 0.7242 trust 0.0004 0.1137 trust_strangers 0.0190 0.0096 Response category 2 Class: 1 2 trust_family 0.0920 0.2381 trust 0.0041 0.0612 trust_strangers 0.6405 0.2251 Response category 3 Class: 1 2 trust_family 0.0027 0.0281 trust 0.0078 0.1338 trust_strangers 0.2763 0.5095 Response category 4 Class: 1 2 trust_family 0.0024 0.0096 trust 0.0254 0.1301 trust_strangers 0.0641 0.2558 Response category 5 Class: 1 2 trust_family NA NA trust 0.112 0.182 trust_strangers NA NA Response category 6 Class: 1 2 trust_family NA NA trust 0.1092 0.1847 trust_strangers NA NA Response category 7 Class: 1 2 trust_family NA NA trust 0.2268 0.1123 trust_strangers NA NA Response category 8 Class: 1 2 trust_family NA NA trust 0.2973 0.0634 trust_strangers NA NA Response category 9 Class: 1 2 trust_family NA NA trust 0.1382 0.0134 trust_strangers NA NA Response category 10 Class: 1 2 trust_family NA NA trust 0.0788 0.0054 trust_strangers NA NA Standard errors for class prevalences (gammas): Est. Std.Err Class 1 0.65672 0.05432 Class 2 0.34328 0.05432 Standard errors for item-response probabilities (rhos): Est. Std.Err Class 1, trust_family , Response 1 0.90291 0.01354 Class 1, trust_family , Response 2 0.09197 0.01232 Class 1, trust_family , Response 3 0.00270 0.00250 Class 1, trust_family , Response 4 0.00242 0.00228 Class 1, trust , Response 1 0.00037 0.00115 Class 1, trust , Response 2 0.00405 0.00542 Class 1, trust , Response 3 0.00779 0.00804 Class 1, trust , Response 4 0.02543 0.00958 Class 1, trust , Response 5 0.11204 0.01494 Class 1, trust , Response 6 0.10920 0.01487 Class 1, trust , Response 7 0.22684 0.01655 Class 1, trust , Response 8 0.29728 0.01928 Class 1, trust , Response 9 0.13821 0.01325 Class 1, trust , Response 10 0.07879 0.00974 Class 1, trust_strangers, Response 1 0.01904 0.00417 Class 1, trust_strangers, Response 2 0.64055 0.02808 Class 1, trust_strangers, Response 3 0.27634 0.02062 Class 1, trust_strangers, Response 4 0.06407 0.01483 Class 2, trust_family , Response 1 0.72418 0.02440 Class 2, trust_family , Response 2 0.23815 0.02276 Class 2, trust_family , Response 3 0.02806 0.00738 Class 2, trust_family , Response 4 0.00961 0.00487 Class 2, trust , Response 1 0.11368 0.02113 Class 2, trust , Response 2 0.06122 0.01409 Class 2, trust , Response 3 0.13376 0.02201 Class 2, trust , Response 4 0.13007 0.02147 Class 2, trust , Response 5 0.18201 0.02619 Class 2, trust , Response 6 0.18471 0.02647 Class 2, trust , Response 7 0.11234 0.03250 Class 2, trust , Response 8 0.06341 0.03142 Class 2, trust , Response 9 0.01341 0.01516 Class 2, trust , Response 10 0.00540 0.01156 Class 2, trust_strangers, Response 1 0.00964 0.00506 Class 2, trust_strangers, Response 2 0.22507 0.03774 Class 2, trust_strangers, Response 3 0.50950 0.02959 Class 2, trust_strangers, Response 4 0.25579 0.02777
In order to perform multiple groups LCA, just add groups argument to the above input and specify whether you want conditional response probabilities (constrain.rhos) and class probabilities (constrain.gammas) to be equal or not. When adding predictors make sure to change the function to lca.cov. A specific post about multiple groups LCA in different software will follow.
Great review, thanks!
What would you say about the ‘covLCA’ package?
Thanks for pointing it to me, Aline, I will definitely have a look and make an update to my post.