Ways to do Latent Class Analysis in R

The best way to do latent class analysis is by using Mplus, or if you are interested in some very specific LCA models you may need Latent Gold. Another decent option is to use PROC LCA in SAS. All the other ways and programs might be frustrating, but are helpful if your purposes happen to coincide with the specific R package.
CRAN offers plenty of different ways to get clusters on your data, but most of these packages have a very narrow and specific utility. For example, I found at least 15 packages involving latent class models, of which only six perform latent class analysis in the form of classification based on indicators, and only two of them allow including nominal indicators, and none allows including ordinal indicators.

First, there are all kinds of mixture models  whose main purpose is to look for the classes in which the regression parameters differ. The meaning of the latent classes here is different as they are based not on the responses of respondents, but on the effects of one variables on other. These packages include flexmix, fpc, mmlcr, lcmm, and others. I do not discuss them.
Second, there are updated Lazarsfeld’s latent class models whose main purpose is to obtain a discrete latent variable based on the responses of respondents. These models look for homogeneous (in terms of responses) classes which differ by responses. Optionally, one can add some predictors and distal outcomes (variables that depend on classes but not their indicators).
R packages capabilities

Continuous responses Binary responses Categorical unordered (nominal) responses Allows to add
poLCA + + class predictors
lcca + + class predictors, multiple groups
depmixS4 + + + class predictors, constaints
covLCA + + class predictors
RandomLCA + random intercept
BayesLCA + priors
e1071::lca +
Mclust +  priors

NONE may utilize ordinal responses (although poLCA manual insists there is no big practical difference with treating responses as nominal ones). NONE can apply 3-step approach in adding covariates. Below I describe three packages that allow for nominal indicators: poLCA, depmixS4, and lcca.

poLCA -[Polytomous Latent class analysis]

Latent classes based on nominal responses (only), may add predictors of all latent classes (in one stage).
Nice features:

  • simple input. You just put:
> poLCA(cbind(indicator1, indicator2, indicator3)~1, data=mydata)

and poLCA gives class probabilities, conditional response probabilities and the fit statistics. The output is clear  and simple.

  • repeats fitting the model as many times as you like, and manipulate convergence criterion and number of iterations.
  •  builds graphs automatically, although not always these graphs are useful.

Not so nice features: 

  • seems to have problems in dealing with indicators with different number of categories; for example, poLCA.entropy wasn’t calculated in my example;
  • doesn’t handle ordinal or metric indicators;
  • you can see only probabilities but not the original thresholds; you don’t have control over any parameters, cannot constrain them to test hypotheses;
  • no function for comparing models, it needs to be done manually;
  • in the manual the authors defend one-step approach of adding covariates using outdated publication by  Bolck, Croon and Hagenaars (2004), whereas adjustments to the three-step approach were suggested a while ago (Vermunt 2010) and successfully implemented in LatentGold and and Mplus software. You can’t even do it manually with poLCA.

Example. I use WVS round 5 data from Canada and three items on trust: generalized interpersonal 10-point trust scale, and two 4-point scales of trust to family and trust to strangers.
The input commands are very simple, excluding an odd need to cbind the LCA indicators.


lca.polca <- poLCA(
                cbind(trust, trust_family, trust_strangers)~1,
                nclass=2,
                data=d1,
                nrep=1,
                na.rm=F,
                graphs=T,
                maxiter = 100000
             )
Conditional item response (column) probabilities,
 by outcome variable, for each class (row)
$trust_family
           Pr(1)  Pr(2) Pr(3) Pr(4)
class 1: 0.9058 0.0901 0.0019 0.0022
class 2: 0.7275 0.2356 0.0278 0.0091
$trust
           Pr(1)  Pr(2)  Pr(3)  Pr(4) Pr(5)  Pr(6)  Pr(7)  Pr(8)  Pr(9)  Pr(10)
class 1: 0.0000 0.0027 0.0058 0.0239 0.1103 0.1077 0.2272 0.3008 0.1407 0.0810
class 2: 0.1098 0.0612 0.1324 0.1287 0.1825 0.1846 0.1164 0.0665 0.0137 0.0042
$trust_strangers
          Pr(1)  Pr(2) Pr(3) Pr(4)
class 1: 0.0190 0.6439 0.274 0.063
class 2: 0.0094 0.2356 0.505 0.250
Estimated class population shares
 0.6428 0.3572
Predicted class memberships (by modal posterior prob.)
 0.677 0.323
=========================================================
Fit for 2 latent classes:
=========================================================
number of observations: 2164
number of fully observed cases: 2071
number of estimated parameters: 31
residual degrees of freedom: 128
maximum log-likelihood: -7651.997
AIC(2): 15365.99
BIC(2): 15542.07
G^2(2): 132.5952 (Likelihood ratio/deviance statistic)
X^2(2): 181.2104 (Chi-square goodness of fit)

polca_plot.
So, there are two classes and their profiles are nicely plotted, though the plot isn’t too informative as it’s hard to compare class profiles.

poLCA.entropy(lca.polca)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : 
 arguments imply differing number of rows: 4, 10

Entropy wasn’t calculated in my case. All the other functions worked smoothly.
Here’s a quick fix for an “absolute entropy” native poLCA function:

poLCA.entropy.fix <- function (lc)
{
 K.j <- sapply(lc$probs, ncol)
 fullcell <- expand.grid(sapply(K.j, seq, from = 1))
 P.c <- poLCA.predcell(lc, fullcell)
 return(-sum(P.c * log(P.c), na.rm = TRUE))
}
poLCA.entropy.fix(lca.polca)
[1] 3.602397

And here is a solution from Daniel Oberski (slide 66), called Entropy R2:

entropy.R2 <- function(fit) {
  entropy <- function(p) sum(-p * log(p))
  error_prior <- entropy(fit$P) # Class proportions
  error_post <- mean(apply(fit$posterior, 1, entropy))
  R2_entropy <- (error_prior - error_post) / error_prior
  R2_entropy
}
entropy_R2(lca.polca)
[1] 0.4205958

lca.polca$predclass gives a vector of predicted class membership for each respondent.
lca.polca$posterior gives a matrix of probabilities for each respondent and each class.
You may add predictors to the model by substituting ~1 with ~predictor1 + predictor2… But be careful, it applies one-step approach which means the classification will depend not only on your indicators, but on the predictors too.

depmixS4 [Dependent Mixture Models]

The package allows to fit various kinds of hidden Markov models, to which LCA model is also generalized.  To avoid time-consuming mistakes in model specification, the analysis involves two steps: construction of a model with mix function and fitting it with fit function. family argument of mix function allows specifying a type of observed variables – whether they are continuous, nominal, or count by adding to a list corresponding distribution name, i.g. guassian or multinomial.
Nice features

  • can be used with nominal, continuous,  count, and other kinds of variables, additionally, users can construct response distribution themselves;
  • can mix different kinds of indicators, for example, multinomial with continuous;
  • likelihood ratio test is included and is as easy as llratio(fit1, fit2)
  • constraints are allowed

Not so nice features

  • Due to its different main purpose (which is latent transition modeling of time series), terminology differs from conventional LCA. Classes are called states.
  • Some basic functions are needed to be specified manually: There is no easy way to plot probabilities; no G2 or X2 values, no entropy.

Example. The same model is built

library(depmixS4) 
model_definition <- mix(list(trust_family ~ 1, trust ~ 1, trust_strangers ~ 1),
 family = list(multinomial("identity"), #For every corresponding 
 multinomial("identity"),  #  indicator a family of distribution 
 multinomial("identity")), # should be indicated in the list.
 data = wvs.canada,
 nstates = 2, #This is the number of classes
 nstart=c(0.5, 0.5), # Prior probabilities of classes
 initdata = wvs.canada #Our data
)
fit.mod <- fit(model_definition)   # Fit the model
iteration 0 logLik: -7754.755
.....
iteration 335 logLik: -7652.003
converged at iteration 339 with logLik: -7652.003
fit.mod
Convergence info: Log likelihood converged to within tol. (relative change)
'log Lik.' -7652.003 (df=31)
AIC: 15366.01
BIC: 15540.71

Loglikelihood values as well as AIC and BIC are very close to the ones found in with poLCA.
summary function shows found probabilities of classes and conditional response probabilities.

summary(fit.mod)
Mixture probabilities model
 pr1 pr2
0.635535 0.364465
Response parameters
Resp 1 : multinomial
Resp 2 : multinomial
Resp 3 : multinomial
 Re1.1 Re1.2 Re1.3 Re1.4 Re2.1 Re2.2 Re2.3 Re2.4 Re2.5 Re2.6 Re2.7
St1 0.9070074 0.08930021 0.001778364 0.001914039 0.0000000 0.002556402 0.005311421 0.02358108 0.1101063 0.1075408 0.2269529
St2 0.7288673 0.23409146 0.027594026 0.009447176 0.1076035 0.060335151 0.130693172 0.12715298 0.1814410 0.1833505 0.1189929
 Re2.8 Re2.9 Re2.10 Re3.1 Re3.2 Re3.3 Re3.4
St1 0.30171593 0.14132783 0.080907262 0.019127226 0.6473959 0.2723859 0.06109097
St2 0.06946556 0.01515732 0.005807863 0.009429918 0.2377029 0.5032355 0.24963171

St1 here stands for state 1, which is class 1. Re1.1 stands for variable 1, response to category 1, in my case it’s trust to family item, response “Trust completely”. Taking together,  a probability of a respondent to give response “Trust completely” to the question about their family given this respondent is member of latent class 1 is 0.9070074, or, rounding 0.9.
With a little messy function I could plot these conditional probabilities:

depmixS4.plot.probs <- function(fit.mod) {
  require(gridExtra)
  require(grid)
  require(ggplot2)
  require(reshape2)
  a<-lapply(1:length(fit.mod@response[[1]]), function(y) {
     sapply(fit.mod@response, function(x) x[[y]]@parameters$coefficients)
          })
 names(a)<-sapply(1:length(fit.mod@response[[1]]), function(y) as.character(fit.mod@response[[1]][[y]]@formula[[2]]))
 g<-grid::gList()
 for(i in 1:length(a)) {
 g[[i]]<- ggplotGrob(ggplot(melt(a[[i]]), aes(paste("Class", rev(Var2) ), value, fill=as.factor(Var1), label=round(value, 2)))+
 geom_col()+geom_text(position=position_stack(vjust=0.5))+coord_flip()+scale_fill_brewer(type="seq", palette=i)+
 labs(x="", y="", title=names(a)[i], fill="Categories")+theme_minimal())
 }
 grid.arrange(grobs=g, nrow=length(a) )
}
depmixS4.plot.probs(fit.mod)

Rplot
fit.mod@posterior gives a predicted class membership variable and posterior probabilities of membership.
Expanding example to mixed distributions.  Let’s take advantage of the depmixS4 abilities and assign variable trust normal distribution instead of multinomial:

mod2 <- mix(list(trust_family ~ 1, trust ~ 1, trust_strangers ~ 1),
   family = list(multinomial("identity"),
   gaussian("identity"), # This line has changed
   multinomial("identity")),
   data = wvs.canada,
   nstates = 2, 
   nstart=c(0.5, 0.5),
   initdata = wvs.canada)
fit.mod2 <- fit(mod2)
iteration 0 logLik: -7958.793
...
iteration 140 logLik: -7754.039
converged at iteration 143 with logLik: -7754.038
fit.mod2
Convergence info: Log likelihood converged to within tol. (relative change)
'log Lik.' -7754.038 (df=17)
AIC: 15542.08
BIC: 15637.89
summary(fit.mod2)
Mixture probabilities model
 pr1 pr2
0.6056074 0.3943926
Response parameters
Resp 1 : multinomial
Resp 2 : gaussian
Resp 3 : multinomial
 Re1.1 Re1.2 Re1.3 Re1.4 Re2.(Intercept) Re2.sd Re3.1 Re3.2 Re3.3 Re3.4
St1 0.906664 0.08935629 0.001804677 0.002175068 7.577571 1.337904 0.01845478 0.6356047 0.2792356 0.06670489
St2 0.742831 0.22308549 0.025606576 0.008476908 4.642214 2.045671 0.01119042 0.2865875 0.4753749 0.22684724

The table of results shows that first and third indicators have multinomial responses, hence it shows probabilities for each  response category given the class membership. The second indicator is continuous, so it has two parameters Re2.(Intercept)  which us mean, and Re2.sd, which  is standard deviation.

lcca [Latent class causal analysis]

this is my favorite, although it’s a dead-born package – since publication of 2.0.0 version in 2013 it hasn’t been supported, it’s an alpha software, it runs only on Windows, and owners of the package are not together anymore (the developers seem to switch to PROC LCA instead). I hope there will be some progress and they will find a way to keep developing lcca.
Nice features:

  • In addition to classic LCA with nominal indicators, it can do a multiple group LCA models and fix or relax all the response probabilities.
  • Like poLCA it allows to add covariates that have an effect on class probabilities (class sizes). It allows to compare likelihoods of several models with the same number of classes and differing covariates.
  • Another nice feature of this package is its ability to account for sampling weights and specific features of sample, including sampling clusters and strata, which is very useful given your using of survey data.
  • You can also “flatten rhos” which means avoiding too large and too small thresholds corresponding to conditional response probabilities (it’s often appears).
  • Furthermore, it implements the LCA version of Rubin causal model, but I am not sure how popular it could be among social scientists.

Not-so-nice features:

  • you still don’t have control over specific parameters excluding fixing all conditional probabilities to be equal across groups in multiple groups analysis;
  • it doesn’t provide usual fit statistics such as G-square or Chi-square;
  • as can be seen from below output it doesn’t nicely handle items with differing number of categories;
  • still no 3-step approach to adding covariates;
  • it’s not in the CRAN, for installation visit the  link;
  • available only for PC.

Example: WVS data from Canada, three trust items. The basic input commands are similar to poLCA:

lca.lcca<-lcca::lca(
              cbind(trust_family, trust, trust_strangers)~1,
              nclass=2,
              data=d1,
              flatten.rhos=1
           )
summary.lca(lca.lcca, show.all=T)
        Summary of Latent-Class Analysis
====================================================
Data and model information
====================================================
Number of cases:   2164
Total frequency for all cases:   2164
Number of measurement items:     3
Number of categories per item:   4 10 4
Number of latent classes:        2
Starting values for rhos:        randomly generated
Random seed 1:                   873
Random seed 2:                   906
Starting values for gammas:      uniform values
Flattening constant for rhos:    1
Max. number of EM iterations:    5000
Convergence criterion:           0.000001
====================================================
Fit statistics
====================================================
The EM algorithm CONVERGED in: 382 iterations
Standard errors computed successfully.
Standard-error method:   STANDARD
Number of free parameters estimated:      31.000000
Loglikelihood:                         -7652.201060
Loglikelihood + penalty:               -7668.500516
-2 * Loglikelihood:                    15304.402121
AIC (smaller is better):               15366.402121
BIC (smaller is better):               15542.473244
====================================================
Parameter estimates
====================================================
Class prevalences (gammas):
Class:                   1       2
                    0.6567  0.3433
Item-response probabilities (rhos):
   Response category 1
Class:                   1       2
   trust_family     0.9029  0.7242
   trust            0.0004  0.1137
   trust_strangers  0.0190  0.0096
   Response category 2
Class:                   1       2
   trust_family     0.0920  0.2381
   trust            0.0041  0.0612
   trust_strangers  0.6405  0.2251
   Response category 3
Class:                   1       2
   trust_family     0.0027  0.0281
   trust            0.0078  0.1338
   trust_strangers  0.2763  0.5095
   Response category 4
Class:                   1       2
   trust_family     0.0024  0.0096
   trust            0.0254  0.1301
   trust_strangers  0.0641  0.2558
   Response category 5
Class:                   1       2
   trust_family         NA      NA
   trust             0.112   0.182
   trust_strangers      NA      NA
   Response category 6
Class:                   1       2
   trust_family         NA      NA
   trust            0.1092  0.1847
   trust_strangers      NA      NA
   Response category 7
Class:                   1       2
   trust_family         NA      NA
   trust            0.2268  0.1123
   trust_strangers      NA      NA
   Response category 8
Class:                   1       2
   trust_family         NA      NA
   trust            0.2973  0.0634
   trust_strangers      NA      NA
   Response category 9
Class:                   1       2
   trust_family         NA      NA
   trust            0.1382  0.0134
   trust_strangers      NA      NA
   Response category 10
Class:                   1       2
   trust_family         NA      NA
   trust            0.0788  0.0054
   trust_strangers      NA      NA
Standard errors for class prevalences (gammas):
           Est. Std.Err
Class 1 0.65672 0.05432
Class 2 0.34328 0.05432
Standard errors for item-response probabilities (rhos):
                                         Est. Std.Err
Class 1, trust_family   , Response  1 0.90291 0.01354
Class 1, trust_family   , Response  2 0.09197 0.01232
Class 1, trust_family   , Response  3 0.00270 0.00250
Class 1, trust_family   , Response  4 0.00242 0.00228
Class 1, trust          , Response  1 0.00037 0.00115
Class 1, trust          , Response  2 0.00405 0.00542
Class 1, trust          , Response  3 0.00779 0.00804
Class 1, trust          , Response  4 0.02543 0.00958
Class 1, trust          , Response  5 0.11204 0.01494
Class 1, trust          , Response  6 0.10920 0.01487
Class 1, trust          , Response  7 0.22684 0.01655
Class 1, trust          , Response  8 0.29728 0.01928
Class 1, trust          , Response  9 0.13821 0.01325
Class 1, trust          , Response 10 0.07879 0.00974
Class 1, trust_strangers, Response  1 0.01904 0.00417
Class 1, trust_strangers, Response  2 0.64055 0.02808
Class 1, trust_strangers, Response  3 0.27634 0.02062
Class 1, trust_strangers, Response  4 0.06407 0.01483
Class 2, trust_family   , Response  1 0.72418 0.02440
Class 2, trust_family   , Response  2 0.23815 0.02276
Class 2, trust_family   , Response  3 0.02806 0.00738
Class 2, trust_family   , Response  4 0.00961 0.00487
Class 2, trust          , Response  1 0.11368 0.02113
Class 2, trust          , Response  2 0.06122 0.01409
Class 2, trust          , Response  3 0.13376 0.02201
Class 2, trust          , Response  4 0.13007 0.02147
Class 2, trust          , Response  5 0.18201 0.02619
Class 2, trust          , Response  6 0.18471 0.02647
Class 2, trust          , Response  7 0.11234 0.03250
Class 2, trust          , Response  8 0.06341 0.03142
Class 2, trust          , Response  9 0.01341 0.01516
Class 2, trust          , Response 10 0.00540 0.01156
Class 2, trust_strangers, Response  1 0.00964 0.00506
Class 2, trust_strangers, Response  2 0.22507 0.03774
Class 2, trust_strangers, Response  3 0.50950 0.02959
Class 2, trust_strangers, Response  4 0.25579 0.02777

In order to perform multiple groups LCA, just add groups argument to the above input and specify whether you want conditional response probabilities (constrain.rhos) and class probabilities (constrain.gammas) to be equal or not. When adding predictors make sure to change the function to lca.cov. A specific post about multiple groups LCA in different software will follow.

2 thoughts on “Ways to do Latent Class Analysis in R

Leave a Reply to Aline Cancel reply