Is measurement invariance just a nonsense?

The short answer is No.
But Christian Welzel has a different opinion. He and Ronald Inglehart have challenged the very common and largely accepted though sometimes tedious and uncomfortable practice of measurement invariance testing. I do not agree with them but their challenge provokes a lot of thoughts and clarifications in a current paradigm.
Aleman and Woods (2015) were one of the few researchers who attempted to check Self-expression values index for measurement invariance across countries. Earlier, Hermann Dulmer and myself were trying to do more or less the same. I simply didn’t have my MGCFA models converged, there was too much non-invariance.
Christian Welzel and Ronald Inglehart do not think that these tests make any sense. In their response to Aleman and Woods  (full text here) they argue that what we call non-invariance is a meaningful variation and can be easily ignored  given a researcher knows what he or she is measuring. The logic behind this position is close to formative measurement approach when measured latent variable (here self-expression) is a causal consequence rather than cause of its indicators. This implies the indicators are unique and cannot be substituted and they don’t have to even correlate with each other. In other words, it fully depends on a researcher’s decision: if I decided that I measure self-expression , I don’t need to care about internal consistency and equivalence of its structure across groups; if self-expression can be seen in Germany, it can be measured in Uganda, and the fact that it is inconsistent  in Uganda is not a measurement problem, it points to the substantive fact that Ugandan people do not have self-expression articulated enough. And one more, and I guess their main argument, is that self-expression (or its newer version – emancipative values) have a very high predictive validity, that is, it correlates with everything. From the authors’ perspective this guarantees validity and meaningfulness of the index. I don’t agree and summarize my counter-arguments as following:

  1. The formative logic is authoritarian in nature. A researcher claims that the concept exists without testing it with data. And when a measurement fails to demonstrate consistency in some countries, the researchers says: This country’s population is not mature enough and people don’t have enough cognitive skills to have this concept articulated in their minds. It sounds a bit arrogant to me. Assigning any phenomenon native to one culture to the others without testing means making it normative. (It sounds even more normative, when Welzel and Inglehart call it “performance”, as if nations are taking an emancipative values exam.) This is against etic logic of cross-cultural research (Triandis & Marin, 1983).
  2. When  Welzel and Inglehart discuss non-interchangeability of items in a construct, they actually discuss what we call “content validity”. Nobody ever claimed that the constructs measured in reflective logic should not or may not possess content validity. Reflective measurement proponents only say it is not the only type of validity a measure should have. The content validity doesn’t contradict with inter-changeability of specific items. For example, measures  of Schwartz values supposed to cover all the 10 values, at the same time the items within each value index are interchangeable. Though it might be problematic when one has only very limited number of items (as in large-scale surveys), content validity is an issue. For example, in ESS, in order to achieve higher content validity the reliability of the 10 value scales was victimized.Internal consistency (or reliability, as it sometimes referred to) is frequently criticized since a low number of items makes it impossible to meet both content validity and consistency at the same time. Still, all the items must significantly and positively correlate with a latent variable, otherwise they simply don’t belong to the latent construct, or they might be even related negatively to latent construct.
    The logic of  measurement is to balance between different kinds of validity and reliability and do not give one kind a full credit. If we fully rely on internal consistency, we will have problems that Welzel and Inglehart have pointed out. If one fully relies on predictability of a measure, as they have suggested, one risks to find a measure which is not more meaningful than a simulated random variable with a given correlation with a criterion variable. The measure may become a fake  especially when a criterion variable (that is used to validate our measure) is the same as what we intended to predict for substantive research. (This is deeply endogenous – if we intend to find a variable correlated with GDP per capita, we will certainly find one, but it wouldn’t be correct to prove a hypothesis regarding this variable.) Or one may have problems of comparability, for which emancipative values have been criticized.
    As Messick (1989) states, the one and the only type of validity we should actually care about is a construct validity, that is, a measure should measure what it is supposed to measure. It is different from predictive validity. Our variable may predict many things, though it might measure something different than we think it measures.

Going beyond basics of psychometrics, Welzel and Inglehart point to a rarely discussed feature of cross-country comparisons, namely that there are two kinds of country measures: characteristics that inherently belong to a country level, such as territory, language, cultural values, Gini, etc. and the country level of individual characteristics. When the inherently country characteristics are discussed, there is no disagreement: dimensions are found and tested at the country level. However, a country level of individual variables is sometimes seen problematic, since it is not readily understandable how it relates to country and individual level. The fact is, it has a different meaning, separate from both inherently country-level characteristics and individual ones. The average level of emancipative values in a country reflects how modernized is the country’s population. It doesn’t say anything about country’s culture, nor individual members of the population, and it doesn’t imply it can be validly compared across countries. Culture is seen as separate entity from population, it’s something external that is much more related to the past of the population rather than to the modern currently studied population. A culture may be individualistic, but population may be at the same time very altruistic (though these are not opposed these are different characteristics).
Therefore, when one is interested in characteristics of population, he or she refers to a mean of individual-level variables. Thus, the logic behind MGCFA is reasonable. If one ignores it and blindly turns to ‘combinatory’ approach he or she risks at comparing a level of tolerance to homosexuality with importance of children’s independence. Are these two variables compensatory? How can importance of independence compensate for the lack of tolerance? The score of, for example, 0.5 becomes highly ambiguous, meaning either tolerance, or independence, or moderate level of both.


Hopefully, Aleman and Woods are preparing their rejoinder to Welzel & Inglehart’s response, so this is not the end of a suddenly emerged discussion.
Beside that Boris Sokolov has already given his elegant and very comprehensive response in which he generally disproved most Welzel’s arguments (slides are here):

UPD Feb 2018: Sokolov’s argument has been published at APSR: https://doi.org/10.1017/S0003055417000624


I can’t help mentioning that this is not the first discussion of this kind. Similar topic was already discussed in Journal for Cross-Cultural Psychology in 2012, in which Bomhoff & Gu expressed their concern about validity of self-expression index, claiming it’s not consistent in East Asia, hence it’s not reliable. Welzel responded that they should be using emancipative values index (which is more consistent theoretically, but not empirically) and demonstrated the approximate similarity of principal components’ loadings across 10 cultural zones (which is usually regarded as the lowest, “configural invariance” in the SEM paradigm). Not consistent with that, the author claimed the emancipative values is a formative index – I would add that factor analysis assumptions are in conflict with formative logic, so it’s not an acceptable way to defend the measure. And finally, Bomhoff & Gu replied that self-expression index (just like emancipative one) has an item that shows different sign of correlations with external variables – in psychometrics it’s called DIF (differential item functioning) and is regarded a severe violation of validity. Though I don’t know whether DIF is important for formative measurement approach, I guess not, like anything else beside The Will of a mighty researcher.

One thought on “Is measurement invariance just a nonsense?

  1. Pingback: Christian Welzel addresses my criticism | Elements of cross-cultural research

Leave a Reply