# Chapter 2: Factor Analysis

### Terminology

In factor analysis, the latent constructs are referred to as factors or latent variables, and the observable variables which are treated as measures of the factors are called indicators, items, or observed variables. The survey questions in our examples are regarded as such indicators.

An indicator is not assumed to measure the corresponding factor perfectly but with some measurement error. For instance, the question "to what extent is it your duty to back the decisions made by the police even when you disagree with them?" is conceptually narrower than the factor "obligation to obey". This specific question provides information about the factor, but not enough on its own for reliable inference on the factor. For better measurement, several items are used to measure the same factor, and information from them is combined through factor analysis.

### The notation used here

We will denote a latent factor by the Greek letter η. Different factors are denoted with subscripts, so that for example η1 and η2 are two distinct factors. Indicators are denoted with y and their measurement errors with ε, with subscripts for different indicators and errors. In later sections the models will also involve observed variables which are treated as explanatory variables rather than measures of the factors; notation for them will be introduced later.

Page 1

# Chapter 2: Factor Analysis

### The mathematical formula of a factor analysis model

Let η denote a single latent factor, and let y1, ..., yp be p indicators of η. The factor η is assumed to be normally distributed with mean κ and variance φ. The measurement model for item yj as a measure of η is

yj = νj + λjη + εj for each j = 1, ..., p.

This is a simple linear regression model where item yj is the dependent variable, factor η is the explanatory variable, and εj is the residual or measurement error. It is assumed that the εj are all normally distributed with means 0 and variances θj and that they are uncorrelated with η. The parameters of this measurement model of an item given one factor are the intercept νj, the regression coefficient λj - which in factor analysis is called the loading - and the variance θj of the measurement error. For instance, for the factor "Obligation to obey" we use the three indicators D18-D20, so p = 3 and the measurement model consists of the three models

y1 = ν1 + λ1η + ε1
y2 = ν2 + λ2η + ε2
y3 = ν3 + λ3η + ε3

which - if we substitute labels for the variables in this example - stands for

(item D18) = ν1 + λ1(Obligation to obey) + (measurement error)1
(item D19) = ν2 + λ2(Obligation to obey) + (measurement error)2
(item D20) = ν3 + λ3(Obligation to obey) + (measurement error)3

It is usually assumed that the measurement errors εj are all uncorrelated with each other, so that there are no "error correlations" (residual correlations) between the observed indicators of the factor after we control for their common dependence on η. However, this assumption is sometimes relaxed by allowing non-zero covariances cov(εj, εk) = θjk between the error terms of one or more specific pairs of items.

The model thus describes a situation where each item measures the factor, but not perfectly, so that the value of an item is determined by the factor and a measurement error. The larger λ2j φ is relative to θj, the larger is the percentage of the variance of yj that is explained by the factor and the more reliable is thus yj as an indicator for η.

The model may have more than one factor. For example, suppose that six items y1, ..., y6 are regarded as measures of two latent factors η1 and η2. The measurement model may then be extended, for example as

y1 = ν1 + λ11η1 + λ12η2 + ε1
y2 = ν2 + λ21η1 + λ22η2 + ε2
y3 = ν3 + λ31η1 + λ32η2 + ε3
y4 = ν4 + λ41η1 + λ42η2 + ε4
y5 = ν5 + λ51η1 + λ52η2 + ε5
y6 = ν6 + λ61η1 + λ62η2 + ε6

where η1 and η2 are assumed to be jointly normally distributed, with means κ1 and κ2, variances φ1 and φ2, and covariance φ12. In this model, all items are measures of both factors. Often we consider more restrictive (and thus simpler) models, in particular ones where each item is taken to measure only one factor. This is achieved by setting some of the loadings λjk to 0. For example, suppose that y1, y2, y3 measure factor η1 and y4, y5, y6 measure η2. The measurement model is then

```			    y1 = ν1 + λ11η1		+ ε1
y2 = ν2 + λ21η1		+ ε2
y3 = ν3 + λ31η1		+ ε3
y4 = ν4		+ λ42η2 + ε4
y5 = ν5		+ λ52η2 + ε5
y6 = ν6		+ λ62η2 + ε6
```

Factor analysis models where all items measure all factors and there are no error correlations are often referred to as Exploratory Factor Analysis (EFA) models, and models with other sets of assumptions (such as further constraints of zero loadings, other parameter constraints, or non-zero error correlations) are known as Confirmatory Factor Analysis (CFA) models.

Page 2

# Chapter 2: Factor Analysis

### Path diagrams

Factor analysis models and other latent variable models are often depicted with path diagrams where most elements of the model equations are presented graphically. In a path diagram all observed variables are denoted by rectangles, and latent variables by circles. Arrows depict relationships among the variables. A double-headed arrow between two variables denotes a correlation between them, and a single-headed arrow denotes a conditional distribution (regression model) between two variables, with the arrow pointing to the dependent variable. Single-headed arrows which point to a variable without another variable at the other end of the arrow denote measurement errors (residuals); these may also be omitted from the diagram. In addition, labels and/or estimated values for some variables and/or parameters may also be included in the diagram, but these are also often omitted.

Figure 2.1 illustrates these conventions. It shows the path diagram for the two-factor confirmatory factor analysis model defined by the last set of equations on the previous page.

Figure 2.1: Theoretical model for the relationships of the constructs in the example.

Page 3

# Chapter 2: Factor Analysis

### Model for the observed variables

A model which includes latent factors also implies a model for the distribution of the observed variables. For example, consider again the two-factor model for six items where

```			    y1 = ν1 + λ11η1		+ ε1
y2 = ν2 + λ21η1		+ ε2
y3 = ν3 + λ31η1		+ ε3
y4 = ν4		+ λ42η2 + ε4
y5 = ν5		+ λ52η2 + ε5
y6 = ν6		+ λ62η2 + ε6
```

and the factors where η1 and η2 are asumed to be jointly normally distributed, with means κ1 and κ2, variances φ1 and φ2, and φ12. This model implies that the means of the items depend on the model parameters as follows:

E(yj) = νj + λj1κ1 for j = 1, 2, 3 and E(yj) = νj + λj2κ2 for j = 4, 5, 6

and their variances as

var(yj) = λ2j1φ1 + θj for j = 1, 2, 3

var(yj) = λ2j2φ2 + θj for j = 4, 5, 6

and covariances between pairs of items are

cov(yj,yk) = λj1λk1φ1 for j, k in 1, 2, 3

cov(yj,yk) = λj2λk2φ2 for j, k in 4, 5, 6

cov(yj,yk) = λj1λk2φ12 for j in 1, 2, 3 and k in 4, 5, 6

### Estimation of the models

The implied model for the observed variables is what allows a factor analysis model to be actually estimated from observed data. The basic idea is to find values (estimates) for the parameters in such a way that the means, variances and covariances implied by the model are as close as possible to the sample means, variances and covariances of the observed variables. There are different ways (methods of estimation) for how this idea is implemented, corresponding to different criteria for “as close as possible”. The method of estimation which we use in this module is that of Maximum Likelihood (ML) estimation.

Page 4

# Chapter 2: Factor Analysis

### Identification of the model: Introduction

Before a factor analysis model – or any other latent variable model – can be estimated, we need to make sure that it is identified. A statistical model is identified if any given distribution of the observed variables is produced by a unique set of values for the parameters of the model. If a model is not identified, exactly the same observed distribution is implied by different values of the model parameters, in which case it will be impossible to give a single interpretation to what the fitted model is telling about the questions we are using it to answer.

For example, suppose that we are interested only in one variable y and its distribution is assumed to be normal with 0 mean and a variance. Also let us assume that the hypothesized statistical model implies that Var(y)=αβ. Knowing the sample variance of y, Var(y), is then not enough to identify unique estimated values for the two model parameters α and β separately. If a specific pair of values {α, β} implies a variance which matches the sample variance of y, then so do also {α/2, 2*β}, {3*α, β/3} and an infinite number of other pairs of values for the parameters. However, if the statistical model implies that Var(y)= α, using just the one parameter α, then knowing the variance of y we can identify a unique value for α.

This example with just one variable is rather simplistic and artificial. In the context of multivariate models such as factor analysis, however, unidentified models may arise more easily and be less easy to spot. For factor analysis models, two types of questions of identifiability need to be resolved. The first is the identification of the latent scales, and the second is the inherent identifiability of the model parameters even after the latent scale has been selected. These two questions are discussed separately over the next two pages.

Page 5

# Chapter 2: Factor Analysis

### Identifying the Latent scales

Identification of the model: Scales of individual factors

Since a factor cannot be directly observed, it also does not have a unique and natural scale of units on which it is measured. Instead, we need to make some assumptions to define the scales of all the factors and give meaning to factor values. To better understand the issue we face here, we may first consider the same question for a directly observable variable such as temperature. A temperature exists and is a well-defined concept, but that does not mean that the scale we use to assign numbers to it is also uniquely defined. For example, if we say that the temperature is 20 what does this mean exactly? Is it cold or warm? The answer would be different for temperatures measured on Fahrenheit and Celsius scales which have different definitions for what the origin (zero) represents and how much of a change is a change of one unit. A specific temperature is assigned a different number depending on the temperature scale we use. For example, 20 degrees Celsius is 68 degrees Fahrenheit. However, both numbers convey the same message, that it is a rather warm day! Hence, when we report the temperature we do not only give a number but we also name the scale we use.

Measuring a factor works in a similar way to temperature. We need to make some assumptions to define a factor’s scale. These assumptions are arbitrary, in that infinitely many different sets of them will produce exactly the same fit for the observed data, and that we are free to choose any one set of assumptions to identify the scales.

First, for each individual factor we need to specify its origin, i.e. what zero represents, and its measurement unit, i.e. how much of a change is a change of one unit. There are two common ways to define the origin: we either fix the expected value κ of the factor at 0, or fix at 0 the intercept of the measurement model of one item which measures the factor (e.g. ν1 = 0 if y1 measures the factor). The former is the default in almost all statistical packages. It implies that the average level of a factor in a population is given the value 0, which thus becomes the reference point for all other values. In the case that one measurement intercept is set at 0, this means that the indicator and the factor have the same origin – i.e. that whenever the average of the corresponding indicator is 0, the mean of the factor will be 0 as well.

Analogously, there are two common ways to define the measurement unit of any one factor: fixing either the variance φ of the factor at 1, or fixing at 1 the loading of that factor in the measurement model of one item which measures the factor (e.g. λ11 = 1 above, if y1 measures factor η1). The former implies that the standard deviation of the factor in the population is defined to be 1 unit, and the latter that the factor has approximately the same unit of measurement as the indicator for which the loading is set to be 1.

Identification of the model: Direction (rotation) of the factor scales

As the final part of identifying the latent scales, the "direction" of the scale of each factor needs to fixed. For any one factor, we may reverse its scale (i.e. whether large values of the factor are associated with large or small values of the items) by changing the signs of all the loadings associated with it. If there are two or more factors, there are an infinite number of ways of choosing the directions of the scales. These choices are known as rotations of the factors. Statistical packages for factor analysis provide various default rules for choosing such rotations. An alternative but equivalent way to choose a rotation is to select for each factor one item (an "anchor item") for which we specify that that item measures only that factor, and has factor loadings of 0 for all other factors. For example, in the two-factor model above, where in the initial specification each of the six items measures both factors, choosing λ12 = 0 and λ61 = 0 fixes the scales – here in such a way that η1 is the factor which is measured by all items except for y6, and η2 the factor which is measured by all but y1. A model with only this minimum number of zero loadings needed to fix a rotation is still an exploratory factor analysis model. A confirmatory factor analysis model which specifies more than the minimum number of zero loadings also automatically fixes the directions of the scales.

Page 6

# Chapter 2: Factor Analysis

### Inherent identifiability of the model parameters

Identification of the model: Numbers of factors vs. Numbers of items

The second and more important issue of identifiability for a factor analysis model is that once we have fixed the latent scales and their directions as discussed above, the model should then be fully identified. If it is not, the model must be changed.

The most basic condition for the model to be identified is that it should not have more parameters than there are distinct pieces of information in the joint distribution of the observed variables from which the parameters will be estimated. For factor analysis, these observed pieces of information are the sample means, variances and covariances of the p observed items. The means are matched by the p freely estimable values of the factor means κj and item intercepts νj, so these parameters will be identified if the rest of the model is. The main question of identification is then whether the factor variances φκ and covariances φκl, the factor loadings λ, the error variances θj, and the error covariances θ are identified. The total number of freely estimable parameters of these kinds should not exceed the total number of distinct variances and covariances of the observed items, which is

```          			p(p + 1)
2
```
(where p is the number of the observed items). This condition is necessary, meaning that a model which does satisfy it is not identified. It is, however, not sufficient, meaning that there are models which satisfy this condition but are nevertheless not identified.

For exploratory factor analysis (EFA) models, this kind of identifiability is determined simply by the number of factors (let’s denote it q) compared to the number of observed indicators for the factors (p). An EFA model is identified if its degrees of freedom

```          		    (p - q)2 - (p + q)
2
```
are greater than or equal to 0. This means, for example, that no EFA model is identified if we have only p=1 or p=2 indicators, only a model with q=1 factors is identified if there are 3 or 4 indicators, and a 2-factor model requires at least 5 indicators.

For a confirmatory factor analysis (CFA) model, identifiability depends both on how many indicators are treated as measures of each individual factor, and on other details of the specification of the model. The following conditions, which are sufficient but not necessary, can be used to determine identification for a very commonly used class of CFA models:

• If a CFA model is such that (i) it has 2 or more factors, (ii) all the factors are correlated with each other, (iii) there are no correlations between the measurement errors, (iv) each observed indicator measures only one factor, and (v) each factor is measured by at least 2 indicators, then the model is identified.
• If the model has 1 factor, there are no correlations between the measurement errors, and the factor is measured by at least 3 indicators, then the model is identified. This is simply a 1-factor EFA model, so the identification condition is just the general EFA condition on the number of indicators.

Page 7

# Chapter 2: Factor Analysis

### Assessment of model fit

A factor analysis model implies a model for the variances and covariances of the observed indicators. This model is usually non-saturated, meaning that it has fewer parameters than there are distinct observed variances and covariances. This means that the model is parsimonious, but also that it will not exactly reproduce the observed variances and covariances. An exact fit for these quantities is provided by the "saturated model", which estimates them all individually by the simple sample variances and covariances of the observed items.

The goodness of fit of a factor analysis model can be assessed by comparing the estimated variance-covariance matrix implied by the model with the sample variance-covariance matrix from the saturated model. If the two are relatively similar, the factor analysis model is judged to fit well; if relatively different, the model is judged to fit poorly. Several different methods of model assessment may be used to make this comparison. Here we briefly discuss a few of them. More information on them and on other methods of model assessment can be found in, for example, [Kap08].

The chi-squared test of overall goodness of fit is (when ML estimation is used to fit the model) the likelihood ratio test between the fitted and the saturated models. It tests the null hypothesis that the factor analysis model fits the data. If the p-value of the test is smaller than α, the null hypothesis is rejected at a 100α% significance level and we conclude that the model does not fit the data.

A problem with the chi-squared test statistic is that it has high power, especially as the sample size increases. That means that it rejects models very easily, even when we otherwise conclude (e.g. by examining the differences between the fitted and sample correlations) that the apparent lack of fit is not very large in magnitude. This behaviour of the test has motivated the development of other tools of model assessment which are meant to be less sensitive to small amounts of lack of fit.

The Root Mean Square Error of Approximation (RMSEA) also quantifies a comparison between the fitted and saturated variance-covariance matrices. It takes non-negative values, such that values between 0 – 0.05 are (as a rule of thumb) considered to indicate good fit, 0.05 – 0.1 moderate fit, and values larger than 0.1 a bad fit.

The Comparative Fit Index (CFI) compares the fitted model to a "null model" which specifies that all of the observed indicators are uncorrelated with each other. It takes values between 0 and 1, with large values indicating better fit of the model. Values below 0.9 may be taken to indicate poor fit, and values above 0.95 a good fit.

The AIC and BIC "information criterion" statistics may also be used to compare any fitted models for the same observed items, for example models with 1 vs. 2 factors. For each of these statistics, models with smaller values of the statistic are preferred. A model with a small value of AIC or BIC is judged to have a good balance between goodness of fit and parsimony, i.e. that it achieves a reasonable goodness of fit with a reasonably small number of parameters.

Standard likelihood ratio (LR) tests can also be used to compare nested pairs of models for the same indicators, for example models with and without zero constraints on some parameters. In any such comparisons, the null hypothesis is that the smaller (more parsimonious) of the two models fits as well as the larger (less parsimonious) model. Not all pairs of models can be tested in this way; in particular, an LR test cannot be used to compare factor analysis models with different numbers of factors, such as a 1-factor model against a 2-factor model. With large samples, this test too is often sensitive to even small lack of fit.

Page 8

# Chapter 2: Factor Analysis

### Factor scores

After a factor analysis model has been fitted, we can use the estimated model to calculate predicted values for the factors for any individuals, based on their observed values of the indicators of the factors. These predictions are known as "factor scores". They are weighted sums of the values of the observed items, with the weights determined by the parameters of the fitted model. Roughly, indicators which are more reliable measures of a factor (in essence, those with larger loadings) will receive higher weights in the calculation of a factor score for that factor.

A calculated factor score may then be used as an observed single measure of the corresponding latent construct in subsequent analyses, for example as an explanatory or response variable in regression models for associations between the construct and other variables. Whether it is desirable to do this depends on our view of the meaning and role of the latent variables in the model specification in specific applications. If we believe that the factor analysis model is a more or less real representation of how the observed indicators measure an unobservable but real latent factor, then substituting a factor score for the factor is undesirable. The reason for this is that the factor score and the true value of the factor for an individual will not be identical, which in turn will cause measurement error bias in many types of analyses if the factor score is used directly in the role of the factor. Instead, we will then prefer analyses which do not calculate factor scores but which estimate measurement models and models for the latent constructs together in one go. Such structural equation models with latent variables are discussed in Chapter 4 of this module.

However, we may also have a less strong view of the role of the latent factors. This is the case if we use the factor analysis model simply as a pragmatic device for deriving a rule for calculating a summary measure of a construct from multiple imperfect indicators it. The main or sole purpose of the factor analysis is then to calculate that summary measure – i.e. the factor score – so that we can use it in subsequent analyses. This approach also has the practical advantage that it is often much simpler to conduct the analysis in these separate steps – deriving the factor scores first, and then using them as observed variables in other analyses – than it is to combine them in one analysis.

Factor scores are not used in the main examples of this module. For completeness, we include here an example of how such scores can be calculated and saved in Stata and in R. This example uses a one-factor model for the three indicators of the construct "obligation to obey the police", and then calculates a factor score for that construct. Here the model is fitted and the scores calculated only for respondents in the United Kingdom.

An example of Stata commands for calculating and saving a factors score:

// Example of creating and saving factor scores:
sem (Obey -> bplcdc doplcsy dpcstrb) if cntry=="GB", ///
var(Obey@1) method(mlmv)
predict obeyScore if cntry=="GB", latent(Obey)
// Factor score for latent variable Obey will be called obeyScore.

An example of R commands for calculating and saving a factor score:

# Example of creating and saving factor scores:
library(lavaan)
#
ModelSyntax <- 'Obey =~ bplcdc + doplcsy + dpcstrb'
FittedModel <- sem(model = ModelSyntax, data = ESS5Police[ESS5Police\$cntry=="GB",],
std.lv = TRUE, meanstructure = TRUE,missing="ml")
ESS5Police\$obeyScore <- NA
fscores <- lavPredict(FittedModel,type="lv",method="regression")
ESS5Police[ESS5Police\$cntry=="GB",][inspect(FittedModel,"case.idx"),"obeyScore"] <- fscores[,"Obey"]
# Factor score for latent variable Obey will be called obeyScore.
# This is calculated only for cases with no missing data on the indicators.

Page 9

# Chapter 2: Factor Analysis

### Example 1 on Factor analysis: Models in one country

Consider the data in our example for respondents in the UK, for the questions D12-D17. It is proposed that questions D12 – D14 measure only the factor "effectiveness of the police" and questions D15 – D17 only the factor "procedural fairness of the police". Fit a 2-factor CFA model with this measurement structure and a correlation between the factors, and also fit 1-factor and a 2-factor EFA models for these items. Do the models fit the data well? In particular, does the CFA model fit the data well enough to be adequate for subsequent use for these items? Interpret the scales of the two factors and their estimated correlation in the CFA model.

// Example of fitting factor analysis models, UK data only
// 1-Factor model:
sem (Factor -> plcpvcr plccbrg plcarcr plcrspc plcfrdc plcexdc) ///
if cntry=="GB", var(Factor@1) method(mlmv)
estat gof, stats(all) // Goodness-of-fit statistics
// 2-factor CFA model:
sem (Effective -> plcpvcr plccbrg plcarcr) ///
(ProcFair -> plcrspc plcfrdc plcexdc) if cntry=="GB", ///
var(Effective@1 ProcFair@1) method(mlmv)
estat gof, stats(all)
estimates store cfa2
matrix b=e(b) // Estimates from this model, for use as starting values for EFA
// 2-factor EFA model:
sem (Effective -> plcpvcr plccbrg plcarcr (plcrspc,init(0)) (plcfrdc,init(0))) ///
(ProcFair -> (plccbrg,init(0)) (plcarcr,init(0)) plcrspc plcfrdc plcexdc) ///
if cntry=="GB", ///
var(Effective@1 ProcFair@1) method(mlmv) from(b)
estat gof, stats(all)
lrtest cfa2 . // Likelihood ratio test between this EFA model and the CFA model

# Example of fitting factor analysis models, UK data only
library(lavaan)
# 1-Factor model:
ModelSyntax <- '
Factor =~ plcpvcr + plccbrg + plcarcr
+ plcrspc + plcfrdc + plcexdc
'
FittedModel <- sem(model = ModelSyntax,
data = ESS5Police[ESS5Police\$cntry=="GB",],
std.lv = TRUE, meanstructure = TRUE,missing="ml")
summary(FittedModel,fit.measures=T)
# 2-factor CFA model:
ModelSyntax <- '
Effective =~ plcpvcr + plccbrg + plcarcr
ProcFair =~ plcrspc + plcfrdc + plcexdc
'
FittedModel.cfa2 <- sem(model = ModelSyntax,
data = ESS5Police[ESS5Police\$cntry=="GB",],
std.lv = TRUE, meanstructure = TRUE,missing="ml")
summary(FittedModel.cfa2,fit.measures=T)
# 2-factor EFA model:
ModelSyntax <- '
Effective =~ plcpvcr+plccbrg+plcarcr+plcrspc+plcfrdc+0*plcexdc
ProcFair =~ 0*plcpvcr+plccbrg+plcarcr+plcrspc+plcfrdc+plcexdc
'
FittedModel.efa2 <- sem(model = ModelSyntax,
data = ESS5Police[ESS5Police\$cntry=="GB",],
std.lv = TRUE, meanstructure = TRUE,missing="ml")
summary(FittedModel.efa2,fit.measures=T)
# Likelihood ratio test between this EFA model and the CFA model:
anova(FittedModel.cfa2,FittedModel.efa2)

Some model assessment statistics are shown in Table 2.1. From them, we observe the following:

• The overall goodness of fit test indicates that all three models fit poorly (with p<0.001 in each case). This is not unusual and is not in itself conclusive, because with moderate to large sample sizes the test is sensitive to even small amounts of lack of fit.
• The 2-factor EFA model differs from the 2-factor CFA model in that the CFA model sets to 0 several factor loadings which are not zero in the EFA model. These are the "cross-loadings" between a factor and the items which are not expected to be indicators of that factor, such as the loading of Effectiveness in the measurement model for item D15 (plcrspc). The likelihood ratio test between these two models is a test of the null hypothesis that all of these cross-loadings are 0 in the population. Here this hypothesis is rejected (p<0.001), indicating that the EFA model fits better than the CFA model. However, this may again be partly due to the sensitivity of the test.
• The RMSEA and CFI fit indices suggest that the 1-factor model fits badly, the 2-factor EFA model very well, and the 2-factor CFA model with a moderate to good fit.
• Both the AIC and BIC statistics have their smallest values for the 2-factor EFA model, suggesting that this model achieves the best balance between goodness of fit and parsimony (number of parameters).
These results give some support for the conclusion that the 2-factor CFA model with a simple measurement model gives an adequate fit to the data on the six survey items in the sample of UK respondents, although also with evidence that this model still fits less well than the 2-factor EFA model.

Table 2.1: Model assesment statistics

Model LR test vs. saturated model:
p-value
AIC BIC RMSEA CFI
1-factor model
`           <0.001`
`39490`
`39595`
`0.188`
`0.808`
2-factor EFA model
`           <0.001`
`38729`
`38863`
`0.007`
`1.000`
3-factor CFA model
`           <0.001`
`38781`
`38891`
`0.054`
`0.986`
LR test: 2-factor EFA vs. 2-factor CFA model:
p-value
`           <0.001`

A path diagram for this CFA model, with estimated values of the parameters also included, is shown in Figure 2.2 (this graph was obtained from Stata, where the default is to show also the measurement errors as circled latent variables). All of the factor loadings have positive values. Recalling the coding of the items, this implies the interpretation that high values of the factors correspond to positive evaluations of the effectiveness and procedural fairness of the police. The estimated correlation of the factors is +0.58, so in the UK individuals who have a positive view of the effectiveness of the police also tend to have a positive view of their procedural fairness.

Figure 2.2: Path diagram and parameter estimates for a 2-factor Confirmatory Factor Analysis model for indicators of Effectiveness and Procedural fairness of the police (UK respondents).

Page 10

# Chapter 2: Factor Analysis

### Example 2 on Factor analysis: Fitting the same model separately for each country

Consider the same six items for "effectiveness of the police" and "procedural fairness of the police" as in Example 1, and a 2-factor confirmatory factor analysis model with the same simple measurement structure as used there. Fit the model separately for the samples from each of the countries in the ESS. Does the model fit well in all of the countries? How much does estimated correlation between the factors vary across the countries?

In both Stata and R, we can implement the calculations by programming a loop over the countries, carrying out the estimation one country at a time, and collecting the estimates we need in a summary table or matrix.

// Fitting the same 2-factor CFA model for 6 items separately for
// each country, and collecting some of the results in a table.
levelsof cntry, clean local(countries)
local n_c: word count `countries'
matrix results = J(`n_c',5,.)
matrix rownames results = `countries'
matrix colnames results = Fcorr pGlob pvEFA RMSEA CFI
// Loop over the countries, and collect results:
local i = 0
foreach c of local countries {
local ++i
display "Country: " "`c'"
// 2-factor CFA model
sem (Effective -> plcpvcr plccbrg plcarcr) ///
(ProcFair -> plcrspc plcfrdc plcexdc) if cntry=="`c'", ///
var(Effective@1 ProcFair@1) method(mlmv) iter(100) ///
nolog satopts(nolog) baseopts(nolog)
matrix b=e(b)
estimates store cfa_mod
estat gof, stats(all)
local converged=e(converged)
if(`converged'==1){
matrix results[`i',1] = _b[cov(Effective,ProcFair):_cons]
matrix results[`i',2] = r(p_ms)
matrix results[`i',4] = r(rmsea)
matrix results[`i',5] = r(cfi)
}
// 2-factor EFA model, for comparison
quietly: sem (Effective -> plcpvcr plccbrg plcarcr ///
(plcrspc,init(0)) (plcfrdc,init(0))) ///
(ProcFair -> (plccbrg,init(0)) (plcarcr,init(0)) ///
plcrspc plcfrdc plcexdc) if cntry=="`c'", ///
var(Effective@1 ProcFair@1) method(mlmv) from(b) iter(100)
local converged=e(converged)
if(`converged'==1){
lrtest cfa_mod .
matrix results[`i',3] = r(p)
}
}
//
matlist results, format(%4.3f)

# Fitting the same 2-factor CFA model for 6 items separately for
# each country, and collecting some of the results in a table.
library(lavaan)
#
countries <- unique(ESS5Police\$cntry)
results <- matrix(NA,length(countries),5)
rownames(results) <- countries
colnames(results) <- c("Fcorr","pGolb","pvEFA","RMSEA","CFI")
ModelSyntax.cfa2 <- '
Effective =~ plcpvcr+plccbrg+plcarcr
ProcFair =~ plcrspc+plcfrdc+plcexdc
'
ModelSyntax.efa2 <- '
Effective =~ plcpvcr+plccbrg+plcarcr+plcrspc+plcfrdc+0*plcexdc
ProcFair =~ 0*plcpvcr+plccbrg+plcarcr+plcrspc+plcfrdc+plcexdc
'
# Loop over the countries, and collect results:
i <- 0
for(cntry in countries){
i <- i+1
cat("Country: ", cntry, "\n")
#
FittedModel.cfa2 <- sem(model = ModelSyntax.cfa2,
data = ESS5Police[ESS5Police\$cntry==cntry,],
std.lv = TRUE, meanstructure = TRUE,missing="ml")
print(summary(FittedModel.cfa2,fit.measures=T))
conv.tmp <- inspect(FittedModel.cfa2,"converged")
if(conv.tmp){
results[i,1] <- coef(FittedModel.cfa2)["Effective~~ProcFair"]
s.tmp <- fitMeasures(FittedModel.cfa2)
results[i,2] <- s.tmp["pvalue"]
results[i,4] <- s.tmp["rmsea"]
results[i,5] <- s.tmp["cfi"]
}
#
FittedModel.efa2 <- sem(model = ModelSyntax.efa2,
data = ESS5Police[ESS5Police\$cntry==cntry,],
std.lv = TRUE, meanstructure = TRUE,missing="ml")
conv.tmp <- inspect(FittedModel.efa2,"converged")
if(conv.tmp){
print(lr.tmp <- anova(FittedModel.efa2,FittedModel.cfa2))
results[i,3] <- lr.tmp\$"Pr(>Chisq)"[2]
}
}
#
print(round(results,3))

Results from this analysis are shown in Table 2.2. We observe that the fit of the model is moderate to good in all of the countries, with RMSEA values between 0.035 and 0.077, and CFI between 0.967 and 0.995. The correlation between the factors (which are all scaled in the same direction, as we can confirm by checking the signs of the factor loadings in the attached output) is positive in all countries, with values between 0.396 for the Netherlands and 0.734 for Greece. The estimated standard errors of these correlations are around 0.03, so some of the between-country differences are clearly statistically significant. Here these results on the correlations are treated mainly as a descriptive summary of the variation across the countries. For any more detailed interpretation of them it would be preferable to draw also on substantive hypotheses or explanations about where and why these correlations might vary between countries.

Table 2.2: Results for a 2-factor confirmatory factor analysis model for indicators of Effectiveness and Procedural fairness of the police, fitted separately to samples in each country in the ESS.

Country
(country code)
Factor corr. p-value:
Overall
p-value:
vs. EFA
RMSEA CFI
Belgium (BE) 0.450 <0.001 <0.001 0.059 0.977
Bulgaria (BG) 0.728 <0.001 <0.001 0.049 0.993
Switzerland (CH) 0.483 <0.001 <0.001 0.046 0.986
Cyprus (CY) 0.657 <0.001 <0.001 0.058 0.990
Czech Republic (CZ) 0.646 <0.001 <0.001 0.069 0.980
Germany (DE) 0.488 <0.001 <0.001 0.052 0.978
Denmark (DK) 0.553 <0.001 <0.001 0.051 0.980
Estonia (EE) 0.613 <0.001 <0.001 0.055 0.980
Spain (ES) 0.662 <0.001 <0.001 0.039 0.994
Finland (FI) 0.480 <0.001 <0.001 0.035 0.991
France (FR) 0.606 <0.001 <0.001 0.069 0.976
United Kingdom (GB) 0.582 <0.001 <0.001 0.054 0.986
Greece (GR) 0.734 <0.001 <0.001 0.050 0.995
Croatia (HR) 0.680 <0.001 <0.001 0.065 0.985
Hungary (HU) 0.612 <0.001 <0.001 0.039 0.992
Ireland (IE) 0.624 <0.001 <0.001 0.054 0.988
Israel (IL) 0.637 <0.001 <0.001 0.077 0.976
Lithuania (LT) 0.640 <0.001 <0.001 0.041 0.993
Croatioa (HR) 0.680 <0.001 <0.001 0.065 0.985
Netherlands (NL) 0.396 <0.001 <0.001 0.058 0.976
Norway (NO) 0.487 <0.001 <0.001 0.071 0.967
Poland (PL) 0.570 <0.001 <0.001 0.056 0.987
Portugal (HR) 0.455 <0.001 <0.001 0.064 0.982
Russia (HR) 0.690 <0.001 <0.001 0.069 0.984
Sweden (SE) 0.504 <0.001 <0.001 0.066 0.969
Slovenia (SI) 0.572 <0.001 <0.001 0.058 0.985
Slovakia (SK) 0.568 <0.001 <0.001 0.071 0.981
Ukraine (UA) 0.637 <0.001 <0.001 0.058 0.989

Go to next chapter >>
Page 11
• [Kap08] Kaplan, D. (2008). Structural equation modeling (Second edition). Sage.