# All pages

# Chapter 3: The basic two-level models

This chapter is mainly intended for reading, with some review questions. We will proceed in three steps. First, we introduce two versions of variance component or random intercept models: the null model without explanatory variables and a model with explanatory variables. In these models, the regression constants (intercepts) can vary among the level 2 units (contexts). Secondly, we add further complexity in the random coefficient model, where the regression coefficients can show variation between contexts. In the final step, we will show models with explanatory variables also at the contextual level. We will conclude the theoretical part with an extended example, before looking at how the data file needs to be organized and estimating the models in SPSS and Stata.

# The variance component model

### The null model

Let us start with the simplest possible regression model without explanatory variables. The regression coefficient estimates the grand mean of the dependent variable, and the residuals are the individual deviations from the mean.

Y_{i} = β_{0} + e_{i}

The variance of the residuals is also the sample variance of Y. The situation is illustrated in the figure.

Figure 3.1. Variance of the residuals = the sample variance of Y

Let us rewrite the regression equation to also include a contextual level (schools, firms, countries).

Y_{ij} = β_{0} + u_{0j} + e_{ij}

Y_{ij} could be the wage of individual i in firm j. The regression constant has no subscript and is assumed to be fixed and equal to the grand mean. The new term in the model is the level 2 residual,u_{0j}, which represents the deviations of the firm means from the overall mean. This is illustrated in the figure below, where the residuals in the former model are split into two components representing the variation at the two levels or the between (u) and within (e) firm (level 2 unit) variation. The variance in Y can be expressed as
Var(Y_{ij}) = Var(u_{j} + e_{ij}) = Var(u_{j}) + Var(e_{ij}) + 2*Cov(u_{j},e_{ij})

Since we assume that the covariance between level 1 and the level 2 residuals is zero, the variance in Y simplifies to the sum of the variances of the residuals:
Var(Y_{ij}) = Var(u_{j}) + Var(e_{ij}) = σ_{u}^{2} + σ_{e}^{2}

Figure 3.2. Variances in Y = sum of the variancces of the residuals

The natural next step is to define a coefficient showing the proportion of the variance in Y that stems from the variation between the level 2 units. This is known as the Intraclass Correlation Coefficient (ICC) or as the Variance Partition Coefficient (VPC) symbolized by the Greek letter rho.

The ICC is frequently used as a baseline for estimating the variances at the two levels that can be explained by more complex models. It is also used to evaluate whether or not the level 2 variation is ignorable. It is hard to set a general threshold here, but the hierarchical data structure should not be ignored if the ICC is 0.05 or more. If the ICC is smaller, one-level models should be considered, perhaps with robust estimation of standard errors.

### Random versus fixed effects

The variance component model implies *random effects*, in that the variation in the intercepts is captured by the variance in the level 2 residuals. The model is repeated below with the assumption of normally distributed errors. Residuals can be correlated within levels but not across levels.

Y_{ij} = β_{0} + U_{0j} + e_{ij}

u_{0j} ∼ N(0,σ_{u0}^{2}) e_{ij} ∼ N(0,σ_{e}^{2})

This model is similar to a One-Way ANOVA, where the level 2 units are seen as a factor (categorical X).

Y_{ij} = β_{0} + β_{j} + e_{ij}

e_{ij} ∼ N(0,σ_{e}^{2})

In the first model, the level 2 residuals form a random normally distributed variable with a variance of σ_{u}^{2} because we see the level 2 units (schools, firms) as a random sample from a population. In the ANOVA model, the level 2 units are seen as fixed, and can even be strategically chosen. The level 2 effects are captured by a set of (J-1) dummy variables (β_{j}).

In general, fixed-effects models represent an alternative to multilevel models. They are especially useful in (pooled) time series data, where they can be used to control for time-constant variables. In cross-sectional data, the fixed-effect model can be chosen because of the small number of level 2 units (<10), and when they are strategically chosen rather than drawing a random sample from a population. In cross-sectional data, however, the fixed effects model cannot be combined with level 2 covariates. Multilevel analysis is the best choice in situations with many (15+) level 2 units, where the level 2 units constitute a random sample from a population and we want to include level 2 explanatory variables.

### Models with level 1 explanatory variables

The OLS regression model with one explanatory variable can be our point of departure:

Y_{i} = β_{0} + β_{1}*X_{i} + e_{i}

The two-level version can be expressed in one equation for each level or in one equation. Both have advantages in relation to understanding multilevel models. The first version with two equations shows the relationship to the slopes-as-outcome approach. In the second equation, the regression constants (intercepts) constitute the dependent variable.

Y_{ij} = β_{0j} + β_{1}*X_{ij} + e_{ij}

β_{0j} = β_{0} + u_{0j}

u_{0j} ∼ N(0,σ_{u}^{2}) e_{ij} ∼ N(0,σ_{e}^{2})

The one-equation version is the result of substituting the right-hand side of the second equation for β_{0j} in the first equation.

Y_{ij} = β_{0} + β_{1}*X_{ij} + u_{0j} + e_{ij}

As you may have seen in OLS regression models with dummy variables, they can be graphed as parallel lines, such as a wage equation for men and women. Here, the level 2 residuals can be used in the same way to estimate the parallel regression lines of each level 2 unit. The line for each level 2 unit is: Y_{j} = (β_{0} + u_{0j}) + β_{1}*X_{ij}. Summing the two terms in the parenthesis gives the level 2 unit specific intercept, and the regression coefficient is common to all level 2 units. The individuals (students, employees, respondents) are spread around the level 2 unit line with a residual of e_{ij}. This is illustrated in the figure below, with the mean regression line and lines for two level 2 units.

Let us stop to consider the terminology:

β_{0} and β_{1} are fixed regression coefficients.

u_{0j} and e_{ij} are random effects or multilevel residuals.

σ_{u}^{2} and σ_{e}^{2} are random parameters to be estimated along with the (fixed) regression coefficients.

Figure 3.3. Mean regression line and lines for two level 2 units

### Adding explanatory variables at both levels

Variance component (random intercept) models can be extended with additional level 1 and level 2 explanatory variables without much extra complexity. Let us add X_{2} at level 1, the individual level, and Z at level 2, the contextual level.

Y_{ij} = β_{0j} + β_{1}*X_{1ij} + β_{2}*X_{2ij} + e_{ij}

β_{0j} = β_{0} + β_{3}*Z_{j} + u_{0j}

We continue to use betas to symbolize all regression coefficients, also those related to contextual covariates, although some texts use Greek gammas for the latter. For the one-equation version, we need to substitute the regression constant in the first equation with the right-hand side of the second equation. Note that the random part of the model is unchanged since the intercept is still the only parameter that is free to vary among level 2 units.

Y_{ij} = (β_{0} + β_{3}*Z_{j} + u_{0j}) + β_{1}*X_{1ij} + β_{2}*X_{2ij} + e_{ij}

Rearranging and removing the parenthesis:

Y_{ij} = β_{0} + β_{1}*X_{1ij} + β_{2}*X_{2ij} + β_{3}*Z_{j}+ u_{0j} + e_{ij}

### Pseudo R squares

Although we use the same symbols for the residuals as in the null model, we expect them to be smaller due to the two added explanatory variables. The differences between the residual variances for the null model and the current model can therefore be used to compute pseudo R squares. We add pseudo here to distinguish them from the R squares in OLS regression models. We can compute three pseudo R squares, one for each level and one for the two levels combined. Let us start by defining the R square for the individual level variation:

The computation of the R square for the level 2 variation is quite similar:

The total R square can be computed as follows:

Unlike the R square from OLS regression, these R squares can also be negative. This can be the case when the estimates are close to zero, especially for the level 2 variation.

# The random coefficient model

This model is also known as the random slope model. Again, it can be represented by one level 1 and several level 2 equations, depending upon the number of random coefficients. It is not advisable to define all level 1 explanatory variables as random at once, i.e. free to vary across the level 2 units. The most important reason for this is that the number of random parameters to be estimated can increase dramatically, as we will see later. The second good reason for this is that the model may not converge.

The variance component model can be changed into a random coefficient model by adding a j-subscript to the regression coefficient in the level 1 equation and adding a second level 2 equation, with the regression coefficient ( slope) as the dependent variable.

Y_{ij} = β_{0j} + β_{1j}*X_{ij} + e_{ij}

β_{0j} = β_{0} + u_{0j}

β_{1j} = β_{1} + u_{1j}

Y_{ij} = β_{0} + β_{1}*X_{ij} + (u_{0j} + u_{1j}*X_{ij} + e_{ij})

Note that the single residual term from the OLS regression model is now replaced by the three terms in the parenthesis. The new term includes the explanatory variable (X), which has the effect of creating heteroscedasticity since the level 2 residual variance will be a function of X.

The two level 2 residuals are assumed to be normally distributed with the expectation of zero and with a covariance matrix:

and e_{ij} ∼ N(0,σ_{e}^{2})

The new term is the covariance between the intercept and the slope residuals, and it is an additional parameter in the model. If several covariates are added to the model and set as free to vary, this will rapidly increase the number of random parameters. This also depends on assumptions about the covariances, a topic we will return to later. The random coefficient model will result in regression lines that may cross each other, as illustrated below:

Figure 3.4. Illustration of crossing regression lines using the random coefficient model

### Adding explanatory variables at both levels

Let us add X_{2} to the individual level model and Z to the level 2 model.

Y_{ij} = β_{0j} + β_{1j}*X_{1ij} + β_{2j}*X_{2ij} + e_{ij}

β_{0j} = β_{0} + β_{3}*Z_{j} + u_{0j}

β_{1j} = β_{1} + β_{4}*Z_{j} + u_{1j}

The one-equation version results from substituting the two random coefficients in the level 1 equation by the right-hand sides of equations 2 and 3:

Y_{ij} = (β_{0} + Z_{j} + u_{0j}) + (β_{1} + Z_{j} + u_{1j})*X_{1ij} + β_{2j}*X_{2ij} + e_{ij}

Rearranging results in this version, with a parenthesis added to show the random part of the model:

Y_{ij} = β_{0} + β_{1}*X_{1ij} + β_{2j}*X_{2ij} + β_{3}*Z_{j} + β_{4}*X_{1ij}*Z_{j} + (u_{0j} + u_{1j}*X_{1ij} + e_{ij})

In this model, the level 2 variable Z can have two effects. First, it may influence the intercept as well as the regression coefficient of X_{1}, as can best be seen in the three-equation version of the model. The one-equation version, however, clearly shows that the latter effect is a cross-level interaction.

The random part of the model is unchanged, however. Let us see the result of defining the regression coefficient of X_{2} as random, i.e. free to vary among level 2 units. Now, the three-equation version will have an additional equation:

Y_{ij} = β_{0j} + β_{1j}*X_{1ij} + β_{2j}*X_{2ij} + e_{ij}

β_{0j} = β_{0} + β_{3}*Z_{j} + u_{0j}

β_{1j} = β_{1} + β_{4}*Z_{j} + u_{1j}

β_{2j} = β_{2} + β_{5}*Z_{j} + u_{2j}

We substitute as above:

Y_{ij} = (β_{0} + β_{3}*Z_{j} + u_{0j}) + (β_{1} + β_{4}*Z_{j} + u_{1j})*X_{1ij} + (β_{2} + β_{5}*Z_{j} + u_{2j})*X_{2ij} + e_{ij}

Rearranging and putting a parenthesis around the random part yields:

Y_{ij} = β_{0} + β_{1}*X_{1ij} + β_{2}*X_{2ij} + β_{3}*Z_{j} + β_{4}*X_{1ij}*Z_{j} + β_{5}*X_{2ij}*Z_{j} + (u_{0j} + u_{1j}*X_{1ij} + u_{2j}*X_{2ij} + e_{ij})

We now have two cross-level interactions and a more complex random model. The full set of random coefficients at level two is as follows:

Figure 3.5. The full set of random coefficients at level two

The three variances are in the main diagonal, and the off-diagonal elements are covariances among the level 2 residuals. By adding one random coefficient, the number of random parameters has doubled from three to six. A model with five random coefficients results in 15 random parameters. This may be too much for the software to handle and is not advisable, especially in situations with a limited number of level 2 units. The general advice is to add complexities, such as additional random parameters, one by one and to keep those that appear to improve upon simpler models in the final model.

This problem is handled by default in SPSS and Stata by only estimating the parameters in the main diagonal, which are normally the most interesting ones, assuming the level 2 residuals to be uncorrelated, whereas Mlwin, by default, will estimate all. In the latter program, it is possible to simplify the model by defining one or more of the random parameters to be zero, and in SPSS and Stata, it is possible to relax the default restriction in various ways. To estimate all random parameters, the covariance matrix has to be defined as unstructured.

# Statistical tests in multilevel analysis

The tests of interest are significance tests and confidence intervals for the regression coefficients and variance components, and likelihood ratio tests for the comparison of nested models. Some programs also compute fit indices that can be used to compare non-nested models.

### Testing regression coefficients

In OLS regression, statistical significance tests for the regression coefficients are based on the following null hypothesis:

H_{0} : β = 0 and H_{1}: β ≠ 0

The test statistic is the t-ratio formed by dividing the estimated regression coefficient by its standard error:

_{b}

The t-ratio is thus the parameter estimate divided by its standard error. The sampling distribution of the t-statistic is the Student’s t-distribution with n - k degrees of freedom, where n is the sample size and k the number of parameters. Like the normal distribution, the t-distribution is symmetrical, but flatter for low degrees of freedom. As the sample size increases, the t-distribution approaches the normal distribution.

In multilevel analysis, the standard normal sampling distribution is assumed under the null hypothesis. This follows from the assumption about normal distributed residuals. Accordingly, the Stata procedures for estimating multilevel models, xtreg and xtmixed, report the Z-statistics instead of the t. The null hypothesis is identical to the one from the t-test and the Z-ratio is formed in the same way as the t-ratio:

In SPSS, the Z-statistic is replaced by the Wald statistic, which is z squared, and it follows the chi-square distribution. Users of Mlwin will have to compute the test statistic manually or do it from the menus. In HLM, the t-test with J - k - 1 is used, where J is the number of groups (level 2 units) and k is the number of explanatory variables. This test is more conservative than the Z-test and the Wald-test in small samples.

### Testing variance components

Testing variance components is less straightforward, although the special purpose software programs as well as SPSS and Stata produce estimates and standard errors. The Z-test is not satisfactory because the sampling distributions of the variance components are skewed and not normal. Especially in a situation with few groups and small variance components, the Z (and the Wald) statistic is clearly non-normally distributed. Still, variance estimates of twice the size of their standard errors or more may be taken as a first indication of statistical significance. However, there is agreement in the literature that the best way of testing the statistical significance of variance components is to use the likelihood ratio test. This test can also be used to test nested models. It can be seen as a parallel to the F-test in OLS regression analysis.

### Comparing nested models

Nested models can be viewed as simplifications of a more general model achieved by removing one or more of its random or fixed parameters. From the likelihood functions, all software programs produce a statistic that shows how well the model fits the data. This statistic, sometimes called the ‘deviance’, is defined as -2*ln(Likelihood), or just -2LL. We do not know the sampling distribution of this statistic, but the difference between the -2LLs (or the ratio of the likelihoods) for two models can be assumed to be chi-square distributed, with the degrees of freedom determined by the difference in the number of parameters in the two models. In general, we assume that the lower the deviance, the better the model. In the definition of the test statistic, model K is the most general one, and model K-H the smaller one, nested within the former with H fewer parameters:

χ^{2}_{H} = -2LL_{K-H} - -2LL_{K}

Stata reports Log Likelihood rather than -2LL, and for Stata output the test statistic can be defined as follows:

χ^{2}

_{H}= -2(LL

_{K-H}- LL

_{K}) or χ

^{2}

_{H}= 2(LL

_{K}- LL

_{K-H})

Note that, in the second version, the log likelihood for model K-H is deducted from the log likelihood of model K. Rabe-Hesketh and Skrondal (2012, 88-89) argue that this test statistic is conservative when testing variance components, since they have a lower boundary of zero. This means that the test could be seen as one-sided and the probability value reported could be divided by two. Note that the - 2LL statistic depends upon the sample size. Valid tests therefore have to be based on identical sample sizes for the two models to be compared.

# Data structure in multilevel analysis

When analysing cross-sectional data, the data files will normally have the desired format, which is a hierarchical sorted data file. With two levels, such as employees in firms or respondents in countries, we need to sort the file first by the firm or country and then by the individuals. In data with three levels, such as students in classes in schools, we need to sort the file first by schools, next by classes, then by students (although the final step is not actually necessary). This means that the files need to have variables that identify the levels.

In longitudinal data, the data can be available in a *wide* or a *long* format. The wide format is a rectangular file with one line for each individual (subject) with the occasions (time points for the measurements) represented by variables, such as income1, income2 etc. However, multilevel models need the long format, where the occasions are nested within the subjects. In Stata, the *Reshape* command can change a file from a wide to a long format and vice versa. The SPSS equivalent is the Restructure command, which is accessed from the Data menu. Hox (2010: 79-83) gives a more thorough description of the differences between the two data formats.

In SPSS, Stata and Mlwin, as well as R, the contextual variables are added to the individual (level 1 record), all with identical values within a level 2 unit. If we want to include the Gross Domestic Product (GDP) of a country in a multilevel analysis, an identical GDP value must be added to all individuals within the same country.

How do we obtain and add explanatory variables at level 2? This of course depends on the level 2 units. To simplify, think of the European Social Survey, where the countries constitute the level 2 units. We can aggregate the data from within the data file to the country level. For example, in all rounds of the ESS, there is a question on how well households manage on their present income:

*
Which of the descriptions on this card comes closest to how you feel about your household’s income nowadays? Living comfortably on present income (1), coping on present income (2), finding it difficult on present income (3), finding it very difficult on present income (4).
*

We can define (3) and (4) as indicating difficulties managing on present income and recode these values to 1 and the rest to zero. Aggregating the new variable to the country level would result in a contextual variable measuring the proportion in each country with income problems. How this is done varies between the software packages. In SPSS, *Aggregate* will do the job, with the choice of adding the aggregated variable to the current file or saving the aggregated variables in a contextual level file. In Stata, *collapse* aggregates variables in a similar way in a new file.

Country level statistics are also available from the ESS website as well as from other sources such as Eurostat. From the ESS site, it is possible to add country level and regional level variables to the main ESS file (more about this in chapter 5).

If the aggregate variables are in a separate file, it will have to be added to the ESS main file. In SPSS, *Merge files - Add variables* in the Data menu will
merge files with a common level-2 identifier, and, in Stata, the *merge* command will do this job. If the number of level 2 units is low, as in the ESS, it is
also possible to add a contextual variable by recoding the country identification variable, although this is clearly inefficient.

# Sample sizes in multilevel analysis

This section on considerations relating to sample sizes in multilevel analysis builds on Hox (2010, Chapter 12). It translates into two questions: how large samples are necessary for multilevel analysis and what sample size is required to obtain a certain level of statistical power. We will focus on the former.

ML methods of estimation are asymptotic, which means that the sample size should be large. It is not clear, however, how the requirement for a large sample should be interpreted, nor whether it applies equally to all levels? The accuracy of estimates with varying sample sizes can be studied by simulation methods, and Hox summarizes current knowledge about this topic.

The regression coefficients are largely unbiased irrespective of which of the estimation methods mentioned is used. The standard errors of the regression coefficients are severely biased by using OLS estimation, and the ML estimates are slightly downwardly biased if the number of groups (level 2 units) is less than 50 and the normality assumption is violated.

For the variance components, the situation varies. The variance in level 1 residual errors is normally very accurately estimated. The same goes for level 2 variance in analyses with 100 or more groups (level 2 units). Simulations indicate that the level 2 variances will be satisfactory estimated with 30 groups and underestimated if the number of groups is as low as 10. This means that multilevel analysis with countries as level 2 units will most likely yield downwardly biased estimates of country level variation.

Since multilevel analysis involves two or more levels, questions concerning optimal sample sizes are difficult to answer, and the best advice will also depend on the purpose. Hox mentions Kreft’s 30/30 rule, which means 30 groups with a least 30 individuals in each. This could be sufficient for the estimation of the regression coefficients but inadequate for other purposes. If it is cross-level interactions that are of interest, Hox recommends the 50/20 rule: 50 groups with 20 or more in each group. If there is strong interest in the random part, the advice is 100 groups with a minimum of ten in each.

# Estimation of residuals in multilevel models

Let us consider this simple variance component model:

Y_{ij} = β_{0} + β_{1}*X_{ij} + u_{0j} + e_{ij}

The estimates of the residuals in multilevel models are less straightforward than in OLS regression models, where we can estimate the residuals by subtracting the predicted values of the dependent variable from the observed values. In multilevel models with residuals at two or more levels, the estimation procedure is more complex (Rasbash, Steele, Browne & Prosser 2005: 36-41, Goldstein 1995, Appendix 2.2, Hox 2010: 24-32).

In multilevel models, the level 2 residuals are of special interest because they can be used to estimate the regression lines for each level 2 unit. The first step is to calculate the raw residual for each individual (level 1 unit):

The raw residual for the jth level 2 unit (firm, country) is the mean of the raw residuals for the individuals belonging to level 2 unit j: r_{+j}. The level 2 residual for this unit is found by multiplying r_{+j} by a *shrinkage* factor (Rasbash, Steele, Browne & Prosser 2005: 36):

The multiplier is always less than or equal to 1, so that the estimated residual is almost always smaller than the raw residual. The multiplier is therefore known as a *shrinkage* factor. The shrinkage factor will be smaller (more shrinkage), the larger the level 1 residual (σ^{2}_{e}) compared with the intercept variance (σ^{2}_{u0}) and the smaller the sample for each unit (n_{j}).
This means that, for level 2 units where the information is scarce, such as having two to five observations, the estimate of the regression line for that unit is shrunk towards the mean regression. For level 2 units with larger sample sizes, the estimates of the regression coefficients are more reliable and the level 2 residual will be less shrunk towards the overall line.

Having estimated the level 2 residuals, the level 1 residuals can be calculated as follows:

# Centring of variables

It is well known that the multiple (OLS) regression model is invariant under linear transformations. If we transform the variables, the estimates change in a predictable way, and we can recalculate to find the coefficients based on the untransformed variables (Hox 2010: 61-70). If we divide age by 10, the regression coefficient and its standard error become 10 times larger. In multiple regression models, this only holds for models without polynomials, such as age and age squared, and for models without interaction terms (Aiken & West 1991). In these situations, the invariance is limited to the most complex term, the highest polynomial and the interaction effect, but not to the main effects.

Why would we want to transform the variables? The two most popular reasons for transforming variables are to obtain common metrics and to ascertain that the regression constant is interpretable. In psychology, standardization of the variables is a common solution to both problems. Centring is sufficient for the second purpose. Centring means measuring one or more variables in deviations from the mean (X^{c}_{i} = X_{i} - x̄). In regression models, the constant (intercept) is the predicted value of the dependent variable when all explanatory variables are set to zero. If a regression model contains age, the regression constant will be the predicted value for new born persons (age=0). To avoid this, we can measure age in deviations from the overall mean, or from other values such as the minimum age in the data.

In linear multilevel models, this invariance only holds for variance component models. In situations with random coefficients (slopes), Hox (2010: 60-61) illustrates that linear transformations of the explanatory variable in question will affect the residuals of the slopes. However, as his empirical example shows (p. 62-63), the -2 log likelihood statistic remains the same for models with the original scoring, with standardized values, and with centred variables. In other words, the models are equivalent. The variance components change, but so do their standard deviations, so that the ratio of the variances to their standard errors remains roughly the same.

In multilevel models, where variation in the intercepts is of interest, it is especially desirable that the intercepts refer to variable values represented in the data. If we grand mean centre all variables as explained above, the regression constant will be the predicted mean for persons with the average values for the explanatory variables. The variances of the intercept and the slope can be interpreted as the expected variances for the average person. It is also advantageous to centre variables with the zero value outside the observed range in the data when interpreting cross-level interactions.

An alternative to grand mean centring is to centre on the group mean. Assume that we have data from rounds 1 or 2 of the European social survey, where the countries are the level 2 units or groups. Assume further that household income is one of the explanatory variables. As income was measured in intervals, we can replace the interval number by the category mean. Whether we use the raw numbers, their natural logs or grand mean centring, we conceal two sources of income differences: differences within each country and differences between the countries. In such situations, group mean centring makes more sense than grand mean centring. For a more thorough discussion of centring, see Hox (2010: 68-69).

The special purpose software for multilevel modelling, HLM and Mlwin, has options for automatic implementation of centring. In SPSS and Stata, grand mean centring has to be done manually by creating centred versions of variables by using *compute* in SPSS and *generate* in Stata. Group mean centring can be performed in one step in SPSS using the aggregate command, while in Stata the operation requires two steps. First, *collapse* creates a separate file with the aggregated variables and *merge* can be used to add the aggregated file to the level 1 (individual) file.

# Analytical strategy

In OLS regression, there are two fundamentally different analytical strategies: top-down or bottom-up. A top-down strategy means starting with the full model and eliminating seemingly irrelevant explanatory variables. This is feasible given a reasonable number of explanatory variables. A bottom-up strategy means starting with a simple model and adding complexities. In multilevel analysis, the unanimous recommendation is to follow the bottom-up strategy. There are two reasons for this. Firstly, adding complexities in steps can help complex models to converge, whereas an initially complex model may fail to converge. Secondly, the number of parameters can easily become unmanageable in multilevel models. The number of fixed (regression) coefficients will be the same as in OLS regression, but the number of random parameters could easily become very high. If k is the number of regression coefficients (including the intercept) that are defined as random, the number of variances and covariances to be estimated is (k+1)*k/2.

The recommended analytical strategy is explained in more detail in Hox (2010: 56-59). Here, we briefly review the recommended steps.

**
Step 1 Estimate the null model
**

The first step is to estimate the null model and compute the intraclass correlation.

Y |

The purpose of this is twofold: to evaluate whether a multilevel analysis is necessary, and, if so, to estimate the baseline values of the variance components (random parameters). The intraclass correlation indicates how large a proportion of the variance in Y stems from variation between the level 2 units (groups, contexts). There is no agreed threshold, but five per cent between-group variation would certainly justify a multilevel approach. Stata also automatically performs a useful likelihood ratio test comparing the multilevel model with the OLS equivalent. The estimated random parameters can be used to calculate explained variation at both levels (pseudo R squares).

**
Step 2 Develop the level 1 model
**

The second step is to develop the full level 1 model, normally the individual level model. In the equation, K is the number of explanatory variables at the individual level.

Y_{ij} = β_{0} + β_{1}X_{1ij} + β_{2}X_{2ij} + ... + β_{K}X_{Kij} + u_{0j} + e_{ij}

The reason for developing a full level 1 model before proceeding is to avoid Hauser’s contextual fallacy and to find out how much of the level 2 variation is explained by *compositional* effects. This could be the case if the distributions of important explanatory variables differ between contexts, such as firms that differ in terms of the skill and educational levels of their employees in a study of wage determination.

**
Step 3 Develop the random model
**

In this step, we want to develop the random part of the model (this step comes later in Hox’s recommendation). Guided primarily by theory, we should test whether important regression coefficients in the individual level equations vary significantly among the level 2 units (contexts, groups). This should be done in steps, testing one regression coefficient, i.e. the effect of one level 1 variable, at a time. The first rough indication is obtained by seeing whether the variance in the level 2 residuals exceeds two times its standard error. Then perform a likelihood ratio test, comparing a model with and without the random effect defining the covariance matrix as unstructured, i.e. by including the covariance(s) of the level 2 residuals. Next, consider whether the covariance structure can be simplified by eliminating the covariance(s).

χ^{2}_{H} = -2LL_{K-H} - -2LL_{K}

It is also possible to model heteroskedasticity in the level 1 residual variance. This is most easily done using special purpose software, such as Mlwin, which contains a step-by-step guide to how to model level 1 variance.

**
Step 4 Add level 2 explanatory variables
**

Level 2 explanatory variables may be unique characteristics (global variables) of the level 2 units (the location and the age of the firm or the context a country belongs to, or they can be aggregated as in official statistics or aggregated from the data set. They can have two kinds of effects, on the intercepts and/or on the slopes. The ‘main’ effects of any level 2 variable can be seen as affecting the intercept (regression coefficient), as we saw in the separate equations for each level. It is recommended to start adding these first.

Level 2 regressors can also affect one or more regression coefficients through cross-level interactions. This means that, in some software programs, we will have to compute an interaction term as the product of the two variables involved, or this can be done from scratch in the software program. It is recommended that we only include cross-level interaction for individual level variables whose effects were found to vary in the previous step, but you will find exceptions to this in applications, especially if the varying effect is backed by strong theory. In situations with few level 2 units, the statistical power may also be too low to reliably detect cross-level interactions by means of variance components.

In situations with many level 1 units and few level 2 units, as is the case when analysing data from the European Social Survey, we need to be careful not to introduce too many explanatory variables at the second level. If our data set includes 20 countries, this is the n to consider when thinking about the number of level 2 regressors. In such situations, two to three is better than five to ten. If there are a larger number of level 2 explanatory variables, try them out one by one. This is especially important if cross-level interactions are also included.

The full model including a cross-level interaction now has the following structure:

Y_{ij} = β_{0} + β_{1}X_{1ij} + β_{2}X_{2ij} + β_{3}Z_{j} + β_{4}X_{1ij}Z_{j} + (u_{0j} + u_{1j}X_{1ij} + e_{ij})

In research applications, variance component models with and without level 2 regressors are found more frequently than models with cross-level interactions.

# An example: educational attainment in local labour markets in Norway

Let us end the theoretical part of the module by reviewing an example of a multilevel analysis that contains all the points we have covered so far. This will hopefully help to clarify the concepts and the interpretation of multilevel models.

The theoretical point of departure is modernization theory. The main hypothesis is that the degree of modernization of the local labour market will affect the status attainment process, including educational attainment. Two testable derived hypotheses follow from this:

In labour markets dominated by the (traditional) primary sector, the level of individual attainment will be lower than in more modern labour markets.

The stronger the traditional sector, the stronger the effect of social background on individual attainment.

The data stem from a 10 per cent sample of the Norwegian 1960-1980 censuses. Data for the analysis of status attainment were available for a subsample of men and women aged 30-36 in 1980. The total sample size at level 1, the individual level, was about 40,000. The 19 Norwegian counties (shown below) constitute level 2 as proxies for the local labour markets.

The dependent variable is child’s education (ED) measured in years. The model below includes three explanatory variables: mother’s education, father’s education and occupational status. The black arrows indicate the effects of level 1 and the blue ones the effects of one level 2 variable, the percentage of people in the county who are employed in the primary sector. This is the indicator of modern versus traditional labour markets.

Figure 3.6. Counties in Norway

Figure 3.7. Model for explaining a child's education

Let us look at the formalization of the model, first in separate equations for the two levels, then in a single equation.

Level 1 (the micro model):

ED_{ij} = β_{0j} + β_{1}Med_{ij} + β_{2j}Fed_{ij} + β_{3}Focc_{ij} + e_{ij}

Level 2 (the macro model):

β_{0j} = β_{0} + β_{4}Primary_{j} + u_{0j}

β_{2j} = β_{2} + β_{5}Primary_{j} + u_{2j}

One equation version:

ED_{ij} = (β_{0} + β_{4}Primary_{j} + u_{0j}) + β_{1}Med_{ij} + (β_{2} + β_{5}Primary_{j} + u_{2j})Fed_{ij} + β_{3}Focc_{ij} + e_{ij}

Simplified, with random part in the second line:

ED_{ij} = β_{0} + β_{1}Med_{ij} + β_{2}Fed_{ij} + β_{3}Focc_{ij} + β_{4}Primary_{j} + β_{5}Fed_{ij}Primary_{j} + u_{0j} + u_{2j}Fed_{ij} + e_{ij}

Before we present the estimated parameters, let us take try to achieve a deeper understanding of the random coefficient for father’s education. In the figure below, separate OLS estimates are presented for the coefficient in each of the 19 counties

Figure 3.8. The OLS estimates for the coefficient in all counties

The regression coefficients vary from around 0.17 to 0.37, and they are positively related to the primary sector percentage, although the pattern seems to deviate slightly from linearity.

Let us present the estimates and interpret the coefficients:

E^{^}D_{ij} = 11.277 + 0.498Med_{ij} + 0.372Fed_{ij} + 0.026Focc_{ij} - 0.056Primary_{j} + 0.016Primary_{j}Fed_{ij}

All individual level variables are grand-mean centred. This implies that the regression constant, 11.277, can be interpreted as the predicted mean years of education for counties, with the other variables at their mean values. An increase of 10 per cent in primary employment is expected to decrease the predicted mean years of education by about half a year (10*-0.056=-0.56). Mother’s and father’s education are measured in levels and have a roughly equal overall effect (not shown), while the effect of father’s occupation is small. The effect of father’s education in the above equation is not the overall or main effect of the variable, but the effect when the primary percentage is set to zero. Those who do not fully understand this can go back to the review of interaction effects in the OLS regression part at the beginning of the module. Thus, the effect of father’s education (FED) is 0.372 when the primary percentage is set to zero. The coefficient of the interaction term, 0.016, shows that the effect of father’s education increases by that amount for each per cent increase in primary sector employment. Thus, an increase of 10 per cent in the primary sector will increase the effect of father’s education by 0.16, to 0.53. Both the direct effect of primary sector employment on the intercept and the interaction effect lend support to the two hypotheses.