Example, part two: regression with dummies

In order to estimate the association between the country variable and the education length variable we must use both these variables simultaneously in one single regression analysis. The regression function will look like this:

yi = a + b1∙x1i + b2∙x2i + ei,

where yi represents the education length values, x1i the ‘Great Britain’ dummy variable values, and x2i the ‘Poland’ dummy variable values. We use the linear regression dialogue box and enter the variables as shown in Figure 14.

Figure 14. Running regression analysis with a nominal independent variable coded as a set of dummy variables

This regression analysis produces the results presented in Table 8 and Table 9.

Table 8. SPSS output: Dummy variable regression goodness of fit statistics

Table 8 tells us that the differences between the mean education lengths of the three country samples ‘explain’ 5.3% of that variable’s total variance.

Note also that we can use the F-statistic to test whether there are any differences between the mean education lengths of the three countries’ populations. In our example the result of this test is displayed as a Sig.-value in the ANOVA-table that we obtain when we run the regression analysis. This result is also displayed in the Model summary table provided that we click the ‘Statistics’ button in the linear regression dialogue box and tick the ‘R squared change’ option before we run the analysis.

This F-test can be interpreted as a test of whether the set of dummy variables, conceived as a block of variables, explains any variance at all at the population level. This is a useful option because the variable set may have statistically significant aggregate effects even if none of the individual dummies have such effects when tested separately by means of the t-statistic (i.e. by means of the Sig. values obtained in the coefficients table).

Table 9. SPSS output: Dummy variable regression coefficients

The Constant coefficient shown in Table 9 can be interpreted as an estimate of the mean education length (in years) of people belonging to the reference category (i.e. Norwegians born before 1975). Why is that? Take a look at the regression function yi = a + b1∙x1i + b2∙x2i + ei. It results in the following predicted y-values: ŷi = a + b1∙x1i + b2∙x2i. We know that Norwegians’ values on both x-variables are 0. Hence, the only term on the right-hand side of the latter function that can differ from 0 is a, i.e. the constant term, which, according to Table 9, is estimated at 13.214. The implication is that, if the person whose x-values we put into the function is a Norwegian, that person’s predicted length of education is 13.214 years, and, since this analysis does not distinguish between different types of Norwegians, this must be an estimate of the mean education length of all Norwegians born before 1975.

The coefficient of the ‘Great Britain’ variable should be interpreted as an estimate of the difference in mean education lengths between those aged 30+ who live in Great Britain and those (30+) who live in Norway. If we insert the x-values of a person who lives in Great Britain, the function will read: yi = a + b1∙1 + 0 + ei. Thus, the predicted mean value of the dependent variable is a + b1, which has been estimated at (13.214 – 1.206 = 12.008); hence the estimated mean education length of those who live in Great Britain is 12.008 years (i.e. 1.206 years shorter than the Norwegian mean value.) Similarly, the mean education length of those who live in Poland is 1.988 years shorter than the Norwegian mean value.

The Sig. values shown in Table 9 also reveal that, as long as we choose a significance level above 0.1%, both the British and the Polish mean education lengths are significantly shorter than the Norwegian ones (i.e. significantly shorter in the statistical sense of the word).

Go to next page >>