Why include more than one independent variable?

Obviously, dependent variables are always associated with more than one other variable. Some of these other variables will normally be associated with each other, which means that they have some of their association with the dependent variable in common. Unless we include them all in the regression analysis, we cannot distinguish this common part of the association from the association that is ‘unique’ to each. For instance, we may be interested in the association between gender and occupational status. But besides being associated with gender, occupational status is also associated with educational level and professional experience etc., which, in turn, may be associated with gender. To find out how much of the association between gender and occupational status that occurs independently of gender differences in education and experience, we need to include educational level and professional experience as independent variables in our analysis. (This makes it possible to compare men and women with equal education and experience.) The inclusion of other variables is particularly important if we aim to study causal effects of independent variables. Not including relevant independent variables increases the risk of overlooking causal associations or of confounding them with non-causal ones. It also keeps the unexplained dependent variable variance high and the reliability of our coefficient estimates low.

We introduce multiple regression with an example based on Polish data. The task is to estimate how year of birth and the respondents’ parents’ education levels are associated with the respondents’ length of education. The parents’ education variables are ordinal. We could, however, follow the not uncommon practice of treating such variables as metric. The alternative strategy, which allows us to keep persons whose parents’ education is unknown in the active dataset, is to treat the variables as nominal and recode them into sets of dummy variables. There is a chance that this will improve the model’s fit with the data, and it gives us a larger and perhaps more representative dataset. But these nice features come at the cost of having to estimate more coefficients, which may decrease the accuracy of the coefficient estimates we wanted to compute in the first place. However, this accuracy also depends on the fit of our model and on the number of persons in our active dataset, and, since the ESS single-country datasets are quite large, we would normally be willing to estimate a few dummy variable sets with a limited number of variables in each, rather than treating ordinal variables as metric variables. In particular, we would do this if it substantially increased the fit of our model. Unfortunately, there is no statistical test that can tell us whether a model based on a dummy variable set is better than a model based on the original ordinal variable. You can use the dummy variable solution if you are able to keep the number of dummy variables small, but you can also choose the ordinal variable model if its adjusted R2 is higher than the dummy-based model’s adjusted R2 or if the merits of the latter are too small to make up for its shortcomings. (In this case, the adjusted R2, which is reported alongside the R2 and which takes account of the number of coefficients included in the model, is better suited for comparing models’ fits than the proper R2 is.)

Go to next page >>