All pages

Chapter 1: What is linear regression?

The main purpose of linear regression analysis is to assess associations between dependent and independent variables. In this chapter, you will learn the basic idea behind this technique. You will also learn how to create a graphic presentation of the association between two variables by means of a regression line.

Strictly speaking, linear regression requires variables to be metric. Non-metric variables are either nominal or ordinal. The ESS data abound with ordinal variables, such as measurements of opinions. This creates problems for the application of linear regression analysis to ESS data. Some of these problems may be alleviated. We deal with this in later chapters. For now, it suffices to say that, in addition to metric variables, all variables that have no more than two values may be used as independent variables in linear regression analyses. This is illustrated in the following example.

Page 1

Example: Why are men’s incomes higher than women’s incomes?

One proximate cause might be men’s longer working hours. The ESS data contain the dichotomous variable ‘gender’ and the metric variable ‘total hours normally worked per week in main job, overtime included’. Assume that the latter measures all paid work and that Poland is our country of interest. Figure 1 presents the dispersion of Polish men’s working hours (left column) and Polish women’s working hours (right column). Each circle represents one or several persons. Their vertical dispersion indicates that working hours vary strongly. How can we translate this into one single measure of gender differences? The most common procedure is to compute the difference between men’s and women’s mean values. In Figure 1, the upper horizontal line marks the men’s mean, whereas the lower line marks the women’s mean. We call these values conditional means because their computation is conditional on the individuals’ values on the gender variable. Thus, working women’s mean value is smaller than working men’s mean value.

Figure 1. Example 1: Regression line based on simple linear regression analysis with gender as the independent variable and total number of hours worked per week in main job as the dependent variable. Polish ESS round 2 data.

Rather than drawing two horizontal lines, however, we get an even more striking illustration of the difference between men and women by drawing a line between the mean point on the men’s column and the mean point on the women’s column. Such a line has been added in Figure 1. Now, the interesting thing here is that we get an identical line if we apply the ordinary least squares (OLS) method of linear regression analysis. Indeed, linear regression can be described as a method for establishing a linear relation between a set of units’ values on one or more independent variables and their mean values on a dependent variable. In this example, gender is the independent and hours worked the dependent variable.

Regression analysis does not add much to our understanding, however, if there is only one dichotomous independent variable. In such cases, a simple comparison of the two means will suffice. But the need for regression analysis as a simplifying device increases if there is more than one independent variable or if the independent variable has more than two values. In the latter case, the line that connects the conditional means may be bumpy rather than straight and simple, and in that case it can not longer be conceived as a convenient representation of the relation between the variables. This, and the use of a regression line to make things simpler, is illustrated by our second example.

Page 2

Example: Different birth cohorts’ length of education

This example can be motivated by the fact that, during the last century, social development led to an increase in educational opportunities for most people. Assume that we want to study how this affected different birth cohorts’ length of education in, say, Norway. Has the number of years spent in educational institutions increased steadily from one cohort to the next; and if so, how steep has the increase been?

In Figure 2, Norwegian survey sample members with identical variable value combinations are represented by small circles. Only people born before 1975 are included, because many younger people had not finished their studies at the time of the survey. Notice that almost all possible lengths of education are represented in every cohort. But there is also a tendency for the proportion of people with long educations to become greater as the cohorts get younger. Thus, the conditional mean education lengths are higher in younger than in older cohorts. If we draw a line through all these conditional means, however, we don't get a straight line but a zigzag line, as can be seen in Figure 2.

Figure 2. Example 2: Regression line based on simple linear regression analysis with year of birth as the independent variable and length of education measured in years as the dependent variable. Norwegian ESS round 2 data.

Thus, the mean education length has not risen at a constant rate from one cohort to the next. However, the long-term tendency to rise still seems pretty uniform over time. Hence, we might obtain a less fuzzy, and for many purposes fully adequate, picture by just letting a straight line ascend through the zigs and zags of the zigzag line. Such a line is exactly what we get when we apply the ordinary least squares method of regression analysis (OLS) to these data. The resulting regression line is shown in Figure 2, which also illustrates that this line always passes through the point at which the overall mean of the dependent variable meets the overall mean of the independent variable. The line’s relative closeness to the observed conditional means is achieved partly because of the defining principle of the OLS method, which says that the regression line should be drawn so that the sum of its squared vertical distances from the various individuals’ positions in the diagram is as small as possible.

Since the regression line captures the association between year of birth and mean education length quite well, it would seem to be a good idea to choose a point on this line if we were to predict the education length for a person about whom we know nothing but his or her year of birth. Thus, we often treat the regression line as an expression of the association between observed values of the independent variable and predicted values of the dependent variable.

But note, also, that we have no guarantee that mean education length will continue to rise at the same long-term rate for cohorts born after 1974 as they did for those studied here. If they do not, a straight regression line should not be used to express associations between variables in studies that include people born before as well as people born after 1974. A possible solution to such problems is discussed in chapter 3.

On the following page, we will describe how SPSS can be used to create figures like the one presented in Figure 2.

Page 3

Create Figure 2 using SPSS

We assume that you have downloaded the ESS data and installed a copy of SPSS.

Open SPSS by clicking on the appropriate link. Open the ESS data by clicking ‘File’, ‘Open’, and ‘Data’ on the SPSS menu bar before you select the folder and the data set.

You can then either proceed by pasting and running the SPSS syntax, or you can follow the instructions and use the menus in SPSS.

SPSS syntax

*You can copy this syntax and paste it into a syntax window in SPSS. * Comments on commands start with an asterisk and end with a dot. * Commands must always end with a dot. *EXAMPLE, CREATE FIGURE 2 IN CHAPTER 1.

*The following command causes the cases to be weighted by the design weight variable 'dweight'.

WEIGHT BY dweight.

* The following commands cause SPSS to select for analysis those cases that belong to the Norwegian sample (value NO on country variable) and have lower values than 1975 on the birth year variable (& stands for AND, < stands for 'less than'). * The commands create a filter variable (filter_$) with value 1 for the selected cases and value 0 for the non-selected cases. * Change the last part of line 2 (which starts after the first equals sign) if you wish to select other cases than the Norwegian ones. If you do this, you should also change the variable label, which can be found within double quotation marks on line 3.

USE ALL.
COMPUTE filter_$=cntry = 'NO' & yrbrn < 1975.
VARIABLE LABEL filter_$ "cntry = 'NO' & yrbrn < 1975 (FILTER)".
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

* The following command creates a scatterplot, with the variable length of education measured along the vertical axis and the variable year of birth measured along the horizontal axis.

GRAPH
/SCATTERPLOT(BIVAR)=yrbrn WITH eduyrs
/MISSING=LISTWISE.

Instructions

The ESS team advises you to weight the data. Therefore, click ‘Data’ on the menu bar, select ‘Weight Cases’, find the variable ‘Design weight’ towards the bottom of the dialogue box’s list of variables, click the variable label and insert it in the ‘Frequency variable’ field. Finish by clicking ‘OK’.

Figure 3. Weight cases procedure

Now, to reproduce Figure 2, you have to deselect the non-Norwegians and all those born after 1974. This is achieved by clicking ‘Data’ and ‘Select Cases’ on the menu bar. Next, tick ‘If condition is satisfied’, click ‘If’ and indicate that you want to analyse information about those Norwegians who were born before 1975. Do this by first typing the country variable name or select it from the variable list, and add = ‘NO’ to indicate that you want to select the Norwegian cases. (NO is the Norwegian value code.) Continue by typing the sign &, which indicates that yet another condition must be fulfilled. Finally, state this second condition by typing yrbrn < 1975, which states that the value of the birth year variable should be less than 1975. The full command should read as follows: cntry = ‘NO’ & yrbrn < 1975. (See Figure 4 and learn more about logical operators here.) When this is done, click ‘Continue’ and ‘OK’. (Avoid ticking the option ‘delete unselected cases’, since this will change your dataset permanently.) You must give a new ‘Select cases’ command if you want to change these settings and include non-Norwegians or younger people in your active data set.

Figure 4. Select cases procedure

Now you can create the graph. Go to the ‘Graphs’ menu. If you are using SPSS 14.0 or older versions, click ‘Scatter/Dot’. If you are using SPSS 15.0, click ‘Legacy dialogs’ and ‘Scatter/Dot’. Choose ‘Simple Scatter’ and click ‘Define’. Select the birth year variable from the variable list and put it in the X Axis (horizontal axis) field. Put the education length variable in the Y Axis field and click ‘OK’.

Figure 5. Creating a scatterplot

The following applies to both those who use syntax and those who use the menus: Double click on the scatterplot that appears in the output window. A new window opens. Choose ‘Interpolation line’ from the ‘Elements’ menu. Choose ‘Interpolation line’ in the dialogue box as well. Tick ‘Straight’ before you click ‘Apply’ and ‘Close’. A line that interpolates between cohort means will be inserted into the plot. (Drop the preceeding step if you wish to create a figure without such an interpolation line.) Then choose ‘Fit line at total’ from the ‘Elements’ menu and ‘Fit line’ in the dialogue box. Tick ‘Linear’ before clicking ‘Apply’ and ‘Close’. The figure should now appear with a linear regression line and a zigzag line running through the conditional means. Text and other features of the figure can be edited. Finish editing before using ‘Copy chart’ from the ‘Edit’ menu to make a copy for pasting into a text file.

Page 4

Exercise

Create a scattergram with a regression line. Use the variable ‘How happy are you’ as the dependent variable on the vertical axis and ‘Subjective general health’ as the independent variable on the horizontal axis. Use the Norwegian sample or change to another country’s sample.

You will find that the observation markers create a grid pattern with 5 columns and 10 rows. The reason is that the independent variable has 5 values, while the dependent variable has 10 values, and because there are people in the dataset who represent almost all possible combinations of these values. Still, the regression line indicates that there is a tendency for low values on the independent variable to be combined with high values on the dependent variable. Why is that? The reason can be found in the way these variables are coded. Happy people have high values on the ‘How happy..’ variable, whereas people who perceive their own health as good have low values on the health variable. Thus, the slope of the regression line tells us that people who conceive their own health to be good, tend to be more happy than other people.

Go to next chapter >>
Page 5