All pages

Chapter 2: Simple linear regression: The regression equation and the regression coefficient

Visual inspection of regression lines may be convenient, but their steepness and direction are usually indicated by numbers rather than figures. These numbers are called regression coefficients. This chapter will teach you how to compute them.

The regression equation

A good understanding of regression coefficients presupposes a basic knowledge of the mathematical representation of regression lines. The data and variables are the same as in the second example in chapter 1. The observed education length of person i (any person in the sample) is symbolised by yi, The value yi can be expressed as the sum of person i’s predicted education length (ŷi) and a residual (ei). We express this as follows:

yi = ŷi + ei

Furthermore, as we know from chapter 1, the regression line is a graphical expression of the association between people’s observed independent variable values and their predicted dependent variable values. Since the linear regression line is straight, this association can be expressed as a linear function. In such functions, the predicted education length is seen as changing in constant proportion to increases or decreases in people’s birth year variable values. We can express such functions as follows:

ŷi = a + b∙xi,

Here, xi is person i’s birth year, while a and b symbolise constants (fixed numbers). These constants are the regression coefficients, or, to be more exact, the a is often called the constant or the intercept, while the b is called variable x’s regression coefficient because it determines how the predicted y values (the ŷi) change as the value of xi changes. A more thorough exposition of these points can be found here.

If we let the right-hand side of the function ŷi = a + b∙xi replace ŷi in the function yi = ŷi + ei, we get:

yi = a + b∙xi. + ei

That is, the observed values of the dependent variable (in our example length of education) are conceived as being determined by four factors: The coefficients a and b, the independent variable xi (year of birth) and the residual ei. The latter is supposed to vary randomly and, hence, to be independent of xi. In Figure 6, this is illustrated for the persons with identification numbers 4 and 1892.

Figure 6. Graphical illustration of residuals and observed / predicted values

The line that slopes upwards from left to right is identical to the regression line we saw in Figure 2. The two circles that represent our selected persons are not positioned on the regression line, which means that their observed dependent variable values deviate from their predicted values. The observed values y4 and y1892 are marked on the right and left vertical axis, respectively. The predicted values ŷ4 and ŷ1892 are the y-variable values of the points on the regression line that lie immediately above person 4’s position and below person 1892’s position in the figure. These values are also marked on the vertical axes. The residuals, e4 and e1892, are the vertical distances between the persons’ circles and the regression line, i.e. they are the differences between the observed and the predicted dependent variable values (e4 = y4 - ŷ4.etc.). Persons whose circles are positioned below the regression line have negative residuals, while those whose circles lie above the regression line have positive residuals.

However, in order to know the sign and size of a person’s residual, we have to know where the regression line is positioned. Hence, we need a method for simultaneous computation of regression coefficients and residuals. In fact, as explained in chapter 1, the OLS method does just this by computing those a and b values that minimise the sum of squared residuals. We do not have to worry about the computational procedures. SPSS does all the necessary computations for us. Note, however, that this method always causes the sum and mean value of the residuals to be approximately 0.

Page 1

Performing ordinary linear regression analyses using SPSS

Follow the preparatory steps outlined in the first chapter, i.e. open the data set, turn on the design weight and select the Norwegian sample of persons born earlier than 1975. Then, run the regression analysis as follows:

You can also copy, paste and run this syntax

*Syntax for the example in chapter 2, the Norwegian sample. *The following command causes the cases to be weighted by the design weight variable 'dweight'.

WEIGHT BY dweight.

*The following commands cause SPSS to select for analysis those cases that belong to the Norwegian sample (value NO on country variable) and have lower values than 1975 on the birth year variable (& stands for AND, < stands for 'less than'). *In this process, the commands create a filter variable (filter_$) with value 1 for the selected cases and value 0 for the non-selected cases. *Change the last part of line 2 (which starts after the first equals sign) if you wish to select other cases. If you do this, you should also change the variable label, which is in double quotation marks on line 3.

COMPUTE filter_$=cntry = 'NO' & yrbrn < 1975.
VARIABLE LABEL filter_$ "cntry = 'NO' & yrbrn < 1975 (FILTER)".
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.

*The following commands cause a linear regression analysis to be performed on the selected data with dependent variable 'eduyrs' and independent variable 'yrbm'. *Change variable names in the last two lines if you wish to run the analysis with other dependent and independent variables.



Figure 7. Running a simple (bivariate) linear regression analysis

The output you get if you execute these commands correctly, contains the ‘Coefficients’ table shown here as Table 1.The computed values of a and b are shown in the B column. The item in the first row is the a-coefficient, which SPSS terms the ‘Constant’. The item in the second row is the birth year variable’s b-coefficient, which indicates the steepness of the regression line or, if you prefer, indicates how much the predicted value of the dependent variable (length of education) increases when the value of the independent birth year variable increases by one unit (one year). The coefficient’s value is 0.097 (or, more exactly, 0.096672408), which means that each new cohort’s predicted length of education is 0.097 years longer than that of the cohort that was born one year before it. The a-coefficient or ‘constant’ is identical to the predicted value of the dependent variable for those cases whose independent variable value is 0. But be careful when interpreting this coefficient. Here, its value is negative (-175.553), and surely no one has a negative length of education. The reason for this strange result is that the persons thus attributed a negative education length are supposed to have been born in year 0. But there are no survivors from year 0 in our sample, and the regression results only apply to persons whose x-variable values lie within the span used in the computations (values between 1910 and 1974). You should avoid making extrapolations beyond these limits, and extrapolations that extend far beyond these limits make little sense. Hence, the constant term has no substantial interpretation in this example, but we still need it for computations of predicted y values.

Table 1. SPSS output: Simple linear regression coefficients

The computed coefficient values may be seen as interesting in themselves. But they can also be used to compute predicted dependent variable values for particular persons or groups of persons. Such computations are done by inserting these persons’ independent variable values and the computed coefficient values into the right-hand side of the function ŷi = a + b∙xi. For example, person 4 in Figure 6 was born in 1954. If we insert this value into the function together with the coefficient values, we get:

ŷ4 = (-175.5535449315) + 0.096672408 ∙1954 = 13.34.

(Abnormal numbers of decimals are used for exact prediction.) In other words, we can predict this person’s length of education to be 13.34 years. But this person’s actual length of education is 11 years, so we get a residual value of -2.34 years, which accords with what Figure 6 shows us.

The coefficients presented in Table 1 pertain to those who are members of the ESS sample. The table also contains information about the accuracy of these coefficients regarded as indicators of the association between birth year and education length in the entire population of Norwegians born before 1975. Exactly what these columns (Std.Error, t and Sig.) tell us will, however, be explained in chapter 4.

The Beta column contains the regression coefficients one gets when the analysis is performed on standardised variables. You don’t have to know anything about them to perform ordinary regression analysis. In fact, in most cases you should avoid using them. Consult a regression analysis textbook if you want to know more about them.

Page 2

Interpretation of the Model summary table

The regression results comprise three tables in addition to the ‘Coefficients’ table, but we limit our interest to the ‘Model summary’ table, which provides information about the regression line’s ability to account for the total variation in the dependent variable. Figure 6 demonstrates that the observed y-values are highly dispersed around the regression line. Thus, as regression analysts often put it, the regression model only ‘explains’ a limited proportion of the dependent variable’s total variation. The dependent variable’s total variation can be measured by its variance. If the regression line is not completely horizontal (i.e. if the b coefficient is different from 0), then some of the total variance is accounted for by the regression line. This part of the variance is measured as the sum of the squared differences between the respondents’ predicted dependent variable values and the overall mean divided by the number of respondents. By dividing this explained variance by the total variance of the dependent variable, we arrive at the proportion of the total variance that is accounted for by the regression equation. This proportion varies between 0 and 1 and is symbolised by R2 (R Square). As can be seen from Table 2, the value of our R2 is 0.131, which means that 13.1 percent of the total variance in education length has been ‘explained’. Not very impressive, but not bad either compared with the R2 values one tends to get in analyses of social survey data. The R is the square root of R2. The Adjusted R2 will be discussed later.

Table 2. SPSS output: Simple linear regression goodness of fit


  1. Perform the same regression analysis as in the example presented above on data from the Polish (or another county’s) ESS sample.
  2. Perform a regression analysis with ‘How happy are you’ as the dependent variable and ‘Subjective general health’ as the independent variable. (These variables are not metric, but they can, at least as an exercise, still be used in OLS regression.) Use data from a country of your own choice. What do the results tell you?
Go to next chapter >>
Page 3