Chapter 2: Simple linear regression: The regression equation and the regression coefficient
Visual inspection of regression lines may be convenient, but their steepness and direction are usually indicated by numbers rather than figures. These numbers are called regression coefficients. This chapter will teach you how to compute them.
The regression equation
A good understanding of regression coefficients presupposes a basic knowledge of the mathematical representation of regression lines. The data and variables are the same as in the second example in chapter 1. The observed education length of person i (any person in the sample) is symbolised by yi, The value yi can be expressed as the sum of person i’s predicted education length (ŷi) and a residual (ei). We express this as follows:
yi = ŷi + ei
Furthermore, as we know from chapter 1, the regression line is a graphical expression of the association between people’s observed independent variable values and their predicted dependent variable values. Since the linear regression line is straight, this association can be expressed as a linear function. In such functions, the predicted education length is seen as changing in constant proportion to increases or decreases in people’s birth year variable values. We can express such functions as follows:
ŷi = a + b∙xi,
Here, xi is person i’s birth year, while a and b symbolise constants (fixed numbers). These constants are the regression coefficients, or, to be more exact, the a is often called the constant or the intercept, while the b is called variable x’s regression coefficient because it determines how the predicted y values (the ŷi) change as the value of xi changes. A more thorough exposition of these points can be found here.
If we let the right-hand side of the function ŷi = a + b∙xi replace ŷi in the function yi = ŷi + ei, we get:
yi = a + b∙xi. + ei
That is, the observed values of the dependent variable (in our example length of education) are conceived as being determined by four factors: The coefficients a and b, the independent variable xi (year of birth) and the residual ei. The latter is supposed to vary randomly and, hence, to be independent of xi. In Figure 6, this is illustrated for the persons with identification numbers 4 and 1892.
The line that slopes upwards from left to right is identical to the regression line we saw in Figure 2. The two circles that represent our selected persons are not positioned on the regression line, which means that their observed dependent variable values deviate from their predicted values. The observed values y4 and y1892 are marked on the right and left vertical axis, respectively. The predicted values ŷ4 and ŷ1892 are the y-variable values of the points on the regression line that lie immediately above person 4’s position and below person 1892’s position in the figure. These values are also marked on the vertical axes. The residuals, e4 and e1892, are the vertical distances between the persons’ circles and the regression line, i.e. they are the differences between the observed and the predicted dependent variable values (e4 = y4 - ŷ4.etc.). Persons whose circles are positioned below the regression line have negative residuals, while those whose circles lie above the regression line have positive residuals.
However, in order to know the sign and size of a person’s residual, we have to know where the regression line is positioned. Hence, we need a method for simultaneous computation of regression coefficients and residuals. In fact, as explained in chapter 1, the OLS method does just this by computing those a and b values that minimise the sum of squared residuals. We do not have to worry about the computational procedures. SPSS does all the necessary computations for us. Note, however, that this method always causes the sum and mean value of the residuals to be approximately 0.