Regression

It is often the ambition of the social scientist to explain a particular phenomenon: Why does something occur, why does a person choose to become a mechanic, what factors influence the level of political participation?

Explanation implies causality. It is common to speak about independent and dependent variables. The independent variable is the variable that causes change in the dependent variable:

Figure 1: Diagram of a causal model

Figure 2

There are at least three requirements that must be fulfilled before the relationship between an independent and a dependent variable is truly causal.

  1. Time; the cause must precede the effect (smoke does not create fire).
  2. There must be a statistical relationship between the variables.
  3. The relationship must not be spurious (caused by another factor).

In Nesstar WebView you can analyse causal relationships by linear regression, with both one and several independent variables (bivariate – multivariate). Linear regression has been developed for analysis of metric variables (age, income, height).

Let us illustrate regression with an example. We want to find factors that can explain why someone uses the Internet. Age is clearly a relevant variable: Young people are open to experimenting with new technologies, and information technology has become an important subject in school. So it is reasonable to expect that young people use the Internet more often than older people do.

In the ESS data we find the variables "age" and "Personal use of internet/e-mail/www". If we perform a regression analysis with "age" as the independent variable and "Personal use of internet…" as the dependent variable, we get the results shown in Table 6:

To obtain the results in Table 6:

Table 6: Regression analysis, (age=independent, personal use of Internet=dependent)
BSEBBetaTSignificanceTolerance
Age, years 2002 -0.05 0.00 -0.35 -66.00 0.000 1.00

Intercept 4.74
Valid N 30,323.60
Multiple R 0.354
Multiple R Squared 0.126
Adjusted R Squared 0.126
F value 4,355.76
F sign 0.000

Weight is on

Open this table in Nesstar WebView

The interpretation of this output goes through several phases.

  1. Are the results significant?
  2. What is the relationship between the variables: positive or negative direction?
  3. How much variation is explained by the independent variable?

The "F sign" in Table 6 indicates the level of significance for the regression model. The significance displayed in the grey coloured part of the output is for the independent variable only. Both numbers indicate that the results are significant. This is not surprising. When the valid N (number of cases included in the analysis) is very large, the results will almost always be significant.

The equation for the regression line is like this:
y = a + bx
where y is the dependent variable, x the independent variable, and a and b are constants.

A task of regression analysis is to estimate the values for the two regression coefficients based on the observed data. The regression line is the best fit to the points in a scatterplot.

The constant "a" is called "Intercept" in the output. It shows the point at which the regression line crosses the Y-axis when the value of X equals zero.

The constant "b" is called "B" in the output. This measures the amount of increase or decrease in the dependent variable for a one-unit increase in the independent variable.

Based on the results in Table 6, we get the following equation:
Personal use of Internet = 4.7 - 0.05*age

Interpretation of a = 4.7
The dependent variable "personal use of Internet" is a categorical variable with 0-7 as the valid values. The regression predicts this value to be 4.7 when a person is 0 years old.

Interpretation of b = - 0.05
To interpret the regression coefficient, or the direction of the relationship, we must know how the variables are coded:

The regression coefficient is negative, and this implies that an increase on the independent variable will lead to a decrease on the dependent variable.

A one-year increase on the "age" variable will make the scale on the "Personal use of Internet" variable decrease by 0.05. This implies that older people tend to use the Internet less than younger people.

So far, we have seen that the relationship between the variables is significant, and that younger people use the Internet more often than older people do. But we also need to know how good this model is. How much of the total variation in the dependent variable is explained by the independent variable? The measure R2, or R-square, indicates the proportion of total variation in the dependent variable "determined" by its linear relationship to the independent variable. In table 6 you can see that the multiple R-square is 0.13. This means that 13 % of the variation is explained, and that 87 % of the variation is unexplained, or caused by other factors. In the social sciences it is difficult to detect relationships in a calculation with a very high R-square.