Correlation
By investigating a cross table, as described above, one is often able to detect whether there is a relationship between two variables, and also what the relationship is. Often this is not enough. Sometimes we would like to know how strong the connection is, or we would like to compare several relationships.
Correlation is a mathematical method summarising the statistical relationship between two variables. Different coefficients are developed for the analysis of different types of variables. We will limit ourselves to Pearson's r, which has been developed for metric variables.1 Pearson's r can range from –1 to 1, and represents the linear relationship between two variables. Using survey data, it is very rare to find correlations above 0.6.
In the following example, we will investigate the relationship between the variables "age", "gender" and "personal use of Internet/e-mail". The interesting correlations here are between age and use of Internet, and between gender and use of Internet.
Read more about the use of weights
To establish the correlations in Table 5:
- Open the Trust dataset.
- Switch the combined weight on.
- Select the "Analysis" tab.
- Find the variable gender, left-click and select "Add to correlation".
- Find the variables age and personal use of Internet and add them to the correlation.
| Age | Gender | Personal use of Internet/E-mail | |
|---|---|---|---|
| Age | - | 0.011 | -0.354 |
| Gender | 0.011 | - | -0.109 |
| Personal use of Internet/E-mail | -0.354 | -0.109 | - |
Weight is on
Open this table in Nesstar WebView
From Table 5 we can see that the correlation between age and personal use of Internet is -.35, while the correlation between gender and personal use of Internet is - .11. Both correlations are significant, that is, the results are unlikely to be due to chance.
It is not straightforward to compare the size of these coefficients. Gender is a variable with only two categories, while age is a metric variable with many categories. When the number of categories is low, it is nearly impossible to obtain a very high correlation using survey data.
To be able to interpret the direction of the correlations - in our example they are both negative- we need to know how the variables are coded: Gender (0 = male, 1 = female), Personal use of internet/e-mail (1 = No access, 2 = Never use, 3 = Less than once a month, 4 = Once a month, 5 = Several times a week, 6 = Every day), Age in number of years 2002 (20, 21 etc.)
A negative correlation means that there is a tendency for a respondent with a low value on one variable to have a high value on the other variable. (Positive correlation means that a high value on one variable is associated with a high value on the other). Thus, the negative correlation between gender and Internet means that male (low value) is connected to frequent use (high value), and female to less frequent use. Similarly, the result reveals that young people (low value) use the Internet more often (high value) than old people.
Footnotes
- [1] In research it is common to use Pearson's r for nominal or ordinal variables as well. To do so, it is necessary to assume that the variable has metric characteristics. For nominal variables (male - female) this is done be recoding the variables into dummy variables: Either you are a man (=1), or you are not (=0) (but then we know you are a woman). For ordinal variables (very positive ? quite positive ? quite negative ? very negative), we assume that this is a scale.