The relationship between errors and quality of questions
In the first chapter, we have shown that response distributions and correlations between variables can be very different for different forms of the questions. Therefore, some or all questions have to contain errors. We expect random errors due to unintended mistakes and systematic errors as a consequence of stable behaviour on the part of respondents confronted with specific forms of a question. In this chapter, we want to indicate where these errors come from and how they can be related to the quality of survey questions.
Given that errors are made in the measurement process, the response variables will not be the same as the variables we want to measure. Looking only at the random errors, psychometricians [Lor68] have suggested the model shown in Figure 2.1.
This model suggests that the response variable (y) is determined directly by two other variables: the so-called true score (t) and the random errors represented by the random variable (e). The true score variable is the observed response variable corrected for random measurement error. For example, imagine that we measure the concept ‘the evaluation of the democratic character of the elections’ (i.e. democratic elections) by a direct request for an answer1 to the question ‘To what extent do you think that the elections in the UK are free and fair?’ using an 11-point scale, and that we assume that there are only random mistakes in the answers. Some people will mistakenly pick a score of 6, while their true score is 5, while others choose a 3 by mistake, while their true score is 4, and so on. Therefore, (y) represents the observed answer to the request, (e) represents the random errors (for these people these would be 1 in both cases), and (t) is equal to the observed answer to the 11-point scale question corrected for random errors. The effect of these two variables on the observed responses is indicated by arrows pointing in the direction of the variable that is influenced by them. Formally, we can write:
|y = t + e||equation 2.1|
Because the errors are random, one can assume that these errors are not related to the opinions of the respondents represented in the true score (t) and that the mean of the errors over all persons is zero. If the latter holds true, then the mean of the observed variable (y) is equal to the mean of the true score (t). So there is no systematic difference or bias between the two variables. If we standardize the observed variable and the true score, we get:
|ys = rts + e*||equation 2.2|
where ys and ts are the standardized observed variable and the standardized true score, respectively, and e* is the new error term that is divided by the standard deviation of the observed variable. r is the standardized effect of the true score on the observed variable (both also standardized). If σy and σt represent the standard deviations of the observed variable and the true score, respectively, it can be shown that:
|r = σt / σy||equation 2.3|
This coefficient (r) is called the reliability coefficient and the coefficient squared is called reliability (r2) and is equal to the ratio between the variance of the true score and the variance of the observed score. It can also be derived from equation 2.2, taking into account that the error is unrelated to the true score, that:
|var(ys) = 1 = r2 + var(e*)||equation 2.4|
So, the variance of the observed variable standardized is equal to the reliability (percentage of explained variance) plus the error variance (percentage of unexplained variance). Thus, the reliability can also be seen as the strength of the relationship between the observed variable and the true score. And it also follows that:
|r2 = 1- var(e*)||equation 2.5|
Thus, reliability is also equal to 1 minus the proportion of error variance. This shows that the reliability is the complement of the error variance. If the random error increases, the reliability decreases and vice versa. So, in order to say something about the amount of random errors in a question, we can also look at the reliability of a question.
However, as we have seen in the previous chapter, this model is too simple because systematic errors also occur because of the method used. The response on an 11-point scale will be systematically different from the response on a 4-point scale or any other measure. This suggests that it can be argued that the method used also has an effect on the responses. This is certainly the case, but [Cam59] have also argued that respondents may react in a systematic way to the formulation of the questions. Some people may use understatements and others overstatements. In general, this reaction of people to the method used has been denoted as the ‘method effect’ while it would have been better to call it the ‘reaction to the method’. However, we will not change this practice and therefore extend the model presented in Figure 2.1 with two extra variables that both influence the true score: the variable we are really interested in, for example ‘the democratic level of the elections’ denoted by (f), and the method (m), which could, for example, be the reaction of people to the question asked on an 11-point scale. Note that the variable of interest (f) is the opinion the person has in his mind, while the true score (t) is the derived opinion if it has to be expressed on an 11-point scale. It should be clear that the researcher is not as interested in the opinion presented on an 11-point scale or on a 4-point scale, but in what the opinion of the person is (f), no matter how it is expressed. This idea is presented in Figure 2.2.
In this case, the relationships between the true score and the two new variables can again be expressed in a simple equation. Going directly to the standardized equation, we can say that:
|ts = vfs + m*||equation 2.6|
This equation is similar to equation 2.2, but there is an important difference. In this case, the reaction to the method (m*) will not necessarily have a mean of zero. So the method effect can create a systematic difference between the true score and the variables of interest (f). This has been observed in the previous chapter, where we saw that the reaction was much more positive for one method than for the other method, even though we were measuring the same concept. As mentioned, the reaction of people to the method (m) will also vary across persons. Thus, we can again assume that this reaction is not related to people’s opinion (f).
In equation 2.6, it is assumed that the method factor is the only source of invalidity in the true score. We should mention that this is an assumption that can't be made in general. Normally, we can expect that there will also be factors other than the method used that create a difference between the variables of interest and the true score. However, in the specific application of the MTMM experiments, the basic question always remains the same; only the form of the question (called ‘the method’ by us) varies and it is therefore acceptable to assume that the only difference between the variable of interest and the true score is the reaction of people to the method. We cannot generalize this issue outside the context of the MTMM experiments.
The size of the effect of the variable of interest on the true score is called the validity coefficient (v) [Sar91], because it indicates how far the true score represents what one wishes to measure. The square of this coefficient is called validity (v2). This is not a standard definition of validity. Specific to this case is that, according to this formulation, the only variable that invalidates the response is the method factor (m). We will return to this issue below. It can again be derived from equation 2.6 that:
|var(ts) = 1 = v2 + var(m*)||equation 2.7|
This means that the variance of the true score standardized is equal to the validity (explained variance) plus the variance due to the method (the unexplained variance). As before, it follows directly that:
|v2 = 1 - var(m*)||equation 2.8|
So, we see that validity is equal to 1 minus the method variance. The validity will therefore be smaller if the reactions to the method vary more, and vice versa. It should be mentioned that there are many different definitions of validity [Bol89], also in the tradition of MTMM experiments. [And84] called the relationship between the variable of interest and the observed variable the construct validity. However, Saris and Andrews [Sar91] suggested that it was better to use the above definition because it allows reliability and validity to be studied separately, which is not the case using the original definition of [And84], as we will show below. Finally, we substitute (ts) in equation 2.2 for the right-hand term of equation 2.6 and get:
|ys = r(vfs + m*) + e*||equation 2.9|
|ys = rvfs + rm* + e*||equation 2.10|
This equation shows that the effect of the variable of interest (fs) on the observed variable (ys) is equal to the product of the reliability and validity coefficient. This product is denoted by a new coefficient (q) called the quality coefficient and the square of this coefficient is the quality of a question (q2), which is equal to r2v2. For [And84], q2 was called the construct validity. We call this coefficient the quality of the question and have split the term into the two parts reliability and validity, which can be evaluated separately from each other because some questions can be good in relation to one and bad in relation to the other criterion.
As before, it will be clear that if the error variance and/or the method variance increases the quality of the question decreases and vice versa. This indicates the relationship between the quality of a question and the random and systematic errors in a question.
To continue our example, if we are not so interested in the variable ‘democratic elections’, but more in the more general concept ‘quality of democracy’, and this concept is not measured by a direct request but by the indicator ‘democratic elections’, then there is a difference between the variable we would like to measure and the variable actually measured. While we certainly expect ‘democratic elections’ to be a good indicator of ‘quality of democracy’, the relationship will not be perfect, because ‘democratic elections’ are just one specific aspect of democracy. This can be modelled as shown in Figure 2.3.
This situation will occur if, for a complex concept, we use several indicators, each with its unique characteristic, to measure the quality of democracy. In this case, the derived quality criteria for the question about democratic elections, is certainly not a good criterion for the quality of the response to this question for the measure of the concept ‘quality of democracy’. This is the case because we need more questions to measure this concept. In that case, we also need to take into account the relationship between the variable democratic elections (f2) and quality of democracy (f1). We will return to this situation in Chapter 7. Till then, we assume that the concept is measured by a question that directly measures the concept of interest. In that case, the relationship between f1 and f2 is perfect and we are back in Figure 2.2.
Specify a measurement model as in Figure 2.2 to indicate where the errors come from for the question ‘equality by law’ (E25), introduced in Table 0.2.
Imagine that we want to measure ‘evaluation of democracy’ using three indicators: free and fair elections (E17), freedom to criticise (E20) and equality by law (E25). Specify a measurement model for this complex concept and indicate where the errors come from in this case.
-  A request for an answer, whatever form the request may take, presents a text that implies that the respondent is expected to give an answer. A question is a specific form of a request for an answer. Other forms are ‘the imperative’ and ‘the assertion’. For more details see (Sar14).
- [And84] Andrews, F. M. (1984). Construct validity and error components of survey measures: a structural modelling approach. Public Opinion Quarterly, 48, 409-442.
- [Bol89] Bollen, K. A. (1989). Structural equations with latent variables. Wiley.
- [Cam59] Campbell, D. T. and Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrices. Psychological Bulletin, 56, 81-105.
- [Lor68] Lord, F. and Novick, M. R.(1968). Statistical theories of mental test scores. Addison – Wesley.
- [Sar91] Saris, W. E. and Andrews, F. M. (1991). Evaluation of measurement instruments using a Structural Modeling Approach. Pp. 575 – 99 in Measurement errors in surveys, edited by Biemer, P. P. et al. New York: Wiley.