The estimation of the quality of survey questions
There are many different procedures for estimating the quality of questions and of measures for complex concepts. The best known is perhaps the test-retest design [Lor68] for estimating the reliability of questions. An adjustment of this approach was the Quasi-simplex model [Hei69], [Wil70] used by [Alw91] and [Alw07]. The Multitrait - Multimethod (MTMM) design was suggested in order to take into account the effects of the method used [Cam59]. It was further developed by Andrews [And84] and others for survey questions. For concepts with multiple indicators, different procedures have been developed based on latent variable models such as factor analysis [Law71], [Har76] and latent class analysis [Hag88], [Ver03], [Bie11]. Furthermore, scaling methods have been developed, such as the Thurstone scale, Likert scale etc. [Tor58], the Gutmann scale and Mokken scale [Mok71], the Unfolding scale [Sch97], Rasch scale [Ras60] and Item Response theory [Ham91]. For the advantages and disadvantages of these different procedures, we refer to this literature.
All these procedures require at least two questions to estimate the quality of each concept. That means that the number of questions has to be at least twice the number of concepts one wishes to take into account in the analysis. As a result, these procedures lead to rather costly and time-consuming research involving rather complex procedures. Besides, all these procedures provide estimates of the quality of specific questions or concepts for the formulation of specific questions used in a specific questionnaire and context. Thus, generalization is not easily possible.
This means that a lot of research has to be done before the final data collection in order to correct for measurement errors in all variables in the study. This is so much work that it is only seldom done. So, the question is whether there is a procedure that is less time-consuming and expensive for estimating the quality of survey questions and of composite scores for concepts with multiple indicators.
From the very start of the European Social Survey (ESS), Saris has emphasized that the measures will contain errors and that, without correction for these errors, the results will be questionable and incomparable across countries. Therefore, since the beginning of 2002, each survey in the ESS has contained four to six MTMM experiments to evaluate the quality of the questions.1 These MTMM experiments were carried out in most countries and all rounds. An example of such an experiment was presented in the first chapter.
In the normal MTMM experiment suggested by [Cam59], the respondent has to provide responses to three different questions (i.e. traits) measured using three different methods [And84]. Because people had to answer the same question approximately three times, we might expect memory effects. In order to cope with the memory effects in the MTMM experiments, it has been suggested by [Sar04] that the sample can be randomly split into different subgroups and the same question asked only twice in each group. This design was named the Split - Ballot Multitrait - Multimethod (SB-MTMM) design. [Sar04] also showed that this design enables estimation of the reliability and validity (complement of the method effect) and the quality of each question.2 In recent years, all experiments in the ESS rounds have been analysed using the SB-MTMM procedure. Consequently, after the first three ESS rounds, more than 250 SB-MTMM experiments have been conducted in more than 20 countries (languages), including approximately 2,700 questions. Thus, because of the results obtained after the first three rounds of the ESS, together with the results obtained by previous and simultaneous MTMM analysis done by other research agencies, the reliability, validity and quality of 3,726 questions is now known. However, this information is not enough because, at the same time in the ESS, more than 62,000 questions were asked about values, norms, policy preferences, feelings etc. the quality of which was not analysed. Thus, a different approach was required.
-  The ESS aims to achieve high methodological standards, striving for optimal comparability in the data collected across all countries. A key part of this consists of maximizing the reliability and validity of the final questionnaire across the participating countries and making the quality (reliability times validity) as comparable as possible across countries.
-  Quality is defined as the product of the reliability and validity
- [Alw07] Alwin, D. F. (2007). Margins of error: A study of reliability in survey measurement. Hoboken, Wiley
- [Alw91] Alwin, D. F. and Krosnick, I. A. (1991). The reliability of survey attitude measurement. The influence of question and respondent attributes. Sociological Methods and Research, 20, 139-181.
- [And84] Andrews, F. M. (1984). Construct validity and error components of survey measures: a structural modelling approach. Public Opinion Quarterly, 48, 409-442.
- [Bie11] Biemer, P. R. (2011). Latent class analysis of survey errors. Hoboken, Wiley.
- [Cam59] Campbell, D. T. and Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrices. Psychological Bulletin, 56, 81-105.
- [Hag88] Hagenaars, J. (1988). Latent structure model with direct effects between indicators; local dependency models. Sociological Methods and Research. 379-405.
- [Ham91] Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
- [Har76] Harman, H. H. (1976). Modern factor analysis. Chicago: University of Chicago Press.
- [Hei69] Heise, D. R.(1969). Separating reliability and stability in test-retest correlation. American Sociological Review, 34, 93-101.
- [Law71] Lawley, D. N. and Maxwell, A. E. (1971). Factor analysis as a statistical method.
- London, Butterworth.
- [Lor68] Lord, F. and Novick, M. R.(1968). Statistical theories of mental test scores. Addison – Wesley.
- [Mok71] Mokken, R. J. (1971). A theory and procedure of scale analysis and applications in political research. New York: Walter Gruyter Mouton.
- [Ras60] Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Education Research.
- [Sar04] Saris, W. E, Satorra, A. and Coenders, G. (2004). A new approach for evaluating the quality of measurement instruments: Split Ballot MTMM design. Sociological Methodology, 34, 331-347.
- [Sch97] Van Schuur, W. H. (1997). Nonparametric IRT models for dominance and proximity data. In: Wilson, M., Engelhard jr. G.,& Draney, K. (Eds.), Objective Measurement: Theory into practise, Volume 4. Greenwich (Cn)/London: Ablex Publishing Corporation, 313-331.
- [Tor58] Torgerson, W. S. (1958). Theory and methods of scaling. London, Wiley.
- [Ver03] Vermunt, J. K. (2003). Multilevel latent class models. Sociological Methodology, 33, 213-239.
- [Wil70] Wiley, D. E. and Wiley, I. A. (1970).The estimation of measurement error in panel data. American Sociological Review, 35, 112-117.