# Population

When we talk about sample designs, it is important to keep in mind that any sample is drawn from a (possibly much) larger set of elements called **population** or **universe**. The upper case letter U denotes a universe of size N. In the sampling literature, upper case letters usually denote quantities of the universe and lower case letters quantities of the sample. A universe U has exactly N elements (persons, businesses, countries etc.) - not more or less or approximately N, but exactly N. A universe U of size N = 10 people thus contains exactly 10 people.

To be able to identify the elements of U, we enumerate them U_{1}, U_{2},..., U_{i},..., U_{N}, i = 1,..., N. For convenience of notation, we generally write U_{i} and only specify which i we mean. You will soon become familiar with this notation.

Every person in our little universe of size N=10 thus has a unique number i=1,..., 10. The first person is referred to as U_{1}, the second one as U_{2}, the third as U_{3} and so on until the tenth person, whom we refer to as U_{10}. Please note that this enumeration does not imply any order. We could just as well refer to U_{3} as U_{1} or U_{7} as U_{5}. However, once the enumeration is fixed, we have to stick to it, of course.

A **study variable** denoted Y is associated with each element of U. Obviously, Y has the same length as U. In our example, Y has the length 10. The value of the i-th element in the study variable is denoted by Y_{i}. The Y-value of the first person in our universe is denoted by Y_{1}, the Y-value of the fourth person by Y_{4}, and so on. If the study variable in our universe were **age**, Y could look like this

Y = (Y_{1}, Y_{2},..., Y_{10})’ = ( 37, 37, 63, 31, 39, 45, 59, 22, 53, 18)’.

This means that the first and second person are both 37 years old, the third person is 63 years and the tenth person is 18 years old. Obviously, we usually we do not know the values in a study for all elements of the population. That is precisely why we select a sample. We pretend to know them just for illustrative purposes.

You may have asked yourself what the apostrophe after the closing bracket could mean. Usually, study variables are defined as so-called **column vectors**. A column vector is a mathematical object comparable to a column in a table in a data set. Consequently, we would have to write Y as

which is a bit cumbersome. Thus, to save space, we write Y as in the first equation (as a row vector) and use the apostrophe to indicate that Y is actually a column vector and must be treated as such.

In the ESS, the size of the population from which we draw samples differs from country to country. The basic definition of the elements that are part of the population is the same in all countries, however. The ESS includes `all persons aged 15 and over (no upper age limit) resident within private households in each country, regardless of their nationality, citizenship or language' [Ess09]. Thus, any sample that is drawn from a population that meets these definitions will, naturally, only contain elements with the aforementioned characteristics. For example, adolescents younger than 15 do not fall within the above definition and are thus excluded from the population, and hence from the sample. That is why we refer to the population we target for our sample as the **target population**. Should the sampling process, for whatever reason, systematically exclude elements of the target population, they will not have a chance of being included in the sample. We are therefore unable to make any judgments about these elements. The population we can reasonably infer to is thereby reduced. We call this reduced population the **inference population**.

In order to actually draw a sample from a population, we need an accessible list of all the elements in the population, a so-called **sampling frame**. Today, sampling frames are usually digital lists that can be processed by computers. Ideally, the sampling frame contains exactly the elements of the target population. However, due to practical problems, the sampling frame can contain elements that are not part of the target population (e.g. people younger than 15 years). If this is the case, we speak of **over coverage**. If the sampling frame does not contain all elements of the target population (e.g. only people older than 18 years), we speak of **under coverage**. Both, over and under-coverage can occur at the same time. Figure 2.1 illustrates the interconnection between over and under-coverage and inference and target population.

Figure 2.1. Interdependence of coverage and inference and target population

We aim for a high overlap of the target and inference populations. In real-world sampling practice, the magnitude of the overlap depends on the quality of the sampling frame. In the ESS, sampling frames are often electoral lists. However, these lists only cover people in a country who are eligible to vote: in most European countries citizens aged 18 and older. If only electoral lists were used as a sampling frame, we would have strong under-coverage of people between 15 and 18 year of age. Therefore, we must either resort to another sampling frame or enrich the electoral lists by using a sampling frame that includes the missing elements.

#### References

- [Ess09] ESS (2009).
*European Social Survey, Round 5: Specification for participating countries*. London: Centre for Comparative Surveys. http://www.europeansocialsurvey.org/index.php?option=com_docman&task=doc_download&gid=602&itemid=80