# Sampling with Probability Proportional to Size

When information on a size measure G exists for every element in the population and this size measure stores valuable information about the ‘importance’ of element i to be included in the sample, we can use this information in the sample design. Sample designs that make explicit use of such size measures are called **probability proportional to size ** (pps) sample designs. The inclusion probability of element i of a pps sample of size n is

Sample designs with pps are often used in business surveys when it is important to include the largest firms in an industry in the sample since they contribute a large amount to the industry’s production of goods or services. However, pps can also be combined with cluster sample designs or general multi-stage sample designs, which are introduced in the next Section.

## Cluster Sampling and Multi-Stage Sampling

A multi-stage or a cluster sample is drawn either because no population-wide sampling frame of ultimate sampling units exists or because the fieldwork personnel management wishes a geographical distribution of the interviewers that minimises travel between and within geographical clusters. Clustering can, however, lead to a severe loss of precision in estimators, as we will see later.

A **cluster sample design** (clu) is any sample design in which ultimate sample units are not selected directly but are taken from a sample of superordinate non-overlapping clusters. A **cluster**, or primary sampling unit (PSU), denotes a subset of population elements that belong to this subset due to some specific well-defined attributes (e.g. a person's address).

Each ultimate sampling element belongs to exactly one PSU and each PSU contains one or more ultimate sampling units. A clustered population consists of M PSUs, which are of size N_{i}, i=1, ... , M. We shall assume that a complete frame of PSUs exists from which a sample of m PSUs is drawn. The set of possible samples of m of the M clusters is denoted by S and a specific sample of m PSUs is denoted by s. The cluster sample design is defined as p(s). The inclusion probabilities of each of the M clusters is denoted by π_{i}. The value of π_{i} depends on the characteristics of the sample design. After s has been obtained, y is surveyed for each of the
ultimate sample elements (ignoring contact and non-response issues for the time being). This sampling scheme is referred to as **cluster sampling**, single or one-stage sampling. It is a special case of a wider class of so-called multi-stage sample designs.

A **multi-stage sample design ** (mul) is any sample design in which ultimate sample elements are selected through subsequent sampling in two or more superordinate stages. In two-stage sampling (mul2), for example, ultimate sampling units are nested directly within superordinate clusters. Under mul2 m of M, clusters are selected at the first stage. The set of possible samples of m primary sampling units is denoted by S^{(1)}. A specific sample of m primary sampling units is denoted by s^{(1)}; inclusion probabilities for each of the M PSUs are denoted by π_{i}, i=1, ... ,M. At the second stage, n_{i} **secondary sampling units (SSU)** of the ith PSU of size N_{i} are selected within each selected PSU. Thus,
Elements of the ith cluster are denoted by 1, ... ,j, ... ,n_{i}. The set of possible samples of n_{i} from N_{i} SSUs in the ith PSU is denoted by S_{i}^{(2)} and a specific sample by s_{i}^{(2)}.

The sum of all S_{i}^{(2)} is S and the sum of all s_{i}^{(2)} is s.

The inclusion probability of the jth element given the ith PSU selected is denoted by π_{j|i}. The magnitude of π_{j|i} depends on the sample design that is used to select the elements within the PSU. We need a convenient notation system for the ultimate sample elements selected into the sample. Say there are
elements selected in total. We will then refer to the jth element of the ith PSU as the kth element, k=1, ... , n. Table 2.3 illustrates this notational scheme.

Table 2.3. Notation scheme

Following this notation, the overall inclusion probability of the jth element in the ith PSU is denoted by π_{ij} or simply by π_{k} and is the product of the inclusion probabilities in the two stages, which is expressed as

π_{ij} = π_{k} = π_{i}* π_{j|i}

and the design weight is expressed as

A consistent notation is generally used in three or more stages in multi-stage sampling.

Figure 2.3 shows examples of cluster sampling with M=16, N_{i}=16 and m=5 and two-stage sampling with m=5 and n_{i}=3.

Figure 2.3. Cluster and two-stage sampling scheme

It is a commonly held belief that one of the most striking advantages of cluster sampling for social surveys is that it guarantees reduced travel costs. Interviewers can be sent into the field within closely defined geographical boundaries. A primary sampling unit is often defined as a municipality or a city district, making travel from address to address relatively inexpensive. We will later see that this assumption may hold if the estimators based on data for a geographically clustered sample design are estimated naively, but that it can be neglected if the effects of the sample design are incorporated in the estimation process.

A further explanation for the widespread use of cluster sample designs is the unavailability of alternative sampling frames for ultimate sample elements (e.g. population registers). In fact, many European countries either lack such a list or do not allow researchers to draw a sample from it. This is also reflected in the sample designs used by ESS countries from which only about half are not multi-stage designs [Ess05b] [Häd07]. In the ESS, the guidelines for selection of a sample design follow the recommendation of Kish1: ‘Sample designs may be chosen flexibly and there is no need for similarity of sample designs. Flexibility of choice is particularly advisable for multinational comparisons, because the foundations for sampling differ between countries. All this flexibility assumes probability selection methods: known probabilities of selection for all population elements’.

### Example 2

Multi-stage sample designs are often combined with pps sampling as described above. A very commonly used sample design in the ESS is the following: In the first stage, m of M clusters are drawn by probability proportional to their size. Then, at the second stage, a fixed number of c persons is selected using srswor.

This particular sample design has the very desirable property that the overall inclusion probabilities are constant. This can be seen fairly easily: the inclusion probability for the ith PSU sampled by pps is π_{i} = N_{i} / N. Once it has been sampled, the inclusion probability of the jth element in the ith PSU is π_{j|i} = c / N_{i}. As explained above, denote the jth element of the i PSU by k. The overall inclusion probability of the jth element in the ith cluster is simply the product of π_{i} and π_{j|i}, which is

π_{ij} = π_{k} = N_{i}/N * c/N_{i} = c/N

and hence constant for all elements.

### Exercise 5

Assume a population exists of M=10 PSUs of the following size:

i | N_{i} |
π_{i} |
π_{j|i} |
π_{ij} |
---|---|---|---|---|

1 | 20 | |||

2 | 40 | |||

3 | 20 | |||

4 | 10 | |||

5 | 10 | |||

6 | 15 | |||

7 | 25 | |||

8 | 20 | |||

9 | 15 | |||

10 | 25 |

We sample by pps m=5 of the 10 PSUs. In each PSU, we select c=5 elements randomly.

- Calculate the first stage inclusion probabilities π
_{i}. - Calculate the second stage inclusion probabilities π
_{j|i} - Calculate the overall inclusion probabilities π
_{i j}

- Due to the pps selection in the first stage, inclusion probabilities are defined as N
_{i}/N. Thus the inclusion probability of the first PSU is 20/200=0.1. - Randomly selecting five secondary sampling units within each selected PSU means that the second stage inclusion probabilities are 5/N
_{i}. For example, the inclusion probabilities of the elements of the first PSU are all 5/20=0.25. - The overall inclusion probability of an element is simply the product of its inclusion probabilities in all stages. In our example, the overall inclusion probability of all elements belonging to the first PSU is N
_{i}/N * 5/ N_{i}= 5/N = 5/200 = 0.025.

i | N_{i} |
π_{i} |
π_{j|i} |
π_{ij} |
---|---|---|---|---|

= N_{i}/N |
= c/N_{i} |
= c/N | ||

1 | 20 | 0.100 | 0.250 | 0.025 |

2 | 40 | 0.200 | 0.125 | 0.025 |

3 | 20 | 0.100 | 0.250 | 0.025 |

4 | 10 | 0.050 | 0.500 | 0.025 |

5 | 10 | 0.050 | 0.500 | 0.025 |

6 | 15 | 0.075 | 0.333 | 0.025 |

7 | 25 | 0.125 | 0.200 | 0.025 |

8 | 20 | 0.100 | 0.250 | 0.025 |

9 | 15 | 0.075 | 0.333 | 0.025 |

10 | 25 | 0.125 | 0.200 | 0.025 |

#### Footnotes

- [1] Kish 1994:173.