# All pages

# Chapter 3: From the Sample to the Population: Estimation and Design Weighting

## Population and Sample Quantities

Once the sample is selected and (ideally) all respondents have answered the questions in the survey, we are interested in making statements about the data. However, we are not only interested in the distribution of study variables in the sample but even more so in the distribution of certain parameters in the population. Generally, a **population parameter** is denoted by θ and is a function of the values of the study variable Y. The population total of Y, for example, can be expressed as

(3) |

and the population mean can be expressed as

(4) |

Following the notation in Lohr 19991, the variation of the population values around the mean is

(5) |

These parameters are usually unknown and we have to use sample data to estimate them. An estimator for the population parameter θ is denoted by
.
It is a function of the observed values of the study variable in the sample. An **estimate** is the numerical value that an estimator yields.

We have already seen that the same estimator can produce different values if calculated using data from different samples. Every single one of the 210 possible samples of size n = 4 that can be drawn from a population of size N = 10 if sampling without replacement can yield a different estimate for the population parameter Y. If the estimator for Y is unbiased, most of the estimates will be scattered closely around Y. Furthermore, if we take the average of the means of all possible samples of size n, we get Y. However, a few estimates will be much lower and a few will be much larger than Y. This (theoretical) distribution is sufficiently described by two parameters: its expected value and variance. These two parameters translate into the quality criteria of an estimator's bias and precision in the following way:

**Bias** refers to the magnitude to which the expected value of over or underestimates the population parameter θ:

**Precision** is measured by the variance of and indicates how close around θ its estimator will scatter over all possible samples:

Figure 3.1 graphically illustrates the concept of bias and precision.

Figure 3.1. Bias and precision of an estimator

Unbiased estimators should be preferred over biased estimators and precise estimators over imprecise ones. However, the magnitude of bias and precision always depends on both the sample design and the estimator under consideration.

Under srs or srswr, both the inclusion probabilities π_{i} and the design weights w_{i} of all elements in the sample are constant. This means that all elements are equally ‘important’. Hence all elements contribute the same amount w_{i} = N/n to any estimator. Thus, we can rescale w_{i} = N/n so that w_{i} = 1. We construct an estimator for the population total and the population mean based on the sample data by substituting Y_{i} in (3) by y_{i} and N in (4) by
. Thus, as an estimator for the population total under equal probability sampling we have

(6) |

where y is the sample mean under equal probability sampling, which can be expressed as

(7) |

The variance of is then

(8) |

which obviously cannot be calculated directly since we have to estimate S^{2} from the sample by

(9) |

Generally, an estimator for the variance of is called a **variance estimator**. The variance estimator for is then

(10) |

Analogously, the variance of can be expressed as

(11) |

Again, the above equation cannot be calculated directly and we have to use s^{2} as an estimator of S^{2}, obtaining

(12) |

as an estimator for Var ().

Both the point estimators and as well as their corresponding variances estimators and assume constant inclusion probabilities. This assumption holds for simple random sampling with and without replacement. Generally, when equal probability sample designs are used, the sample total and the sample mean are unbiased estimators for the population total, and the population mean and their variance can be estimated from sample data using the above formulas. However, if inclusion probabilities are not constant, we need a more sophisticated estimator that takes this variation into account, both in the point and in the variance estimator. An estimator that meets these criteria is the Horvitz-Thompson estimator, which is introduced in the next subsection.

- [1] Lohr 1999:29.

# The Horvitz-Thompson Estimator

Under unequal probability sampling, the Horvitz-Thompson estimator (HT estimator) is an unbiased estimator of the population total. It is defined as

(13) |

where w_{i} is the design weight of the ith element as defined above. The HT estimator of the population mean can be expressed as

(14) |

where, as before, is the estimated population size. As we can see, the functional form of the point estimators has not changed. If we rescale the weights so that we can rewrite (14) as

(15) |

The above formulas for the Horvitz-Thompson estimators of the population total and the population mean look very similar to the formulas that apply under equal probability sampling. The variance of the HT estimator for the population total and the population mean, however, look a bit different, as we can see:

(16) |

The π_{iq} denotes the probability of both elements i and q being selected into the sample. The above formula can also be expressed in the so-called Sen-Yates-Grundy form as

(17) |

Since the variances Var () in both (16) and (17) are unknown, we have to estimate them using sample data, obtaining the variance estimator for the HT estimator of the population total, which is given by

(18) |

and, correspondingly, in Sen-Yates-Grundy form by

(19) |

Formulas for the variance estimator of the HT estimator for the population mean can be found in Kish1. Estimation of the variance in this closed form requires computation of second order inclusion probabilities, π_{iq}, which, as Münnich notes2, can be cumbersome or impossible. Thus, practical approximations have to be found that avoid use of second order inclusion probabilities. What is important here is that, with unequal probability sample designs, the variances in the HT estimators for both the population total and the population mean are increased compared with equal probability sampling. This inflation of variance must be taken into account, for example, in statistical testing. If data analysts treat the data from a complex sample as having arisen from a simple random sample and therefore use formulas (10) instead of (19), they underestimate the variance of the estimator for the population total. As an effect of this underestimation, substantive researchers will, for example, find significant differences that are more significant than if the correct variances estimator had been used.

Modern statistical software packages such as STATA, R or SUDAAN make use of approximation methods for variance estimation, for example Taylor series approximation3 or jackknife repeated replication4. In STATA, the data analyst can use the appropriate svy commands when analysing data from unequal probability samples.