# The Horvitz-Thompson Estimator

Under unequal probability sampling, the Horvitz-Thompson estimator (HT estimator) is an unbiased estimator of the population total. It is defined as

(13) |

where w_{i} is the design weight of the ith element as defined above. The HT estimator of the population mean can be expressed as

(14) |

where, as before, is the estimated population size. As we can see, the functional form of the point estimators has not changed. If we rescale the weights so that we can rewrite (14) as

(15) |

The above formulas for the Horvitz-Thompson estimators of the population total and the population mean look very similar to the formulas that apply under equal probability sampling. The variance of the HT estimator for the population total and the population mean, however, look a bit different, as we can see:

(16) |

The π_{iq} denotes the probability of both elements i and q being selected into the sample. The above formula can also be expressed in the so-called Sen-Yates-Grundy form as

(17) |

Since the variances Var () in both (16) and (17) are unknown, we have to estimate them using sample data, obtaining the variance estimator for the HT estimator of the population total, which is given by

(18) |

and, correspondingly, in Sen-Yates-Grundy form by

(19) |

Formulas for the variance estimator of the HT estimator for the population mean can be found in Kish1. Estimation of the variance in this closed form requires computation of second order inclusion probabilities, π_{iq}, which, as Münnich notes2, can be cumbersome or impossible. Thus, practical approximations have to be found that avoid use of second order inclusion probabilities. What is important here is that, with unequal probability sample designs, the variances in the HT estimators for both the population total and the population mean are increased compared with equal probability sampling. This inflation of variance must be taken into account, for example, in statistical testing. If data analysts treat the data from a complex sample as having arisen from a simple random sample and therefore use formulas (10) instead of (19), they underestimate the variance of the estimator for the population total. As an effect of this underestimation, substantive researchers will, for example, find significant differences that are more significant than if the correct variances estimator had been used.

Modern statistical software packages such as STATA, R or SUDAAN make use of approximation methods for variance estimation, for example Taylor series approximation3 or jackknife repeated replication4. In STATA, the data analyst can use the appropriate svy commands when analysing data from unequal probability samples.