# Stratified Sampling

In stratified sampling (str), the population of interest can be divided into H non-overlapping sub-populations or **strata** of size N_{h} (h = 1, ... , H) according to a stratification variable Z. The stratification variable is either discrete or has to be recoded into a discrete variable with as many unique values as the desired number of strata. The values of Z are denoted by Z_{h}. The total sample size n is then allocated to the strata, so that
. Samples of size n_{h} are drawn within each of the H strata. To summarise, there are four choices to make when planning a stratified sample:

- What variable to use for stratification?
- How are the stratum boundaries defined?
- By what method should the total sample size be allocated to the strata?
- What sample design is used to draw the samples within the strata?

The answer to the first question depends to large extent on the availability of population information. Very often there is no such additional information available, and the overall population figure N and the stratum population figures N_{h} are therefore used for Z and Z_{h} and, hence, n_{h} = n * (N_{h}/N).

Answering the second question is not always easy. Researchers often plan to stratify the sample geographically. In our example, it would make sense to stratify the sample according to the 19 regions of Norway. Hence, the stratum boundaries are clearly defined by the geographical location of the people living in Norway. But if a non-discrete variable, for example age, is to be used, the researcher would have to make a decision on how many discrete values the recoded variable should have and where to draw the stratum boundaries1.

Concerning the third question, there are basically two methods for allocating n to the H strata and thus determining n_{h}. First, n can be allocated to the strata proportionally (strp) to Z_{h} so that

n_{h} = n * Z_{h}/Z.

If we know the stratum population figures N_{h} for h = 1, ... , H, we would use them for stratification and set Z_{h} = N_{h}. Now we are able to allocate n to the H strata proportionally to their size:

n_{h} = n * N_{h}/N.

One advantage of proportional allocation is that the inclusion probabilities are constant. Generally, the inclusion probability of the jth element in stratum h can be expressed as

π_{hj}^{(str)} = n_{h}/N_{h}.

Since, in proportional allocation, the stratum sample sizes n_{h} are by definition proportional to N_{h}, the above ratio is constant and hence

There can, however, be reasons to decide not to use proportional allocation of the total sample size, for example if the researcher wants to make sure that a minimum sample size is drawn within each stratum. We refer to any stratified sample design where n_{h} is not decided by proportional allocation as **disproportional stratified sampling** (strd).

Question number four also depends on the availability of information within strata. If, for example, no stratum-wise lists of people are available but only a list of households, one would be forced to opt for a multi-stage sample design (see the Section on multi-stage sample design) instead of a simple random sample within strata.

One final question remains: why use stratification in the first place? One important aspect is that stratified samples can have a lower variance than srs. However, the magnitude of the reduction or increase in variance depends on the degree of homogeneity of elements within the strata and on heterogeneity between strata. Thus, a well-informed choice of stratification characteristics is essential to achieve the gains in efficiency that stratification generally offers2.

### Exercise 4

Suppose you want to draw a stratified sample of size n=2,750 from the Norwegian population. You know the population figures in the 19 regions of Norway. They are shown in the following table:

Region | Population |
---|---|

Total | 4 794 619 |

Akershus | 523272 |

Aust-Agder | 106842 |

Buskerud | 253006 |

Finnmark | 72560 |

Hedmark | 189586 |

Hordaland | 469681 |

Møre og Romsdal | 247933 |

Nordland | 235124 |

Nord-Trøndelag | 130192 |

Oppland | 183851 |

Oslo | 586860 |

Østfold | 267039 |

Rogaland | 420574 |

Sogn og Fjordane | 106389 |

Sør-Trøndelag | 284773 |

Telemark | 167102 |

Troms | 155061 |

Vest-Agder | 166976 |

Vestfold | 227798 |

- Allocate the sample size to the regions using proportional allocation.
- Allocate the sample size to the regions using the same sample size in each stratum.
- What practical problems do you have and how do you solve them?

- Find the percentage of the total population living in each of the strata (people living in stratum/4794619). Use this percentage to proportionally allocate the 2,750 persons in the total sample to the strata (percentage of stratum*2750). In the Akershus stratum, for example, the stratum sample size according to this allocation scheme is 586860/4794619*2750 = 336.6.
- Divide the total number of respondents in the sample by the number of regions (2750/19). The sample size in every stratum will be 2750/19=144.7.
- In practical applications, you will almost always encounter problems of integrity. Allocating a total sample size proportionally to strata hardly ever results in whole numbers. This is a problem since rounding stratum sample sizes in accordance with a fixed system can result in the sum of stratum sample sizes no longer equalling the total sample size. One solution to this problem is to use Cox controlled rounding [Cox89].

Region | Population | % of total population | Proportional n | Equal sized n | Difference in n |
---|---|---|---|---|---|

Total | 4 794 619 | 100 | 2750 | 2750 | 0 |

Akershus | 523272 | 10.9 | 300.1 | 144.7 | -155.4 |

Aust-Agder | 106842 | 2.2 | 61.3 | 144.7 | 83.5 |

Buskerud | 253006 | 5.3 | 145.1 | 144.7 | -0.4 |

Finnmark | 72560 | 1.5 | 41.6 | 144.7 | 103.1 |

Hedmark | 189586 | 4 | 108.7 | 144.7 | 36 |

Hordaland | 469681 | 9.8 | 269.4 | 144.7 | -124.7 |

Møre og Romsdal | 247933 | 5.2 | 142.2 | 144.7 | 2.5 |

Nordland | 235124 | 4.9 | 134.9 | 144.7 | 9.9 |

Nord-Trøndelag | 130192 | 2.7 | 74.7 | 144.7 | 70.1 |

Oppland | 183851 | 3.8 | 105.4 | 144.7 | 39.3 |

Oslo | 586860 | 12.2 | 336.6 | 144.7 | -191.9 |

Østfold | 267039 | 5.6 | 153.2 | 144.7 | -8.4 |

Rogaland | 420574 | 8.8 | 241.2 | 144.7 | -96.5 |

Sogn og Fjordane | 106389 | 2.2 | 61 | 144.7 | 83.7 |

Sør-Trøndelag | 284773 | 5.9 | 163.3 | 144.7 | -18.6 |

Telemark | 167102 | 3.5 | 95.8 | 144.7 | 48.9 |

Troms | 155061 | 3.2 | 88.9 | 144.7 | 55.8 |

Vest-Agder | 166976 | 3.5 | 95.8 | 144.7 | 49 |

Vestfold | 227798 | 4.8 | 130.7 | 144.7 | 14.1 |

Only one decimal is shown.

#### Footnotes

- [1] Defining stratum boundaries is not necessarily an arbitrary decision. Methods exist for optimally stratifying a sample.
- [2] For a more detailed overview of stratification techniques, see Särndal et al. (1992, chapter 3.7), Cochran (1977), Lehtonen and Pahkinene (2004, pp. 61) or Münnich (2003).

#### References

- [Cox89] Cox, L. W. and George, J. A. (1989). Controlled Rounding For Tables With Subtotals. In
*Annals of Operations Research*, 20:141-157.