Measures of Homogeneity

In expressions (23), (24) and (28), ρ is used in the definition of the design effect (and its estimator) due to clustering. It serves as a measure of homogeneity. As such, it indicates the degree to which elements of the same cluster are more similar to each other than to all other elements in the population or sample, respectively. For the population, ρ is defined as


with Sy2 = SST/MB-1 . This can also be written as


where . The domain of ρ ranges from -B/B-1 to one. The value of ρ can be negative when most or all of the total variation can be attributed to variation within clusters. This makes sense theoretically but will almost never occur in practical applications where ρ has small values around 0.02 to 0.15, depending on the variable under study.

It is obvious that the population quantities in (30) and (31) are unknown and have to be estimated from sample data. Numerous estimators have been proposed in the literature, but the so-called ANOVA or AOV estimator of ρ has proven to be a very reliable candidate in many studies. It can be expressed as


Most statistical software packages use the estimator by default. Even if a software package does not provide a pre-defined function, the estimator can be constructed quite easily as MSB, and MSW can be obtained by any function that yields an analysis of variance table.

Exercise 6

Based on the data from Example 5, calculate ρ under the assumption that both columns and rows defined the clusters, respectively. Explain the results.

Now, assume that we define the rows as clusters and select a cluster sample with m=2 of total size n=10. Assume a sample has been selected according to this design so that all elements in the first and the third row have been selected. Calculate


Substituting this in (31) gives,

This is caused by the high degree of heterogeneity of the column-wise means and the high homogeneity of the row-wise means. If the columns selected are not exactly symmetrical with the third column (i.e. first and fifth and second and fourth), the difference between the sample mean and the population parameter will be very large. If columns are sampled, the variance in sample mean is very low. This is due to the very low heterogeneity of row-wise means. Even if, for example, in one of the worst cases, the two upper rows are selected, the sample mean is 11.5, which is closer to the population mean than in one of the corresponding worst cases of column-wise selection (i.e. if, for example, the first or last two columns are selected) where the sample mean is 5 and 20.5, respectively.

Under the specified sample design, the estimated value of rho is obtained by substituting in (32). We then get an estimate for = (MSB-MSW)/(MSB + (K-1)*MSW) = (10-62.5)/(10 + 4*62.5) = -0.2019.