Suppose that, following the design of the previous well contamination study, a new group of researchers wants to conduct a survey to characterize the effect that distance to the nearest safe drinking water source has on well switching behavior. For the next study, we would like to estimate the number of residents that need to be polled in order to achieve high power in our test for the distance effect. Classical methods of power calculation assume that observations are independent of one another. Our analysis of the data from the previous well study called this assumption into question. We propose using a modified bootstrap sampling method called the Moving Block Bootstrap (MBB) to appropriately preserve the dependence structure in the data during power calculation. We compare the MBB results to those obtained from parametric and non-parametric alternatives and show that taking into account the additional dependence increases the number of samples required to achieve a given power.
How many residents do researchers need to include in their next study in order to reliably detect the association of distance to the nearest safe drinking water source with the decision to switch wells? To properly answer this question, we need to make clear a few points. In the context of this analysis, reliable detection of a phenomenon means that a specific statistical test will reject the hypothesis that the phenomenon is absent with high probability, given that the phenomenon in fact exists. In this narrow sense, the probability of detecting the phenomenon is called “power.” When we speak about the power (we will denote it as \(P\)) of a future study, we assume that the methods used to determine statistical power exactly match the method that will be used to determine statistical significance in the future study.
Rather than using direct simulation, we choose to use the existing well data to calculate power. That means our results implicitly assume that data obtained in the future study will be sampled from the same distribution as the one sampled in the previous study. Under the assumption that Wald statistics derived from logistic regression coefficients are used to determine statistical significance, we follow the prescription of the authors in Hsieh et al. (1998), who derived a simple formula for calculating necessary sample size \(n\) given a desired power level \(P\) and a significance level \(1-\alpha\):
\[n = \frac{(Z_{1 - \alpha/2} + Z_P)^2}{\hat{p}(1 - \hat{p})\hat{\beta}^2} \cdot \frac{1}{1 - \rho^2}\]Here, the \(Z\)’s are the normal quantiles, \(\hat{\beta}\) is the regression coefficient for distance, and \(\hat{p}\) is the fitted switch probability at the column-wise mean of the covariate matrix. In our case, the re-scaling parameter \(\rho^2\) is the proportion of variance explained by regressing the distance variable on the arsenic and education variables.
The Wald-based power calculation relies on distributional assumptions about the data, which, in practice, may not hold. While a sampling method like bootstrapping is free from distributional assumptions, even it relies on the assumption that the switch responses are conditionally independent and identically distributed given the covariates. Since our previous analysis of these data showed that the iid assumption does not hold, we propose using a method called the Moving Block Bootstrap (MBB) (Künsch 1989) to estimate power.
Let \(\mathcal{N}\) be a set of predetermined sample sizes, let \(\alpha\) be the desired false positive rate under the null, and let \(p\) be a chosen block size. Then for each \(n \in \mathcal{N}\), our method (1) samples \(n/p\) blocks of observations with replacement to create a bootstrap sample of size \(n\) (2) fits a logistic regression to the bootstrap sample, modeling the decision to switch as a function of distance and the other covariates (3) calculates the p-value of the distance coefficient and (4) reports “reject” or “fail to reject” according to \(\alpha\). After sampling at each prospective sample size, the MBB’s estimate of power at the \(1-\alpha\) level is simply the proportion of bootstrap samples for which the distance p-value was less than \(\alpha\). One limitation of our method is that all power values must be interpreted as the probability of detecting an effect of similar magnitude to that observed in the sampled data. To be conservative, we make no claims about power at different scales of effect size.
Figure 1: Power versus sample size for three methods of calculation. The blue curve follows formula described in Hsieh (1998). The red and green curves are linear interpolations of means and standard errors obtained from non-parametric bootstrap sampling.
A full discussion of the theory underlying the MBB is outside the scope of this work. However, the intuition is that sampling contiguous chunks of data guarantees that every bootstrap sample maintains the neighbor-dependent response structure present in the full dataset. Sampling without blocking provides no such guarantees and is therefore likely to yield overly optimistic estimates of power. Our results indicate that, indeed, preserving the structure of the data in bootstrap samples leads to higher estimates of sample size required to achieve a given \(P\).
To justify using the MBB results over the others, it is important to emphasize that even if the researchers are aware of the geographic dependence phenomenon, the next set of data will still contain it. The geographic dependence was not an artifact of the data collection procedure. Rather, it was the neighbors’ interactions that created this dependence. The fact that the data were stored in the order they were collected allowed us to see the structure, but it would have been there even if the order of observations had been randomly scrambled from the start. It is unlikely – perhaps even undesirable – that the next group of researchers should attempt some contrived study design in order to decouple their responses. We believe we should instead adopt a method of calculating power that is most faithful to the data-generating distribution the researchers are likely to encounter in their next study.
Our results demonstrate that although geographic dependence does reduce the effective size of a given sample, it does so only modestly at the \(P = 0.8\) level. Researchers should increase their sample size by roughly \(100\) residents at this level of power over what the parametric or naive bootstrapping results would dictate. In a clinical trial setting, this recommendation might be devastating to a prospective study. For a survey such as this, it is presumably feasible to find an additional \(100\) respondents. However, to achieve higher \(P\) levels, the necessary increase in sample size becomes a more substantial hurdle.
Hsieh, F. Y., Bloch, D. A. and Larsen, M. D. (1998), A simple method of sample size calculation for linear and logistic regression. Statist. Med., 17: 1623-1634. doi:10.1002/(SICI)1097-0258(19980730)17:14<1623::AID-SIM871>3.0.CO;2-S
Künsch, Hans R. The Jackknife and the Bootstrap for General Stationary Observations. Ann. Statist. 17 (1989), no. 3, 1217–1241. doi:10.1214/aos/1176347265. https://projecteuclid.org/euclid.aos/1176347265