Jordan Bryan

Structural dependence in binary regression

Drinking water contamination affects many communities in the developing world. When safe water sources become available to these communities, various factors may determine whether residents modify their drinking habits. We analyze data gathered from \(3020\) residents in a developing country with Frequentist and Bayesian methods in order to understand why certain households switch to safer sources of drinking water and why others do not.

These data were obtained from and are featured in Andrew Gelman and Jennifer Hill’s book Data Analysis Using Regression and Multilevel/Hierarchical Models. The authors of that work perform an analysis that focuses on interactions between the explanatory covariates. Our approach focuses on a thus far unexamined artifact present in the dataset.

Using a standard probit model, we find some evidence that the distance to the nearest safe water source and the degree of contamination in the current water source are associated with a household’s propensity to switch. When checking the iid assumptions underlying our model, however, we find an apparent dependence between successive observations in the data. We demonstrate that incorporating this structure into our model increases our model quality as measured by AIC.

We propose a scenario in which the researchers in this study collected data in geographic order, thereby inducing the dependence structure we observe. We hypothesize that neighboring households’ decisions to switch to safe water sources are the most important determinants of whether a given household will switch. Accordingly, we present two analyses of the data, one that ignores the dependence between successive observations and one that exploits this dependence.

Inference under naive probit model

To collect these data, researchers interested in drinking water contamination in the developing world visited a region of Bangladesh to measure the concentration of arsenic in 3020 household wells. In addition to measuring arsenic levels, they recorded residents’ years of education, asked whether they were involved in their communities, and calculated the distance to the nearest safe drinking water source. Several years later, the researchers returned to the region to ask residents whether they had switched their source of drinking water since the last visit. We choose to model residents’ binary switch response using a probit regression:

\[\begin{aligned} y_i &= \mathbf{1}_{z_i > 0} \\ z_i &= X_i \cdot \beta + \epsilon_i \\ \epsilon_i &\overset{\text{iid}}\sim N(0,1) \end{aligned}\]

Here, \(y_i\) is the binary response of the \(i\)th household, \(X_i\) is a \(1 \times 4\) vector containing the explanatory variables for household \(i\), and \(\beta\) are the model coefficients. The variable \(z_i\) is household \(i\)’s latent switch propensity score. For Frequentist inference, we fit the model using maximum likelihood estimation and obtain 95% confidence intervals for the coefficients \(\beta\).

  Estimate 2.5% 97.5%
Intercept 0.2382 0.1773 0.2992
arsenic 0.3062 0.2549 0.3582
dist -0.2101 -0.2582 -0.1624
assoc -0.0796 -0.1726 0.0133
educ 0.1068 0.0605 0.1532

Table 1: Coefficient estimates from Frequentist naive probit regression

Our Bayesian procedure places a prior distribution on the coefficients \(\beta\) and constructs point estimates and 95% credible intervals from the posterior distribution of \(\beta | Z, X\). Under the prior \(\beta \sim N(b_0,B_0)\), the probit model allows us to find closed-form posterior distributions for all parameters given by

\[\begin{aligned} \beta|Z,X &\sim N(\beta_{\text{post}}, B_\text{post}) \\ z_i|X_i,\beta,y_i &\sim \left\{ \begin{array}{ll} TN_{[0,\infty)}(X_i \cdot \beta,1) & \text{ if } y_i=1 \\ TN_{(-\infty,0)}(X_i \cdot \beta,1) & \text{ if } y_i=0 \end{array} \right. \end{aligned}\]

where \( B_\text{post} = (B_0^{-1} + X^TX)^{-1} \), \(\beta_\text{post} = B_\text{post}(B_0^{-1} b_0+X^TZ)\), and \(TN_{(a,b)}\) denotes the normal density function truncated to lie within the interval \((a,b)\). Since we have relatively few covariates, relatively many samples, and no strong prior beliefs about the data, we set \(b_0\) and \(B_0\) to be the \(0\) vector and \(10^3\) times the identity matrix, respectively.

The coefficient estimates and intervals from the Frequentist and Bayesian procedures are quite similar. They suggest that a resident’s involvement in the community is not predictive of whether he or she will switch wells. By contrast, high concentrations of arsenic in the resident’s household water are associated with higher propensity to switch wells. Short distances to the nearest safe water source are also weakly associated with higher propensity to switch, as are higher levels of household education.

  Estimate 2.5% 97.5%
Intercept 0.2381 0.1769 0.2984
arsenic 0.3064 0.2541 0.3586
dist -0.2103 -0.2577 -0.1627
assoc -0.0793 -0.1717 0.0137
educ 0.1073 0.0599 0.1532

Table 2: Coefficient estimates from Bayesian naive probit regression

However, our assumption of independence of the residual terms in the Frequentist and Bayesian analyses appears to be violated (see Figure 1). The results are striking in both cases: even when all explanatory variables are held constant, there appears to be correlation between the responses of successive entries in the dataset.

alt text

Figure 1: Autocorrelation in residuals

Extending the probit model to exploit structural dependence

The dependence we observe suggests the following hypothesis: researchers collected (and stored) their data in geographic order, surveying neighboring households one by one. Households with neighbors that switched drinking wells were more likely to switch, perhaps because of social pressure to conform, perhaps due to advice or other information that flowed quickly from neighbor to neighbor. Whatever the mechanism that caused it, the dependence between successive observations motivates a new model, one that explicitly incorporates neighbor behavior into a resident’s response.

Accordingly, we introduce an auto-regressive component to the probit model. Instead of assuming that a resident’s decision to switch is a function of the explanatory variables alone, we include the \(p\) previous residents’ switch responses as additional covariates.

\[\begin{aligned} y_i &= \mathbf{1}_{z_i > 0} \\ z_i &= X_i \cdot \beta+\gamma_1 y_{i-1}+\cdots+\gamma_p y_{i-p} + \epsilon_i \\ \epsilon_i &\overset{\text{iid}} \sim N(0,1) \end{aligned}\]

The number of lag terms, \(p\), may be selected using the AIC criterion, cross-validation accuracy or other model selection procedures. We choose the model with minimal AIC to set \(p = 7\). We then apply the Bayesian and frequentist procedures outlined above to the modified probit model.

alt text

Figure 2: Removal of autocorrelation in residuals after introducing auto-regressive lag terms

Both procedures yield similar results, which (1) improve model quality over the naive probit model, (2) eliminate the autocorrelation structure in the residuals (see Figure 2) and (3) demonstrate that although the explanatory variables highlighted in the previous section remain significant, the most important predictors of switch behavior are indeed the neighboring switch terms with lags \(1\) and \(2\).

alt text

Figure 3: AIC versus number of lag terms included in AR probit model

  Estimate 2.5% 97.5%
(Intercept) -0.5962 -0.7127 -0.4803
arsenic 0.2323 0.1796 0.2857
dist -0.1501 -0.2002 -0.1003
assoc -0.0653 -0.1621 0.0315
educ 0.1006 0.0522 0.1492
switch_lag1 0.5371 0.4350 0.6392
switch_lag2 0.3775 0.2732 0.4819
switch_lag3 0.1242 0.0181 0.2300
switch_lag4 0.1014 -0.0052 0.2079
switch_lag5 0.1594 0.0532 0.2656
switch_lag6 0.0474 -0.0583 0.1529
switch_lag7 0.1160 0.0125 0.2193

Table 3: Coefficient estimates from Frequentist AR probit regression

  Estimate 2.5% 97.5%
(Intercept) -0.5950 -0.7107 -0.4766
arsenic 0.2324 0.1791 0.2874
dist -0.1507 -0.2005 -0.1002
assoc -0.0633 -0.1599 0.0314
educ 0.1002 0.0519 0.1477
switch_lag1 0.5403 0.4393 0.6427
switch_lag2 0.3769 0.2716 0.4850
switch_lag3 0.1199 0.0133 0.2262
switch_lag4 0.1040 -0.0025 0.2085
switch_lag5 0.1614 0.0555 0.2663
switch_lag6 0.0475 -0.0582 0.1533
switch_lag7 0.1137 0.0083 0.2141

Table 4: Coefficient estimates from Bayesian AR probit regression

Conclusion and recommendations

Though our hypothesis that the data were collected in geographic order is still just a hypothesis, we find clear evidence that incorporating lag-terms into the probit model improves our model quality. A future study might explicitly record household location to verify that the dependence we observe is spatial in nature.

Additional studies might also focus on recording explanatory covariates with more actionable policy remedies. Altering the education level of the head of the household, for example, seems like an indirect way of ensuring that a family obtains safe drinking water in the near future. By contrast, recording whether residents have access to a car or other means of transportation seems easier to remedy in the short term, perhaps through targeted subsidy or loan programs.

Given these data alone, the naive probit model suggests that building more safe drinking water sources closer to residents might encourage more households to change their behavior. However, if we are correct about the spatial dependence in the data, the AR probit model suggests that appropriate policy interventions should exploit the effect of geographic proximity. We suggest a few possibilities: (1) Select a subset of residents with optimal geographic coverage in the region and give these households direct financial incentives to switch wells. (2) Encourage households that have switched to tell their neighbors that they have done so. (3) Hold town meetings and ask volunteers to give positive testimony about switching to a safe drinking water source.


This work was done in collaboration with Bora Jin, Naoki Awaya, and Shounak Chattopadhyay.


Albert, James H., and Siddhartha Chib. “Bayesian analysis of binary and polychotomous response data.” Journal of the American statistical Association 88, no. 422 (1993): 669-679.

Gelman, Andrew, and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge university press, 2006.