Am I a Bayesian or a Frequentist?

For many applied social science problems (for example relatively simple versions of linear regression, descriptive statistics where you still use a model, etc.) it doesn’t seem to matter much.

Frequentists use Maximum Likelihood Estimation (MLE) to obtain point estimates for parameters. Bayesians, on the other hand, combine the likelihood with a prior distribution using Bayes’ theorem to obtain a posterior distribution. It’s like taking a weighted average of the MLE estimate and the prior.

In social science, Bayesians often employ vague or not-very-informative priors. Consequently, the posterior distribution of their estimates tends to align closely with MLE estimates.

There are packages that do very easily the Bayesian versions of things like lm(), glm(), lmer() and glmer() (mostly rstanarm and brms).

The results are often similar to what you would get with lm(), glm()… Importantly, if the results are not similar, you should try to understand why. If one of the bootstrap or frequentist confidence interval, or the Bayesian credibility interval, is really wide or close to zero, then you simply don’t have a lot of good evidence.

In practice, I’ll often use frequentist methods because they are much faster on large datasets. Most people are trained with frequentist statistics so that’s also what they know.

I’ll sometimes use rstanarm and brms if I need the full posterior distribution of parameters because I want to get the downstream uncertainty; like poststratified predictions. We can also get these with bootstrap or the delta method. But there’s something nice with getting the full posterior.

My high level view is something like section 5.5.5 of Kevin Murphy’s book, especially the quote by Donald Rubin at the top of p. 203:

The applied statistician should be Bayesian in principle and calibrated to the real world in practice. [They] should attempt to use specifications that lead to approximately calibrated procedures under reasonable deviations from [their assumptions]. [They] should avoid models that are contradicted by observed data in relevant ways — frequency calculations for hypothetical replications can model a model’s adequacy and help to suggest more appropriate models.

Both for Frequentists and Bayesians, the parameter to estimate “may have been fixed from the start or may have been generated from a physically random mechanism”. The difference is that Bayesians will estimate a probability model for parameters given the model and their priors on these parameters, while Frequentists will estimate the most likely point estimate given the model and will use the standard error of the parameter to provides a measure of the uncertainty or variability associated with the estimate (e.g., if we replicated that experiment 100 times, the parameter true would be contained in the confidence interval 95 times). Seems to me to be a pragmatic choice.

Here is a toy example but for many applied problems it won’t matter much either:

library(rstanarm)
d <- read.csv("https://stats.idre.ucla.edu/wp-content/uploads/2019/02/elemapi2v2.csv")
fit <- lm(api00 ~  enroll + meals + full, data = d)
fitBayes <- stan_glm(api00 ~  enroll + meals + full, data = d, seed = 222)

coef(summary(fit))[,1:2]

               Estimate  Std. Error
(Intercept) 801.8298337 26.42660282
enroll       -0.0514556  0.01383741
meals        -3.6597345  0.10879550
full          1.0810944  0.23945269

fitBayes[1:2]

$coefficients
 (Intercept)       enroll        meals         full 
801.96170047  -0.05139102  -3.65802369   1.08278020 

$ses
(Intercept)      enroll       meals        full 
 25.7709027   0.0138499   0.1080426   0.2365597

For example if we look at the coefficient on the intercept, it’s 801.83 for MLE and 801.96 for the Bayesian fit. The prior chosen automatically by rstanarm for that coefficient is $N (648, 356)$ . That’s what is meant by some kind of weighted average of the prior and the likelihood. As is the case with all applied regression models like this one, the likelihood strongly dominates the prior.

prior_summary(fitBayes)

Priors for model 'fitBayes' 
------
Intercept (after predictors centered)
  Specified prior:
    ~ normal(location = 648, scale = 2.5)
  Adjusted prior:
    ~ normal(location = 648, scale = 356)

Coefficients
  Specified prior:
    ~ normal(location = [0,0,0], scale = [2.5,2.5,2.5])
  Adjusted prior:
    ~ normal(location = [0,0,0], scale = [ 1.57,11.14,23.79])

Auxiliary (sigma)
  Specified prior:
    ~ exponential(rate = 1)
  Adjusted prior:
    ~ exponential(rate = 0.007)
------
See help('prior_summary.stanreg') for more details

Of course you can also set an implausibly very precise prior. It’s not clear why you would do that…

fitBayes <- stan_glm(
  api00 ~ enroll + meals + full, 
  data = d, 
  prior_intercept = normal(0, 1), 
)

fitBayes[1:2]

$coefficients
 (Intercept)       enroll        meals         full 
140.85916753  -0.04714781  -3.59955054   1.20609481 

$ses
(Intercept)      enroll       meals        full 
306.7225878   0.1554933   1.2277531   2.7602487

One case where it’s sometimes said that Bayesian statistics is superior is noisy data: when studies are small and data are indirect or highly variable. Partial pooling adjusts individual group estimates based on the overall data. If a particular group has less data, its estimate will be pulled (or “shrunk”) towards the grand mean (or the overall average across groups). This is beneficial because, in cases where we have limited data, the overall average provides some information that can be more reliable than the limited data on its own. We will set structured priors and the model will fit. For example, in Canada, you could still get estimates for the territories in addition to the provinces (but with a much more uncertainty around your parameters) using a hierarchical model with partial pooling. Frequentists would often simply not say anything about such small areas like this where we don’t have a lot of information. It comes down to a difference in philosophy: do we want to model something and have a lot of uncertainty by pooling towards the grand mean, or do we simply want to refrain from saying anything.

There’s a good summary of all this on pages 14-16 of Regression and Other Stories:

So, in that sense, we have a choice: classical inference, leading to pure summaries of data which can have limited value as predictions; or Bayesian inference, which in theory can yield valid predictions even with weak data, but relies on additional assumptions. There is no universally correct answer here; we should just be aware of our options. There is also a practical advantage of the Bayesian approach, which is that all its inferences are probabilistic and thus can be represented by random simulations. For this reason, whenever we want to summarize uncertainty in estimation beyond simple confidence intervals, and whenever we want to use regression models for predictions, we go Bayesian. As we discuss in Chapter 9, we can perform Bayesian inference using noninformative or weakly informative priors and obtain results similar to classical estimates, along with simulation draws that can be used to express predictive uncertainty, or we can use informative priors if so desired. To the extent that we have relevant information that is not in our model (for example, awareness of bias, selection on unmeasured characteristics, prior information on effect sizes, etc), then we have a duty to account for this as well as we can when interpreting our data summaries.

Here is an example using the 2021 Canadian Election Study v1.0.dta available here.

We look at the estimate of the mean of cps21_pos_carbon by province/territory.

cps21_pos_carbon To help reduce greenhouse gas emissions, the federal government should continue the carbon tax.

Strongly disagree (1)
Somewhat disagree (2)
Neither agree nor disagree (3)
Somewhat agree (4)
Strongly agree (5)
Don’t know/ Prefer not to answer (6)

Let’s remove (6). Let’s treat it as a continuous variable.

library(haven)
library(tidyverse)
library(srvyr)
df <- read_stata("~/Downloads/dataverse_files_CES2021/2021 Canadian Election Study v1.0.dta")
to_model <- df |>
  select(cps21_ResponseId,cps21_pos_carbon,provcode,weight=cps21_weight_general_restricted) |>
  # sketchy but weights for territories are all NA ...
  mutate(weight=ifelse(is.na(weight),1,weight)) |>
  drop_na() |>
  filter(cps21_pos_carbon!=6)
to_model$provcode <- as_factor(to_model$provcode)
results_survey_province <- to_model %>%
  as_survey_design(ids = 1, weight = weight) %>%
  group_by(provcode) %>%
  summarise(survey_mean(cps21_pos_carbon),n=n(),
            survey_mean(cps21_pos_carbon,vartype = "ci"))

to_model$provcode <- as.character(to_model$provcode)
to_model$cps21_pos_carbon <- as.numeric(to_model$cps21_pos_carbon)
bayesian_fit <- stan_glmer(cps21_pos_carbon~(1|provcode),
                           to_model,family='gaussian',weights = weight,
                           cores=4,chains=4)
to_predict <- to_model |> select(provcode) |> unique()
pred <- apply(posterior_epred(bayesian_fit,newdata = to_predict),2,
      function(x){quantile(x,c(0.025,0.5,0.975))})
to_predict$min <- pred[1,]
to_predict$estimate <- pred[2,]
to_predict$max <- pred[3,]
to_plot <- bind_rows(results_survey_province|>
                        select(provcode,
                               estimate=coef,
                               min=`_low`,
                               max=`_upp`)|>mutate(type='classical survey\nestimate'),
                      to_predict|>
                        select(provcode,
                               estimate,min,max)|>mutate(type='Bayesian with\npartial pooling'))
to_plot |>
  ggplot(aes(x=provcode,y=estimate,ymin=min,ymax=max,color=type,group=type)) +
  geom_errorbar(position=position_dodge(width=.85)) +
  geom_point(position=position_dodge(width=.85)) +
  coord_flip()

There is no confidence interval for Nunavut as it has one observation. The Bayesian estimate for Yukon, PEI and NT seem more plausible. Of course, with a good sample size, results are very similar.

So that’s what they mean above when they say: “classical inference, leading to pure summaries of data which can have limited value as predictions; or Bayesian inference, which in theory can yield valid predictions even with weak data, but relies on additional assumptions”. Limited value for prediction for Nunavut for instance.

We also see that the red point is always closer to the center. It’s especially true where the sample size is small. That’s what is meant by partial pooling.