+ - 0:00:00
Notes for current slide
Notes for next slide

Week 10: Working with weighted data

PUBPOL 750 Data Analysis for Public Policy I

Justin Savoie

MPP-DS McMaster

2022-07-22

1 / 19

Working with weighted data

2 / 19

Weighted data

  • Often the data you will be working with is weighted
  • This could, in theory, apply to any type dataset
  • Policy cases could be weighted, counties could have weights
  • In practice, weighted data is usually data collected on individuals. Intuitively, some individuals are less likely to be represented in your data, they are harder to reach, and you want to increase their weight in your analysis
  • Weights allow for a dataset to be corrected so that results more accurately represent the population being studied.
3 / 19

Sampling

  • Why can we claim that a poll of 100 respondents represent the views of 38M of Canadians ?
  • Random sampling: if we select respondents truly randomly, the results are very close to true population numbers
  • Random digit dialing + random selection within the household
4 / 19

Working with existing weights

  • Weights are often calculated by raking: adjusting weights algorithmically to match marginals distributions
  • Marginal distributions are simply the general distribution in the population. We call it "marginal" to distinguish them from a "conditional" distribution which are, for example, the distribution of age, given gender or education.
  • To calculate weights: you pick which variables you want to weight on. Provide the true numbers in the population for those distribution (marginal distribution). Let R algorithmically create weights. Will provide an example at the end.
5 / 19

Working with existing weights

  • Almost all the time, weights either have a mean of 1 or sum to 1. Sometimes weight sum to the population size.
  • Weights with a mean of one attribute how much respondent is worth in that sample.
  • It's not very important which scale the weights are one (if they sum to 1 or sum to the sample size or the population). You can always scale them back (if the weights sum to 1, multiplying them by N has them average to 1, multiplying them by the population has them sum to the population, and vice versa).
  • Importantly, it will not affect your analyses.
6 / 19

Simple weighted mean

library(tidyverse)
library(readstata13)
library(survey)
df <- read.dta13("~/Downloads/2019 Canadian Election Study - Phone Survey v1.0.dta") %>%
as_tibble()
mean(df$q14)
## [1] 46.36732
weighted.mean(df$q14,df$weight_CES)
## [1] 46.75191
7 / 19
df %>%
group_by(q31) %>%
summarise(nweighted=sum(weight_CES),n=n()) %>%
ungroup() %>%
mutate(wp=nweighted/sum(nweighted),p=n/sum(n))
## # A tibble: 5 × 5
## q31 nweighted n wp p
## <fct> <dbl> <int> <dbl> <dbl>
## 1 (-9) Don't know 113. 117 0.0282 0.0291
## 2 (-8) Refused 8.22 7 0.00204 0.00174
## 3 (1) Better 853. 818 0.212 0.203
## 4 (2) Worse 1270. 1316 0.316 0.327
## 5 (3) About the same 1776. 1763 0.442 0.438
8 / 19
CESDesign <- svydesign(ids = ~0,
weights = ~weight_CES,
data = df)
svymean(~q14, CESDesign)
## mean SE
## q14 46.752 0.5594
q14mean <- svymean(~q14, CESDesign)
confint(q14mean)
## 2.5 % 97.5 %
## q14 45.65546 47.84836
t.test(df$q14)[[4]]
## [1] 45.44254 47.29210
## attr(,"conf.level")
## [1] 0.95
47.29210-45.4425;47.84836-45.65546
## [1] 1.8496
## [1] 2.1929
9 / 19
svytable(~q31, CESDesign)
## q31
## (-9) Don't know (-8) Refused (-7) Skipped (1) Better
## 113.385743 8.218755 0.000000 852.964985
## (2) Worse (3) About the same
## 1270.314932 1776.115585
svyby(~q31,by=~age_range, CESDesign,svymean)
## age_range q31(-9) Don't know q31(-8) Refused
## (1) 18-24 years old (1) 18-24 years old 0.09858770 0.003671925
## (2) 25-34 years old (2) 25-34 years old 0.05292624 0.003662916
## (3) 35-44 years old (3) 35-44 years old 0.03218542 0.001306650
## (4) 45-54 years old (4) 45-54 years old 0.01587141 0.000000000
## (5) 55+ years old (5) 55+ years old 0.01493389 0.002451560
## q31(-7) Skipped q31(1) Better q31(2) Worse
## (1) 18-24 years old 0 0.1356857 0.3535161
## (2) 25-34 years old 0 0.2031205 0.3338878
## (3) 35-44 years old 0 0.1922825 0.3447326
## (4) 45-54 years old 0 0.2099687 0.3304475
## (5) 55+ years old 0 0.2334403 0.2888716
## q31(3) About the same se.q31(-9) Don't know
## (1) 18-24 years old 0.4085385 0.021733816
## (2) 25-34 years old 0.4064025 0.011124884
## (3) 35-44 years old 0.4294928 0.007988759
## (4) 45-54 years old 0.4437124 0.005838565
## (5) 55+ years old 0.4603027 0.003182128
## se.q31(-8) Refused se.q31(-7) Skipped se.q31(1) Better
## (1) 18-24 years old 0.003668756 0 0.02578584
## (2) 25-34 years old 0.003654493 0 0.02000676
## (3) 35-44 years old 0.001306358 0 0.01704271
## (4) 45-54 years old 0.000000000 0 0.01774903
## (5) 55+ years old 0.001395807 0 0.01192614
## se.q31(2) Worse se.q31(3) About the same
## (1) 18-24 years old 0.03518463 0.03584606
## (2) 25-34 years old 0.02347812 0.02450288
## (3) 35-44 years old 0.02076485 0.02173208
## (4) 45-54 years old 0.02006696 0.02150164
## (5) 55+ years old 0.01248310 0.01393730
10 / 19

Making your own weights

df_sub <- df %>% select(q31,q4,q3,age_range)
df_sub <- df_sub %>% mutate_all(as.character)
marginals <- c(`(Intercept)`= 1,
c(#`q4(1) Newfoundland and Labrador` = 0.0115942028985507,
`q4(10) British Columbia` = 0.144927536231884,
`q4(2) Prince Edward Island` = 0.00289855072463768,
`q4(3) Nova Scotia` = 0.0289855072463768,
`q4(4) New Brunswick` = 0.0289855072463768,
`q4(5) Quebec` = 0.231884057971014,
`q4(6) Ontario` = 0.376811594202899,
`q4(7) Manitoba` = 0.0289855072463768,
`q4(8) Saskatchewan` = 0.0289855072463768,
`q4(9) Alberta` = 0.115942028985507
),c(#"q3(1) Male"=0.48,
"q3(2) Female"=0.48, "q3(3) Other"=0.04),
c(#"age_range(1) 18-24 years old"=0.1,
"age_range(2) 25-34 years old"=0.2,
"age_range(3) 35-44 years old"=0.15,
"age_range(4) 45-54 years old"=0.2,
"age_range(5) 55+ years old"=0.35))
11 / 19
tmp_form <- paste(" ~ 1 + q4 + q3 + age_range")
surveyDesign <- svydesign(id = ~ 1,
weights = ~ 1,
data = df_sub)
surveyDesign <- calibrate(design = surveyDesign,
formula = as.formula(tmp_form),
calfun = "raking",
population = marginals,
maxit = 2000)
df_sub$weight <- weights(surveyDesign)
12 / 19
plot(df_sub$weight*nrow(df_sub),df$weight_CES)

13 / 19
df_sub %>% arrange(desc(weight))
## # A tibble: 4,021 × 5
## q31 q4 q3 age_range weight
## <chr> <chr> <chr> <chr> <dbl>
## 1 (3) About the same (10) British Columbia (3) Other (2) 25-34 years … 4.00e-2
## 2 (3) About the same (6) Ontario (2) Female (1) 18-24 years … 8.88e-4
## 3 (1) Better (6) Ontario (2) Female (1) 18-24 years … 8.88e-4
## 4 (2) Worse (6) Ontario (2) Female (1) 18-24 years … 8.88e-4
## 5 (1) Better (6) Ontario (2) Female (1) 18-24 years … 8.88e-4
## 6 (-9) Don't know (6) Ontario (2) Female (1) 18-24 years … 8.88e-4
## 7 (3) About the same (6) Ontario (2) Female (1) 18-24 years … 8.88e-4
## 8 (3) About the same (6) Ontario (2) Female (1) 18-24 years … 8.88e-4
## 9 (2) Worse (6) Ontario (2) Female (1) 18-24 years … 8.88e-4
## 10 (3) About the same (6) Ontario (2) Female (1) 18-24 years … 8.88e-4
## # … with 4,011 more rows
14 / 19
df_sub$weight[df_sub$weight>quantile(df_sub$weight,0.99)] <- quantile(df_sub$weight,0.99)
plot(df_sub$weight*nrow(df_sub),df$weight_CES)

15 / 19

When to use weights ?

  • As a heuristic, it's ok to use them all the time.
  • Always use them for description (for inferring a proportion from the sample to a population).
  • Unclear when running regressions. See :https://www.jstor.org/stable/24735988#metadata_info_tab_contents
  • If you run a regression where you control weighting variables, it's ok not to use weights.
17 / 19

Let's look at code Weights Example.R

18 / 19

Working with weighted data

2 / 19
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow