Complex Surveys: A Guide to Analysis Using R by Thomas Lumley

Author

Justin Savoie

These are my notes from the book. I don’t distinguish between direct quotes from the book and my own comments. For illustrative purposes, I replicate some of the analysis using the package using basic algebra. This is a work in progress.

1 Basic Tools

1.1 Goals of inference

1.1.1 Population or process

Most of statistics is model-bused (think regression,MRP). The analysis of complex survey samples, in contrast, is usually design-based. The model represents the process; can be generalized to other situations. The goal of the design-based analysis is to estimate features of the fixed population.

1.1.2 Probability samples

The fundamental statistical concept in design-based inference is the probability sample or random sample. “Taking a random sample” of 1000 individuals means a sampling procedure when any subset of 1000people from the population is equally likely to be selected.

A stratified random sample is still a probability sample (e.g., taking a simple random sample of 20 people from each state).

What makes aprobability sample is the procedure for taking samples from a population.

  • Every individual in the population must have a non-zero probability of ending up in the sample (written \(\pi_i\) for individual \(i\))
  • The probability \(\pi_i\) must be known for every individual
  • Every pair of individuals in the sample must have a non-zero probability of both ending up in the sample (\(\pi_{ij}\))
  • The probability \(\pi_{ij}\) must be known for every pair that does end up in the sample.

The first two properties are necessary in order to get valid population estimates; the last two are necessary to work out the accuracy of the estimates. If individuals were sampled independently of each other the first two properties would guarantee the last two (which is the case in the surveys I work with).

1.1.3 Sampling weights

The fundamental statistical idea behind all of design-based inference is that an individual sampled with a sampling probability of \(\pi_i\) represents \(1/\pi_i\) individuals in the population. The value \(1/\pi_i\) is called the sampling weight.

Given a sample of size n,the (Horvitz-Thompson) estimator \(\hat{T}_X\) for the population total \(T_X\) of \(X\) is \[\hat{T}_X=\sum_{i=1}^{n}\frac{1}{\pi_i}X_i\]

The variance estimate is

\[\mathrm{var}[\hat{T}_X] = \sum_{i,j} \left( \frac{X_i X_j}{\pi_{ij}} - \frac{X_i}{\pi_i} \frac{X_j}{\pi_j} \right)\] Knowing the formula for the variance estimator is less important to the applied user, but it is useful to note two things. The first is that the formula applies to any design, however complicated, where \(\pi_i\) and \(\pi_ij\) are known for the sampled observations. The second is that the formula depends on the pairwise sampling probabilities \(\pi_ij\), not just on the sampling weights; this is how correlations in the sampling design enter the computations.

Statisticians use the term ‘weight’ to mean at least three different things.

  1. sampling weights: A sampling weight of 1000 means that the observation represents 1000 individuals in the population.

  2. precision weights: A precision (or inverse-variance) weight of 1000 means that the observation has 1000 times lower variance than an observation with a weight of 1.

  3. frequency weights: A frequency weight of 1000 means that the sample contains 1000 identical observations and space is being saved by using only one record in the data set to represent them.

In this book, weights are always sampling weights. They’ll all produce the same point estimates but not same standard errors.

1.1.4 Design effects

A complex survey will not have the same standard errors for estimates as a simple random sample of the same size, but many sample size calculations are only conveniently available for simple random samples.

The design effect was defined by Kish (1965) as the ratio of a variance of an estimate in a complex sample to the variance of the same estimate in a simple random sample.

Design effects are greater than 1.0 when the sample is not simple random, implying that a larger sample size is needed.

1.2 An introduction to the data

The book uses some datasets. I think they are all available through the survey package.

1.3 Obtaining the software

The book uses R.

1.4 Using R

R is very nice, really!

2 Simple and stratified sampling

2.1 Analyzing simple random samples

The Horvitz-Thompson estimator of the population total of a variable X is

\[\hat{T}_X=\sum_{i=1}^{n}\frac{1}{\pi_i}X_i\]

A sampling weight \(\pi_i\) is a value assigned to each unit in a sample, representing the number of units in the total population that the sampled unit represents. Let’s say there are 31 million Canadian adults. Let’s say, implausibly, that income is a normal centered at 50k with sd of 5k.

weight <- 31000000/1000
population <- rnorm(31000000,50000,5000)
sample <- sample(population,1000)

(T_hat_X <- (31000000/1000)*sum(sample))
[1] 1.550859e+12

Which is a ok estimate of that true number

(T_X <- sum(population))
[1] 1.550036e+12

The variance of this Horvitz-Thompson estimator is

\[\mathrm{var}[\bar{X}] = \frac{N - n}{N} \times N^2 \times \frac{\mathrm{var}[X]}{n}\]

(var_T_hat_X <- ((31000000-1000)/31000000)*31000000^2*(var(sample)/1000))
[1] 2.293811e+19

For illustration purposes, you can get this estimator using simulations

get_sum_one_sample <- function() (31000000/1000)*sum(sample(population,1000))
var(replicate(100000,get_sum_one_sample()))
[1] 2.40975e+19

The standard error of our estimator

sqrt(var_T_hat_X)
[1] 4789375170

The first term in the variance formula is the inite population correction. When the population is large, you can drop it.

The population mean of X can be estimated by dividing the estimated total by the population size, N

(mu_x <- T_hat_X / 31000000)
[1] 50027.7

The variance estimate is obtained by dividing the variance estimate for the total by \(N^2\) (get the square root for standard error)

sqrt(var_T_hat_X / 31000000^2)
[1] 154.496

Which is equal to getting the standard error of the mean with the usual formula

sd(sample)/sqrt(1000)
[1] 154.4985

The Horvitz-Thompson estimator of the population size is the sum of the sampling weights, in large surveys where the population size is known, the sampling weights are adjusted so that the estimated population size equals the true sample size.

2.1.1 Confidence intervals

Confidence intervals for estimates are computed by using a Normal distribution for the estimate, ie, for a 95% confidence interval adding and subtracting 1.96 standard errors.

2.1.2 Describing the sample to R

The Academic Performance Index is computed for all California schools based on standardised testing of students. The data sets contain information for all schools with at least 100 students and for various probability samples of the data. ?apisrs for more info.

library(survey)
Loading required package: grid
Loading required package: Matrix
Loading required package: survival

Attaching package: 'survey'
The following object is masked from 'package:graphics':

    dotchart
data(api)
dim(apisrs)
[1] 200  39

The variable fpc in this data set contains the number 6194, the number of schools in California. id=~1 says that individual schools were sampled. fpc=~fpc says that the variable called fpc in the data set contains the population size.

srs_design <- svydesign(id=~1,fpc=~fpc,data=apisrs)

We can replicate the estimates and standard errors by hand:

svytotal(~enroll, srs_design)
         total     SE
enroll 3621074 169520
(6194/200)*sum(apisrs$enroll)
[1] 3621074
(svytotal_SE <- sqrt( ((6194-200)/6194)*6194^2*(var(apisrs$enroll)/200)) )
[1] 169519.7
svymean(~enroll, srs_design)
         mean     SE
enroll 584.61 27.368
mean(apisrs$enroll)
[1] 584.61
sqrt(svytotal_SE^2 / (6194^2))
[1] 27.36837

If the population size is not specified it is necessary to specify the sampling probabilities or sampling weights. If we have the sampling weight but not the population, the standard error are slightly larger.

svytotal(~stype, srs_design)
         total     SE
stypeE 4397.74 196.00
stypeH  774.25 142.85
stypeM 1022.01 160.33
(svytotal_SE_E <- sqrt( ((6194-200)/6194)*6194^2*(var(as.numeric(apisrs$stype=="E"))/200)) )
[1] 195.9953
(svytotal_SE_H <- sqrt( ((6194-200)/6194)*6194^2*(var(as.numeric(apisrs$stype=="H"))/200)) )
[1] 142.8488
(svytotal_SE_M <- sqrt( ((6194-200)/6194)*6194^2*(var(as.numeric(apisrs$stype=="M"))/200)) )
[1] 160.3255
svymean(~stype, srs_design)
        mean     SE
stypeE 0.710 0.0316
stypeH 0.125 0.0231
stypeM 0.165 0.0259
sqrt(svytotal_SE_E^2 / (6194^2))
[1] 0.03164276
sqrt(svytotal_SE_H^2 / (6194^2))
[1] 0.02306244
sqrt(svytotal_SE_M^2 / (6194^2))
[1] 0.025884

2.2 Stratified sampling

Stratified sampling involves dividing the population up into groups called strata and drawing a separate probability sample from each one. Since a stratified sample is just a set of simple random samples from each stratum, the Horvitz-Thompson estimator of the total is just the sum of the estimated totals in each stratum and its variance is the sum of the estimated variances in each stratum.

strat_design <- svydesign(id=~1,strata=~stype,fpc=~fpc,data=apistrat)
svytotal(~enroll,strat_design)
         total     SE
enroll 3687178 114642

Off because use of stratas (treating like random sample)

(6194/200)*sum(apistrat$enroll)
[1] 4624967

Of course, this is ugly and impossible to read but the point is that it works: total is just the sum of the estimated totals in each stratum and its variance is the sum of the estimated variances in each stratum

(apistrat$fpc[apistrat$stype=="E"][1]/sum(apistrat$stype=="E")) *
  sum(apistrat$enroll[apistrat$stype=="E"]) +
(apistrat$fpc[apistrat$stype=="H"][1]/sum(apistrat$stype=="H")) *
  sum(apistrat$enroll[apistrat$stype=="H"]) +
(apistrat$fpc[apistrat$stype=="M"][1]/sum(apistrat$stype=="M")) *
  sum(apistrat$enroll[apistrat$stype=="M"])
[1] 3687178
sqrt( ((apistrat$fpc[apistrat$stype=="E"][1]-sum(apistrat$stype=="E"))/apistrat$fpc[apistrat$stype=="E"][1])*apistrat$fpc[apistrat$stype=="E"][1]^2*(var(apistrat$enroll[apistrat$stype=="E"])/sum(apistrat$stype=="E")) +
   ((apistrat$fpc[apistrat$stype=="H"][1]-sum(apistrat$stype=="H"))/apistrat$fpc[apistrat$stype=="H"][1])*apistrat$fpc[apistrat$stype=="H"][1]^2*(var(apistrat$enroll[apistrat$stype=="H"])/sum(apistrat$stype=="H")) +
     ((apistrat$fpc[apistrat$stype=="M"][1]-sum(apistrat$stype=="M"))/apistrat$fpc[apistrat$stype=="M"][1])*apistrat$fpc[apistrat$stype=="M"][1]^2*(var(apistrat$enroll[apistrat$stype=="M"])/sum(apistrat$stype=="M")))
[1] 114641.7

2.3 Replicate Weights

The standard error of a mean or any other population summary is, by definition, the standard deviation of that estimated summary across many independent samples of data.

population <- rnorm(1000000)
draw_1000 <- function(){sample(population,1000)}
sample1 <- draw_1000()
sd(sample1)/sqrt(1000)
[1] 0.03156248
sd(replicate(1000,mean(draw_1000())))
[1] 0.03035286

The replicate-weight approach to estimating standard errors computes the standard deviation of the estimated summary across many partially independent subsets of the one sample, and extrapolates from this to the standard deviation between completely independent samples. Replicate weight methods require more computation than using the Horvitz- Thompson estimator but are easier to generalize to statistics other than the mean and total (at least for software designers). Are given as example: Balanced repeated replication, Fay’s method, jackkknife and bootstrap.

An example with the bootstrap: take a sample of observations (or clusters) with replacement from each stratum and the sampling weight is multiplied by the number of times the observation appears in this sample.

strat_design <- svydesign(id=~1,strata=~stype,fpc=~fpc,data=apistrat)
svymean(~enroll,strat_design)
         mean     SE
enroll 595.28 18.509
enroll_E <- apistrat$enroll[apistrat$stype=="E"]
fpc_E <- apistrat$fpc[apistrat$stype=="E"][1]
enroll_H <- apistrat$enroll[apistrat$stype=="H"]
fpc_H <- apistrat$fpc[apistrat$stype=="H"][1]
enroll_M <- apistrat$enroll[apistrat$stype=="M"]
fpc_M <- apistrat$fpc[apistrat$stype=="M"][1]

draw1 <- function() {(fpc_E*mean(sample(enroll_E,length(enroll_E),replace=TRUE)) +
  fpc_H*mean(sample(enroll_H,length(enroll_H),replace=TRUE)) +
  fpc_M*mean(sample(enroll_M,length(enroll_M),replace=TRUE))) /
  sum(fpc_E+fpc_H+fpc_M)}

draws1000 <- replicate(1000,draw1())
mean(draws1000);sd(draws1000)
[1] 596.3182
[1] 19.0782

Very close. I love the bootstrap, it’s so intuitive.

2.3.1 Specifying replicate weights to R

Replicate-weight approach: computes the standard deviation of the estimated summary across many partially independent subsets of the one sample, and extrapolate from this to the standard deviation between completely independent samples. Let’s skip for now.

2.3.2 Creating replicate weights to R

boot_design <- as.svrepdesign(strat_design,type='bootstrap',replicates=1000)
svymean(~enroll,boot_design)
         mean     SE
enroll 595.28 17.866

But we’ve already done something similar by hand, above.

Again, the most obvious advantage of designs with replicate weights in the survey package is that it is possible to compute standard errors for differences between subpopulations for arbitrary statistics; to perform an analysis that has not been implemented in the survey package.

Using replicate weights (or the bootstrap as I show) it is only necessary to write code to compute the weighted point estimates; this code is re-run on all the sets of replicate weights to estimate standard errors, which is usually easier than writing code for the linearization standard error estimates.

There are other notes, for example, the bootstrap is straightforward only when all the strata are large.

2.4 Other population summaries

Estimators, such as the median or a regression coefficient, are obtained by solving equations that can be written as population totals or means. Standard errors for these estimates can be obtained with replicate weights. Alternatively we can use linearization, dicussed in Appendix.

2.4.1 Quantiles

The median is defined implicitly rather than explicitly: the estimated population size above and below the median are equal.

Bootstrapping them:

chis_adult <- foreign::read.dta("~/Downloads/adult.dta") # get on book website
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('sragef')
for 'srage_p' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('povgwd')
for 'povgwd_p' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw0' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw1' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw2' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw3' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw4' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw5' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw6' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw7' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw8' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw9' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw10' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw11' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw12' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw13' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw14' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw15' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw16' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw17' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw18' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw19' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw20' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw21' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw22' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw23' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw24' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw25' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw26' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw27' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw28' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw29' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw30' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw31' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw32' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw33' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw34' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw35' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw36' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw37' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw38' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw39' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw40' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw41' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw42' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw43' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw44' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw45' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw46' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw47' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw48' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw49' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw50' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw51' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw52' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw53' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw54' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw55' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw56' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw57' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw58' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw59' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw60' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw61' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw62' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw63' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw64' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw65' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw66' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw67' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw68' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw69' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw70' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw71' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw72' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw73' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw74' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw75' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw76' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw77' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw78' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw79' are missing
Warning in foreign::read.dta("~/Downloads/adult.dta"): value labels ('rakedf')
for 'rakedw80' are missing
chis <- svrepdesign(variables=chis_adult[,1:418] ,
repweights=chis_adult [,420:499] , weights=chis_adult[,419], combined.weights=TRUE, type="other", scale=1, rscales=1)
svyquantile(~bmi_p,design=chis,quantiles=c(0.25,0.5,0.75))
$bmi_p
     quantile ci.2.5 ci.97.5         se
0.25    22.68  22.66   22.81 0.03767982
0.5     25.75  25.69   25.80 0.02763161
0.75    29.18  29.12   29.29 0.04270393

attr(,"hasci")
[1] TRUE
attr(,"class")
[1] "newsvyquantile"
sum_w <- sum(chis_adult$rakedw0)
bmi_order <- order(chis_adult$bmi_p)
median_loc <- which(cumsum(chis_adult$rakedw0[bmi_order])>(sum_w)/2)[1]
chis_adult$bmi_p[bmi_order][median_loc]
[1] 25.75
median_loc <- which(cumsum(chis_adult$rakedw0[bmi_order])>(sum_w)/4)[1]
chis_adult$bmi_p[bmi_order][median_loc]
[1] 22.68
boot <- function(){
  temp <- chis_adult[,c("rakedw0","bmi_p")]
  temp_boot <- temp[sample(1:nrow(temp),replace=TRUE),]
  sum_w <- sum(temp_boot$rakedw0)
  bmi_order <- order(temp_boot$bmi_p)
  median_loc <- which(cumsum(temp_boot$rakedw0[bmi_order])>(sum_w)/2)[1]
  temp_boot$bmi_p[bmi_order][median_loc]
}
v_means <- replicate(500,boot())
mean(v_means);sd(v_means)
[1] 25.72364
[1] 0.04076844

SE is bigger than the result from svyquantile() but actually in line with what’s in the book. Strange.

I’m skipping many chapters. This is what I’m using most often.

3 Post-stratification raking calibration

This is chapter 7.

Raking by hand:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::expand() masks Matrix::expand()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ tidyr::pack()   masks Matrix::pack()
✖ tidyr::unpack() masks Matrix::unpack()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
gender <- sample(c("M","F"),100,replace=TRUE)
age <- sample(c("Y","O"),100,replace=TRUE)
educ <- sample(c("HS","C","U"),100,replace=TRUE)

df <- data.frame(gender,age,educ)

design <- svydesign(ids = ~1, data = df, weights = ~rep(1, nrow(df)))

gender_marg <- data.frame(gender = c("M", "F"), Freq = c(40, 60))
age_marg <- data.frame(age = c("Y", "O"), Freq = c(30, 70))
educ_marg <- data.frame(educ = c("HS", "C", "U"), Freq = c(40, 40, 20))

raked_design <- rake(design, 
                     sample.margins = list(~gender, ~age, ~educ), 
                     population.margins = list(gender_marg, age_marg, educ_marg))

df$weights <- weights(raked_design)
stored_w <- df$weights
df$weights <- 1
store_weights <- df$weights
for (i in 1:100){
  mult_gender <- df %>% 
    group_by(gender) %>%
    summarise(sw=sum(weights)) %>%
    left_join(gender_marg,"gender") %>% mutate(mult=Freq/sw)
  mult_age <- df %>% 
    group_by(age) %>%
    summarise(sw=sum(weights)) %>%
    left_join(age_marg,"age") %>% mutate(mult=Freq/sw)
  mult_educ <- df %>% 
    group_by(educ) %>%
    summarise(sw=sum(weights)) %>%
    left_join(educ_marg,"educ") %>% mutate(mult=Freq/sw)
  
  df <- df %>%
    left_join(mult_gender,"gender") %>%
    left_join(mult_age,"age") %>%
    left_join(mult_educ,"educ") %>%
    mutate(weights=weights*mult.x*mult.y*mult) %>%
    select(gender,age,educ,weights) %>%
    mutate(weights=100*(weights/sum(weights)))
  if (sum((df$weights-store_weights)^2)<0.00001){
    break
  }
  store_weights <- df$weights
  
}

cor(stored_w,df$weights)
[1] 1

Standard errors: