In this project, we use data from the 2019 Canadian Election Study (CES) to produce an exploratory data analysis. We start with a univariate exploratory data analysis. Then we move to bivariate analysis. Data available here (version 1.1 Stata14), or directly on the class website.
Section 1 is a code along. You just have to run the code. There’s no code to write. However, there are questions (marked QUESTION: in red) to answer. Answer them directly in the qmd file.
Do not start a new Quarto file from scratch. Work from the available Quarto file available on the website. Add inline text to answer the text in the qmd file.
In Section 2, I ask you to run the an analysis similar to the one in section 1, but on some other variables of your choice. You can pick any variables from the 2019 CES, or from another dataset if you prefer.
Project 1 is due on November 8. When you are done, knit this Quarto file to html. Submit both the html file and this .qmd (Quarto) file.
library(tidyverse)
# The CES data provided is in Stata format, so we need haven
library(haven)
# We need e1071 for kurtosis and skewness
library(e1071)
# We need kableExtra to produce nice html data tables
library(kableExtra)
# Read in the data, assign to df
df <- read_stata("2019 Canadian Election Study - Phone Survey v1.1.dta")
# To convert numeric value labels to factor
library(labelled)
Use the glimpse function on the dataset.
glimpse(df)
## Rows: 4,021
## Columns: 275
## $ sample_id <dbl> 18, 32, 39, 59, 61, 69, 157, 158, 165, 167, 185…
## $ survey_end_CES <chr> "2019-09-23 15:48:29-06", "2019-09-12 18:02:30-…
## $ survey_end_month_CES <dbl> 9, 9, 9, 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 9, …
## $ survey_end_day_CES <dbl> 23, 12, 10, 10, 12, 17, 12, 14, 10, 12, 16, 12,…
## $ num_attempts_CES <dbl> 5, 1, 1, 6, 1, 1, 1, 4, 1, 1, 4, 1, 9, 2, 1, 1,…
## $ interviewer_id_CES <dbl> 161182, 151152, 161182, 147601, 151152, 2503, 2…
## $ interviewer_gender_CES <chr> "Female", "Male", "Female", "Female", "Male", "…
## $ language_CES <dbl+lbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2…
## $ phonetype_CES <dbl+lbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ survey_end_PES <chr> "2019-11-08 14:24:14-07", "", "2019-11-09 13:08…
## $ survey_end_month_PES <dbl+lbl> 11, NA, 11, NA, 11, 10, 10, 11, NA, 10, 11,…
## $ survey_end_day_PES <dbl+lbl> 8, NA, 9, NA, 4, 28, 28, 7, NA, 25, 6,…
## $ num_attempts_PES <dbl+lbl> 4, NA, 4, NA, 3, 1, 3, 3, NA, 0, 2,…
## $ interviewer_id_PES <dbl+lbl> 2503, NA, 161182, NA, 164893, 2…
## $ interviewer_gender_PES <chr> "Female", "", "Female", "", "Female", "Female",…
## $ language_PES <dbl+lbl> 2, NA, 2, NA, 2, 2, 2, 2, NA, 2, 2,…
## $ phonetype_PES <dbl+lbl> 2, NA, 2, NA, 2, 2, 2, 2, NA, NA, 2,…
## $ mode_PES <dbl+lbl> 1, NA, 1, NA, 1, 1, 1, 1, NA, 2, 1,…
## $ phone_type <dbl+lbl> 2, 2, 2, 3, 2, 2, 2, 3, 2, 3, 3, 3, 3, 2, 2…
## $ weight_CES <dbl> 0.9019529, 0.9019529, 0.9019529, 1.2334642, 0.9…
## $ weight_PES <dbl+lbl> 1.030709, NA, 1.030709, NA, 1.0…
## $ c1 <dbl+lbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ c2a <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ c3 <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ q1 <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ q2 <dbl+lbl> 1963, 1973, 1994, 2000, 1984, 1939, 1999, 1…
## $ q3 <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ q4 <dbl+lbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ q6 <dbl+lbl> 3, 2, 1, 2, 4, 3, 3, 2, 2, 2, 2, 2, 2, 3, 4…
## $ q7 <chr> "economie", "Finances", "agriculture", "l'envir…
## $ q8 <dbl+lbl> 1, 8, 3, 5, 3, 4, -9, 3, 2, 1, 6,…
## $ q8_7_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q9 <dbl+lbl> 8, 10, 10, 6, 10, 10, 6, 8, 7, 7, 8,…
## $ q10 <dbl+lbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 3…
## $ q11 <dbl+lbl> -9, -9, 1, 4, 3, 4, 5, 4, 2, -9, 6,…
## $ q11_7_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q12 <dbl+lbl> -9, -9, NA, NA, NA, NA, NA, NA, NA, 1, NA,…
## $ q12_7_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q13 <dbl+lbl> 2, 2, 2, 2, 4, 4, 4, 2, 3, 2, 3, 3, 2, 2, 1…
## $ q14 <dbl+lbl> 60, 70, 70, 75, 10, 0, 50, 65, 50, 70, 15,…
## $ q15 <dbl+lbl> 40, 55, 60, 40, 10, 30, 20, 25, 80, 10, 50,…
## $ q16 <dbl+lbl> 40, 40, 55, 85, 90, 0, 70, 75, 10, 40, 20,…
## $ q17 <dbl+lbl> 40, 10, 50, 80, 49, 100, 40, 80, 50…
## $ q18 <dbl+lbl> 30, 40, 50, 75, 10, 30, 70, 75, 0, 0, 0,…
## $ q19 <dbl+lbl> 10, 15, -6, 40, 0, 0, -6, 0, 0, -6, 95,…
## $ q20 <dbl+lbl> 70, 50, 70, 70, 25, 0, 35, 70, 50, 70, 10,…
## $ q21 <dbl+lbl> 40, 50, 40, 55, 25, 30, 35, 15, 80, 40, 30,…
## $ q22 <dbl+lbl> 30, 45, 70, 90, 80, 0, 65, 80, -6, 60, 25,…
## $ q23 <dbl+lbl> 50, 10, 80, 50, -9, 100, -6, 85, 50…
## $ q24 <dbl+lbl> 70, 40, 60, 70, 40, 30, 50, 77, -6, 40, 10,…
## $ q25 <dbl+lbl> 70, 15, 30, 50, 20, 0, -6, 5, 10, 40, 98,…
## $ q27_a <dbl+lbl> NA, 3, 1, NA, 1, 3, 1, 1, 1, 1, 1,…
## $ q27_b <dbl+lbl> 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1…
## $ q27_c <dbl+lbl> 3, 3, 1, 3, 1, 3, 3, -8, 1, 3, 1,…
## $ q27_d <dbl+lbl> 3, 3, 3, 2, 1, 3, 2, 2, 3, 3, 3,…
## $ q27_e <dbl+lbl> 2, 3, 3, 1, 3, 2, 3, 1, 3, 3, 3, 1, 3, 1, 2…
## $ q31 <dbl+lbl> 2, 3, 1, 3, 1, 2, -9, 1, 3, 3, 2,…
## $ q32 <dbl+lbl> 3, 3, 1, 1, 2, 2, 1, 1, 3, 3, 2,…
## $ q33 <dbl+lbl> 1, 2, 1, 1, 3, 2, 2, 1, 2, 1, 6,…
## $ q33_7_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q34 <dbl+lbl> 5, 1, 5, 5, 3, -9, 5, 5, -8, 1, 6,…
## $ q34_7_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q35 <dbl+lbl> 1, 2, 1, 1, 1, 2, 3, 1, 2, 1, 2,…
## $ q35_7_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q36 <dbl+lbl> 2, 1, 2, 2, 2, 1, 5, 2, 1, 3, 1,…
## $ q36_7_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q37 <dbl+lbl> 1, 1, 1, 3, 1, 4, 5, 1, 1, 1, 6,…
## $ q37_7_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q38 <dbl+lbl> 2, 2, 2, 1, 2, 1, 3, 3, 2, 2, 2,…
## $ q38_7_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q39 <dbl+lbl> 3, 1, 3, 1, 3, 2, 3, 1, 3, 3, 3, 3, 3, 3, 2…
## $ q40 <dbl+lbl> 3, 3, 3, 1, 1, 3, 3, 1, 3, 3, 3, 3, 3, 1, 3…
## $ q75 <dbl+lbl> 3, 3, 3, 5, 4, 4, 2, 4, 3, 4, 4,…
## $ q44 <dbl+lbl> 3, 3, 4, 2, 6, 5, 2, 5, 5, 5, 6, 2, 3, 5, 1…
## $ q76 <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 2, 1, 1, 2…
## $ q45 <dbl+lbl> 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2…
## $ q46 <dbl+lbl> 3, 3, 3, 2, 4, 4, 4, 3, 2, 2, 2,…
## $ q47 <dbl+lbl> 3, 3, 3, 1, 3, 3, 3, 3, 2, 1, 3, 1, 3, 1, 2…
## $ q48 <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 4, 1, 1, 3…
## $ q49 <dbl+lbl> 4, 4, 4, 4, 1, 1, 4, 1, 4, 3, 4, 4, 4, 4, 4…
## $ q52 <dbl+lbl> 1, 1, 3, 3, 3, 4, 3, 8, 2, 1, 6, 4, 2, 8, 8…
## $ q52_7_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q53 <dbl+lbl> 2, 2, 2, 2, 2, 2, 3, NA, 2, 3, 2,…
## $ q54 <dbl+lbl> 2, 2, 2, 3, 2, 1, 2, 3, 1, 1, 2,…
## $ q59 <dbl+lbl> 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, 2, 1, 1, 2…
## $ q60 <dbl+lbl> 1, 1, 3, NA, 3, -9, NA, 4, 1, 1, 2,…
## $ q60_7_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q77 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ q43 <dbl+lbl> 2, 4, 3, 2, 3, 1, 3, 1, 2, 3, 4,…
## $ q61 <dbl+lbl> 9, 8, 9, 8, 10, 4, 6, 10, 4, 7, 11,…
## $ q62 <dbl+lbl> 6, 6, 21, 21, 21, 6, 22, 21, 6, 6, 6,…
## $ q62_22_ <chr> "", "", "", "", "", "", "Beisme", "", "", "", "…
## $ q63 <dbl+lbl> 1, 3, NA, NA, NA, 3, 4, NA, 2, 2, 1,…
## $ q64 <dbl+lbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ q64_13_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q65 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, …
## $ q66a_1 <dbl+lbl> 0, 0, 0, 0, 0, -9, 0, 0, 1, 0, 1,…
## $ q66a_2 <dbl+lbl> 0, 0, 0, 0, 0, -9, 0, 0, 0, 0, 0,…
## $ q66a_3 <dbl+lbl> 0, 0, 0, 0, 0, -9, 0, 0, 0, 0, 0,…
## $ q66a_4 <dbl+lbl> 0, 0, 0, 0, 0, -9, 0, 0, 0, 0, 0,…
## $ q66a_5 <dbl+lbl> 0, 0, 1, 0, 0, -9, 1, 1, 0, 0, 0,…
## $ q66a_6 <dbl+lbl> 0, 0, 0, 0, 0, -9, 0, 0, 0, 0, 0,…
## $ q66a_7 <dbl+lbl> 0, 0, 0, 0, 0, -9, 0, 0, 0, 0, 0,…
## $ q66a_8 <dbl+lbl> 0, 0, 0, 0, 0, -9, 0, 0, 0, 0, 0,…
## $ q66a_9 <dbl+lbl> 0, 0, 0, 0, 0, -9, 0, 0, 0, 0, 0,…
## $ q66a_10 <dbl+lbl> 0, 0, 0, 0, 0, -9, 0, 0, 0, 0, 0,…
## $ q66a_11 <dbl+lbl> 0, 0, 0, 0, 0, -9, 0, 0, 0, 0, 0,…
## $ q66a_12 <dbl+lbl> 0, 1, 0, 0, 0, -9, 0, 0, 0, 0, 0,…
## $ q66a_13 <dbl+lbl> 0, 0, 0, 0, 0, -9, 0, 0, 0, 0, 0,…
## $ q66a_14 <dbl+lbl> 1, 0, 0, 0, 0, -9, 0, 0, 0, 0, 0,…
## $ q66a_15 <dbl+lbl> 0, 0, 0, 0, 1, -9, 0, 0, 0, 0, 0,…
## $ q66a_16 <dbl+lbl> 0, 0, 0, 0, 0, -9, 0, 0, 0, 1, 0,…
## $ q66a_17 <dbl+lbl> 0, 0, 0, 1, 0, -9, 1, 0, 0, 0, 0,…
## $ q66a_17_ <chr> "", "", "", "colon français, autochtone, canadi…
## $ q66_1 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_3 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_4 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_5 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_6 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_7 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_8 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_9 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_10 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_11 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_12 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_13 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_14 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_15 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 1,…
## $ q66_16 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_17 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_18 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, -9, NA, 0,…
## $ q66_18_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q67 <dbl+lbl> 4, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ q67_31_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q68 <dbl+lbl> 1, 1, 6, 9, 1, 4, 6, 1, 8, 1, 1,…
## $ q68_12_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q69 <dbl+lbl> 104000, 75000, 20000, 120000, 95000, 38…
## $ q70 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 4,…
## $ q71 <dbl+lbl> 2, 4, 1, 5, 1, 1, 1, 2, 2, 4, 3, 5, 3, 2, 4…
## $ q26a <dbl+lbl> 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2…
## $ q26b <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ r1 <dbl+lbl> 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 1…
## $ age <dbl> 56, 46, 25, 19, 35, 80, 20, 24, 56, 49, 41, 20,…
## $ age_range <dbl+lbl> 5, 4, 2, 1, 3, 5, 1, 1, 5, 4, 3, 1, 5, 4, 2…
## $ q71r <dbl+lbl> 2, 4, 1, 5, 1, 1, 1, 2, 2, 4, 3, 5, 3, 2, 4…
## $ q70r <dbl+lbl> 5, 4, 2, 6, 5, 3, 6, 3, 3, 7, 4,…
## $ q14r <dbl+lbl> 3, 4, 4, 4, 1, 1, 3, 4, 3, 4, 1, 2, 3, 3, 5…
## $ q15r <dbl+lbl> 2, 3, 3, 2, 1, 2, 1, 2, 4, 1, 3,…
## $ q16r <dbl+lbl> 2, 2, 3, 5, 5, 1, 4, 4, 1, 2, 1, 1, 2, 4, 7…
## $ q17r <dbl+lbl> 2, 1, 3, 4, 3, 5, 2, 4, 3, 4, 1,…
## $ q18r <dbl+lbl> 2, 2, 3, 4, 1, 2, 4, 4, 1, 1, 1,…
## $ q19r <dbl+lbl> 1, 1, 7, 2, 1, 1, 7, 1, 1, 7, 5, 2, 2, 1, 7…
## $ q20r <dbl+lbl> 4, 3, 4, 4, 2, 1, 2, 4, 3, 4, 1, 1, 3, 3, 3…
## $ q21r <dbl+lbl> 2, 3, 2, 3, 2, 2, 2, 1, 4, 2, 2, 2, 3, 1, 7…
## $ q22r <dbl+lbl> 2, 3, 4, 5, 4, 1, 4, 4, 7, 3, 2, 1, 2, 5, 7…
## $ q23r <dbl+lbl> 3, 1, 4, 3, 6, 5, 7, 5, 3, 7, 1,…
## $ q24r <dbl+lbl> 4, 2, 3, 4, 2, 2, 3, 4, 7, 2, 1, 3, 7, 4, 4…
## $ q25r <dbl+lbl> 4, 1, 2, 3, 1, 1, 7, 1, 1, 2, 5, 3, 3, 1, 2…
## $ vote <dbl+lbl> 11, 11, 1, 4, 3, 4, 5, 4, 2, 1, 6,…
## $ q77eng <dbl+lbl> NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA,…
## $ q77fr <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pc1 <dbl+lbl> 1, NA, 1, NA, 1, 1, 1, 1, NA, 1, 1,…
## $ p1 <chr> "l'écologie", "", "laicité", "", "L'environneme…
## $ p2 <dbl+lbl> 1, NA, 1, NA, 1, 1, 2, 1, NA, 1, 1,…
## $ p3 <dbl+lbl> 1, NA, 4, NA, 3, 4, NA, 3, NA, 4, 6,…
## $ p3_7_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ p4 <dbl+lbl> 2, NA, 1, NA, 4, 2, 3, 3, NA, 2, 1,…
## $ p5 <dbl+lbl> 2, NA, 2, NA, 3, 4, 3, 2, NA, 2, 2,…
## $ p6 <dbl+lbl> 2, NA, 6, NA, 0, 3, 2, 2, NA, 4, 6,…
## $ p7 <dbl+lbl> 9, NA, 7, NA, 1, 4, 5, 5, NA, 8, 5,…
## $ p8 <dbl+lbl> 2, NA, 8, NA, 9, 0, 8, 9, NA, 4, 4,…
## $ p9 <dbl+lbl> 6, NA, 7, NA, 8, 1, 7, 5, NA, 0, 0,…
## $ p10 <dbl+lbl> 7, NA, 8, NA, 7, 10, 5, 7, NA, 9, 2,…
## $ p11 <dbl+lbl> 1, NA, 4, NA, 0, 0, 11, 0, NA, 0, 8,…
## $ p12 <dbl+lbl> 2, NA, 5, NA, 0, 4, 2, 0, NA, 5, 4,…
## $ p13 <dbl+lbl> 9, NA, 6, NA, 0, 5, 4, 6, NA, 7, 4,…
## $ p14 <dbl+lbl> 7, NA, 7, NA, 9, 5, 7, 9, NA, 7, 5,…
## $ p15 <dbl+lbl> 6, NA, 5, NA, 8, 0, 9, 3, NA, 0, 1,…
## $ p16 <dbl+lbl> 7, NA, 8, NA, 10, 8, 11, 9, NA, 9, 2,…
## $ p17 <dbl+lbl> 8, NA, 4, NA, 0, 0, 11, 0, NA, 0, 8,…
## $ p18 <dbl+lbl> 1, NA, 1, NA, 3, 1, 3, 1, NA, 3, 3,…
## $ p19 <dbl+lbl> 3, NA, 3, NA, 3, -9, 3, 3, NA, 1, 2,…
## $ p20_a <dbl+lbl> 4, NA, 5, NA, 5, 1, 5, 5, NA, 4, 3,…
## $ p20_b <dbl+lbl> 5, NA, 4, NA, 5, 3, 3, 5, NA, 2, 4,…
## $ p20_c <dbl+lbl> 1, NA, 1, NA, 1, 1, 1, 2, NA, 1, 1,…
## $ p20_d <dbl+lbl> 1, NA, 1, NA, 2, 1, 3, 1, NA, 2, 3,…
## $ p20_e <dbl+lbl> 3, NA, 5, NA, 5, 4, 5, 5, NA, 5, 5,…
## $ p20_f <dbl+lbl> 4, NA, 4, NA, 2, 1, 2, 5, NA, 3, 4,…
## $ p20_g <dbl+lbl> 2, NA, 2, NA, 1, 1, 2, 1, NA, 3, 2,…
## $ p20_h <dbl+lbl> 5, NA, 4, NA, 5, 1, 4, 5, NA, 2, 4,…
## $ p20_i <dbl+lbl> 4, NA, 4, NA, 1, 4, 4, 4, NA, 2, 4,…
## $ p20_j <dbl+lbl> 4, NA, 2, NA, 4, 4, 4, 3, NA, 4, 2,…
## $ p20_k <dbl+lbl> 4, NA, 4, NA, 3, 3, 4, 5, NA, 3, 4,…
## $ p20_l <dbl+lbl> 3, NA, 5, NA, 4, 2, 2, 3, NA, 4, 5,…
## $ p20_m <dbl+lbl> 2, NA, 4, NA, 2, 1, 2, 2, NA, 2, 4,…
## $ p20_n <dbl+lbl> 4, NA, 4, NA, 1, 1, 3, 3, NA, 1, 4,…
## $ p21_a <dbl+lbl> 1, NA, 2, NA, 2, 1, 4, 4, NA, 1, 2,…
## $ p21_b <dbl+lbl> 5, NA, 4, NA, 4, 1, 4, 3, NA, 3, 3,…
## $ p22_a <dbl+lbl> 2, NA, 1, NA, 2, 2, 2, 1, NA, 2, 2,…
## $ p22_b <dbl+lbl> 4, NA, 4, NA, 4, 1, 4, 5, NA, 1, 5,…
## $ p22_c <dbl+lbl> 4, NA, 4, NA, 3, 3, 4, 5, NA, 3, 5,…
## $ p23 <dbl+lbl> 1, NA, 1, NA, 1, 1, 1, 1, NA, 1, 2,…
## $ p24 <dbl+lbl> 1, NA, 3, NA, 3, 4, 3, 3, NA, 4, NA,…
## $ p24_7_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ p25_a <dbl+lbl> 2, NA, 4, NA, 3, 3, 2, 4, NA, 2, 4,…
## $ p25_b <dbl+lbl> 2, NA, 4, NA, 4, 1, 3, 4, NA, 2, 4,…
## $ p25_c <dbl+lbl> 1, NA, 1, NA, 1, 1, 1, 2, NA, 1, 2,…
## $ p25_d <dbl+lbl> 1, NA, 2, NA, 2, 2, 3, 3, NA, 1, 2,…
## $ p26 <dbl+lbl> 2, NA, 3, NA, 1, 1, 3, 3, NA, 1, 3,…
## $ p27 <dbl+lbl> 8, NA, 9, NA, 10, 7, 6, 9, NA, 6, 9,…
## $ p28 <dbl+lbl> 2, NA, 1, NA, 1, 1, 3, 1, NA, 3, 2,…
## $ p29_a <dbl+lbl> 4, NA, 3, NA, 4, 1, 4, 4, NA, 1, 1,…
## $ p29_b <dbl+lbl> 1, NA, 1, NA, 3, 1, 1, 1, NA, 1, 1,…
## $ p29_c <dbl+lbl> 2, NA, 3, NA, 3, 1, 1, 2, NA, 1, 1,…
## $ p30 <dbl+lbl> 3, NA, 3, NA, 3, 3, 3, 3, NA, 3, 3,…
## $ p31 <dbl+lbl> 2, NA, 1, NA, 1, 2, 2, 1, NA, 2, 2,…
## $ p32 <dbl+lbl> 3, NA, 2, NA, 3, 2, 4, 2, NA, 3, 3,…
## $ p33 <dbl+lbl> 4, NA, 5, NA, 4, 3, 4, 2, NA, 1, 4,…
## $ p34 <dbl+lbl> 4, NA, 5, NA, 1, 4, 5, 2, NA, 5, 5,…
## $ p35_a <dbl+lbl> 3, NA, 3, NA, 3, 5, 2, 1, NA, 3, 3,…
## $ p35_b <dbl+lbl> 3, NA, 2, NA, 1, 5, 2, 1, NA, 1, 3,…
## $ p35_c <dbl+lbl> 3, NA, 2, NA, 3, 5, 3, 1, NA, 1, 3,…
## $ p36 <dbl+lbl> 8, NA, 4, NA, 8, 12, 7, 5, NA, 5, 5,…
## $ p37 <dbl+lbl> 7, NA, 7, NA, 9, 12, 9, 7, NA, 7, 6,…
## $ p38 <dbl+lbl> 7, NA, 2, NA, 3, 12, 4, 3, NA, 3, 4,…
## $ p39 <dbl+lbl> 4, NA, 4, NA, 4, 12, 11, 4, NA, 8, 2,…
## $ p40 <dbl+lbl> 3, NA, 2, NA, 4, 12, 3, 5, NA, 7, 5,…
## $ p41 <dbl+lbl> 7, NA, 10, NA, 10, 12, 5, 9, NA, 11, 8,…
## $ p42 <dbl+lbl> 7, NA, 4, NA, 0, 12, 4, 1, NA, 7, 5,…
## $ p43 <dbl+lbl> 3, NA, 3, NA, 5, 1, 1, 2, NA, 1, 2,…
## $ p44 <dbl+lbl> 2, NA, 1, NA, 1, 1, 3, 1, NA, 1, 3,…
## $ p45 <dbl+lbl> 2, NA, 2, NA, 1, 2, 2, 2, NA, 2, 2,…
## $ p46 <dbl+lbl> 2, NA, 2, NA, NA, 1, 1, 1, NA, 2, 1,…
## $ p47 <dbl+lbl> NA, NA, NA, NA, 3, 4, 3, 3, NA, NA, 6,…
## $ p47_7_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ p48 <dbl+lbl> NA, NA, NA, NA, 1, 1, 2, 2, NA, NA, 2,…
## $ p49 <dbl+lbl> 6, NA, 7, NA, 12, 7, 11, 6, NA, 2, 7,…
## $ p50 <dbl+lbl> 1, NA, 6, NA, 6, 5, 6, 6, NA, 1, 2,…
## $ p51 <dbl+lbl> 2, NA, 2, NA, 1, 2, 1, 2, NA, 2, 2,…
## $ p52 <chr> "Gestionnaire dans le domaine funéraire", "", "…
## $ p53 <dbl+lbl> 3, NA, NA, NA, 1, NA, NA, -9, NA, 1, 2,…
## $ p54 <dbl+lbl> 7, NA, 1, NA, 1, 1, 1, 3, NA, 2, 5,…
## $ p55 <dbl+lbl> 2, NA, 2, NA, 2, 2, 2, 2, NA, 2, 2,…
## $ p56_1 <dbl+lbl> 0, NA, 0, NA, 0, 0, 0, 0, NA, 0, 0,…
## $ p56_2 <dbl+lbl> 1, NA, 1, NA, 1, 1, 1, 1, NA, 1, 1,…
## $ p56_3 <dbl+lbl> 0, NA, 0, NA, 0, 0, 0, 0, NA, 0, 0,…
## $ p56_4 <dbl+lbl> 0, NA, 0, NA, 0, 0, 0, 0, NA, 0, 0,…
## $ p56_5 <dbl+lbl> 0, NA, 0, NA, 0, 0, 0, 0, NA, 0, 0,…
## $ p56_6 <dbl+lbl> 0, NA, 0, NA, 0, 0, 0, 0, NA, 0, 0,…
## $ p56_7 <dbl+lbl> 0, NA, 0, NA, 0, 0, 0, 0, NA, 0, 0,…
## $ p56_8 <dbl+lbl> 0, NA, 0, NA, 0, 0, 0, 0, NA, 0, 0,…
## $ p56_9 <dbl+lbl> 0, NA, 0, NA, 0, 0, 0, 0, NA, 0, 0,…
## $ p56_9_ <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ p57 <dbl+lbl> 5, NA, 5, NA, 5, 5, 5, 5, NA, 3, 1,…
## $ p6r <dbl+lbl> 2, NA, 6, NA, 0, 3, 2, 2, NA, 4, 6,…
## $ p7r <dbl+lbl> 9, NA, 7, NA, 1, 4, 5, 5, NA, 8, 5,…
## $ p8r <dbl+lbl> 2, NA, 8, NA, 9, 0, 8, 9, NA, 4, 4,…
## $ p9r <dbl+lbl> 6, NA, 7, NA, 8, 1, 7, 5, NA, 0, 0,…
## $ p10r <dbl+lbl> 7, NA, 8, NA, 7, 10, 5, 7, NA, 9, 2,…
## $ p11r <dbl+lbl> 1, NA, 4, NA, 0, 0, -9, 0, NA, 0, 8,…
## $ p12r <dbl+lbl> 2, NA, 5, NA, 0, 4, 2, 0, NA, 5, 4,…
## $ p13r <dbl+lbl> 9, NA, 6, NA, 0, 5, 4, 6, NA, 7, 4,…
## $ p14r <dbl+lbl> 7, NA, 7, NA, 9, 5, 7, 9, NA, 7, 5,…
## $ p15r <dbl+lbl> 6, NA, 5, NA, 8, 0, 9, 3, NA, 0, 1,…
## $ p16r <dbl+lbl> 7, NA, 8, NA, 10, 8, -9, 9, NA, 9, 2,…
## $ p17r <dbl+lbl> 8, NA, 4, NA, 0, 0, -9, 0, NA, 0, 8,…
## $ p36r <dbl+lbl> 8, NA, 4, NA, 8, -5, 7, 5, NA, 5, 5,…
## $ p37r <dbl+lbl> 7, NA, 7, NA, 9, -5, 9, 7, NA, 7, 6,…
## $ p38r <dbl+lbl> 7, NA, 2, NA, 3, -5, 4, 3, NA, 3, 4,…
## $ p39r <dbl+lbl> 4, NA, 4, NA, 4, -5, -9, 4, NA, 8, 2,…
## $ p40r <dbl+lbl> 3, NA, 2, NA, 4, -5, 3, 5, NA, 7, 5,…
## $ p41r <dbl+lbl> 7, NA, 10, NA, 10, -5, 5, 9, NA, -9, 8,…
## $ p42r <dbl+lbl> 7, NA, 4, NA, 0, -5, 4, 1, NA, 7, 5,…
## $ feduid <dbl> 24015, 24046, 24059, 24011, 24027, 24045, 24037…
## $ fedname <chr> "Bourassa", "Manicouagan", "Québec", "Beloeil--…
QUESTION: How many individuals are there in the dataset? How many variables? What are the column types present in the data (they are between “<>” in the output of the glimpse() function? What is a dbl+lbl? You can read the first section of this document (up until section Variable labels).
Let’s look at the distribution of age.
ggplot(df,aes(x=age)) +
geom_histogram()
Let’s calculate the number of values for which age is not missing, the mean and the median.
sample_size_age <- df |>
summarise(sample_size_age=sum(!is.na(age))) |>
pull(sample_size_age)
# we could also use the tidyverse to get the mean and median but since it's simple
# let's just use the compact way
my_mean <- mean(df$age,na.rm=TRUE)
my_median <- median(df$age,na.rm=TRUE)
Let’s redo our histogram, but adding a vertical line where the median is. We can add a caption to programmatically indicate the sample size.
ggplot(df,aes(x=age)) +
geom_histogram(binwidth=1,fill="white",color="black") +
theme_classic() +
labs(
x="Age",
y="Count (in survey)",
title="Age distribution in Canada",
# You can read on ?paste0 and ?format
caption=paste0("Data from CES 2019; n = ",format(sample_size_age,big.mark = ","))
) +
geom_vline(aes(xintercept=my_mean),linetype=2) +
annotate("text", x = my_mean-2, y = 90, label = "mean",angle = 90)
QUESTION: In your own words, how would you describe the distribution of age?
In the data, there’s a variable called age_range. Let’s look at it with group_by and count.
df |>
group_by(age_range) |>
count()
## # A tibble: 5 × 2
## # Groups: age_range [5]
## age_range n
## <dbl+lbl> <int>
## 1 1 [(1) 18-24 years old] 256
## 2 2 [(2) 25-34 years old] 561
## 3 3 [(3) 35-44 years old] 694
## 4 4 [(4) 45-54 years old] 728
## 5 5 [(5) 55+ years old] 1782
Let’s print the first five values of age_range. We see age_range is a labelled_double. That means the variable is a number, but it has a label associated with it. It’s similar to a factor: it’s a number with a label associated with it.
You can convert it to a factor like this:
df$age_range <- to_factor(df$age_range)
Let’s look at the possible levels age_range can take.
levels(df$age_range)
## [1] "(-9) Don't know" "(-8) Refused" "(-7) Skipped"
## [4] "(1) 18-24 years old" "(2) 25-34 years old" "(3) 35-44 years old"
## [7] "(4) 45-54 years old" "(5) 55+ years old"
QUESTION: How many missing values are there? (go back to the count() above ) How many Don’t know’s, Refused, Skipped? Why do you think this is the case? People could refuse to answer, it’s an option, so why are there none?
Now, plot age_range.
ggplot(df,aes(x=age_range)) +
geom_bar()
Now, imagine that these are not the groups you want. Rather, you want
18-34, 35-54, 55+. Recode the groups using the following code. You are
using the cut()
function.
df <- df |>
mutate(
age_group=cut(age,breaks=c(-Inf,17,34,54,Inf),
label=c("0-17","18-34","35-54","55+")),
age_group=droplevels(age_group))
Plot this using a bar graph.
ggplot(df,aes(x=age_group)) +
geom_bar()
You can add labels with the count number like this.
ggplot(df, aes(x = age_group)) +
geom_bar() + labs(x = "", y = "") +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.5, colour = "white")
Let’s drop the empty levels from the age_range factor. You can use
the recode()
function if you want to recode (e.g. clean)
them.
df <- df |>
mutate(age_range = droplevels(age_range))
df <- df |>
mutate(age_range=recode(age_range,
"(1) 18-24 years old"="18-24",
"(2) 25-34 years old"="25-34",
"(3) 35-44 years old"="35-44",
"(4) 45-54 years old"="45-54",
"(5) 55+ years old"="55+"))
That can be plotted too.
ggplot(df,aes(x=age_range)) +
geom_bar() +
labs(x="",y="")
Lastly, instead of visualizing age with a graph, let’s use a table to
get all the summary statistics. Use kable()
to output these
numbers.
age_summary <- df |>
summarize(
mean_age = mean(age, na.rm = TRUE),
sd_age = sd(age, na.rm = TRUE),
min_age = min(age, na.rm = TRUE),
max_age = max(age, na.rm = TRUE),
median_age = median(age, na.rm = TRUE),
skew_age = skewness(age, na.rm = TRUE),
kurtosis_age = kurtosis(age, na.rm = TRUE),
n_age = sum(!is.na(age))
)
age_summary |>
kable(format = "simple")
mean_age | sd_age | min_age | max_age | median_age | skew_age | kurtosis_age | n_age |
---|---|---|---|---|---|---|---|
50.89033 | 16.83581 | 18 | 100 | 51 | -0.0053535 | -0.871748 | 4021 |
age_summary |>
t() |>
kable(format = "simple")
mean_age | 50.8903258 |
sd_age | 16.8358082 |
min_age | 18.0000000 |
max_age | 100.0000000 |
median_age | 51.0000000 |
skew_age | -0.0053535 |
kurtosis_age | -0.8717480 |
n_age | 4021.0000000 |
QUESTION: What’s the mean/sd/min/max/median/skewnewss/kurtosis? Interpret the skewness and kurtosis?
QUESTION: t()
function
stands for transpose. What does t()
do in practice?
Now, let’s look at the variable household income.
summary(df$q69)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9 0 60000 80331 120000 2120000
Looking at the codebook, we see that -8 and -9 should be coded as NA.
df <- df |>
mutate(hincome=ifelse(q69 %in% c(-8,-9), NA, q69))
ggplot(df,aes(x=hincome)) +
geom_histogram(binwidth = 5000)