class: center, middle, inverse, title-slide # Week 4: Importing data and tidy data ## PUBPOL 750 Data Analysis for Public Policy I ### Justin Savoie ### MPP-DS McMaster ### 2022-06-03 --- class: inverse, center, middle # Homework 1 ### Solution code for HW1 posted on the website --- class: inverse, center, middle # The pipe (%>%) --- # The pipe (%>%) - We use the pipe to emphasise a sequence of actions - The pipe means "THEN" ```r mtcars_grouped <- group_by(mtcars,cyl) count(mtcars_grouped) ``` ``` ## # A tibble: 3 × 2 ## # Groups: cyl [3] ## cyl n ## <dbl> <int> ## 1 4 11 ## 2 6 7 ## 3 8 14 ``` --- # The pipe (%>%) ```r mtcars %>% group_by(cyl) %>% count() ``` ``` ## # A tibble: 3 × 2 ## # Groups: cyl [3] ## cyl n ## <dbl> <int> ## 1 4 11 ## 2 6 7 ## 3 8 14 ``` In plain language: take the mtcars dataset THEN group by "cyl" THEN count. --- ```r mtcars %>% filter(cyl==4) %>% select(mpg,hp,vs,cyl) ``` ``` ## mpg hp vs cyl ## Datsun 710 22.8 93 1 4 ## Merc 240D 24.4 62 1 4 ## Merc 230 22.8 95 1 4 ## Fiat 128 32.4 66 1 4 ## Honda Civic 30.4 52 1 4 ## Toyota Corolla 33.9 65 1 4 ## Toyota Corona 21.5 97 1 4 ## Fiat X1-9 27.3 66 1 4 ## Porsche 914-2 26.0 91 0 4 ## Lotus Europa 30.4 113 1 4 ## Volvo 142E 21.4 109 1 4 ``` --- # Another way to think of the pipe (%>%) ```r my_vector <- c(1,2,3,4,8) mean(my_vector) ``` ``` ## [1] 3.6 ``` ```r my_vector %>% mean() ``` ``` ## [1] 3.6 ``` More formally, the pipe passes what comes before the pipe as the first argument in the function after the pipe. --- # Now that we know about the pipe, let's take a look at the homework --- class: inverse, center, middle # Data importing --- # Reading a csv ```r library(tidyverse) Parliamentarians_subset100 <- read_csv("https://www.justinsavoie.com/pubpol750-2022/homework1/Parliamentarians_subset100.csv") Parliamentarians_subset100 <- read_csv("~/Downloads/Parliamentarians_subset100.csv") Parliamentarians_subset100 <- read_csv("/Users/justinsavoie/Downloads/Parliamentarians_subset100.csv") Parliamentarians_subset100 <- read_csv("/Users/justinsavoie/Documents/personal_repos/personal_website/pubpol750-2022/Homework1/Parliamentarians_subset100.csv") Parliamentarians_subset100 <- read_csv("C:\\Users\\justinsavoie\\Downloads/Parliamentarians_subset100.csv") ``` # Reading an excel file ```r library(readxl) Parliamentarians_subset100 <- read_excel("~/Downloads/Parliamentarians.xlsx") ``` --- # Reading spss, sas, stata files ```r library(haven) #read_spss() #read_sas() #read_stata() library(readstata13) #readstata13::read.dta13() ``` --- class: inverse, center, middle # Tidy data --- # Tidy data (A) - Each variable must have its own column. - Each observation must have its own row. - Each value must have its own cell. (B) - Put each dataset in a tibble. - Put each variable in a column. --- .pull-left[ ```r table1 ``` ``` ## # A tibble: 6 × 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` ```r table2 ``` ``` ## # A tibble: 12 × 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 ``` ] .pull-right[ ```r table3 ``` ``` ## # A tibble: 6 × 3 ## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 ``` ```r table4a ``` ``` ## # A tibble: 3 × 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766 ``` ```r table4b ``` ``` ## # A tibble: 3 × 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583 ``` ] --- # Pivoting - Pivoting wider - Pivoting longer --- # Pivoting wider - column names are not names of variables, but values of a variable .pull-left[ ```r table4a ``` ``` ## # A tibble: 3 × 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766 ``` ] .pull-right[ ```r table4a %>% pivot_longer(cols=c(`1999`,`2000`),names_to = 'year', values_to = 'cases') ``` ``` ## # A tibble: 6 × 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766 ``` ] --- ```r tidy4a <- table4a %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases") tidy4b <- table4b %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population") left_join(tidy4a, tidy4b) ``` ``` ## Joining, by = c("country", "year") ``` ``` ## # A tibble: 6 × 4 ## country year cases population ## <chr> <chr> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` --- # Pivoting wider .pull-left[ ```r table2 ``` ``` ## # A tibble: 12 × 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 ``` ] .pull-right[ ```r table2 %>% pivot_wider(names_from = type, values_from = count) ``` ``` ## # A tibble: 6 × 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` ] --- # Separating .pull-left[ ```r table3 ``` ``` ## # A tibble: 6 × 3 ## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 ``` ] .pull-right[ ```r table3 %>% separate(rate, into = c("cases", "population")) ``` ``` ## # A tibble: 6 × 4 ## country year cases population ## <chr> <int> <chr> <chr> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` ] --- # Uniting .pull-left[ ```r table5 ``` ``` ## # A tibble: 6 × 4 ## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272 ## 6 China 20 00 213766/1280428583 ``` ] .pull-right[ ```r table5 %>% unite(new, century, year) ``` ``` ## # A tibble: 6 × 3 ## country new rate ## <chr> <chr> <chr> ## 1 Afghanistan 19_99 745/19987071 ## 2 Afghanistan 20_00 2666/20595360 ## 3 Brazil 19_99 37737/172006362 ## 4 Brazil 20_00 80488/174504898 ## 5 China 19_99 212258/1272915272 ## 6 China 20_00 213766/1280428583 ``` ] --- # Missing values - Explicit vs Implicit ```r (stocks <- tibble( year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016), qtr = c( 1, 2, 3, 4, 2, 3, 4), return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) )) ``` ``` ## # A tibble: 7 × 3 ## year qtr return ## <dbl> <dbl> <dbl> ## 1 2015 1 1.88 ## 2 2015 2 0.59 ## 3 2015 3 0.35 ## 4 2015 4 NA ## 5 2016 2 0.92 ## 6 2016 3 0.17 ## 7 2016 4 2.66 ``` --- - Force explicit ```r stocks %>% pivot_wider(names_from = year, values_from = return) ``` ``` ## # A tibble: 4 × 3 ## qtr `2015` `2016` ## <dbl> <dbl> <dbl> ## 1 1 1.88 NA ## 2 2 0.59 0.92 ## 3 3 0.35 0.17 ## 4 4 NA 2.66 ``` --- ```r stocks %>% pivot_wider(names_from = year, values_from = return) %>% pivot_longer(cols=c(`2015`,`2016`),names_to='year',values_to='return') ``` ``` ## # A tibble: 8 × 3 ## qtr year return ## <dbl> <chr> <dbl> ## 1 1 2015 1.88 ## 2 1 2016 NA ## 3 2 2015 0.59 ## 4 2 2016 0.92 ## 5 3 2015 0.35 ## 6 3 2016 0.17 ## 7 4 2015 NA ## 8 4 2016 2.66 ``` --- ```r stocks %>% pivot_wider(names_from = year, values_from = return) %>% pivot_longer(cols=c(`2015`,`2016`),names_to='year',values_to='return',values_drop_na = TRUE) ``` ``` ## # A tibble: 6 × 3 ## qtr year return ## <dbl> <chr> <dbl> ## 1 1 2015 1.88 ## 2 2 2015 0.59 ## 3 2 2016 0.92 ## 4 3 2015 0.35 ## 5 3 2016 0.17 ## 6 4 2016 2.66 ``` --- .pull-left[ ```r stocks ``` ``` ## # A tibble: 7 × 3 ## year qtr return ## <dbl> <dbl> <dbl> ## 1 2015 1 1.88 ## 2 2015 2 0.59 ## 3 2015 3 0.35 ## 4 2015 4 NA ## 5 2016 2 0.92 ## 6 2016 3 0.17 ## 7 2016 4 2.66 ``` ] .pull-left[ ```r stocks %>% complete(year, qtr) ``` ``` ## # A tibble: 8 × 3 ## year qtr return ## <dbl> <dbl> <dbl> ## 1 2015 1 1.88 ## 2 2015 2 0.59 ## 3 2015 3 0.35 ## 4 2015 4 NA ## 5 2016 1 NA ## 6 2016 2 0.92 ## 7 2016 3 0.17 ## 8 2016 4 2.66 ``` ]