+ - 0:00:00
Notes for current slide
Notes for next slide

Week 4: Importing data and tidy data

PUBPOL 750 Data Analysis for Public Policy I

Justin Savoie

MPP-DS McMaster

2022-06-03

1 / 25

Homework 1

Solution code for HW1 posted on the website

2 / 25

The pipe (%>%)

3 / 25

The pipe (%>%)

  • We use the pipe to emphasise a sequence of actions
  • The pipe means "THEN"
mtcars_grouped <- group_by(mtcars,cyl)
count(mtcars_grouped)
## # A tibble: 3 × 2
## # Groups: cyl [3]
## cyl n
## <dbl> <int>
## 1 4 11
## 2 6 7
## 3 8 14
4 / 25

The pipe (%>%)

mtcars %>%
group_by(cyl) %>%
count()
## # A tibble: 3 × 2
## # Groups: cyl [3]
## cyl n
## <dbl> <int>
## 1 4 11
## 2 6 7
## 3 8 14

In plain language: take the mtcars dataset THEN group by "cyl" THEN count.

5 / 25
mtcars %>%
filter(cyl==4) %>%
select(mpg,hp,vs,cyl)
## mpg hp vs cyl
## Datsun 710 22.8 93 1 4
## Merc 240D 24.4 62 1 4
## Merc 230 22.8 95 1 4
## Fiat 128 32.4 66 1 4
## Honda Civic 30.4 52 1 4
## Toyota Corolla 33.9 65 1 4
## Toyota Corona 21.5 97 1 4
## Fiat X1-9 27.3 66 1 4
## Porsche 914-2 26.0 91 0 4
## Lotus Europa 30.4 113 1 4
## Volvo 142E 21.4 109 1 4
6 / 25

Another way to think of the pipe (%>%)

my_vector <- c(1,2,3,4,8)
mean(my_vector)
## [1] 3.6
my_vector %>%
mean()
## [1] 3.6

More formally, the pipe passes what comes before the pipe as the first argument in the function after the pipe.

7 / 25

Now that we know about the pipe, let's take a look at the homework

8 / 25

Data importing

9 / 25

Reading a csv

library(tidyverse)
Parliamentarians_subset100 <- read_csv("https://www.justinsavoie.com/pubpol750-2022/homework1/Parliamentarians_subset100.csv")
Parliamentarians_subset100 <- read_csv("~/Downloads/Parliamentarians_subset100.csv")
Parliamentarians_subset100 <- read_csv("/Users/justinsavoie/Downloads/Parliamentarians_subset100.csv")
Parliamentarians_subset100 <- read_csv("/Users/justinsavoie/Documents/personal_repos/personal_website/pubpol750-2022/Homework1/Parliamentarians_subset100.csv")
Parliamentarians_subset100 <- read_csv("C:\\Users\\justinsavoie\\Downloads/Parliamentarians_subset100.csv")

Reading an excel file

library(readxl)
Parliamentarians_subset100 <- read_excel("~/Downloads/Parliamentarians.xlsx")
10 / 25

Reading spss, sas, stata files

library(haven)
#read_spss()
#read_sas()
#read_stata()
library(readstata13)
#readstata13::read.dta13()
11 / 25

Tidy data

12 / 25

Tidy data

(A)

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

(B)

  • Put each dataset in a tibble.
  • Put each variable in a column.
13 / 25
table1
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
## # A tibble: 6 × 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
table4a
## # A tibble: 3 × 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b
## # A tibble: 3 × 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
14 / 25

Pivoting

  • Pivoting wider
  • Pivoting longer
15 / 25

Pivoting wider

  • column names are not names of variables, but values of a variable
    table4a
    ## # A tibble: 3 × 3
    ## country `1999` `2000`
    ## * <chr> <int> <int>
    ## 1 Afghanistan 745 2666
    ## 2 Brazil 37737 80488
    ## 3 China 212258 213766
    table4a %>%
    pivot_longer(cols=c(`1999`,`2000`),names_to = 'year', values_to = 'cases')
    ## # A tibble: 6 × 3
    ## country year cases
    ## <chr> <chr> <int>
    ## 1 Afghanistan 1999 745
    ## 2 Afghanistan 2000 2666
    ## 3 Brazil 1999 37737
    ## 4 Brazil 2000 80488
    ## 5 China 1999 212258
    ## 6 China 2000 213766
16 / 25
tidy4a <- table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b)
## Joining, by = c("country", "year")
## # A tibble: 6 × 4
## country year cases population
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
17 / 25

Pivoting wider

table2
## # A tibble: 12 × 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table2 %>%
pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
18 / 25

Separating

table3
## # A tibble: 6 × 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
table3 %>%
separate(rate, into = c("cases", "population"))
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
19 / 25

Uniting

table5
## # A tibble: 6 × 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
table5 %>%
unite(new, century, year)
## # A tibble: 6 × 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
20 / 25

Missing values

  • Explicit vs Implicit
(stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
))
## # A tibble: 7 × 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 2 0.92
## 6 2016 3 0.17
## 7 2016 4 2.66
21 / 25
  • Force explicit
stocks %>%
pivot_wider(names_from = year, values_from = return)
## # A tibble: 4 × 3
## qtr `2015` `2016`
## <dbl> <dbl> <dbl>
## 1 1 1.88 NA
## 2 2 0.59 0.92
## 3 3 0.35 0.17
## 4 4 NA 2.66
22 / 25
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(cols=c(`2015`,`2016`),names_to='year',values_to='return')
## # A tibble: 8 × 3
## qtr year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 1 2016 NA
## 3 2 2015 0.59
## 4 2 2016 0.92
## 5 3 2015 0.35
## 6 3 2016 0.17
## 7 4 2015 NA
## 8 4 2016 2.66
23 / 25
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(cols=c(`2015`,`2016`),names_to='year',values_to='return',values_drop_na = TRUE)
## # A tibble: 6 × 3
## qtr year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 2 2016 0.92
## 4 3 2015 0.35
## 5 3 2016 0.17
## 6 4 2016 2.66
24 / 25
stocks
## # A tibble: 7 × 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 2 0.92
## 6 2016 3 0.17
## 7 2016 4 2.66
stocks %>%
complete(year, qtr)
## # A tibble: 8 × 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 1 NA
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66
25 / 25

Homework 1

Solution code for HW1 posted on the website

2 / 25
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow