Week 4: Importing data and tidy dataPUBPOL 750 Data Analysis for Public Policy IJustin SavoieMPP-DS McMaster2022-06-031 / 25

Homework 1Solution code for HW1 posted on the website2 / 25

The pipe (%>%)3 / 25

The pipe (%>%)

We use the pipe to emphasise a sequence of actions
The pipe means "THEN"

mtcars_grouped <- group_by(mtcars,cyl)
count(mtcars_grouped)

## # A tibble: 3 × 2
## # Groups:   cyl [3]
##     cyl     n
##   <dbl> <int>
## 1     4    11
## 2     6     7
## 3     8    14

4 / 25

The pipe (%>%)

mtcars %>%
  group_by(cyl) %>%
  count()

## # A tibble: 3 × 2
## # Groups:   cyl [3]
##     cyl     n
##   <dbl> <int>
## 1     4    11
## 2     6     7
## 3     8    14

In plain language: take the mtcars dataset THEN group by "cyl" THEN count.

5 / 25

mtcars %>%
  filter(cyl==4) %>%
  select(mpg,hp,vs,cyl)

##                 mpg  hp vs cyl
## Datsun 710     22.8  93  1   4
## Merc 240D      24.4  62  1   4
## Merc 230       22.8  95  1   4
## Fiat 128       32.4  66  1   4
## Honda Civic    30.4  52  1   4
## Toyota Corolla 33.9  65  1   4
## Toyota Corona  21.5  97  1   4
## Fiat X1-9      27.3  66  1   4
## Porsche 914-2  26.0  91  0   4
## Lotus Europa   30.4 113  1   4
## Volvo 142E     21.4 109  1   4

6 / 25

Another way to think of the pipe (%>%)

my_vector <- c(1,2,3,4,8)
mean(my_vector)

## [1] 3.6

my_vector %>%
  mean()

## [1] 3.6

More formally, the pipe passes what comes before the pipe as the first argument in the function after the pipe.

7 / 25

Now that we know about the pipe, let's take a look at the homework8 / 25

Data importing9 / 25

Reading a csv

library(tidyverse)
Parliamentarians_subset100 <- read_csv("https://www.justinsavoie.com/pubpol750-2022/homework1/Parliamentarians_subset100.csv")
Parliamentarians_subset100 <- read_csv("~/Downloads/Parliamentarians_subset100.csv")
Parliamentarians_subset100 <- read_csv("/Users/justinsavoie/Downloads/Parliamentarians_subset100.csv")
Parliamentarians_subset100 <- read_csv("/Users/justinsavoie/Documents/personal_repos/personal_website/pubpol750-2022/Homework1/Parliamentarians_subset100.csv")
Parliamentarians_subset100 <- read_csv("C:\\Users\\justinsavoie\\Downloads/Parliamentarians_subset100.csv")

Reading an excel file

library(readxl)
Parliamentarians_subset100 <- read_excel("~/Downloads/Parliamentarians.xlsx")

10 / 25

Reading spss, sas, stata files

library(haven)
#read_spss()
#read_sas()
#read_stata()
library(readstata13)
#readstata13::read.dta13()

11 / 25

Tidy data12 / 25

Tidy data

(A)

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

(B)

Put each dataset in a tibble.
Put each variable in a column.

13 / 25

table1

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table2

## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table3

## # A tibble: 6 × 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
table4a

## # A tibble: 3 × 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b

## # A tibble: 3 × 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583
14 / 25

PivotingPivoting wider
Pivoting longer
15 / 25

Pivoting widercolumn names are not names of variables, but values of a variabletable4a

## # A tibble: 3 × 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4a %>%
pivot_longer(cols=c(`1999`,`2000`),names_to = 'year', values_to = 'cases')

## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766
16 / 25

tidy4a <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b)

## Joining, by = c("country", "year")

## # A tibble: 6 × 4
##   country     year   cases population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Afghanistan 2000    2666   20595360
## 3 Brazil      1999   37737  172006362
## 4 Brazil      2000   80488  174504898
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

17 / 25

Pivoting widertable2

## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table2 %>%
    pivot_wider(names_from = type, values_from = count)

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
18 / 25

Separatingtable3

## # A tibble: 6 × 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
table3 %>% 
  separate(rate, into = c("cases", "population"))

## # A tibble: 6 × 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
19 / 25

Unitingtable5

## # A tibble: 6 × 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583
table5 %>% 
  unite(new, century, year)

## # A tibble: 6 × 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 19_99 745/19987071     
## 2 Afghanistan 20_00 2666/20595360    
## 3 Brazil      19_99 37737/172006362  
## 4 Brazil      20_00 80488/174504898  
## 5 China       19_99 212258/1272915272
## 6 China       20_00 213766/1280428583
20 / 25

Missing values

Explicit vs Implicit

(stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
))

## # A tibble: 7 × 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4  NA   
## 5  2016     2   0.92
## 6  2016     3   0.17
## 7  2016     4   2.66

21 / 25

Force explicit

stocks %>% 
  pivot_wider(names_from = year, values_from = return)

## # A tibble: 4 × 3
##     qtr `2015` `2016`
##   <dbl>  <dbl>  <dbl>
## 1     1   1.88  NA   
## 2     2   0.59   0.92
## 3     3   0.35   0.17
## 4     4  NA      2.66

22 / 25

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>%
  pivot_longer(cols=c(`2015`,`2016`),names_to='year',values_to='return')

## # A tibble: 8 × 3
##     qtr year  return
##   <dbl> <chr>  <dbl>
## 1     1 2015    1.88
## 2     1 2016   NA   
## 3     2 2015    0.59
## 4     2 2016    0.92
## 5     3 2015    0.35
## 6     3 2016    0.17
## 7     4 2015   NA   
## 8     4 2016    2.66

23 / 25

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>%
  pivot_longer(cols=c(`2015`,`2016`),names_to='year',values_to='return',values_drop_na = TRUE)

## # A tibble: 6 × 3
##     qtr year  return
##   <dbl> <chr>  <dbl>
## 1     1 2015    1.88
## 2     2 2015    0.59
## 3     2 2016    0.92
## 4     3 2015    0.35
## 5     3 2016    0.17
## 6     4 2016    2.66

24 / 25

stocks

## # A tibble: 7 × 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4  NA   
## 5  2016     2   0.92
## 6  2016     3   0.17
## 7  2016     4   2.66
stocks %>% 
  complete(year, qtr)

## # A tibble: 8 × 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4  NA   
## 5  2016     1  NA   
## 6  2016     2   0.92
## 7  2016     3   0.17
## 8  2016     4   2.66
25 / 25

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Week 4: Importing data and tidy data

PUBPOL 750 Data Analysis for Public Policy I

Justin Savoie

MPP-DS McMaster

2022-06-03

Homework 1

Solution code for HW1 posted on the website

The pipe (%>%)

The pipe (%>%)

The pipe (%>%)

Another way to think of the pipe (%>%)

Now that we know about the pipe, let's take a look at the homework

Data importing

Reading a csv

Reading an excel file

Reading spss, sas, stata files

Tidy data

Tidy data

Pivoting

Pivoting wider

Pivoting wider

Separating

Uniting

Missing values

Homework 1

Solution code for HW1 posted on the website

Help