class: center, middle, inverse, title-slide # Week 8: Strings, dates and times, joining and binding ## PUBPOL 750 Data Analysis for Public Policy I ### Justin Savoie ### MPP-DS McMaster ### 2022-07-08 --- class: inverse, center, middle # Strings, dates and times, joining and binding --- # Data wrangling ? - We've discussed select(), filter(), group(), mutate() etc. - A lot of data wrangling is about fixing messy strings, dates; joining data from different sources (from different datasets). --- # Plan for today - Joining and binding - Dealing with character strings - Dealing with dates - Two examples + two questions I had on Project 1 --- # Binding (columsn or rows) means appending new lines or new columns ## Binding rows with `bind_rows()` ```r df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline")) df2 <- tibble(age=c(55,30,12),name=c("Hillary","Anna","Luis")) (df <- bind_rows(df1,df2)) ``` ``` ## # A tibble: 6 × 2 ## age name ## <dbl> <chr> ## 1 25 Justin ## 2 29 Ian ## 3 22 Rosaline ## 4 55 Hillary ## 5 30 Anna ## 6 12 Luis ``` --- ```r df1 <- read_csv("mydata1.csv") df2 <- read_csv("mydata2.csv") df <- bind_rows(df1,df2) ``` - Sometimes, you can use a loop to read in numerous csvs. We'll cover this next week. Think of data for 125 riding. - Need to make sure columns names are the same. --- ## Binding columns with `bind_cols()` - `bind_cols()` is often used if you want to stack variables where the individuals are the same ```r df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline")) df2 <- tibble(income=c(55,30,12),name=c("Justin","Ian","Rosaline")) (df <- bind_cols(df1,df2 %>% select(-name))) ``` ``` ## # A tibble: 3 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 29 Ian 30 ## 3 22 Rosaline 12 ``` ```r (df <- bind_cols(df1,df2)) ``` ``` ## New names: ## • `name` -> `name...2` ## • `name` -> `name...4` ``` ``` ## # A tibble: 3 × 4 ## age name...2 income name...4 ## <dbl> <chr> <dbl> <chr> ## 1 25 Justin 55 Justin ## 2 29 Ian 30 Ian ## 3 22 Rosaline 12 Rosaline ``` --- ```r (df <- bind_rows(df1,df2 %>% select(-name))) ``` ``` ## # A tibble: 6 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin NA ## 2 29 Ian NA ## 3 22 Rosaline NA ## 4 NA <NA> 55 ## 5 NA <NA> 30 ## 6 NA <NA> 12 ``` ```r bind_rows(tibble(x=c(1,2),y=c(2,3)), tibble(y=c(9,8),x=c(10,11))) ``` ``` ## # A tibble: 4 × 2 ## x y ## <dbl> <dbl> ## 1 1 2 ## 2 2 3 ## 3 10 9 ## 4 11 8 ``` --- # Joins Instead of "stacking" like binds, joins "joins" on one or more columns. If you have riding names in two different datasets, you can join them. - `inner_join()`: includes all rows in x and y. - `left_join()`: includes all rows in x. - `right_join()`: includes all rows in y. - `full_join()`: includes all rows in x or y. --- ```r left_join(df1,df2) ``` ``` ## Joining, by = "name" ``` ``` ## # A tibble: 3 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 29 Ian 30 ## 3 22 Rosaline 12 ``` ```r #left_join(df1,df2,c("name")) #left_join(df1,df2,c("name"="name")) ``` --- ```r df1 <- tibble(age=c(25,29,22,23),name=c("Justin","Ian","Rosaline","Jack")) df2 <- tibble(income=c(55,30,12),name=c("Justin","Ian","Rosaline")) inner_join(df1,df2) ``` ``` ## Joining, by = "name" ``` ``` ## # A tibble: 3 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 29 Ian 30 ## 3 22 Rosaline 12 ``` ```r left_join(df1,df2) ``` ``` ## Joining, by = "name" ``` ``` ## # A tibble: 4 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 29 Ian 30 ## 3 22 Rosaline 12 ## 4 23 Jack NA ``` --- ```r right_join(df1,df2) ``` ``` ## Joining, by = "name" ``` ``` ## # A tibble: 3 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 29 Ian 30 ## 3 22 Rosaline 12 ``` ```r full_join(df1,df2) ``` ``` ## Joining, by = "name" ``` ``` ## # A tibble: 4 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 29 Ian 30 ## 3 22 Rosaline 12 ## 4 23 Jack NA ``` --- ```r left_join(df2,df1) ``` ``` ## Joining, by = "name" ``` ``` ## # A tibble: 3 × 3 ## income name age ## <dbl> <chr> <dbl> ## 1 55 Justin 25 ## 2 30 Ian 29 ## 3 12 Rosaline 22 ``` --- ```r df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline")) df2 <- tibble(income=c(55,30,12,33),name=c("Justin","Ian","Rosaline","Justin")) left_join(df1,df2) ``` ``` ## Joining, by = "name" ``` ``` ## # A tibble: 4 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 25 Justin 33 ## 3 29 Ian 30 ## 4 22 Rosaline 12 ``` --- ```r df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline"),last=c("Savoie","Smith","Azevedo")) df2 <- tibble(income=c(55,30,12,33),name=c("Justin","Ian","Rosaline","Justin"),last=c("Savoie","Smith","Azevedo","Gajevic")) left_join(df1,df2) ``` ``` ## Joining, by = c("name", "last") ``` ``` ## # A tibble: 3 × 4 ## age name last income ## <dbl> <chr> <chr> <dbl> ## 1 25 Justin Savoie 55 ## 2 29 Ian Smith 30 ## 3 22 Rosaline Azevedo 12 ``` ```r # or explicitly left_join(df1,df2, c("name"="name","last"="last")) ``` --- # Dealing with character strings - Useful, but potentially frustrating - Very cryptic at first (not just at first) - replacing, substituting, extracting with `stringr` package --- ```r my_string <- c("Justin Savoie (2019)","Daniel Ortega (2019)", "Natalia Burina (2019)") str_replace(my_string,"\\(2019\\)","") %>% trimws() ``` ``` ## [1] "Justin Savoie" "Daniel Ortega" "Natalia Burina" ``` ```r my_string <- c("POL2517-U","POL2517-V","POL2699-S") str_sub(my_string,4,-1) ``` ``` ## [1] "2517-U" "2517-V" "2699-S" ``` ```r str_sub(my_string,4,7) ``` ``` ## [1] "2517" "2517" "2699" ``` ```r my_string <- c("Justin Savoie (2019)","Daniel Ortega (2017)", "Natalia Burina (2018)") str_extract(my_string,"(?<=\\().+?(?=\\))") ``` ``` ## [1] "2019" "2017" "2018" ``` ```r str_replace(my_string,"\\((?<=\\().+?(?=\\))\\)","") %>% trimws() ``` ``` ## [1] "Justin Savoie" "Daniel Ortega" "Natalia Burina" ``` ```r # https://stackoverflow.com/questions/8613237/extract-info-inside-all-parenthesis-in-r ``` --- ```r my_string <- c("justin savoie","micheal Gillet","daniela Smith-jones") str_to_title(my_string) ``` ``` ## [1] "Justin Savoie" "Micheal Gillet" "Daniela Smith-Jones" ``` ```r str_length(my_string) ``` ``` ## [1] 13 14 19 ``` ```r str_count(my_string,"j") ``` ``` ## [1] 1 0 1 ``` --- # Dates and times ```r library(lubridate) ``` ``` ## ## Attaching package: 'lubridate' ``` ``` ## The following objects are masked from 'package:base': ## ## date, intersect, setdiff, union ``` ```r today() ``` ``` ## [1] "2022-07-08" ``` ```r now()+60 ``` ``` ## [1] "2022-07-08 12:30:40 EDT" ``` ```r today()+1 ``` ``` ## [1] "2022-07-09" ``` --- ```r adate <- "2022-06-01" ymd(adate) ``` ``` ## [1] "2022-06-01" ``` ```r lubridate::dmy("26-06-19") ``` ``` ## [1] "2019-06-26" ``` ```r adatetime <- "2001-10-10 02:30:30" ymd_hms(adatetime) ``` ``` ## [1] "2001-10-10 02:30:30 UTC" ``` ```r as.numeric(ymd_hms(adatetime)) ``` ``` ## [1] 1002681030 ``` --- ```r to_plot <- tibble(date=c("2019-01-01","2019-03-01","2020-01-01","2020-06-01"),value=rnorm(4,10,3)) ggplot(to_plot,aes(x=date,y=value)) + geom_point() ``` <!-- --> --- ```r to_plot <- to_plot %>% mutate(date=ymd(date)) ggplot(to_plot,aes(x=date,y=value)) + geom_point() + geom_line() + scale_x_date(breaks = as_date(c("2019-01-01","2019-03-01","2020-01-01","2020-06-01")), labels=c("Jan 2019","March 2019","Jan 2020","June 202")) + theme(axis.text.x = element_text(angle=45)) + scale_y_continuous(limits=c(0,30)) ``` <!-- --> --- # An example ```r library(tidyverse) library(readxl) library(lubridate) df <- read_excel("~/Downloads/Parliamentarians.xlsx") profession <- read_excel("~/Downloads/Parliamentarians Job.xlsx") df <- left_join(df,profession) head(df$`Province/Territory`) df <- df %>% mutate(province=str_extract(`Province/Territory`, "[^\r]+")) head(df$`Type of Parliamentarian`) df <- df %>% mutate(start_date=str_extract(`Type of Parliamentarian`, "\\([^()]+\\)") %>% str_sub(2,11) %>% ymd()) df %>% select(Name,start_date,Profession,province) ``` --- ## Two questions from Project 1 ```r hsum <- df %>% summarize(mean_age = mean(hincome, na.rm = TRUE), sd_age = sd(hincome, na.rm = TRUE), min_age = min(hincome, na.rm = TRUE), max_age = max(hincome, na.rm = TRUE), median_age = median(hincome, na.rm = TRUE), skew_age = skewness(hincome, na.rm = TRUE), kurtosis_age = kurtosis(hincome, na.rm = TRUE), n_age = sum(!is.na(hincome))) hsum %>% kable(format = "simple") hsum %>% t() %>% kable(format = "simple") hsum %>% pivot_longer(cols = everything()) %>% mutate(value=as.character(round(value))) %>% kable(format = "simple") ``` --- ```r kurtosis(df$hincome,na.rm = TRUE) options(scipen=999) unname(quantile(df$hincome,0.99,na.rm=TRUE)) temp <- df$hincome[df$hincome <=500000] kurtosis(temp,na.rm = TRUE) temp <- df$hincome kurtosis(temp,na.rm = TRUE) ``` --- # A worked example See Script Code Example July 8