class: center, middle, inverse, title-slide # Week 2: Data visualisation ## PUBPOL 750 Data Analysis for Public Policy I ### Justin Savoie ### MPP-DS McMaster ### 2023-09-20 --- background-image: url(https://research.mcmaster.ca/app/uploads/2019/11/20180706-152629-McMaster-University-Campus-0004-1.jpg) --- class: inverse, center, middle # Summary Week 1 --- # R and RStudio <div class="figure"> <img src="images/motor.png" alt="Source: Modern Dive Chapter 1" width="90%" /> <p class="caption">Source: Modern Dive Chapter 1</p> </div> --- # RStudio <div class="figure"> <img src="images/4panes.png" alt="Source: Modern Dive Chapter 1" width="80%" /> <p class="caption">Source: Modern Dive Chapter 1</p> </div> --- Every time you open R: ```r library(tidyverse) ``` ``` ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ``` ``` ## ✔ ggplot2 3.4.2 ✔ purrr 1.0.1 ## ✔ tibble 3.2.1 ✔ dplyr 1.1.2 ## ✔ tidyr 1.3.0 ✔ stringr 1.5.0 ## ✔ readr 2.1.2 ✔ forcats 0.5.2 ``` ``` ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::group_rows() masks kableExtra::group_rows() ## ✖ dplyr::lag() masks stats::lag() ``` This shows that it worked. It tells you that 8 packages are attached. It also tells you that the function filter from the stats packaged is now masked by filter from dplyr. --- ```r View(mpg) ``` <img src="images/View.png" width="60%" /> --- ```r glimpse(mpg) ``` ``` ## Rows: 234 ## Columns: 11 ## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "… ## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "… ## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.… ## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200… ## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, … ## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto… ## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4… ## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1… ## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2… ## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p… ## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c… ``` --- class: inverse, middle, center # Two different philosophies when learning R (one very quick word about it) ### <p style="color:grey;">math + base R (not really covered)</p> ### <p style="color:green;">data analysis with examples (the focus of this course) + tidyverse</p> --- .pull-left[ base R ```r (df <- data.frame(x=c(1,2),y=c(6,7))) ``` ``` ## x y ## 1 1 6 ## 2 2 7 ``` ```r df$z <- c(10,11) df ``` ``` ## x y z ## 1 1 6 10 ## 2 2 7 11 ``` ```r df[,c("x")] ``` ``` ## [1] 1 2 ``` ] .pull-right[ tidyverse ```r # library(tidyverse) (df <- tibble(x=c(1,2),y=c(6,7))) ``` ``` ## # A tibble: 2 × 2 ## x y ## <dbl> <dbl> ## 1 1 6 ## 2 2 7 ``` ```r (df <- df |> mutate(z=c(10,11))) ``` ``` ## # A tibble: 2 × 3 ## x y z ## <dbl> <dbl> <dbl> ## 1 1 6 10 ## 2 2 7 11 ``` ```r df |> select(x) ``` ``` ## # A tibble: 2 × 1 ## x ## <dbl> ## 1 1 ## 2 2 ``` ] --- # The “whole game” of data science <div class="figure"> <img src="images/tidy.png" alt="Source: RFDS2" width="70%" /> <p class="caption">Source: RFDS2</p> </div> --- class: inverse, middle, center # Data visualisation --- <img src="Slides_files/figure-html/unnamed-chunk-10-1.png" width="50%" /> --- ## Empty graph ```r ggplot(data = mpg) ``` <img src="Slides_files/figure-html/unnamed-chunk-11-1.png" width="30%" /> --- ## Adding structure through mapping ```r ggplot( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g) ) ``` <img src="Slides_files/figure-html/unnamed-chunk-12-1.png" width="30%" /> --- ## Addings points ```r ggplot( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g) ) + geom_point() ``` ``` ## Warning: Removed 2 rows containing missing values (`geom_point()`). ``` <img src="Slides_files/figure-html/unnamed-chunk-13-1.png" width="30%" /> --- ## Addings colors for groups ```r ggplot( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species) ) + geom_point() ``` ``` ## Warning: Removed 2 rows containing missing values (`geom_point()`). ``` <img src="Slides_files/figure-html/unnamed-chunk-14-1.png" width="30%" /> --- ## Adding lines of fit ```r ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) + geom_point() + geom_smooth(method = "lm") ``` ``` ## `geom_smooth()` using formula = 'y ~ x' ``` ``` ## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`). ``` ``` ## Warning: Removed 2 rows containing missing values (`geom_point()`). ``` <img src="Slides_files/figure-html/unnamed-chunk-15-1.png" width="25%" /> --- ## The full graph ```r ggplot( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = species)) + geom_smooth(method = "lm") + labs(title = "Body mass and flipper length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Flipper length (mm)", y = "Body mass (g)", color = "Species", shape = "Species" ) + scale_color_colorblind() ``` <img src="Slides_files/figure-html/unnamed-chunk-16-1.png" width="25%" /> --- ## ggplot2 calls .pull-left[ arguments are named ```r ggplot( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g) ) + geom_point() ``` <img src="Slides_files/figure-html/unnamed-chunk-17-1.png" width="50%" /> ] .pull-right[ arguments are not named ```r ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point() ``` <img src="Slides_files/figure-html/unnamed-chunk-18-1.png" width="50%" /> ] --- ## Visualizing distributions for categorical variables You could reorder with `fct_infreq()` ```r ggplot(penguins, aes(x = species)) + geom_bar() ``` <img src="Slides_files/figure-html/unnamed-chunk-19-1.png" width="35%" /> --- ## Visualizing distributions for a continuous variable ```r ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 200) ``` ``` ## Warning: Removed 2 rows containing non-finite values (`stat_bin()`). ``` <img src="Slides_files/figure-html/unnamed-chunk-20-1.png" width="35%" /> --- ## Visualizing distributions using a density ```r ggplot(penguins, aes(x = body_mass_g)) + geom_density() ``` ``` ## Warning: Removed 2 rows containing non-finite values (`stat_density()`). ``` <img src="Slides_files/figure-html/unnamed-chunk-21-1.png" width="35%" /> --- ## Visualizing the relationship between a categorical and a continuous variable ```r ggplot(penguins, aes(x = species, y = body_mass_g)) + geom_boxplot() ``` ``` ## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`). ``` <img src="Slides_files/figure-html/unnamed-chunk-22-1.png" width="35%" /> --- ## Visualizing the relationship between a categorical and a continuous variable ```r ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) + geom_density(alpha = 0.5) ``` ``` ## Warning: Removed 2 rows containing non-finite values (`stat_density()`). ``` <img src="Slides_files/figure-html/unnamed-chunk-23-1.png" width="35%" /> --- ## Visualizing the relationship between two categorical variables ```r ggplot(penguins, aes(x = island, fill = species)) + geom_bar() ``` <img src="Slides_files/figure-html/unnamed-chunk-24-1.png" width="35%" /> --- ## Visualizing the relationship between two categorical variables ```r ggplot(penguins, aes(x = island, fill = species)) + geom_bar(position = "fill") ``` <img src="Slides_files/figure-html/unnamed-chunk-25-1.png" width="35%" /> --- ## Visualizing the relationship between two categorical variables <img src="Slides_files/figure-html/unnamed-chunk-26-1.png" width="720" /> --- <img src="Slides_files/figure-html/unnamed-chunk-27-1.png" width="864" /> --- ## Two numerical variables ```r ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point() ``` ``` ## Warning: Removed 2 rows containing missing values (`geom_point()`). ``` <img src="Slides_files/figure-html/unnamed-chunk-28-1.png" width="35%" /> --- ## Three variables (or more!) ```r ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = species)) + facet_wrap(~island) ``` ``` ## Warning: Removed 2 rows containing missing values (`geom_point()`). ``` <img src="Slides_files/figure-html/unnamed-chunk-29-1.png" width="35%" /> --- ## Saving a plot ```r ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point() ``` ``` ## Warning: Removed 2 rows containing missing values (`geom_point()`). ``` <img src="Slides_files/figure-html/unnamed-chunk-30-1.png" width="30%" /> ```r ggsave(filename = "penguin-plot.png") ``` ``` ## Saving 7 x 7 in image ``` ``` ## Warning: Removed 2 rows containing missing values (`geom_point()`). ``` --- ## Problems Plus sign on wrong line. ```r ggplot(data = mpg) * + geom_point(aes(x=displ,y=hwy,size=12)) ``` One too many `)` on line 1 and one missing on line 2. ```r *ggplot(data = mpg)) + * geom_point(aes(x=displ,y=hwy,size=12) ``` --- class: inverse, middle, center # Exercices ### 2.2.5, 2.4.3, 2.5.5 ### Solutions: https://r4ds-solutions.nhsrcommunity.com/data-visualize.html --- ## Note on making these slides I use R Markdown (extension for these syntax files rmd) and xaringan to make these slides: https://bookdown.org/yihui/rmarkdown/xaringan.html RMarkdown is a reporting tool that enables the integration of R code, its output (such as figures and tables), and narrative text within a single document, which can then be exported to various formats including HTML, PDF, and Microsoft Word. Recently people have started moving to Quarto (extension is qmd). You can read more if you are curious: https://yihui.org/en/2022/04/quarto-r-markdown/ I'll say more about this later in the semester. Tools exists for reporting (e.g., writing reports, decks, etc.) To get the syntax (the R Markdown code files) for these slides, you can change the url to ".../slides.rmd".