class: center, middle, inverse, title-slide # Week 3: Data transformation ## PUBPOL 750 Data Analysis for Public Policy I ### Justin Savoie ### MPP-DS McMaster ### 2023-09-27 --- class: inverse, center, middle # Summary Week 2 Data Viz --- ### Basic visualization using ggplot2 ```r library(tidyverse) library(palmerpenguins) library(ggthemes) # initializes a ggplot object ggplot(penguins,mapping = aes(x = flipper_length_mm, y = body_mass_g)) + # point geom is used to create scatterplot geom_point(aes(color = species, shape = species)) + # add a visual line of best fit geom_smooth(method = "lm") + # add labels labs( title = "Body mass and flipper length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Flipper length (mm)", y = "Body mass (g)", color = "Species", shape = "Species" ) + # change the color palette scale_color_colorblind() ``` --- ![](Slides_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- class: inverse, center, middle # Coding basics --- ### Objects ```r x <- c(1,2,3) x ``` ``` ## [1] 1 2 3 ``` ```r x <- "my name is justin" x ``` ``` ## [1] "my name is justin" ``` ```r z <- sum(c(1,2,3))*2 z ``` ``` ## [1] 12 ``` ```r x1 <- c(2,4,6) x2 <- c(4,5,1) x12 <- x1+x2 x12 ``` ``` ## [1] 6 9 7 ``` --- ### Assigning a value the value of something to an object ```r object_name <- value ``` ```r x1 <- c(2,4,6) print(x1) ``` ``` ## [1] 2 4 6 ``` ```r x1 ``` ``` ## [1] 2 4 6 ``` --- ### Comments ```r # you can leave comments using the like this using the # # create vector of primes primes <- c(2, 3, 5, 7, 11, 13) ``` **Great advice:** Use comments to explain the why of your code, not the how or the what. Why did you use this model instead of that model. Why did you use binwidth of 2.3 instead of 2.8. If when you code you make a decision that is not fully trivial, it makes sense to write a comment. In general, if you start using R more frequently, you should try to make things clear and replicable. There's nothing as frustrating as going back in old code when you need it and not understanding or not remebering why you did something. --- ### naming objects You can give different names to objects ```r xYUBVChh.1 <- c(1,2,3) xyubvchh_1 <- c(1,2,3) ``` The recommendation is to use "snake case" where you use lowercase words with "`_`" to separate them. These are ok: ```r x <- c(1,2,3) my_vector <- c(1,2,3) ``` --- ### Functions ```r my_vector <- c(1,2,3,4,5) mean(my_vector) ``` ``` ## [1] 3 ``` ```r rnorm(10,mean=100,sd=25) ``` ``` ## [1] 130.77001 90.58477 95.73495 109.52756 132.09524 111.60374 97.10145 ## [8] 136.75363 82.72128 121.48660 ``` ```r seq(from=0,to=10,by=2) ``` ``` ## [1] 0 2 4 6 8 10 ``` ```r seq(0,10,2) ``` ``` ## [1] 0 2 4 6 8 10 ``` --- class: inverse, center, middle # Data transformation --- ### Five key verbs (or functions) in dplyr - It's very rare we get the data in the form we need it. We need to transform it first. - dplyr has verbs: verbs are just functions that do things - These verbs do things on **rows** or on **columns** or on **groups** or on **tables** (or data frames). This week we talk about rows, columns and groups. - For example, on rows it can filter, on columns in can select or create a new column. - The pipe (it is not a verb) "`|>`" (it used to be `%>%` so you will see that often) allows to combine verbs. --- ### verbs on rows `filter()` (keep some rows) `arrange()` (reorder the rows) `distinct()` (keep only unique rows) --- ### Let's look at the flights data frame ```r library(nycflights13) flights ``` ``` ## # A tibble: 336,776 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ℹ 336,766 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ### filter() ```r flights |> filter(dep_delay > 120) ``` ``` ## # A tibble: 9,723 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 848 1835 853 1001 1950 ## 2 2013 1 1 957 733 144 1056 853 ## 3 2013 1 1 1114 900 134 1447 1222 ## 4 2013 1 1 1540 1338 122 2020 1825 ## 5 2013 1 1 1815 1325 290 2120 1542 ## 6 2013 1 1 1842 1422 260 1958 1535 ## 7 2013 1 1 1856 1645 131 2212 2005 ## 8 2013 1 1 1934 1725 129 2126 1855 ## 9 2013 1 1 1938 1703 155 2109 1823 ## 10 2013 1 1 1942 1705 157 2124 1830 ## # ℹ 9,713 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ### filter() ```r flights |> filter(month == 1 & day == 1) ``` ``` ## # A tibble: 842 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ℹ 832 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ### filter() ```r flights |> filter(month == 1 | month == 2) ``` ``` ## # A tibble: 51,955 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ℹ 51,945 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ### filter() ```r flights |> filter(month %in% c(1, 2)) ``` ``` ## # A tibble: 51,955 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ℹ 51,945 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ### filter() ```r # Prints this to console, doesn't do anything flights |> filter(month == 1 & day == 1) # Creates new jan1 object jan1 <- flights |> filter(month == 1 & day == 1) # Overwrites flights (note: I'm not running this here) flights <- flights |> filter(month == 1 & day == 1) ``` --- ### common mistake ```r flights |> filter(month = 1) flights |> filter(month == 1) ``` --- ### arrange() ```r flights |> arrange(year, month, day, dep_time) ``` ``` ## # A tibble: 336,776 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ℹ 336,766 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ```r flights |> arrange(desc(dep_delay)) ``` ``` ## # A tibble: 336,776 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 9 641 900 1301 1242 1530 ## 2 2013 6 15 1432 1935 1137 1607 2120 ## 3 2013 1 10 1121 1635 1126 1239 1810 ## 4 2013 9 20 1139 1845 1014 1457 2210 ## 5 2013 7 22 845 1600 1005 1044 1815 ## 6 2013 4 10 1100 1900 960 1342 2211 ## 7 2013 3 17 2321 810 911 135 1020 ## 8 2013 6 27 959 1900 899 1236 2226 ## 9 2013 7 22 2257 759 898 121 1026 ## 10 2013 12 5 756 1700 896 1058 2020 ## # ℹ 336,766 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ### distinct() ```r flights |> distinct() ``` ``` ## # A tibble: 336,776 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ℹ 336,766 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ### distinct() ```r flights |> distinct(origin, dest) ``` ``` ## # A tibble: 224 × 2 ## origin dest ## <chr> <chr> ## 1 EWR IAH ## 2 LGA IAH ## 3 JFK MIA ## 4 JFK BQN ## 5 LGA ATL ## 6 EWR ORD ## 7 EWR FLL ## 8 LGA IAD ## 9 JFK MCO ## 10 LGA ORD ## # ℹ 214 more rows ``` --- ### count() ```r flights |> count(origin, dest, sort = TRUE) ``` ``` ## # A tibble: 224 × 3 ## origin dest n ## <chr> <chr> <int> ## 1 JFK LAX 11262 ## 2 LGA ATL 10263 ## 3 LGA ORD 8857 ## 4 JFK SFO 8204 ## 5 LGA CLT 6168 ## 6 EWR ORD 6100 ## 7 JFK BOS 5898 ## 8 LGA MIA 5781 ## 9 JFK MCO 5464 ## 10 EWR BOS 5327 ## # ℹ 214 more rows ``` --- ### verbs on columns `mutate()` (creates a new variable) `select()` (keeps only some variables) `rename()` (renames variables) `relocate()` (changes the positions of variables) --- ### mutate() ```r flights |> mutate( gain = dep_delay - arr_delay, speed = distance / air_time * 60 ) ``` ``` ## # A tibble: 336,776 × 21 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ℹ 336,766 more rows ## # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl> ``` --- ### mutate() ```r flights |> mutate( gain = dep_delay - arr_delay, speed = distance / air_time * 60, .before = 1 ) ``` ``` ## # A tibble: 336,776 × 21 ## gain speed year month day dep_time sched_dep_time dep_delay arr_time ## <dbl> <dbl> <int> <int> <int> <int> <int> <dbl> <int> ## 1 -9 370. 2013 1 1 517 515 2 830 ## 2 -16 374. 2013 1 1 533 529 4 850 ## 3 -31 408. 2013 1 1 542 540 2 923 ## 4 17 517. 2013 1 1 544 545 -1 1004 ## 5 19 394. 2013 1 1 554 600 -6 812 ## 6 -16 288. 2013 1 1 554 558 -4 740 ## 7 -24 404. 2013 1 1 555 600 -5 913 ## 8 11 259. 2013 1 1 557 600 -3 709 ## 9 5 405. 2013 1 1 557 600 -3 838 ## 10 -10 319. 2013 1 1 558 600 -2 753 ## # ℹ 336,766 more rows ## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, ## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ### select() ```r flights |> select(year, month, day) ``` ``` ## # A tibble: 336,776 × 3 ## year month day ## <int> <int> <int> ## 1 2013 1 1 ## 2 2013 1 1 ## 3 2013 1 1 ## 4 2013 1 1 ## 5 2013 1 1 ## 6 2013 1 1 ## 7 2013 1 1 ## 8 2013 1 1 ## 9 2013 1 1 ## 10 2013 1 1 ## # ℹ 336,766 more rows ``` --- ### select() ```r flights |> select(year:day) ``` ``` ## # A tibble: 336,776 × 3 ## year month day ## <int> <int> <int> ## 1 2013 1 1 ## 2 2013 1 1 ## 3 2013 1 1 ## 4 2013 1 1 ## 5 2013 1 1 ## 6 2013 1 1 ## 7 2013 1 1 ## 8 2013 1 1 ## 9 2013 1 1 ## 10 2013 1 1 ## # ℹ 336,766 more rows ``` --- ### select() ```r flights |> select(!year:day) ``` ``` ## # A tibble: 336,776 × 16 ## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ## <int> <int> <dbl> <int> <int> <dbl> <chr> ## 1 517 515 2 830 819 11 UA ## 2 533 529 4 850 830 20 UA ## 3 542 540 2 923 850 33 AA ## 4 544 545 -1 1004 1022 -18 B6 ## 5 554 600 -6 812 837 -25 DL ## 6 554 558 -4 740 728 12 UA ## 7 555 600 -5 913 854 19 B6 ## 8 557 600 -3 709 723 -14 EV ## 9 557 600 -3 838 846 -8 B6 ## 10 558 600 -2 753 745 8 AA ## # ℹ 336,766 more rows ## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ### select() ```r flights |> select(where(is.character)) ``` ``` ## # A tibble: 336,776 × 4 ## carrier tailnum origin dest ## <chr> <chr> <chr> <chr> ## 1 UA N14228 EWR IAH ## 2 UA N24211 LGA IAH ## 3 AA N619AA JFK MIA ## 4 B6 N804JB JFK BQN ## 5 DL N668DN LGA ATL ## 6 UA N39463 EWR ORD ## 7 B6 N516JB EWR FLL ## 8 EV N829AS LGA IAD ## 9 B6 N593JB JFK MCO ## 10 AA N3ALAA LGA ORD ## # ℹ 336,766 more rows ``` --- ### rename() ```r flights |> rename(tail_num = tailnum) ``` ``` ## # A tibble: 336,776 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ℹ 336,766 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tail_num <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ### relocate() ```r flights |> relocate(time_hour, air_time) ``` ``` ## # A tibble: 336,776 × 19 ## time_hour air_time year month day dep_time sched_dep_time ## <dttm> <dbl> <int> <int> <int> <int> <int> ## 1 2013-01-01 05:00:00 227 2013 1 1 517 515 ## 2 2013-01-01 05:00:00 227 2013 1 1 533 529 ## 3 2013-01-01 05:00:00 160 2013 1 1 542 540 ## 4 2013-01-01 05:00:00 183 2013 1 1 544 545 ## 5 2013-01-01 06:00:00 116 2013 1 1 554 600 ## 6 2013-01-01 05:00:00 150 2013 1 1 554 558 ## 7 2013-01-01 06:00:00 158 2013 1 1 555 600 ## 8 2013-01-01 06:00:00 53 2013 1 1 557 600 ## 9 2013-01-01 06:00:00 140 2013 1 1 557 600 ## 10 2013-01-01 06:00:00 138 2013 1 1 558 600 ## # ℹ 336,766 more rows ## # ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, sched_arr_time <int>, ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, ## # dest <chr>, distance <dbl>, hour <dbl>, minute <dbl> ``` --- ### relocate() ```r flights |> relocate(year:dep_time, .after = time_hour) ``` ``` ## # A tibble: 336,776 × 19 ## sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight ## <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 515 2 830 819 11 UA 1545 ## 2 529 4 850 830 20 UA 1714 ## 3 540 2 923 850 33 AA 1141 ## 4 545 -1 1004 1022 -18 B6 725 ## 5 600 -6 812 837 -25 DL 461 ## 6 558 -4 740 728 12 UA 1696 ## 7 600 -5 913 854 19 B6 507 ## 8 600 -3 709 723 -14 EV 5708 ## 9 600 -3 838 846 -8 B6 79 ## 10 600 -2 753 745 8 AA 301 ## # ℹ 336,766 more rows ## # ℹ 12 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, year <int>, ## # month <int>, day <int>, dep_time <int> ``` --- ### the pipe `|>` Allows to combine multiple verbs. ```r flights |> filter(dest == "IAH") |> mutate(speed = distance / air_time * 60) |> select(year:day, dep_time, carrier, flight, speed) |> arrange(desc(speed)) ``` ``` ## # A tibble: 7,198 × 7 ## year month day dep_time carrier flight speed ## <int> <int> <int> <int> <chr> <int> <dbl> ## 1 2013 7 9 707 UA 226 522. ## 2 2013 8 27 1850 UA 1128 521. ## 3 2013 8 28 902 UA 1711 519. ## 4 2013 8 28 2122 UA 1022 519. ## 5 2013 6 11 1628 UA 1178 515. ## 6 2013 8 27 1017 UA 333 515. ## 7 2013 8 27 1205 UA 1421 515. ## 8 2013 8 27 1758 UA 302 515. ## 9 2013 9 27 521 UA 252 515. ## 10 2013 8 28 625 UA 559 515. ## # ℹ 7,188 more rows ``` --- ```r arrange( select( mutate( filter( flights, dest == "IAH" ), speed = distance / air_time * 60 ), year:day, dep_time, carrier, flight, speed ), desc(speed) ) flights1 <- filter(flights, dest == "IAH") flights2 <- mutate(flights1, speed = distance / air_time * 60) flights3 <- select(flights2, year:day, dep_time, carrier, flight, speed) arrange(flights3, desc(speed)) ``` --- ### working with groups with group_by() ```r flights |> group_by(month) ``` ``` ## # A tibble: 336,776 × 19 ## # Groups: month [12] ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ℹ 336,766 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ### group_by() and summarize() ```r flights |> group_by(month) |> summarize( delay = mean(dep_delay, na.rm = TRUE) ) ``` ``` ## # A tibble: 12 × 2 ## month delay ## <int> <dbl> ## 1 1 10.0 ## 2 2 10.8 ## 3 3 13.2 ## 4 4 13.9 ## 5 5 13.0 ## 6 6 20.8 ## 7 7 21.7 ## 8 8 12.6 ## 9 9 6.72 ## 10 10 6.24 ## 11 11 5.44 ## 12 12 16.6 ``` --- ### group_by() and summarize() .pull-left[ ```r flights |> group_by(month) |> summarize( delay = mean(dep_delay) ) ``` ``` ## # A tibble: 12 × 2 ## month delay ## <int> <dbl> ## 1 1 NA ## 2 2 NA ## 3 3 NA ## 4 4 NA ## 5 5 NA ## 6 6 NA ## 7 7 NA ## 8 8 NA ## 9 9 NA ## 10 10 NA ## 11 11 NA ## 12 12 NA ``` ] .pull-right[ ```r x <- c(1,2,3,NA) mean(x) ``` ``` ## [1] NA ``` ```r mean(x,na.rm=TRUE) ``` ``` ## [1] 2 ``` ] --- ### group_by() and summarize() ```r flights |> group_by(month) |> summarize( delay = mean(dep_delay, na.rm = TRUE), n = n() ) ``` ``` ## # A tibble: 12 × 3 ## month delay n ## <int> <dbl> <int> ## 1 1 10.0 27004 ## 2 2 10.8 24951 ## 3 3 13.2 28834 ## 4 4 13.9 28330 ## 5 5 13.0 28796 ## 6 6 20.8 28243 ## 7 7 21.7 29425 ## 8 8 12.6 29327 ## 9 9 6.72 27574 ## 10 10 6.24 28889 ## 11 11 5.44 27268 ## 12 12 16.6 28135 ``` --- ### group_by() and summarize() ```r flights |> group_by(month) |> summarize( delay = mean(dep_delay, na.rm = TRUE), n = n() ) ``` ``` ## # A tibble: 12 × 3 ## month delay n ## <int> <dbl> <int> ## 1 1 10.0 27004 ## 2 2 10.8 24951 ## 3 3 13.2 28834 ## 4 4 13.9 28330 ## 5 5 13.0 28796 ## 6 6 20.8 28243 ## 7 7 21.7 29425 ## 8 8 12.6 29327 ## 9 9 6.72 27574 ## 10 10 6.24 28889 ## 11 11 5.44 27268 ## 12 12 16.6 28135 ``` --- ### `slice_` ```r df |> slice_head(n = 1) # first df |> slice_min(x, n = 1) # smallest value ``` ```r flights |> group_by(dest) |> slice_max(arr_delay, n = 1) |> relocate(dest) ``` ``` ## # A tibble: 108 × 19 ## # Groups: dest [105] ## dest year month day dep_time sched_dep_time dep_delay arr_time ## <chr> <int> <int> <int> <int> <int> <dbl> <int> ## 1 ABQ 2013 7 22 2145 2007 98 132 ## 2 ACK 2013 7 23 1139 800 219 1250 ## 3 ALB 2013 1 25 123 2000 323 229 ## 4 ANC 2013 8 17 1740 1625 75 2042 ## 5 ATL 2013 7 22 2257 759 898 121 ## 6 AUS 2013 7 10 2056 1505 351 2347 ## 7 AVL 2013 8 13 1156 832 204 1417 ## 8 BDL 2013 2 21 1728 1316 252 1839 ## 9 BGR 2013 12 1 1504 1056 248 1628 ## 10 BHM 2013 4 10 25 1900 325 136 ## # ℹ 98 more rows ## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, ## # flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- class: inverse, center, middle # Homework 1 --- class: inverse, center, middle # Exercices ### 4.2.5, 4.3.5, 4.5.7