Introduction to R, the tidyverse way

# Introduction to R, the tidyverse way
## SSMW 2022 University of Toronto
### Justin Savoie

---

## Acknowledgments

Thanks to Thomas Mock, Customer Enablement Lead at RStudio.

https://www.youtube.com/watch?v=MKwyauo8nSI&ab_channel=ThomasMock

Some parts of my presentation are inspired from his.
---
This presentation is available at https://www.justinsavoie.com/ssmw2022
---
## Today's agenda
.pull-left[
* R and RStudio
  * Writing code
  * File manipulation
  * Package control

* R coding basics
  * Math
  * Assignment
  * Functions
  * Load and install packages
]

.pull-right[
* The tidyverse
  * Read data in with readr
  * Tidy data with tidyr
  * Transform data with dplyr

* Putting it together: two examples
  * cleaning data; linear model with predicted values and marginal effects
  * working with text from SCC cases

]

---

## R and RStudio

* R is a programming language for statistical computing and graphics
* RStudio is an IDE (integrated development environment)
* A place to write:
  * Console
  * R scripts
  * R Markdown
  * Code completion
* A place to:
  * work with folders and paths
  * visualize plots, data, files

---

---

---

## Common mistakes

Error messages are usually informative.

You can also google the error message.

---

# R coding basics (before tidyverse)

---

## R coding basics

Assignment

```r
x <- 2.5
```

```r
y <- 10
```

```r
z <- x*y
z
```

```
## [1] 25
```

```r
zz <- z + 2 *x
zz
```

```
## [1] 30
```

---

```r
x1 <- c(2,3,4,10,12,46)
x2 <- c(-1,3,4,10,5,3)
x1+x2
```

```
## [1]  1  6  8 20 17 49
```

```r
x1 <- c(2,3,4,10,12,46)
x2 <- c(-1,3,4,10,5,3,1,1,1.5)
x1+x2
```

```
## Warning in x1 + x2: longer object length is not a multiple of shorter object
## length
```

```
## [1]  1.0  6.0  8.0 20.0 17.0 49.0  3.0  4.0  5.5
```

---

## Dataframes

```r
my_dataframe <- data.frame(x=c(1,2,3),y=c(3.5,4.5,5.5))
my_dataframe
```

```
##   x   y
## 1 1 3.5
## 2 2 4.5
## 3 3 5.5
```

```r
head(mtcars)
```

```
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```

---

## Functions

```r
mean(c(2,3,4))
```

```
## [1] 3
```

```r
random_vector<- rnorm(n=100,mean=0,sd=1)
mean(random_vector)
```

```
## [1] 0.1090938
```

```r
sd(random_vector)
```

```
## [1] 0.9015466
```

```r
IQR(random_vector)
```

```
## [1] 1.265599
```
---

```r
summary(random_vector)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.8532 -0.5379  0.1892  0.1091  0.7277  2.3159
```

```r
set.seed(232)
random_vector <- rnorm(n=5,mean=0,sd=1)
set.seed(232)
random_vector <- rnorm(5,0,1)
mean(x=random_vector)
```

```
## [1] 0.233712
```

```r
mean(random_vector)
```

```
## [1] 0.233712
```
---

```r
mtcars$cyl[1:5]
```

```
## [1] 6 6 4 6 8
```

```r
my_quadratic_function <- function(x){
  return(x^2+6*x+14.5)
}

my_quadratic_function(mtcars$cyl[1:5])
```

```
## [1]  86.5  86.5  54.5  86.5 126.5
```

```r
mtcars$cylQUADRATIC <- my_quadratic_function(mtcars$cyl)

head(mtcars,3)
```

```
##                mpg cyl disp  hp drat    wt  qsec vs am gear carb cylQUADRATIC
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         86.5
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4         86.5
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1         54.5
```

---
## Other useful functions

```r
seq(from=1,to=3.5,by=0.5)
```

```
## [1] 1.0 1.5 2.0 2.5 3.0 3.5
```

```r
seq(from=1,to=3.5,by=0.51)
```

```
## [1] 1.00 1.51 2.02 2.53 3.04
```

```r
seq(from=-2,to=2,length.out=7)
```

```
## [1] -2.0000000 -1.3333333 -0.6666667  0.0000000  0.6666667  1.3333333  2.0000000
```
---
## Other useful functions

```r
table(mtcars$cyl)
```

```
## 
##  4  6  8 
## 11  7 14
```

```r
ifelse(c(3,3,6,8)>5,1,0)
```

```
## [1] 0 0 1 1
```
---
## Other useful functions

```r
plot(mtcars$mpg,mtcars$hp)
```

<img src="ssmw2022_files/figure-html/unnamed-chunk-22-1.png" width="40%" />
---
## indexing

```r
x <- c(1,2,3)
x[2]
```

```
## [1] 2
```

```r
x[2:3]
```

```
## [1] 2 3
```

```r
mtcars$mpg[c(1,3)]
```

```
## [1] 21.0 22.8
```
---
## indexing

```r
mtcars[1:3,4:5]
```

```
##                hp drat
## Mazda RX4     110 3.90
## Mazda RX4 Wag 110 3.90
## Datsun 710     93 3.85
```
---
## indexing

```r
x[3] <- NA
x
```

```
## [1]  1  2 NA
```

```r
mtcars$mpg[mtcars$mpg>20]
```

```
##  [1] 21.0 21.0 22.8 21.4 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4
```
---
## indexing

```r
x <- mtcars$mpg
x
```

```
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
```

```r
x[x>20] <- 1000
x
```

```
##  [1] 1000.0 1000.0 1000.0 1000.0   18.7   18.1   14.3 1000.0 1000.0   19.2
## [11]   17.8   16.4   17.3   15.2   10.4   10.4   14.7 1000.0 1000.0 1000.0
## [21] 1000.0   15.5   15.2   13.3   19.2 1000.0 1000.0 1000.0   15.8   19.7
## [31]   15.0 1000.0
```
---
## logical operators

```r
2>3 # 2 is bigger than 3
```

```
## [1] FALSE
```

```r
3>2 # 3 is bigger than 2
```

```
## [1] TRUE
```

```r
2==2 # 2 is equal to 2
```

```
## [1] TRUE
```

```r
2.00000001==2
```

```
## [1] FALSE
```

```r
3!=2 # 3 is not equal to 2
```

```
## [1] TRUE
```
---
## logical operators

```r
4 %in% c(2,4,5) # 4 is in the vector
```

```
## [1] TRUE
```

```r
!(4 %in% c(2,4,5)) # 4 is not in the vector
```

```
## [1] FALSE
```

```r
(5==6|6==6) # 5 or 6 is equal to 6
```

```
## [1] TRUE
```

```r
(5==5&6==7) # 
```

```
## [1] FALSE
```

```r
is.na(c(2,3,NA)) # value is NA
```

```
## [1] FALSE FALSE  TRUE
```

---
## working directory

```r
my_data <- read_delim("/Users/justinsavoie/Downloads/
                      2019 Canadian Election Study - Phone Survey v1.0.tab")

setwd("~/Dropbox (Personal)/UofT/thisprojectimworkingon/")
my_data <- read_delim("/data/
                      2019 Canadian Election Study - Phone Survey v1.0.tab")
```

To know a file's location:

on mac: cmd-i on file and copy 'where'

on pc: right click file and copy 'property'
---
## working directory
<img src="images/wd.png" width="85%" />
---
## working directory
<img src="images/wd2.png" width="85%" />
---
## The %>% (the 'pipe') and intro to tidyverse

```r
did_something <- do_something(data)
did_another_thing <- do_another_thing(did_something)
do_last_thing <- do_last_thing(did_another_thing)
```

```r
final_thing <- do_last_thing(
  do_another_thing(
    do_something(data)
  )
)
```

```r
final_thing <- data %>%
  do_something() %>%
  do_another_thing() %>%
  do_last_thing()
```
---
## The %>%

```r
mean(c(1,2,3));
```

```
## [1] 2
```

```r
c(1,2,3) %>% mean()
```

```
## [1] 2
```
---
## The %>%

```r
xplus6 <- function(x) x+6
xminus2 <- function(x) x+2
xtotwothird <- function(x) x^(2/3)
my_vector <- c(3,4,2)
my_vector %>%
  xplus6() %>%
  xminus2() %>%
  xtotwothird() %>%
  mean()
```

```
## [1] 4.943053
```

```r
mean(xtotwothird(xminus2(xplus6(my_vector))))
```

```
## [1] 4.943053
```
---
## The %>%
do_something(data) is equivalent to:

> - data %>% do_something(data=.)
> - data %>% do_something(.)
> - data %>% do_something()
---
## The %>%
<img src="images/ceswebsite.png" width="100%" />
---

```r
ces <- read_csv("https://www.justinsavoie.com/data/dataces1.txt")
head(ces) 
```

```
## # A tibble: 6 × 9
##   q2_birthyear q3_gender q4_province q6_satisfied_democracy   q9_interest_elect…
##          <dbl> <chr>     <chr>       <chr>                    <chr>             
## 1         1963 (1) Male  (5) Quebec  (3) Not very satisfied   (8)               
## 2         1973 (1) Male  (5) Quebec  (2) Fairly satisfied     (10) Great deal o…
## 3         1994 (1) Male  (5) Quebec  (1) Very satisfied       (10) Great deal o…
## 4         2000 (1) Male  (5) Quebec  (2) Fairly satisfied     (6)               
## 5         1984 (1) Male  (5) Quebec  (4) Not satisfied at all (10) Great deal o…
## 6         1939 (1) Male  (5) Quebec  (3) Not very satisfied   (10) Great deal o…
## # … with 4 more variables: q10_certain_vote <chr>, q11_vote_intention <chr>,
## #   q14_feeling_liberal_party <dbl>, q15_feeling_cons_party <dbl>
```
---
## The %>%

```r
ces %>%
  filter(q4_province=="(5) Quebec") %>%
  group_by(q3_gender) %>%
  summarise(mean_birthyear=mean(q2_birthyear),
            sd=sd(q2_birthyear))
```

```
## # A tibble: 2 × 3
##   q3_gender  mean_birthyear    sd
##   <chr>               <dbl> <dbl>
## 1 (1) Male            1972.  15.9
## 2 (2) Female          1970.  16.4
```
---
## The %>%

```r
ces %>%
  group_by(q4_province,q3_gender) %>%
  summarise(mean_birthyear=mean(q2_birthyear),
            sd=sd(q2_birthyear)) %>%
  head(5)
```

```
## `summarise()` has grouped output by 'q4_province'. You can override using the
## `.groups` argument.
```

```
## # A tibble: 5 × 4
## # Groups:   q4_province [2]
##   q4_province                   q3_gender  mean_birthyear    sd
##   <chr>                         <chr>               <dbl> <dbl>
## 1 (1) Newfoundland and Labrador (1) Male            1967.  15.7
## 2 (1) Newfoundland and Labrador (2) Female          1965.  16.9
## 3 (10) British Columbia         (1) Male            1967.  17.3
## 4 (10) British Columbia         (2) Female          1966.  16.6
## 5 (10) British Columbia         (3) Other           1992   NA
```
---

# The tidyverse
---
## The tidyverse
- The tidyverse is an opinionated collection of R packages designed for data science.

- All packages share an underlying design philosophy, grammar, and data structures.

- tidyverse is a R package. But it's also a package of packages.

- Core packages: `readr, tidyr, dplyr, ggplot2`
---
## tidyverse
<img src="images/tidyverse.png" width="100%" />
---
## tidyverse
<img src="images/tidyverse2.png" width="100%" />
---
## Statistical inference using the tidyverse
<img src="images/moderndive.png" width="100%" />
---
## tidyverse
<img src="images/tidyverse3.png" width="85%" />
---
## Install and load packages
* Install package once on your computer.

```r
install.packages('tidyverse')
```

* Each time you run R, load the package.

```r
library(tidyverse)
```
---
## tidyverse vs Base R

* People often contrast tidyverse and "base R"
* Many things can be done either in base R and with tidyverse
* Of course in practice, people use both. In particular: you need base R when you use the tidyverse for all the basic stuff.

```r
read.csv("...")
read_csv("...")

plot(data$x,data$y)
ggplot(data,aes(x=x,y=y)) + geom_point()
```
---
## tidyverse vs Base R

```r
tapply(mtcars$mpg,mtcars$cyl,mean)
```

```
##        4        6        8 
## 26.66364 19.74286 15.10000
```

```r
mtcars %>% group_by(cyl) %>% summarise(mean(mpg))
```

```
## # A tibble: 3 × 2
##     cyl `mean(mpg)`
##   <dbl>       <dbl>
## 1     4        26.7
## 2     6        19.7
## 3     8        15.1
```
---
## tidyverse vs Base R

```r
mtcars$new_var <- rnorm(nrow(mtcars),0,1)
mtcars <- mtcars %>%
  mutate(new_var=rnorm(nrow(.),0,1))
```
---
## tidyverse core principles

* Built about two-dimensional data (data.frame or tibble)
* Built around tidy data
  * Each variable in it's own column
  * Each observation in its own row
  * Each type of observational units forms a table
---
## Tidy data

```r
head(ces)
```

```r
(untidy_df <- tibble(age=18:30,
       male_2016=round(rnorm(13,50000,5000)),
       female_2016=round(rnorm(13,50000,5000)),
       male_2017=round(rnorm(13,50000,5000)),
       female_2017=round(rnorm(13,50000,5000)),
       male_2018=round(rnorm(13,50000,5000)),
       female_2018=round(rnorm(13,50000,5000)))) %>% head(10)
```

```
## # A tibble: 10 × 7
##      age male_2016 female_2016 male_2017 female_2017 male_2018 female_2018
##    <int>     <dbl>       <dbl>     <dbl>       <dbl>     <dbl>       <dbl>
##  1    18     52579       52796     55592       50408     50659       49530
##  2    19     52374       53110     56601       53242     45912       52298
##  3    20     44102       48351     53827       55603     46277       54563
##  4    21     50885       38245     38620       45510     54871       58411
##  5    22     44634       45495     51560       45302     53563       52324
##  6    23     50467       46760     51060       48077     39407       56501
##  7    24     51462       47773     50152       44989     53284       54753
##  8    25     54106       46685     47309       50160     57431       59393
##  9    26     57848       46243     44781       47118     48008       52496
## 10    27     56968       45185     59887       56179     48006       53883
```
---
## Tidy data

```r
(tidy_df <- untidy_df %>%
  pivot_longer(-age,names_to = c("gender","year"),
               values_to = "value",names_sep = "_"))
```

```
## # A tibble: 78 × 4
##      age gender year  value
##    <int> <chr>  <chr> <dbl>
##  1    18 male   2016  52579
##  2    18 female 2016  52796
##  3    18 male   2017  55592
##  4    18 female 2017  50408
##  5    18 male   2018  50659
##  6    18 female 2018  49530
##  7    19 male   2016  52374
##  8    19 female 2016  53110
##  9    19 male   2017  56601
## 10    19 female 2017  53242
## # … with 68 more rows
```
---
## Read data

Read in data with `readr, haven, readxl`.

I've also used `readstata13` which is not in tidyverse.

* `readr`
  * `read_csv(),read_tsv(),read_delim()`

* `haven`
  * `read_sas(),read_spss(),read_stata()`

* `readxl`
  * `read_xls(),read_xlsx(),read_excel()`

For example:

```r
df <- read_csv("~/Desktop/mydata.csv")
```
  
---

## dplyr package

* 6 main verbs
  * `filter()` keep only certain rows
  * `arrange()` order rows by order in a variable
  * `select()` keep only certain variables
  * `mutate()` create a variable
  * `group_by()` group by values in a variable
  * `summarise()` summarise e.g. mean, sd etc.

* simple functions
  * `pull()` extract one variable and make it vector
  * `n()` and `count()` 
  * `glimpse()` give summary of data

---
## dplyr package

* advanced iterations
  * `summarize_at` summarise on many variables
  * `mutate_at` create/modify many variables
  * `summarize_all`
  * `mutate_all`

* for more info
  * dplyr.tidyverse.org
  * R for Data Science
  * Google
  * Stack Overflow
---

<img src="images/wickham.png" width="50%" />
---
## mtcars dataset

```r
class(mtcars)
```

```
## [1] "data.frame"
```

```r
mtcars_tbl <- as_tibble(mtcars)
class(mtcars_tbl)
```

```
## [1] "tbl_df"     "tbl"        "data.frame"
```
---
## mtcars dataset

```r
mtcars
```

```
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
```
---
## mtcars dataset

```r
mtcars_tbl
```

```
## # A tibble: 32 × 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # … with 22 more rows
```
---
## mtcars dataset

```r
mtcars_tbl <- mtcars %>%
  mutate(name=row.names(.)) %>%
  as_tibble()

head(mtcars_tbl,3)
```

```
## # A tibble: 3 × 12
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb name        
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>       
## 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4 Mazda RX4   
## 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4 Mazda RX4 W…
## 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1 Datsun 710
```
---
## mtcars dataset

```r
mtcars <- mtcars_tbl
```
---
## dplyr::slice()

```r
mtcars %>% slice(c(1,2,3))
```

```r
mtcars %>% slice(c(1,4,5))
```

```
## # A tibble: 3 × 12
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb name        
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>       
## 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4 Mazda RX4   
## 2  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1 Hornet 4 Dr…
## 3  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2 Hornet Spor…
```
---
## dplyr::glimpse()

```r
mtcars %>% glimpse()
```

```
## Rows: 32
## Columns: 12
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
## $ name <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Ho…
```
---
## dplyr::filter()

```r
mtcars %>% filter(cyl==4)
```

```
## # A tibble: 11 × 12
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb name       
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      
##  1  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1 Datsun 710 
##  2  24.4     4 147.     62  3.69  3.19  20       1     0     4     2 Merc 240D  
##  3  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2 Merc 230   
##  4  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1 Fiat 128   
##  5  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2 Honda Civic
##  6  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1 Toyota Cor…
##  7  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1 Toyota Cor…
##  8  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1 Fiat X1-9  
##  9  26       4 120.     91  4.43  2.14  16.7     0     1     5     2 Porsche 91…
## 10  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2 Lotus Euro…
## 11  21.4     4 121     109  4.11  2.78  18.6     1     1     4     2 Volvo 142E
```
---
## dplyr::filter()

```r
mtcars %>% filter(cyl!=4)
```

```
## # A tibble: 21 × 12
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb name       
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4 Mazda RX4  
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4 Mazda RX4 …
##  3  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1 Hornet 4 D…
##  4  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2 Hornet Spo…
##  5  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1 Valiant    
##  6  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 Duster 360 
##  7  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4 Merc 280   
##  8  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4 Merc 280C  
##  9  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3 Merc 450SE 
## 10  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3 Merc 450SL 
## # … with 11 more rows
```
---
## dplyr::filter()

```r
mtcars %>% filter(cyl %in% c(4,6))
```

```
## # A tibble: 18 × 12
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb name       
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      
##  1  21       6 160     110  3.9   2.62  16.5     0     1     4     4 Mazda RX4  
##  2  21       6 160     110  3.9   2.88  17.0     0     1     4     4 Mazda RX4 …
##  3  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1 Datsun 710 
##  4  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1 Hornet 4 D…
##  5  18.1     6 225     105  2.76  3.46  20.2     1     0     3     1 Valiant    
##  6  24.4     4 147.     62  3.69  3.19  20       1     0     4     2 Merc 240D  
##  7  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2 Merc 230   
##  8  19.2     6 168.    123  3.92  3.44  18.3     1     0     4     4 Merc 280   
##  9  17.8     6 168.    123  3.92  3.44  18.9     1     0     4     4 Merc 280C  
## 10  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1 Fiat 128   
## 11  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2 Honda Civic
## 12  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1 Toyota Cor…
## 13  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1 Toyota Cor…
## 14  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1 Fiat X1-9  
## 15  26       4 120.     91  4.43  2.14  16.7     0     1     5     2 Porsche 91…
## 16  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2 Lotus Euro…
## 17  19.7     6 145     175  3.62  2.77  15.5     0     1     5     6 Ferrari Di…
## 18  21.4     4 121     109  4.11  2.78  18.6     1     1     4     2 Volvo 142E
```
---
## dplyr::select()

```r
select(mtcars,hp,mpg,cyl)
```

```
## # A tibble: 32 × 3
##       hp   mpg   cyl
##    <dbl> <dbl> <dbl>
##  1   110  21       6
##  2   110  21       6
##  3    93  22.8     4
##  4   110  21.4     6
##  5   175  18.7     8
##  6   105  18.1     6
##  7   245  14.3     8
##  8    62  24.4     4
##  9    95  22.8     4
## 10   123  19.2     6
## # … with 22 more rows
```
---
## dplyr::select()

```r
mtcars %>% select(-mpg)
```

```
## # A tibble: 32 × 11
##      cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb name             
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>            
##  1     6  160    110  3.9   2.62  16.5     0     1     4     4 Mazda RX4        
##  2     6  160    110  3.9   2.88  17.0     0     1     4     4 Mazda RX4 Wag    
##  3     4  108     93  3.85  2.32  18.6     1     1     4     1 Datsun 710       
##  4     6  258    110  3.08  3.22  19.4     1     0     3     1 Hornet 4 Drive   
##  5     8  360    175  3.15  3.44  17.0     0     0     3     2 Hornet Sportabout
##  6     6  225    105  2.76  3.46  20.2     1     0     3     1 Valiant          
##  7     8  360    245  3.21  3.57  15.8     0     0     3     4 Duster 360       
##  8     4  147.    62  3.69  3.19  20       1     0     4     2 Merc 240D        
##  9     4  141.    95  3.92  3.15  22.9     1     0     4     2 Merc 230         
## 10     6  168.   123  3.92  3.44  18.3     1     0     4     4 Merc 280         
## # … with 22 more rows
```
---
## dplyr::select()

```r
mtcars %>% select(starts_with("c"),starts_with("h"))
```

```
## # A tibble: 32 × 3
##      cyl  carb    hp
##    <dbl> <dbl> <dbl>
##  1     6     4   110
##  2     6     4   110
##  3     4     1    93
##  4     6     1   110
##  5     8     2   175
##  6     6     1   105
##  7     8     4   245
##  8     4     2    62
##  9     4     2    95
## 10     6     4   123
## # … with 22 more rows
```
---
## dplyr::arrange()

```r
mtcars %>% arrange(mpg)
```

```
## # A tibble: 32 × 12
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb name       
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      
##  1  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4 Cadillac F…
##  2  10.4     8  460    215  3     5.42  17.8     0     0     3     4 Lincoln Co…
##  3  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4 Camaro Z28 
##  4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 Duster 360 
##  5  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4 Chrysler I…
##  6  15       8  301    335  3.54  3.57  14.6     0     1     5     8 Maserati B…
##  7  15.2     8  276.   180  3.07  3.78  18       0     0     3     3 Merc 450SLC
##  8  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2 AMC Javelin
##  9  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2 Dodge Chal…
## 10  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4 Ford Pante…
## # … with 22 more rows
```
---
## dplyr::arrange()

```r
mtcars %>% arrange(desc(mpg))
```

```
## # A tibble: 32 × 12
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb name       
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      
##  1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1 Toyota Cor…
##  2  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1 Fiat 128   
##  3  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2 Honda Civic
##  4  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2 Lotus Euro…
##  5  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1 Fiat X1-9  
##  6  26       4 120.     91  4.43  2.14  16.7     0     1     5     2 Porsche 91…
##  7  24.4     4 147.     62  3.69  3.19  20       1     0     4     2 Merc 240D  
##  8  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1 Datsun 710 
##  9  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2 Merc 230   
## 10  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1 Toyota Cor…
## # … with 22 more rows
```
---
## dplyr::arrange()

```r
mtcars %>% arrange(desc(cyl),disp)
```

```
## # A tibble: 32 × 12
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb name       
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      
##  1  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3 Merc 450SE 
##  2  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3 Merc 450SL 
##  3  15.2     8  276.   180  3.07  3.78  18       0     0     3     3 Merc 450SLC
##  4  15       8  301    335  3.54  3.57  14.6     0     1     5     8 Maserati B…
##  5  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2 AMC Javelin
##  6  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2 Dodge Chal…
##  7  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4 Camaro Z28 
##  8  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4 Ford Pante…
##  9  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2 Hornet Spo…
## 10  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 Duster 360 
## # … with 22 more rows
```

---
## dplyr::mutate()

```r
mtcars %>% 
  mutate(hpsquare=hp^2) %>%
  select(mpg,cyl,disp,hp,hpsquare)
```

```
## # A tibble: 32 × 5
##      mpg   cyl  disp    hp hpsquare
##    <dbl> <dbl> <dbl> <dbl>    <dbl>
##  1  21       6  160    110    12100
##  2  21       6  160    110    12100
##  3  22.8     4  108     93     8649
##  4  21.4     6  258    110    12100
##  5  18.7     8  360    175    30625
##  6  18.1     6  225    105    11025
##  7  14.3     8  360    245    60025
##  8  24.4     4  147.    62     3844
##  9  22.8     4  141.    95     9025
## 10  19.2     6  168.   123    15129
## # … with 22 more rows
```
---
## dplyr::mutate()

```r
mtcars %>% 
  mutate(randomnoise=rnorm(nrow(.),mean=0,sd=1),
         mpg_with_random_noise = mpg+randomnoise) %>%
  select(mpg,cyl,disp,randomnoise,mpg_with_random_noise)
```

```
## # A tibble: 32 × 5
##      mpg   cyl  disp randomnoise mpg_with_random_noise
##    <dbl> <dbl> <dbl>       <dbl>                 <dbl>
##  1  21       6  160       -0.391                  20.6
##  2  21       6  160        3.10                   24.1
##  3  22.8     4  108       -1.88                   20.9
##  4  21.4     6  258        0.565                  22.0
##  5  18.7     8  360        1.31                   20.0
##  6  18.1     6  225        0.734                  18.8
##  7  14.3     8  360       -0.848                  13.5
##  8  24.4     4  147.       1.32                   25.7
##  9  22.8     4  141.      -0.505                  22.3
## 10  19.2     6  168.      -0.652                  18.5
## # … with 22 more rows
```
---
## dplyr::mutate()

```r
mtcars %>% 
  mutate(cyl=factor(cyl,levels=c(4,6,8),
                    labels=c("4 Cyl","6 Cyl","8 Cyl")))
```

```
## # A tibble: 32 × 12
##      mpg cyl    disp    hp  drat    wt  qsec    vs    am  gear  carb name       
##    <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      
##  1  21   6 Cyl  160    110  3.9   2.62  16.5     0     1     4     4 Mazda RX4  
##  2  21   6 Cyl  160    110  3.9   2.88  17.0     0     1     4     4 Mazda RX4 …
##  3  22.8 4 Cyl  108     93  3.85  2.32  18.6     1     1     4     1 Datsun 710 
##  4  21.4 6 Cyl  258    110  3.08  3.22  19.4     1     0     3     1 Hornet 4 D…
##  5  18.7 8 Cyl  360    175  3.15  3.44  17.0     0     0     3     2 Hornet Spo…
##  6  18.1 6 Cyl  225    105  2.76  3.46  20.2     1     0     3     1 Valiant    
##  7  14.3 8 Cyl  360    245  3.21  3.57  15.8     0     0     3     4 Duster 360 
##  8  24.4 4 Cyl  147.    62  3.69  3.19  20       1     0     4     2 Merc 240D  
##  9  22.8 4 Cyl  141.    95  3.92  3.15  22.9     1     0     4     2 Merc 230   
## 10  19.2 6 Cyl  168.   123  3.92  3.44  18.3     1     0     4     4 Merc 280   
## # … with 22 more rows
```
---
## more on factors

```r
vect <- c("Much less","About the same","Much more")
class(vect)
```

```
## [1] "character"
```

```r
table(vect)
```

```
## vect
## About the same      Much less      Much more 
##              1              1              1
```

```r
vect <- factor(vect,levels=c("Much less","About the same","Much more"))
class(vect)
```

```
## [1] "factor"
```

```r
table(vect)
```

```
## vect
##      Much less About the same      Much more 
##              1              1              1
```
---
## more on factors

```r
vect <- factor(vect,levels=c("Much less","About the same","Much more"),
               labels=c("ml","abs","mm"))
vect
```

```
## [1] ml  abs mm 
## Levels: ml abs mm
```

```r
as.numeric(vect)
```

```
## [1] 1 2 3
```
---
## more on factors

```r
vect <- 1:22
factor(vect)
```

```
##  [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
```

```r
vect <- as.character(1:22)
(fct_vect <- factor(vect))
```

```
##  [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22
## Levels: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9
```

```r
as.numeric(fct_vect)
```

```
##  [1]  1 12 16 17 18 19 20 21 22  2  3  4  5  6  7  8  9 10 11 13 14 15
```
---
## more on factors

```r
mtcars %>%
  mutate(cyl=factor(cyl,levels=c(4,6,8),
                    labels=c("4 Cyl","6 Cyl","8 Cyl")),
         cyl2 = fct_recode(cyl,"Small"="4 Cyl",
                            "Big"="6 Cyl",
                            "Big"="8 Cyl")) %>%
  select(cyl,cyl2)
```

```
## # A tibble: 32 × 2
##    cyl   cyl2 
##    <fct> <fct>
##  1 6 Cyl Big  
##  2 6 Cyl Big  
##  3 4 Cyl Small
##  4 6 Cyl Big  
##  5 8 Cyl Big  
##  6 6 Cyl Big  
##  7 8 Cyl Big  
##  8 4 Cyl Small
##  9 4 Cyl Small
## 10 6 Cyl Big  
## # … with 22 more rows
```
---
## more on factors

```r
temp <- mtcars %>%
  mutate(cyl_f=factor(cyl,levels=c(4,6,8),
                    labels=c("4 Cyl","6 Cyl","8 Cyl")),
         cyl2 = fct_recode(cyl_f,"Small"="4 Cyl","Big"="6 Cyl","Big"="8 Cyl"),
         cyl2REV=fct_relevel(cyl2,"Big","Small"))
table(temp$cyl2)
```

```
## 
## Small   Big 
##    11    21
```

```r
table(temp$cyl2REV)
```

```
## 
##   Big Small 
##    21    11
```
---
## dplyr::mutate()

```r
mtcars %>%
  mutate(miles_per_liter = mpg*3.78,
         miles_per_gallon=miles_per_liter/3.78) %>%
  select(miles_per_liter,mpg,miles_per_gallon)
```

```
## # A tibble: 32 × 3
##    miles_per_liter   mpg miles_per_gallon
##              <dbl> <dbl>            <dbl>
##  1            79.4  21               21  
##  2            79.4  21               21  
##  3            86.2  22.8             22.8
##  4            80.9  21.4             21.4
##  5            70.7  18.7             18.7
##  6            68.4  18.1             18.1
##  7            54.1  14.3             14.3
##  8            92.2  24.4             24.4
##  9            86.2  22.8             22.8
## 10            72.6  19.2             19.2
## # … with 22 more rows
```
---
## dplyr::group_by()

```r
mtcars %>%
  group_by(cyl)
```

```
## # A tibble: 32 × 12
## # Groups:   cyl [3]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb name       
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4 Mazda RX4  
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4 Mazda RX4 …
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1 Datsun 710 
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1 Hornet 4 D…
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2 Hornet Spo…
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1 Valiant    
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 Duster 360 
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2 Merc 240D  
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2 Merc 230   
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4 Merc 280   
## # … with 22 more rows
```
---
## dplyr::group_by()

```r
mtcars %>%
  group_by(cyl) %>%
  mutate(mean_mpg_per_cyl=mean(mpg)) %>%
  select(mpg,cyl,disp,mean_mpg_per_cyl)
```

```
## # A tibble: 32 × 4
## # Groups:   cyl [3]
##      mpg   cyl  disp mean_mpg_per_cyl
##    <dbl> <dbl> <dbl>            <dbl>
##  1  21       6  160              19.7
##  2  21       6  160              19.7
##  3  22.8     4  108              26.7
##  4  21.4     6  258              19.7
##  5  18.7     8  360              15.1
##  6  18.1     6  225              19.7
##  7  14.3     8  360              15.1
##  8  24.4     4  147.             26.7
##  9  22.8     4  141.             26.7
## 10  19.2     6  168.             19.7
## # … with 22 more rows
```
---
## dplyr::group_by()

```r
mtcars %>%
  group_by(cyl) %>%
  mutate(max_mpg_per_cyl=max(mpg))%>%
  select(mpg,cyl,disp,max_mpg_per_cyl)
```

```
## # A tibble: 32 × 4
## # Groups:   cyl [3]
##      mpg   cyl  disp max_mpg_per_cyl
##    <dbl> <dbl> <dbl>           <dbl>
##  1  21       6  160             21.4
##  2  21       6  160             21.4
##  3  22.8     4  108             33.9
##  4  21.4     6  258             21.4
##  5  18.7     8  360             19.2
##  6  18.1     6  225             21.4
##  7  14.3     8  360             19.2
##  8  24.4     4  147.            33.9
##  9  22.8     4  141.            33.9
## 10  19.2     6  168.            21.4
## # … with 22 more rows
```
---
## dplyr::group_by()

```r
mtcars %>%
  arrange(cyl,desc(mpg)) %>%
  group_by(cyl) %>%
  mutate(n=1:n()) %>%
  filter(n==1)
```

```
## # A tibble: 3 × 13
## # Groups:   cyl [3]
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb name       n
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>  <int>
## 1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1 Toyot…     1
## 2  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1 Horne…     1
## 3  19.2     8 400     175  3.08  3.84  17.0     0     0     3     2 Ponti…     1
```
---
## dplyr::group_by()

```r
mtcars %>%
  arrange(cyl,mpg) %>%
  group_by(cyl) %>%
  mutate(n=1:n()) %>%
  mutate(type=ifelse(n==1,"Best in class","Other")) %>%
  select(cyl,mpg,type,name)
```

```
## # A tibble: 32 × 4
## # Groups:   cyl [3]
##      cyl   mpg type          name         
##    <dbl> <dbl> <chr>         <chr>        
##  1     4  21.4 Best in class Volvo 142E   
##  2     4  21.5 Other         Toyota Corona
##  3     4  22.8 Other         Datsun 710   
##  4     4  22.8 Other         Merc 230     
##  5     4  24.4 Other         Merc 240D    
##  6     4  26   Other         Porsche 914-2
##  7     4  27.3 Other         Fiat X1-9    
##  8     4  30.4 Other         Honda Civic  
##  9     4  30.4 Other         Lotus Europa 
## 10     4  32.4 Other         Fiat 128     
## # … with 22 more rows
```
---
## dplyr::group_by()

```r
mtcars %>%
  group_by(cyl) %>%
  filter(hp==max(hp))
```

```
## # A tibble: 3 × 12
## # Groups:   cyl [3]
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb name        
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>       
## 1  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2 Lotus Europa
## 2  19.7     6 145     175  3.62  2.77  15.5     0     1     5     6 Ferrari Dino
## 3  15       8 301     335  3.54  3.57  14.6     0     1     5     8 Maserati Bo…
```
---
## dplyr::group_by()

```r
mtcars %>%
  group_by(cyl) %>%
  top_n(1,hp)
```

```r
mtcars %>%
  group_by(cyl) %>%
  arrange(desc(hp)) %>%
  slice(1)
```

```r
mtcars %>%
  group_by(cyl,am) %>%
  count()
```

```
## # A tibble: 6 × 3
## # Groups:   cyl, am [6]
##     cyl    am     n
##   <dbl> <dbl> <int>
## 1     4     0     3
## 2     4     1     8
## 3     6     0     4
## 4     6     1     3
## 5     8     0    12
## 6     8     1     2
```
---
## dplyr::summarize()

```r
mtcars %>%
  summarise(mean=mean(mpg))
```

```
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1  20.1
```
---
## dplyr::summarize()

```r
mtcars %>%
  group_by(cyl) %>%
  summarise(mean=mean(mpg))
```

```
## # A tibble: 3 × 2
##     cyl  mean
##   <dbl> <dbl>
## 1     4  26.7
## 2     6  19.7
## 3     8  15.1
```
---
## dplyr::summarize()

```r
mtcars %>%
  group_by(cyl) %>%
  mean(mpg)
```

```
## Warning in mean.default(., mpg): argument is not numeric or logical: returning
## NA
```

```
## [1] NA
```
---
## dplyr::summarize()

```r
mtcars %>%
  group_by(cyl) %>% 
  pull(mpg) %>%
  mean()
```

```
## [1] 20.09062
```
---
## dplyr::summarize()

```r
mtcars %>%
  group_by(cyl) %>%
  summarise(median_mpg=median(mpg),
            mean(mpg),
            sd_mpg=sd(mpg),
            n=n())
```

```
## # A tibble: 3 × 5
##     cyl median_mpg `mean(mpg)` sd_mpg     n
##   <dbl>      <dbl>       <dbl>  <dbl> <int>
## 1     4       26          26.7   4.51    11
## 2     6       19.7        19.7   1.45     7
## 3     8       15.2        15.1   2.56    14
```
---
## dplyr::mutate_at()

```r
mtcars %>%
  mutate_at(.vars = vars(cyl,am),factor) %>%
  select(cyl,am) %>%
  glimpse()
```

```
## Rows: 32
## Columns: 2
## $ cyl <fct> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, …
## $ am  <fct> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
```
---
## dplyr::summarise_at()

```r
mtcars %>%
  group_by(cyl) %>%
  summarise_at(.vars=vars(mpg,disp,qsec),.funs=mean)
```

```
## # A tibble: 3 × 4
##     cyl   mpg  disp  qsec
##   <dbl> <dbl> <dbl> <dbl>
## 1     4  26.7  105.  19.1
## 2     6  19.7  183.  18.0
## 3     8  15.1  353.  16.8
```
---
## dplyr::summarise_at()

```r
mtcars %>%
  group_by(cyl) %>%
  summarise_at(.vars=vars(mpg,disp,qsec),
               .funs=list(mean = mean, median = median, max=max,min=min))
```

```
## # A tibble: 3 × 13
##     cyl mpg_mean disp_mean qsec_mean mpg_median disp_median qsec_median mpg_max
##   <dbl>    <dbl>     <dbl>     <dbl>      <dbl>       <dbl>       <dbl>   <dbl>
## 1     4     26.7      105.      19.1       26          108         18.9    33.9
## 2     6     19.7      183.      18.0       19.7        168.        18.3    21.4
## 3     8     15.1      353.      16.8       15.2        350.        17.2    19.2
## # … with 5 more variables: disp_max <dbl>, qsec_max <dbl>, mpg_min <dbl>,
## #   disp_min <dbl>, qsec_min <dbl>
```
---
## tidyr

Package to clean data and create tidy data. Tidy data is:

* Each variable is in a column
* Each observation is a row
* Each value is a cell
---

## This is not tidy

```r
untidy_df
```

```
## # A tibble: 13 × 7
##      age male_2016 female_2016 male_2017 female_2017 male_2018 female_2018
##    <int>     <dbl>       <dbl>     <dbl>       <dbl>     <dbl>       <dbl>
##  1    18     52579       52796     55592       50408     50659       49530
##  2    19     52374       53110     56601       53242     45912       52298
##  3    20     44102       48351     53827       55603     46277       54563
##  4    21     50885       38245     38620       45510     54871       58411
##  5    22     44634       45495     51560       45302     53563       52324
##  6    23     50467       46760     51060       48077     39407       56501
##  7    24     51462       47773     50152       44989     53284       54753
##  8    25     54106       46685     47309       50160     57431       59393
##  9    26     57848       46243     44781       47118     48008       52496
## 10    27     56968       45185     59887       56179     48006       53883
## 11    28     55828       53403     52976       51957     53399       42629
## 12    29     47639       45426     59319       47554     43280       46296
## 13    30     54327       58727     37977       45432     59564       54647
```

---

## tidy from wide to long

```r
tidy_df <- untidy_df %>%
  pivot_longer(-age,names_to = c("gender","year"),
               values_to = "value",names_sep = "_")

tidy_df  
```

```r
pivot_wider(tidy_df,
            names_from=c("gender","year"),
            values_from="value")
```

* Build the base with `ggplot()`. Then add layers with `geom_point()`, `geom_line()`, `geom_bar()`, `geom_boxplot()`...etc
* Add labels, themes, change the axis ticks etc.
* Extremely flexible
---

## ggplot2

```r
ggplot(data=mtcars,aes(x=hp,y=mpg)) +
  geom_point() 
```

<img src="ssmw2022_files/figure-html/unnamed-chunk-101-1.png" style="display: block; margin: auto;" />
---
## ggplot2

```r
ggplot(data=mtcars,aes(x=hp,y=mpg,color=factor(cyl))) +
  geom_point() 
```

<img src="ssmw2022_files/figure-html/unnamed-chunk-102-1.png" style="display: block; margin: auto;" />
---
## ggplot2

```r
(g <- ggplot(data=mtcars,aes(x=hp,y=mpg,color=factor(cyl))) +
   geom_point() +
   geom_smooth(method="lm",se=FALSE) +
   geom_smooth(aes(x=hp,y=mpg,color="1"),color="black",method = "lm",se=FALSE) +
   labs(x="Some X-ax title",y="Some Y-ax title",title="Some title",
       color="Cyl",subtitle = "Some subtitle",caption="Source: My source") +
   scale_x_continuous(limits=c(0,400)) +
   scale_y_continuous(limits=c(0,50)) +
   theme_dark() +
   theme(legend.title = element_text(size=12)) +
   scale_color_manual(values=c("grey","green","pink"))) + facet_wrap(~factor(am))
```

---
## ggplot2

If you are interested in Data Viz, I highly recommend:

**Data Visualization and Data Exploration using R**

**with Dr. Alicia Eads**

Thursday, April 28, 1-4pm, SSMW 2022
---

# Putting it together: two examples

---

# The 2019 Canadian Election Study

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/8RHLG1

```r
library(readstata13)
df <- read.dta13("data/2019 Canadian Election Study - Phone Survey v1.0.dta")

ces_2019_subset <- df %>%
  select(q2,q3,q4,q6,q9,q10,q11,q14,q15,q31,q32,weight_CES) %>%
  as_tibble()

ces_2019_subset %>% glimpse()
```

```
## Rows: 4,021
## Columns: 12
## $ q2         <int> 1963, 1973, 1994, 2000, 1984, 1939, 1999, 1995, 1963, 1970,…
## $ q3         <fct> (1) Male, (1) Male, (1) Male, (1) Male, (1) Male, (1) Male,…
## $ q4         <fct> (5) Quebec, (5) Quebec, (5) Quebec, (5) Quebec, (5) Quebec,…
## $ q6         <fct> (3) Not very satisfied, (2) Fairly satisfied, (1) Very sati…
## $ q9         <fct> (8), (10) Great deal of interest, (10) Great deal of intere…
## $ q10        <fct> (1) Certain, (1) Certain, (1) Certain, (1) Certain, (1) Cer…
## $ q11        <fct> "(-9) Don't know / Undecided", "(-9) Don't know / Undecided…
## $ q14        <int> 60, 70, 70, 75, 10, 0, 50, 65, 50, 70, 15, 40, 50, 50, 90, …
## $ q15        <int> 40, 55, 60, 40, 10, 30, 20, 25, 80, 10, 50, 25, 75, 0, -6, …
## $ q31        <fct> (2) Worse, (3) About the same, (1) Better, (3) About the sa…
## $ q32        <fct> (3) Not made much difference, (3) Not made much difference,…
## $ weight_CES <dbl> 0.9019529, 0.9019529, 0.9019529, 1.2334642, 0.9019529, 0.90…
```
---

```r
ces_2019_subset <- ces_2019_subset %>%
  rename(q2_birthyear=q2,q3_gender=q3,q4_province=q4,
         q6_satisfied_democracy=q6,q9_interest_election=q9,
         q10_certain_vote=q10,q11_vote_intention=q11,
         q14_feeling_liberal_party=q14,q15_feeling_cons_party=q15,
         q31_ecnchange=q31,q32_policies_fed_gov_ecn=q32)

ces_2019_subset <- ces_2019_subset %>%
  # Q14 and Q15 have negative values which should be NAs
  # See codebook
  mutate(q14_feeling_liberal_party=ifelse(q14_feeling_liberal_party<0,
                                          NA,q14_feeling_liberal_party),
         q15_feeling_cons_party=ifelse(q15_feeling_cons_party<0,
                                          NA,q15_feeling_cons_party))
```

---

```r
ces_2019_subset <- ces_2019_subset %>%
  mutate(q11_vote_intention=fct_recode(q11_vote_intention,
      "(4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)"=
      "(4) Bloc QuÃ©bÃ©cois (BQ, PQ, Bloc, Parti QuÃ©bÃ©cois)"))
```

---

```r
ces_2019_subset %>%
  group_by(q3_gender) %>%
  summarise(avg_feeling_lib=mean(q14_feeling_liberal_party,na.rm=TRUE))
```

```
## # A tibble: 3 × 2
##   q3_gender  avg_feeling_lib
##   <fct>                <dbl>
## 1 (1) Male              44.3
## 2 (2) Female            51.8
## 3 (3) Other             40
```

---

```r
to_plot <- ces_2019_subset %>%
  group_by(q3_gender,q4_province) %>%
  summarise(avg_feeling_lib=mean(q14_feeling_liberal_party,na.rm=TRUE),.groups = 'drop')
to_plot
```

```
## # A tibble: 21 × 3
##    q3_gender q4_province                   avg_feeling_lib
##    <fct>     <fct>                                   <dbl>
##  1 (1) Male  (1) Newfoundland and Labrador            45.3
##  2 (1) Male  (2) Prince Edward Island                 46.5
##  3 (1) Male  (3) Nova Scotia                          49.4
##  4 (1) Male  (4) New Brunswick                        47.8
##  5 (1) Male  (5) Quebec                               49.2
##  6 (1) Male  (6) Ontario                              50.3
##  7 (1) Male  (7) Manitoba                             40.7
##  8 (1) Male  (8) Saskatchewan                         32.1
##  9 (1) Male  (9) Alberta                              24.7
## 10 (1) Male  (10) British Columbia                    43.3
## # … with 11 more rows
```

---

```r
to_plot %>%
  filter(q3_gender!="(3) Other") %>%
  ggplot(aes(x=q3_gender,y=avg_feeling_lib,fill=q4_province)) +
  geom_bar(stat = "identity",position="dodge")
```

![](ssmw2022_files/figure-html/unnamed-chunk-109-1.png)
---

```r
to_plot %>%
  filter(q3_gender!="(3) Other") %>%
  ggplot(aes(x=q3_gender,y=avg_feeling_lib,fill=q4_province)) + 
  geom_bar(stat = "identity",position="dodge") + 
  geom_text(aes(label=round(avg_feeling_lib)),
            position = position_dodge(width = .9),vjust=-0.25)
```

![](ssmw2022_files/figure-html/unnamed-chunk-110-1.png)

---

```r
to_plot <- ces_2019_subset %>%
  mutate(q4_province_2=fct_recode(q4_province,"Atlantic"="(1) Newfoundland and Labrador",
                                  "Atlantic"="(2) Prince Edward Island",
                                  "Atlantic"="(3) Nova Scotia",
                                  "Atlantic"="(4) New Brunswick")) %>%
  group_by(q3_gender,q4_province_2) %>%
  summarise(avg_feeling_lib=mean(q14_feeling_liberal_party,na.rm=TRUE),
            .groups = 'drop')
```

---

```r
to_plot %>%
  filter(q3_gender!="(3) Other") %>%
  ggplot(aes(x=q3_gender,y=avg_feeling_lib,fill=q4_province_2)) +
  geom_bar(stat = "identity",position="dodge") +
  geom_text(aes(label=round(avg_feeling_lib)),position =
              position_dodge(width = .9),vjust=-0.25)
```

![](ssmw2022_files/figure-html/unnamed-chunk-112-1.png)

---

```r
to_plot <- ces_2019_subset %>%
  mutate(q4_province_2=fct_recode(q4_province,"Atlantic"="(1) Newfoundland and Labrador",
                                  "Atlantic"="(2) Prince Edward Island",
                                  "Atlantic"="(3) Nova Scotia",
                                  "Atlantic"="(4) New Brunswick")) %>%
  group_by(q3_gender,q4_province_2) %>%
  summarise(avg_feeling_lib=weighted.mean(q14_feeling_liberal_party,
                                          weight_CES,na.rm=TRUE),.groups = 'drop')
```

---

```r
to_plot %>%
  filter(q3_gender!="(3) Other") %>% ggplot(aes(x=q3_gender,y=avg_feeling_lib,
    fill=q4_province_2)) + geom_bar(stat = "identity",position="dodge") +
  geom_text(aes(label=round(avg_feeling_lib)), 
            position =position_dodge(width = .9),vjust=-0.25)
```

![](ssmw2022_files/figure-html/unnamed-chunk-114-1.png)

---

```r
to_plot <- ces_2019_subset %>%
  mutate(q4_province_2=fct_recode(q4_province,"Atlantic"="(1) Newfoundland and Labrador",
                                  "Atlantic"="(2) Prince Edward Island",
                                  "Atlantic"="(3) Nova Scotia",
                                  "Atlantic"="(4) New Brunswick")) %>%
  group_by(q3_gender,q4_province_2) %>%
  summarise(avg_feeling_cons=weighted.mean(q15_feeling_cons_party,weight_CES,na.rm=TRUE),
            avg_feeling_lib=weighted.mean(q14_feeling_liberal_party,weight_CES,na.rm=TRUE),.groups = 'drop')
```

---

```r
to_plot %>%
  pivot_longer(c("avg_feeling_cons","avg_feeling_lib"),names_to = "party",values_to = "value") %>% filter(q3_gender!="(3) Other") %>%
  ggplot(aes(x=q3_gender,y=value,fill=party)) + geom_bar(stat = "identity",position="dodge") +
  geom_text(aes(label=round(value)),position = position_dodge(width = .9),vjust=-0.25) +
  facet_wrap(~q4_province_2) + scale_fill_manual(values=c("darkblue","red"))
```

![](ssmw2022_files/figure-html/unnamed-chunk-116-1.png)

---

```r
to_plot <- ces_2019_subset %>%
  filter(q31_ecnchange %in% c("(1) Better", "(2) Worse", "(3) About the same")) %>%
  mutate(q31_ecnchange=factor(q31_ecnchange,c("(2) Worse","(3) About the same","(1) Better"),
                              labels=c("Worse","About the same","Better"))) %>%
  mutate(q4_province_2=fct_recode(q4_province,"Atlantic"="(1) Newfoundland and Labrador",
                                  "Atlantic"="(2) Prince Edward Island",
                                  "Atlantic"="(3) Nova Scotia",
                                  "Atlantic"="(4) New Brunswick")) %>%
  group_by(q3_gender,q4_province_2,q31_ecnchange) %>%
  summarise(sum_weightsQ31=sum(weight_CES),.groups = 'drop') %>%
  group_by(q3_gender,q4_province_2) %>%
  mutate(pc=sum_weightsQ31/sum(sum_weightsQ31))
```

---

```r
to_plot %>%
  ggplot(aes(x=q31_ecnchange,y=pc,fill=q3_gender)) + facet_wrap(~q4_province_2) +
  geom_bar(stat = "identity",position="dodge") +
  geom_text(aes(label=round(pc*100)),position = position_dodge(width = .9),vjust=-0.25) +
  labs(title="Feeling about the economy by gender and province")
```

---

`$$Feeling Lib Party_i = B0+B1*PoliciesFedGovEcn_i+B2*birthyear_i+B3*gender_i+u_i$$`

```r
ces_2019_subset <- ces_2019_subset %>%
  filter(!(q31_ecnchange %in%
  c("(-9) Don't know", "(-8) Refused", "(-7) Skipped"))) %>%
  mutate(q31_ecnchange_numeric=
           as.numeric(factor(q31_ecnchange,
  c("(2) Worse","(3) Not made much difference","(1) Better"))))

to_model <- ces_2019_subset %>%
  select(q14_feeling_liberal_party,q31_ecnchange_numeric,
         q2_birthyear,q3_gender,q4_province) %>%
  filter(complete.cases(.)) %>%
  mutate(age=2019-q2_birthyear)

m <- lm(q14_feeling_liberal_party~q31_ecnchange_numeric+
          age+q3_gender+q4_province,to_model)
```

---

```r
summary(m)
```

```
## 
## Call:
## lm(formula = q14_feeling_liberal_party ~ q31_ecnchange_numeric + 
##     age + q3_gender + q4_province, data = to_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.498 -21.108   1.342  18.442  69.697 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          21.83572    3.28780   6.641 3.95e-11 ***
## q31_ecnchange_numeric                16.39937    0.60896  26.930  < 2e-16 ***
## age                                  -0.14609    0.03379  -4.324 1.61e-05 ***
## q3_gender(2) Female                   7.47247    1.13553   6.581 5.91e-11 ***
## q4_province(2) Prince Edward Island   0.87570    3.63661   0.241  0.80973    
## q4_province(3) Nova Scotia           -4.54362    3.70761  -1.225  0.22053    
## q4_province(4) New Brunswick         -2.47037    3.70780  -0.666  0.50532    
## q4_province(5) Quebec                -0.58027    2.89823  -0.200  0.84133    
## q4_province(6) Ontario               -0.75340    2.89516  -0.260  0.79471    
## q4_province(7) Manitoba              -6.74018    3.46396  -1.946  0.05181 .  
## q4_province(8) Saskatchewan         -10.10906    3.25871  -3.102  0.00195 ** 
## q4_province(9) Alberta              -13.76282    3.21729  -4.278 1.97e-05 ***
## q4_province(10) British Columbia     -2.08257    2.86832  -0.726  0.46788    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.41 on 2082 degrees of freedom
## Multiple R-squared:  0.3289,	Adjusted R-squared:  0.325 
## F-statistic: 85.03 on 12 and 2082 DF,  p-value: < 2.2e-16
```

```r
library(emmeans)

emmeans(m,specs = "q31_ecnchange_numeric",
        at=list(q31_ecnchange_numeric=c(1,2,3)))
```

```
##  q31_ecnchange_numeric emmean    SE   df lower.CL upper.CL
##                      1   30.5 0.795 2082     29.0     32.1
##                      2   46.9 0.698 2082     45.5     48.3
##                      3   63.3 1.041 2082     61.3     65.3
## 
## Results are averaged over the levels of: q3_gender, q4_province 
## Confidence level used: 0.95
```

---

```r
library(emmeans)

emmeans(m,specs = "q31_ecnchange_numeric",
        at=list(q31_ecnchange_numeric=c(1,3))) %>%
  contrast( method = "pairwise", infer=TRUE)
```

```
##  contrast estimate   SE   df lower.CL upper.CL t.ratio p.value
##  1 - 3       -32.8 1.22 2082    -35.2    -30.4 -26.930  <.0001
## 
## Results are averaged over the levels of: q3_gender, q4_province 
## Confidence level used: 0.95
```

---

```r
to_model$prediction <-  predict(m)

myrmse <- round(sqrt(mean((to_model$q14_feeling_liberal_party-
                             to_model$prediction)^2)),1)
```

---

```r
ggplot(to_model,aes(x=q14_feeling_liberal_party,y=prediction)) +
  geom_jitter() +
  labs(title=paste0("predictions and true values plotted: rmse=",myrmse))
```

![](ssmw2022_files/figure-html/unnamed-chunk-124-1.png)
---
## SCC cases

58 cases from the SCC in 2021. Can be accesses through link below. Can also be downloaded as a single zip.

```r
library(tidyverse)
library(textreadr)
library(ldatuning)
library(topicmodels)
library(tidytext)
# https://decisions.scc-csc.ca/scc-csc/scc-csc/en/2021/nav_date.do
download.file("justinsavoie.com/data/2021.zip","~/Downloads/2021.zip")
```

See https://www.tidytextmining.com/topicmodeling.html for more info on tidy text analysis. 
---

```r
folder_2021 <- "~/Downloads/2021"
paths <- file.path(folder_2021,list.files(folder_2021))
my_list <- list()

for (k in 1:length(paths)){
  doc <- textreadr::read_docx(paths[k])
  citation <- str_replace(doc[2],"Citation: ","")
  
  start <- which(grepl("Present: ",doc))+3
  end <- which(grepl("Cases Cited",doc))-1
  
  if (length(start)==0 & length(end)==0){
    next
  }
  
  doc_sub <- doc[start:end]
  df <- tibble(paragraph=paste0(doc_sub,collapse=" "),citation=citation)
  
  my_list[[k]] <- df %>%
    unnest_tokens(word, paragraph) %>%
    mutate(word=str_replace(word,"\\.","")) %>%
    filter(!grepl('[0-9]',word)) %>%
    filter(nchar(word)>4)
}

df_words <- bind_rows(my_list)
```
---

```r
df_words %>%
  group_by(citation,word) %>%
  summarise(n=n()) %>%
  group_by(citation) %>%
  mutate(pc=n/sum(n)) %>%
  arrange(citation,desc(pc)) %>%
  group_by(citation) %>%
  slice(1:5) %>%
  ungroup() %>%
  ggplot(aes(x=word,y=pc)) +
  geom_bar(stat="identity") +
  facet_wrap(~citation,scales = "free")
```

---

```r
dtm <- cast_dtm(df_words%>%group_by(citation,word)%>%summarise(n=n()),
         citation,word,n)

modelLDA <- LDA(dtm,k=3)
wtp <- tidy(modelLDA,matrix="beta")

wtp %>%
  arrange(topic,desc(beta)) %>%
  group_by(topic) %>%
  slice(1:5) %>%
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(y=term,fill=factor(topic),x=beta)) +
  geom_bar(stat = "identity",position = "dodge") +
  facet_wrap(~topic,ncol=1,scales = "free") + scale_y_reordered()
```
---
Discussion of ideal number of topics https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html

```r
result <- FindTopicsNumber(
  dtm,
  topics = seq(from = 2, to = 50, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  verbose = TRUE
)

FindTopicsNumber_plot(result)

modelLDA <- LDA(dtm,k=9)
wtp <- tidy(modelLDA,matrix="beta")
```
---

```r
wtp %>%
  arrange(topic,desc(beta)) %>%
  group_by(topic) %>%
  slice(1:10) %>%
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(y=term,x=beta)) + geom_bar(stat = "identity",position = "dodge") +
  facet_wrap(~topic,scales = "free") + scale_y_reordered()
```

---

# Thank you! Questions, comments, discussions ...