---
title: "Project1"
author: "PUBPOL 750 McMaster"
date: "`r Sys.Date()`"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
```

## Project 1 - Exploring data from the 2019 Canadian Election Study

In this project, we use data from the [2019 Canadian Election Study](http://www.ces-eec.ca/) (CES) to produce an exploratory data analysis. We start with a univariate exploratory data analysis. Then we move to bivariate analysis. Data available [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/8RHLG1) (version 1.1 Stata14), or directly on the class website.

Section 1 is a code along. You just have to run the code. There's no code to write. However, there are questions (marked [QUESTION]{style="color: red;"}: in red) to answer. Answer them directly in the qmd file.

**Do not start a new Quarto file from scratch.** Work from the available Quarto file available on the website. Add inline text to answer the text in the qmd file. 

In Section 2, I ask you to run the an analysis similar to the one in section 1, but on some other variables of your choice. You can pick any variables from the 2019 CES, or from another dataset if you prefer.

Project 1 is due on November 8. When you are done, knit this Quarto file to html. Submit both the html file and this .qmd (Quarto) file.

### Section 1 - Exploring variables (code along)

#### Loading packages, loading the data

```{r}
#| warning: false
#| message: false
library(tidyverse)
# The CES data provided is in Stata format, so we need haven
library(haven)
# We need e1071 for kurtosis and skewness
library(e1071)
# We need kableExtra to produce nice html data tables
library(kableExtra)
# Read in the data, assign to df
df <- read_stata("2019 Canadian Election Study - Phone Survey v1.1.dta")
# To convert numeric value labels to factor
library(labelled)
```

Use the glimpse function on the dataset.

```{r}
glimpse(df)
```

[QUESTION]{style="color: red;"}: How many individuals are there in the dataset? How many variables? What are the column types present in the data (they are between "\<\>" in the output of the glimpse() function? What is a dbl+lbl? You can read the first section of [this document](https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html) (up until section *Variable labels*).

#### Univariate analysis

Let's look at the distribution of age.

```{r}
#| fig-cap: "Histogram age"
ggplot(df,aes(x=age)) +
  geom_histogram()
```

Let's calculate the number of values for which age is not missing, the mean and the median.

```{r}
sample_size_age <- df |>
  summarise(sample_size_age=sum(!is.na(age))) |>
  pull(sample_size_age)
# we could also use the tidyverse to get the mean and median but since it's simple
# let's just use the compact way
my_mean <- mean(df$age,na.rm=TRUE)
my_median <- median(df$age,na.rm=TRUE)
```

Let's redo our histogram, but adding a vertical line where the median is. We can add a caption to programmatically indicate the sample size.

```{r}
ggplot(df,aes(x=age)) +
  geom_histogram(binwidth=1,fill="white",color="black") +
  theme_classic() +
  labs(
    x="Age",
    y="Count (in survey)",
    title="Age distribution in Canada",
    # You can read on ?paste0 and ?format
    caption=paste0("Data from CES 2019; n = ",format(sample_size_age,big.mark   = ","))
  ) +
  geom_vline(aes(xintercept=my_mean),linetype=2) +
  annotate("text", x = my_mean-2, y = 90, label = "mean",angle = 90)
```

[QUESTION]{style="color: red;"}: In your own words, how would you describe the distribution of age?

In the data, there's a variable called age_range. Let's look at it with group_by and count.

```{r}
df |>
  group_by(age_range) |>
  count()
```

Let's print the first five values of age_range. We see age_range is a labelled_double. That means the variable is a number, but it has a label associated with it. It's similar to a factor: it's a number with a label associated with it.

You can convert it to a factor like this:

```{r}
df$age_range <- to_factor(df$age_range)
```

Let's look at the possible levels age_range can take.

```{r}
levels(df$age_range)
```

[QUESTION]{style="color: red;"}: How many missing values are there? (go back to the count() above ) How many Don't know's, Refused, Skipped? Why do you think this is the case? People could refuse to answer, it's an option, so why are there none?

Now, plot age_range.

```{r}
ggplot(df,aes(x=age_range)) +
  geom_bar()
```

Now, imagine that these are not the groups you want. Rather, you want 18-34, 35-54, 55+. Recode the groups using the following code. You are using the `cut()` function.

```{r}
df <- df |>
  mutate(
    age_group=cut(age,breaks=c(-Inf,17,34,54,Inf),
                  label=c("0-17","18-34","35-54","55+")),
    age_group=droplevels(age_group))
```

Plot this using a bar graph.

```{r}
ggplot(df,aes(x=age_group)) +
  geom_bar()
```

You can add labels with the count number like this.

```{r}
ggplot(df, aes(x = age_group)) +
  geom_bar() + labs(x = "", y = "") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.5, colour = "white")
```

Let's drop the empty levels from the age_range factor. You can use the `recode()` function if you want to recode (e.g. clean) them.

```{r}
df <- df |>
  mutate(age_range = droplevels(age_range))
df <- df |>
  mutate(age_range=recode(age_range,
                          "(1) 18-24 years old"="18-24",
                          "(2) 25-34 years old"="25-34",
                          "(3) 35-44 years old"="35-44",
                          "(4) 45-54 years old"="45-54",
                          "(5) 55+ years old"="55+"))
```

That can be plotted too.

```{r}
ggplot(df,aes(x=age_range)) +
  geom_bar() +
  labs(x="",y="")
```

Lastly, instead of visualizing age with a graph, let's use a table to get all the summary statistics. Use `kable()` to output these numbers.

```{r}
age_summary <- df |>
  summarize(
    mean_age = mean(age, na.rm = TRUE), 
    sd_age = sd(age, na.rm = TRUE), 
    min_age = min(age, na.rm = TRUE), 
    max_age = max(age, na.rm = TRUE), 
    median_age = median(age, na.rm = TRUE), 
    skew_age = skewness(age, na.rm = TRUE), 
    kurtosis_age = kurtosis(age, na.rm = TRUE), 
    n_age =  sum(!is.na(age))
  )

age_summary |>
  kable(format = "simple") 

age_summary |>
  t() |>
  kable(format = "simple") 
```

[QUESTION]{style="color: red;"}: What's the mean/sd/min/max/median/skewnewss/kurtosis? Interpret the skewness and kurtosis?

[QUESTION]{style="color: red;"}: `t()` function stands for transpose. What does `t()` do in practice?

Now, let's look at the variable household income.

```{r}
summary(df$q69)
```

Looking at the codebook, we see that -8 and -9 should be coded as NA.

```{r}
df <- df |>
  mutate(hincome=ifelse(q69 %in% c(-8,-9), NA, q69))
```

```{r}
ggplot(df,aes(x=hincome)) +
  geom_histogram(binwidth = 5000)
```

It's sort of annoying that the data is all clustered to the left because of those "1%" rich outliers. Let's zoom in. Let's also pick bins of 1000.

```{r}
ggplot(df,aes(x=hincome)) +
  geom_histogram(binwidth = 1000) +
  coord_cartesian(xlim=c(0,500000)) +
  scale_x_continuous(breaks=seq(0,500000,by=100000),labels=c("0","100k","200k","300k","400k","500k"))
```

[QUESTION]{style="color: red;"}: What do you notice? What's the shape of the distribution? Why the weird spikes?

[QUESTION]{style="color: red;"}: By copy and pasting the code above for the summary table using `kable()` try to replicate the statistical analysis but on household income instead of age. Comment very briefly.

[QUESTION]{style="color: red;"}: Now on to another variable. Check q11 and q12 in the codebook. What are those variables?

Take a look at these three frequency tables.

```{r}
df |>
  group_by(q11) |>
  count()
df |>
  group_by(q12) |>
  count()
df |>
  group_by(q11,q12) |>
  count() |>
  pivot_wider(names_from = q11,values_from = n)
```

So, given this, if we want to merge these two variables, we should start with q12, and if q12 is NA, then put q11. This can be done using the `coalesce()` function. Think of two columns in excel. Column A and column B. `coalesce(a,b)` creates a new variable equal to a if a is not missing and equal to b if b is missing.\
\
Specifically, we are coalescing q12 and q11 so that when q12 is missing q11 is used. Equivalently, we could check if q11 = "(-9) Don't know / Undecided" and put values from q12 in using the `ifelse()` function.\
\
In other words, for those who said "(-9) Don't know / Undecided", we want to plug in their answer to the follow asking "Is there a party you are leaning to?"

```{r}
df <- df |>
  mutate(vote_coalesced = coalesce(q12,q11))
```

[QUESTION]{style="color: red;"}: This returns an error message. It's not the end of the world, nothing breaks but we should understand why. Can you identify why? You can print the labels by doing `df$q11` and `df$q12`.

Let's clean this variable. You can use what we call a named vector with the recode function.

Recode is neat. If for instance you wanted to remove the Refused and Skipped, you could replace the "OTH" there by NA.

Note that in recode, the syntax is "old value"="new value". This is different than usually. Usually (e.g. using mutate) syntax is "new value"="old value".\
\
When you use a function for the first time, make sure you read the help file or copy paste code using that function from a trusted resource (e.g. a stackoverflow answer with a lot of up votes).

```{r}

df <- df |>
  mutate(vote_coalesced=to_factor(vote_coalesced))

my_cleaning_mapping <- c("(-9) Don't know"="DK",
  "(-8) Refused"="OTH",
  "(-7) Skipped"="OTH",
  "(1) Liberal (Grits)"="LIB", 
  "(2) Conservatives (Tory, PCs, Conservative Party of Canada)"="CONS", 
  "(3) NDP (New Democratic Party, New Democrats, NDPers)"="NDP",
  "(4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)"="BQ", 
  "(5) Green Party (Greens)"="GRN",
  "(6) People's Party"="PPC",
  "(7) Other"="OTH", 
  "(8) Will not vote"="OTH",
  "(9) None of these"="OTH",
  "(10) Will spoil ballet"="OTH", 
  "(-9) Don't know / Undecided"="DK")

df <- df |>
  mutate(
    vote_clean=recode(vote_coalesced,!!!my_cleaning_mapping)
  )
```

When you are going to plot this, it's nice if OTH and DK are at the end. You can set the factor levels like this.

```{r}
df <- df |>
  mutate(
    vote_clean = factor(vote_clean,levels=c("LIB", "CONS", "NDP", "BQ", "GRN", "PPC","OTH","DK"))
  )
```

Plot this.

```{r}
party_colors <- c("LIB"="red",
  "CONS"="darkblue",
  "NDP"="orange",
  "BQ"="lightblue",
  "GRN"="darkgreen",
  "PPC"="purple",
  "DK"="darkgrey",
  "OTH"="grey")

df |>
  count(vote_clean) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = vote_clean, y = prop, fill = vote_clean)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values=party_colors) +
  scale_y_continuous(limits=c(0,0.35))
```

[QUESTION]{style="color: red;"} Given what you know about the 2019 election (or using Wikipedia), does this distribution appear reasonable?

Up to now, we've looked at two continuous variable (age and household income) and one nominal categorical variable (vote intention).

Let's look at another categorical variable, this time an ordered categorical variable.

Q46 asks: Do you strongly agree, somewhat agree, somewhat disagree, or strongly disagree with the following statement: "Justin Trudeau kept the election promises he made in 2015."

```{r}
ggplot(df,aes(x=q46)) +
  geom_bar()
```

Let's clean it like this:

```{r}

df$q46 <- to_factor(df$q46)

my_cleaning_mapping <- c(
  "(-9) Don't know"="Don't know",
  "(-8) Refused"=NA,
  "(-7) Skipped"=NA,
  "(1) Strongly agree"="Strongly agree", 
  "(2) Somewhat agree"="Somewhat agree",
  "(3) Somewhat disagree"="Somewhat disagree",
  "(4) Strongly disagree"="Strongly disagree"
)

df <- df |>
  mutate(q46=recode(q46,!!!my_cleaning_mapping),
         q46=factor(q46,levels=c("Strongly disagree","Somewhat disagree",
                                 "Somewhat agree","Strongly agree","Don't know")),
         q46=droplevels(q46))

df <- df %>%
  mutate(q46 = recode(q46, !!!my_cleaning_mapping),
         q46 = factor(q46, levels=c("Strongly disagree", "Somewhat disagree",
                                    "Somewhat agree", "Strongly agree", "Don't know")))


```

Let's plot it again.

```{r}
ggplot(df,aes(x=q46)) +
  geom_bar()
```

Note that in the social science, an ordinal categorical variable is often be converted to numeric. Here, you would have Strongly disagree becomes 1, Somewhat disagree becomes 2, Somewhat agree becomes 3. Strongly agree becomes 4. Note, "lowest / disagree" always come first: we want the numeric scale from "less/disagree" to "more/agree. You also have to decide what to do with the DK. You could drop them (put NAs or code them as 2.5 (middle of scale).

Up to now, we did univariate analysis on age, household income, vote intention and q46 (a question asking if Trudeau held his promises).

#### Bivariate analysis

Now, let's do bivariate analysis. We will cover this more in the next weeks.

When doing bivariate analysis, you look at the distribution of two variables against each other. Bivariate analysis will be different depending on variable type. This is discussed in the book. You can have:

1 - two continuous variables

2 - one continuous variable and one categorical variable

3 - two categorical variables

Let's explore the three possibilities in turn.

#### 1 - two continuous variables

This is, in a way, the simplest case. We've all seen scatterplots. In the book, they look at carat and price.

Let's plot age on x and household income on y.

```{r}
ggplot(df,aes(x=age,y=hincome)) +
  geom_point()
```

All the points are clustered on top of each other and we don't see much. You can use "alpha" to play with the opacity. Where many points overlap the color will be darker. Where not many points overlap the color will be lighter. This way, you see where the density is.

```{r}
ggplot(df,aes(x=age,y=hincome)) +
  geom_point(alpha=0.1)
```

In the social sciences, we often ask questions of the type: "who's richer, the young or the old". One (extremely naive) way to ask this question is to fit a linear model where income = b0 + b1 \* age + error. You'll see details of this in the sequel to this class. In other words, you ask: when age increases, does household income increase or decrease (or stays the same). You can fit that straight line using a linear model "lm" smooth.

```{r}
ggplot(df,aes(x=age,y=hincome)) +
  geom_point(alpha=0.1) + geom_smooth(method="lm", se=FALSE)
```

This suggests that income decreases with age. But, you can notice that this fit is terrible. Looking at the data, it actually seems that income first increases and then decreases with age.

Fitting a straight line is what we call a linear model of degree 1.

We can fit, instead, a local regression. The idea is that instead of having one straightline, you have many local straightlines so the line "moves". This captures non-linearities.

```{r}
ggplot(df,aes(x=age,y=hincome)) +
  geom_point(alpha=0.1) + geom_smooth(se=FALSE)
```

This seems more reasonable, though this statistical model still seems to be poor in the sense that you probably do not explain a lot of the variation in household income using only one variable (age). For now, that's enough for two continuous variables. More on this next semester.

#### 2 - one continuous variable and one categorical variable

One good way to visualize the relation between one continuous variable and one categorical variable is the boxplot.

```{r}
ggplot(data = df, mapping = aes(x = vote_clean, y = hincome)) +
  geom_boxplot()
```

Above, we see the outliers really well, but the bottom is a bit clustered and hard to read. Let's try to transform the y axis using a square root transformation.

```{r}
ggplot(data = df, mapping = aes(x = vote_clean, y = hincome)) +
  geom_boxplot() +
  scale_y_sqrt(breaks=c(0,25000,50000,100000,500000,1000000,2000000))
```

Better, we still hard to read. Let's just cut off the rich above 250,000 so we see clearly the distribution of the normal folks.

```{r}
ggplot(data = df, mapping = aes(x = vote_clean, y = hincome)) +
  geom_boxplot() + coord_cartesian(ylim = c(0,250000))
```

[QUESTION]{style="color: red;"} How would you interpret these findings from those three boxplots?

#### 3 - two categorical variables

In the book, they use geom_count().

```{r}
ggplot(data = df) +
  geom_count(mapping = aes(x = vote_clean, y = q46))
```

This is interesting in the sense that we that many liberals think Trudeau held is promises and many conservatives think he did not.

In the social sciences, what we usually want to know is: "how many % of liberals think he did", "how many % of conservatives think he did" etc. The simplest way to do this is as follow:

```{r}
df |>
  group_by(vote_clean,q46) |>
  count() |>
  group_by(vote_clean) |>
  mutate(percent=100*n/sum(n)) |>
  ggplot(aes(x=q46,y=percent,fill=vote_clean)) + 
  geom_bar(position="dodge",stat="identity") +
  geom_text(aes(label=round(percent)),position=position_dodge(width=0.9),vjust=-0.1)
```

[QUESTION]{style="color: red;"} If you focus on the NDP, the Conservatives and the Liberal only (ignore the other parties/choices), for each party, how many % of respondents agree Trudeau his promises (agree = strongly agree + somewhat agree)?

### Section 2 - Your turn

Pick two variables of your choice. The variables can be in this dataset, or they can be in another dataset of your choice. The variables can be continuous or categorical.

Describe in a few words the variables you picked.

Make three figures. The first two figures are univariate analyses of your two variables (bar chart if categorical, histogram if continuous). Add informative figure captions (see R4DS 29.6.2). There's also an example above in the first figure in this document.

The third figure should depict the two variables together (bivariate analysis). This is not something we've explicitly covered in the class yet but you should be able to adapt the code from above. 

If you two variables are continuous, you can use geom_point() and identify x and y.

If your two variables are categorical, you can imitate the analysis above where we breakdown q46 by vote.

If one variable is continuous and one variable is categorical you can replicate the boxplot analysis.

Very succinctly, comment each figure. Try to make the figures as aesthetically pleasing as possible.