```
Call:
lm(formula = satisfaction ~ income, data = df)
Coefficients:
(Intercept) income
0.9311 0.5247
```

PUBPOL 750 Data Analysis for Public Policy I: Week 9

Justin Savoie

MPP-DS McMaster

2023-11-15

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables.

It’s a fundamental tool in data analysis in all social science.

It quantifies the average effect of changes in independent variables on a dependent variable.

It can be used for prediction, description or causal explanation.

- Models the relationship between a dependent variable and one independent variable in a linear way
- \(Y\) is the dependent variable
- \(X_1\) is the independent variable
- \(\beta_0\) is the intercept
- \(\beta_1\) is the slope coefficient
- \(\epsilon\) represents the error term
- Here, it’s a
*simple*linear regression because there is one independent variable, \(X_1\)

For example, for an income of 50,000$, the predicted value for life satisfaction is 3.37. Of course, it’s not because you have an income of 50,000$ that your life satisfaction is 3.37. That’s why there’s error \(ϵ\) in the model.

*This is made up data. The true relation would not be as clear.*

Typically, \(Y = \beta_0 + \beta_1X_1 + \epsilon\) will be used for the general model. \(y = b_0 + b_1x_1 + \epsilon\) refers to the estimated regression equation based on sample data.

The residual is the distance between the prediction (i.e. the line of best fit) and the true observed value. The residuals are shown with the black lines.

“error” and “residuals” are sometimes used interchangeably but there’s a subtle difference. The “residual” is the observed difference when you fit the line. The “error” is the equivalent in the theoretical model, but it’s unobservable.

The most common method for estimating the coefficients of a linear regression model is Ordinary Least Squares (OLS). This method *minimizes the sum of the squared differences between the observed values and the values predicted by the model: the sum of the squared residuals.

This is the best model of the form \(y=b0+b1+ϵ\) because it minimizes the square of the residual.

In contrast, this is NOT the best model of the form \(y=b0+b1+ϵ\) because it does not minimizes the square residual.

```
Call:
lm(formula = satisfaction ~ income, data = df)
Coefficients:
(Intercept) income
0.9311 0.5247
```

On average, when income increases by 1, satisfaction increases by 0.52. On average, when income is 0, satisfaction is 0.93.

```
Call:
lm(formula = satisfaction ~ income, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.5868 -0.4679 0.0993 0.5708 3.2122
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.93111 0.19094 4.876 4.17e-06 ***
income 0.52470 0.03232 16.232 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9682 on 98 degrees of freedom
Multiple R-squared: 0.7289, Adjusted R-squared: 0.7261
F-statistic: 263.5 on 1 and 98 DF, p-value: < 2.2e-16
```

The standard error is the uncertainty around the estimate. When it’s small you have more confidence in the estimate. We usually say it’s statisticaly significant if Pr(>|t|) (the “p-value”) is below 0.05. The p-value is the probability of obtaining an estimate at least as extreme as the estimate actually observed, if there was no effect. We can get a confidence interval around the estimate by adding +- 1.96 * the standard error.

Key: treat everything as numbers (0 and 1).

```
Call:
lm(formula = satisfaction ~ community, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.397 -1.377 0.023 1.249 4.833
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0769 0.2799 14.563 <2e-16 ***
communityurban -0.8178 0.3676 -2.225 0.0284 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.814 on 98 degrees of freedom
Multiple R-squared: 0.04808, Adjusted R-squared: 0.03837
F-statistic: 4.95 on 1 and 98 DF, p-value: 0.02838
```

```
Call:
lm(formula = satisfaction ~ education, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.5709 -1.4955 0.0437 1.3485 5.2117
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0946 0.3086 10.026 <2e-16 ***
educationCollege Trade School 0.6032 0.4630 1.303 0.1957
educationUniversity 0.9163 0.4306 2.128 0.0359 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.826 on 97 degrees of freedom
Multiple R-squared: 0.0456, Adjusted R-squared: 0.02592
F-statistic: 2.317 on 2 and 97 DF, p-value: 0.104
```

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + ... +\epsilon\] In practice, linear regression will have multiple predictors. Life satisfaction will be modelled as a function of multiple factors.

Perhaps, we could have: \[life\_satisfaction = \beta_0 + \beta_1*income + \beta_2*physical\_health + \\ \beta_3*mental\_health + \beta_4*quality\_of\_infrastructures + ... +\epsilon\]

In our example, we can have: \[life\_satisfaction = \beta_0 + \beta_1*income + \beta_2*communityurban + \\ \beta_3*educationCollegeTradeSchool + \beta_3*educationUniversity + ... +\epsilon\]

```
Call:
lm(formula = satisfaction ~ income + community + education, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.54264 -0.44395 0.03309 0.54851 3.03890
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.17894 0.25357 4.649 1.07e-05 ***
income 0.52328 0.03445 15.190 < 2e-16 ***
communityurban -0.28474 0.19988 -1.425 0.158
educationCollege Trade School -0.06170 0.24983 -0.247 0.805
educationUniversity -0.15727 0.23983 -0.656 0.514
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9701 on 95 degrees of freedom
Multiple R-squared: 0.7362, Adjusted R-squared: 0.7251
F-statistic: 66.27 on 4 and 95 DF, p-value: < 2.2e-16
```

For multiple linear regression, we can plot the marginal effect, that is the average effect, holding all other variables constant.

- We can still fit a linear regression model if the dependent variable is binary. This is called the
**linear probability model**. - Here, everything is interpreted as probabilities. It’s the probability of being satisfied (where being satisfied means answer 5 or more).

```
Call:
lm(formula = satisfaction_yes ~ income, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.73789 -0.24289 0.01137 0.29121 0.61695
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.22910 0.06634 -3.454 0.000818 ***
income 0.09803 0.01123 8.729 6.89e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3364 on 98 degrees of freedom
Multiple R-squared: 0.4374, Adjusted R-squared: 0.4317
F-statistic: 76.2 on 1 and 98 DF, p-value: 6.888e-14
```

- Validity. The data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient.
- Representativeness. A regression model is fit to data and is used to make inferences about a larger population, hence the implicit assumption in interpreting regression coefficients is that the sample is representative of the population.
- Additivity and linearity. Its deterministic component is a linear function of the separate predictors: i.e., that it
*actually makes sense*to model it like this: \(y = \beta_0+\beta_1*X_1+\beta_2*X_2 ...\) - Independence of errors. If you have repeated observations on some individuals, then this is violated and you will have to use other (related) models.
- Equal variance of errors (also called heteroscedasticity). If life satisfaction is much more variable for people with high income than people with low income.
- Normality of errors. The distribution of the error term is relevant when predicting individual data points.

- Linear regression can be used for prediction: as a machine simple learning model

- Linear regression can be used for description: for group summaries or correlation
- Linear regression can be used for data summary: you have several independent variables and you look at how they each affect the dependent variable
- Linear regression can sometimes be used for the causal analysis of the effect of x on y: \[y = \beta_0+\beta_1*x + controls ... \]

- YaRrr! The Pirate’s Guide to R, Chapter 15: Regression
- Modern Dive, Chapters 5-6
- Regression and Other Stories, Chapters 6-12: Regression
- Telling Stories with Data, Chapter 14: Causality from observational data
- Regression and Other Stories, Chapters 20: Observational studies with all confounders assumed to be measured
- Marginal effects, marginal means, etc.
- The Linear Probability Model