Call:
lm(formula = satisfaction ~ income, data = df)
Coefficients:
(Intercept) income
0.9311 0.5247
PUBPOL 750 Data Analysis for Public Policy I: Week 9
MPP-DS McMaster
2023-11-15
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables.
It’s a fundamental tool in data analysis in all social science.
It quantifies the average effect of changes in independent variables on a dependent variable.
It can be used for prediction, description or causal explanation.
For example, for an income of 50,000$, the predicted value for life satisfaction is 3.37. Of course, it’s not because you have an income of 50,000$ that your life satisfaction is 3.37. That’s why there’s error \(ϵ\) in the model.
This is made up data. The true relation would not be as clear.
Typically, \(Y = \beta_0 + \beta_1X_1 + \epsilon\) will be used for the general model. \(y = b_0 + b_1x_1 + \epsilon\) refers to the estimated regression equation based on sample data.
The residual is the distance between the prediction (i.e. the line of best fit) and the true observed value. The residuals are shown with the black lines.
“error” and “residuals” are sometimes used interchangeably but there’s a subtle difference. The “residual” is the observed difference when you fit the line. The “error” is the equivalent in the theoretical model, but it’s unobservable.
The most common method for estimating the coefficients of a linear regression model is Ordinary Least Squares (OLS). This method *minimizes the sum of the squared differences between the observed values and the values predicted by the model: the sum of the squared residuals.
This is the best model of the form \(y=b0+b1+ϵ\) because it minimizes the square of the residual.
In contrast, this is NOT the best model of the form \(y=b0+b1+ϵ\) because it does not minimizes the square residual.
Call:
lm(formula = satisfaction ~ income, data = df)
Coefficients:
(Intercept) income
0.9311 0.5247
On average, when income increases by 1, satisfaction increases by 0.52. On average, when income is 0, satisfaction is 0.93.
Call:
lm(formula = satisfaction ~ income, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.5868 -0.4679 0.0993 0.5708 3.2122
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.93111 0.19094 4.876 4.17e-06 ***
income 0.52470 0.03232 16.232 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9682 on 98 degrees of freedom
Multiple R-squared: 0.7289, Adjusted R-squared: 0.7261
F-statistic: 263.5 on 1 and 98 DF, p-value: < 2.2e-16
The standard error is the uncertainty around the estimate. When it’s small you have more confidence in the estimate. We usually say it’s statisticaly significant if Pr(>|t|) (the “p-value”) is below 0.05. The p-value is the probability of obtaining an estimate at least as extreme as the estimate actually observed, if there was no effect. We can get a confidence interval around the estimate by adding +- 1.96 * the standard error.
Key: treat everything as numbers (0 and 1).
Call:
lm(formula = satisfaction ~ community, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.397 -1.377 0.023 1.249 4.833
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0769 0.2799 14.563 <2e-16 ***
communityurban -0.8178 0.3676 -2.225 0.0284 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.814 on 98 degrees of freedom
Multiple R-squared: 0.04808, Adjusted R-squared: 0.03837
F-statistic: 4.95 on 1 and 98 DF, p-value: 0.02838
Call:
lm(formula = satisfaction ~ education, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.5709 -1.4955 0.0437 1.3485 5.2117
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0946 0.3086 10.026 <2e-16 ***
educationCollege Trade School 0.6032 0.4630 1.303 0.1957
educationUniversity 0.9163 0.4306 2.128 0.0359 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.826 on 97 degrees of freedom
Multiple R-squared: 0.0456, Adjusted R-squared: 0.02592
F-statistic: 2.317 on 2 and 97 DF, p-value: 0.104
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + ... +\epsilon\] In practice, linear regression will have multiple predictors. Life satisfaction will be modelled as a function of multiple factors.
Perhaps, we could have: \[life\_satisfaction = \beta_0 + \beta_1*income + \beta_2*physical\_health + \\ \beta_3*mental\_health + \beta_4*quality\_of\_infrastructures + ... +\epsilon\]
In our example, we can have: \[life\_satisfaction = \beta_0 + \beta_1*income + \beta_2*communityurban + \\ \beta_3*educationCollegeTradeSchool + \beta_3*educationUniversity + ... +\epsilon\]
Call:
lm(formula = satisfaction ~ income + community + education, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.54264 -0.44395 0.03309 0.54851 3.03890
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.17894 0.25357 4.649 1.07e-05 ***
income 0.52328 0.03445 15.190 < 2e-16 ***
communityurban -0.28474 0.19988 -1.425 0.158
educationCollege Trade School -0.06170 0.24983 -0.247 0.805
educationUniversity -0.15727 0.23983 -0.656 0.514
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9701 on 95 degrees of freedom
Multiple R-squared: 0.7362, Adjusted R-squared: 0.7251
F-statistic: 66.27 on 4 and 95 DF, p-value: < 2.2e-16
For multiple linear regression, we can plot the marginal effect, that is the average effect, holding all other variables constant.
Call:
lm(formula = satisfaction_yes ~ income, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.73789 -0.24289 0.01137 0.29121 0.61695
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.22910 0.06634 -3.454 0.000818 ***
income 0.09803 0.01123 8.729 6.89e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3364 on 98 degrees of freedom
Multiple R-squared: 0.4374, Adjusted R-squared: 0.4317
F-statistic: 76.2 on 1 and 98 DF, p-value: 6.888e-14