Linear Regression

Suppose we consider a model

\[\hat{Y_i} = b_1 X_i + b_0\]

Our hope is that our estimations are pretty close to the actual values. That is, we hope that:

\[Y_i = b_1 X_i + b_0 + \epsilon\]

  • We think of the error term as encapsulation some random variation in the response variable.

  • In a hypothesis test, this is our alternate hypothesis.

  • The null model will be our null hypothesis. In the null model, there is not a relationship between the explanatory variable and the response variable (so \(b_1=0\)).

\(H_0: Y_i = b_0 + \epsilon\): the null model is better at predicting \(Y_i\) \(H_A: Y_i = b_1 X_i + b_0 + \epsilon\): the alternate model is better at predicting \(Y_i\)

load("~/Documents/GitHub/MAT104-Fall24/Week12/parenthood.Rdata")
dansleep <- parenthood$dan.sleep
babysleep <- parenthood$baby.sleep
grump <- parenthood$dan.grump
set.seed(33)
GPA <- rnorm(100,3,.3)

\(H_0:\) Danielle’s grumpiness is due to random variation \(H_A:\) Danielles’s sleep predicts her grumpiness better than random variation.

# To find the line of best fit we use the lm() function:
summary(lm(grump ~ dansleep))
## 
## Call:
## lm(formula = grump ~ dansleep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.025  -2.213  -0.399   2.681  11.750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 125.9563     3.0161   41.76   <2e-16 ***
## dansleep     -8.9368     0.4285  -20.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.332 on 98 degrees of freedom
## Multiple R-squared:  0.8161, Adjusted R-squared:  0.8142 
## F-statistic: 434.9 on 1 and 98 DF,  p-value: < 2.2e-16
  • Since the p-value is very close to 0, there is statistically significant evidence that using sleep to predict grumpiness performs better than random variation.

Random Variation

  • Let’s actually test if Danielle’s grumpiness is better explained by random variation or by 100 random student GPAs
summary(lm(parenthood$dan.grump~GPA))
## 
## Call:
## lm(formula = parenthood$dan.grump ~ GPA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.043  -6.896  -1.833   7.431  27.722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   59.117     10.172   5.812 7.71e-08 ***
## GPA            1.522      3.354   0.454    0.651    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.09 on 98 degrees of freedom
## Multiple R-squared:  0.002097,   Adjusted R-squared:  -0.008086 
## F-statistic: 0.2059 on 1 and 98 DF,  p-value: 0.651

Since we get a p-value of \(.651\), there is not significant evidence that these 100 random student GPAs are good predictors of Danielle’s grumpiness. In fact, we can see this in a scatter plot:

df<-data.frame(grump,GPA)
ggplot(df,aes(x=GPA,y=grump))+geom_point() + geom_abline(intercept = 59.117, slope = 1.522)

\[\hat{Y_i} = -8.936 X_i + 125.9563\]

Pitfalls

  • Just because a linear model performs better than random variation does not mean the model performs well. Compare the following two linear regressions:
summary(lm(grump ~ dansleep))
## 
## Call:
## lm(formula = grump ~ dansleep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.025  -2.213  -0.399   2.681  11.750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 125.9563     3.0161   41.76   <2e-16 ***
## dansleep     -8.9368     0.4285  -20.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.332 on 98 degrees of freedom
## Multiple R-squared:  0.8161, Adjusted R-squared:  0.8142 
## F-statistic: 434.9 on 1 and 98 DF,  p-value: < 2.2e-16
summary(lm(grump ~ babysleep))
## 
## Call:
## lm(formula = grump ~ babysleep)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.4190  -5.0049  -0.0587   4.9567  23.7275 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  85.7817     3.3528  25.585  < 2e-16 ***
## babysleep    -2.7421     0.4035  -6.796 8.45e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.327 on 98 degrees of freedom
## Multiple R-squared:  0.3203, Adjusted R-squared:  0.3134 
## F-statistic: 46.18 on 1 and 98 DF,  p-value: 8.448e-10
  • They both have low p-values, so they both predict Danielle’s grumpiness better than randomness.
  • Looking at the \(R^2\) value reveals a bit more information, Danielle’s sleep can explain about \(81 \%\) of the variation in here grumpiness. However, the baby’s sleep would only account for \(31 \%\) of the variation.

Multiple Regression Model

summary(lm(grump ~ dansleep + babysleep))
## 
## Call:
## lm(formula = grump ~ dansleep + babysleep)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0345  -2.2198  -0.4016   2.6775  11.7496 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 125.96557    3.04095  41.423   <2e-16 ***
## dansleep     -8.95025    0.55346 -16.172   <2e-16 ***
## babysleep     0.01052    0.27106   0.039    0.969    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.354 on 97 degrees of freedom
## Multiple R-squared:  0.8161, Adjusted R-squared:  0.8123 
## F-statistic: 215.2 on 2 and 97 DF,  p-value: < 2.2e-16

Looking at the multiple regression model, we see that the p-value for specifically the baby’s sleep is really high. So, in the multiple regression model we see that the baby’s sleep does not help predict Danielle’s grumpiness better than just random varaition when Danielle’s sleep is taken into account.

Confidence Intervals for the Coefficients

  • We can actually find confidence intervals for each coefficient in our model.
  • note that the Danielle sleep coefficient has standard error .55346 From this we can say:
-8.9368 - qt(.975,97)*.4285
## [1] -9.787254
-8.9368 + qt(.975,97)*.4285
## [1] -8.086346
# We are 95% confident that the coefficient for Danielle's sleep is between -9.79 and -8.09