Linear Regression

load("~/MAT104-Fall2024/Week12/parenthood.Rdata")
sleep <- parenthood$dan.sleep
grump <- parenthood$dan.grump

Let’s go back to the parenthood data we’ve been using:

# To find the line of best fit we use the lm() function:
summary(lm(grump ~ sleep))
## 
## Call:
## lm(formula = grump ~ sleep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.025  -2.213  -0.399   2.681  11.750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 125.9563     3.0161   41.76   <2e-16 ***
## sleep        -8.9368     0.4285  -20.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.332 on 98 degrees of freedom
## Multiple R-squared:  0.8161, Adjusted R-squared:  0.8142 
## F-statistic: 434.9 on 1 and 98 DF,  p-value: < 2.2e-16
ggplot(parenthood, aes(x=dan.sleep,y=dan.grump)) +
  geom_point() + 
  geom_abline(intercept = 125.9563, slope = -8.936, color="green") 

So, our model is

\[\hat{Y_i} = -8.936 X_i + 125.9563\]

Properties of the Linear Regression Model

  • the \(x\)-axis is often called the explanatory variable or independent variable
  • the \(y\)-axis is often called the response variable or dependent
  • the linear regression model minimizes the sum of the squared residuals (the least squares solution or least squares line)
  • the least squares line always goes through the point \((\bar{x},\bar{y})\)
# show that (mean(x),mean(y)) is a point on the model
mean(sleep)
## [1] 6.9652
mean(grump)
## [1] 63.71
# (6.9652,63.71)

-8.936*(mean(sleep)) + 125.9563
## [1] 63.71527

Connections to the correlation \(r_{XY}=R\)

  • we can estimate the slope with the formula

\[b_1 = \frac{s_y}{s_x}R\]

# compute the slope of the model with this shortcut
sd(grump)/sd(sleep)*cor(sleep,grump)
## [1] -8.936756

Using \(R^2\)

  • the correlation coefficient \(R\) tells us how well out data fits a line
  • though, it’s a bit more common to report the \(R^2\) value
  • \(R^2\) is more common because it is easier to interpret (eg. \(R^2=.75\) means that \(75 \%\) of the variability in the response variable can be explained by the explanatory variable)
  • large \(R^2\) values mean that the response variable is very well explained by the explanatory variable.
  1. Using the penguins data set, perform a simple linear regression to model body mass using the explanatory variable bill length. What values do you get for \(\hat{b}_0\) and \(\hat{b}_1\)? Graph a scatter plot for the data and include your regression model in the plot.
penguins <- na.omit(penguins)

\(\hat{b}_0 =\)

\(\hat{b}_1 =\)

  1. Use cor() to find \(R^2\) and compare this to what the linear model function lm() says it should be.

  2. Use the formula for estimating \(b_1\) in the penguin flipper length data and verify that it is the same value given by lm().

  3. Use simple linear regression to model the relationship between flipper length and body mass for the penguin data. What values do you get for \(\hat{b}_0\) and \(\hat{b}_1\)? Plot a scatter plot for the data with the line you found.

  4. Use multiple linear regression to model the body mass using the explanatory variables flipper length and bill length for the penguin data. Assuming \(X_1\) is flipper length and \(X_2\) is bill length, what values do you get for \(\hat{b}_0\), \(\hat{b}_1\), and \(\hat{b}_2\)?