For the following section we will use data from: “Learning Statistics with R” by Danielle Navarro:
load("~/MAT104-Spring25/Week12/parenthood.Rdata")
load("~/MAT104-Spring25/Week12/pearson_correlations.RData")
load("~/MAT104-Spring25/Week12/effort.Rdata")
Consider the following plots:
ggplot(parenthood, aes(x=dan.sleep, y=dan.grump)) + geom_point()
ggplot(parenthood, aes(x=baby.sleep, y=dan.grump)) + geom_point()
The Pearson correlation coefficient \(r_{XY}\) is a standardized covariance measure:
\[r_{XY} = \frac{1}{N-1} \sum_{i=1}^{N} \frac{X_i-\bar{X}}{s_X}\frac{Y_i-\bar{Y}}{s_Y} \]
where \(s_x\) and \(s_y\) are the sample standard deviations.
The R code for the Pearson correlation coefficient is:
cor()
#correlation for sleep and grump
cor(parenthood$dan.sleep,parenthood$dan.grump)
## [1] -0.903384
# Pearson correlation coefficient is -.9
# -.9 is close to -1, which means there is a strong negative association between the variables
cor(parenthood$baby.sleep, parenthood$dan.grump)
## [1] -0.5659637
#-.56 pearson correlation coefficient, still some negative association, but not as strong as the previous relationship.
Since \(X\) and \(Y\) have a correlation coefficient closer to \(-1\) that means Danielle’s sleep and her grumpiness level are more strongly associated than the Danielle’s sleep and the baby’s sleep.
The cor()
function in R is actually more powerful than
how we previously used it. We can find all Pearson coefficients at once
by plugging in the data frame:
cor(parenthood)
## dan.sleep baby.sleep dan.grump day
## dan.sleep 1.00000000 0.62794934 -0.90338404 -0.09840768
## baby.sleep 0.62794934 1.00000000 -0.56596373 -0.01043394
## dan.grump -0.90338404 -0.56596373 1.00000000 0.07647926
## day -0.09840768 -0.01043394 0.07647926 1.00000000
Below is data with various Pearson correlation coefficients:
ggplot(outcomes, aes(x=V1,y=V2))+geom_point() + facet_wrap(~pearson)
Correlation | Strength | Direction |
---|---|---|
\(-1\) to \(-0.9\) | Very Strong | Negative |
\(-0.9\) to \(-0.7\) | Strong | Negative |
\(-0.7\) to \(-0.4\) | Moderate | Negative |
\(-0.4\) to \(-0.2\) | Weak | Negative |
\(-0.2\) to \(0\) | Negligible | Negative |
\(0\) to \(0.2\) | Negligible | Positive |
\(0.2\) to \(0.4\) | Weak | Positive |
\(0.4\) to \(0.7\) | Moderate | Positive |
\(0.7\) to \(0.9\) | Strong | Positive |
\(0.9\) to \(1\) | Very Strong | Positive |
Consider the following data set:
cor(anscombe$x1,anscombe$y1)
ggplot(anscombe, aes(x=x1,y=y1))+geom_point()
Now let’s check
cor(anscombe$x2,anscombe$y2)
## [1] 0.8162365
cor(anscombe$x3,anscombe$y3)
## [1] 0.8162867
cor(anscombe$x4,anscombe$y4)
## [1] 0.8165214
ggplot(anscombe, aes(x=x2,y=y2))+geom_point()
# correlation coefficients measure LINEAR relationships
ggplot(anscombe, aes(x=x3,y=y3))+geom_point()
ggplot(anscombe, aes(x=x4,y=y4))+geom_point()
ggplot(parenthood, aes(x=dan.sleep,y=dan.grump))+geom_point() +geom_abline(slope=-12,intercept = 144, color="blue")
# How grumpy would Danielle be according to our model if she got 8 hours of sleep?
-12*8 + 144
## [1] 48
# We predict she would have 48 grumps
# You can answer other questions like: How much sleep does Danielle need to have 0 grumps? Set y =0 and solve for x (find the x-intercept)
The data looks like there is a linear relationship between
dan.sleep
and dan.grump
, so let’s add a line
to our plot that represents this relationship.
GOAL: What do we mean by BEST fitting line and how do we find it?
To make an equation for a line we use the slope-intercept formula:
\[y=mx+b\] where \(m\) is the slope of the line and \(b\) is the \(y\)-intercept.
In statistics, we typically use different letters but the equation is the same: \[\hat{Y_i} = b_1 X_i + b_0\] where \(b_1\) is the slope and \(b_0\) is the \(y\)-intercept. The values \(b_1\) and \(b_0\) are called coefficients.
\(b_1\) is -12: for every additional hour of sleep Daniele has 12 less grumps. \(b_0\) is 144: Danielle has 144 grumps when she stays up all night
The graph below displays the error between our prediction and the true data in red:
# Use our slope and intercept guesses to predict the y-values
predictions <- -12*parenthood$dan.sleep+144
# Plot segments joining the actual points and the predicted points
ggplot(parenthood, aes(x=dan.sleep,y=dan.grump)) +
geom_point() +
geom_abline(intercept = 144, slope = -12, color="blue") +
geom_segment(aes(xend = dan.sleep, yend = predictions, color = "residual"))
To lengths of these little red segments are called residuals. We can find how big the residuals are by finding the difference between the predicted value and the actual value:
\[\epsilon_i = Y_i - \hat{Y}_i\]
Putting these ideas together, we can see that our real data is equal to our prediction plus the residuals:
\[Y_i = \hat{Y}_i + \epsilon_i = b_1X_i + b_0 + \epsilon_i\]
\[ \sum \epsilon_i^2 = \sum (Y_i - \hat{Y}_i)^2\]
The model we use for two predictors is: \[\hat{Y}_i = b_2 X_{i,2} + b_1 X_{i,1} + b_0\]
where \(X_{i,2}\) is the amount of sleep the baby got on day \(i\) and \(X_{i,1}\) is the amount of sleep Danielle got on day \(i\).
To find the best fitting regression line we use
lm()
:
scatter3d(dan.grump ~ dan.sleep + baby.sleep, parenthood)
rglwidget()
$_0 = $
$_1 = $
Find the sum of the square residuals for the model in exercise 1.
Use cor()
to find the strength of the correlation
between bill length and body mass.
Use simple linear regression to model the relationship between flipper length and body mass for the penguin data. What values do you get for \(\hat{b}_0\) and \(\hat{b}_1\)? Plot a scatter plot for the data with the line you found.
Use multiple linear regression to model the body mass using the predictors flipper length and bill length for the penguin data. Assuming \(X_1\) is flipper length and \(X_2\) is bill length, what values do you get for \(\hat{b}_0\), \(\hat{b}_1\), and \(\hat{b}_2\)?