Correlation

For the following section we will use data from: “Learning Statistics with R” by Danielle Navarro:

load("~/MAT104-Spring25/Week12/parenthood.Rdata")
load("~/MAT104-Spring25/Week12/pearson_correlations.RData")
load("~/MAT104-Spring25/Week12/effort.Rdata")
  • This data captures how grumpy Danielle is, how much she slept in a day, and how much her baby slept in a day.

Consider the following plots:

ggplot(parenthood, aes(x=dan.sleep, y=dan.grump)) + geom_point()

ggplot(parenthood, aes(x=baby.sleep, y=dan.grump)) + geom_point()

Correlation coefficient

The Pearson correlation coefficient \(r_{XY}\) is a standardized covariance measure:

\[r_{XY} = \frac{1}{N-1} \sum_{i=1}^{N} \frac{X_i-\bar{X}}{s_X}\frac{Y_i-\bar{Y}}{s_Y} \]

where \(s_x\) and \(s_y\) are the sample standard deviations.

The R code for the Pearson correlation coefficient is: cor()

#correlation for sleep and grump
cor(parenthood$dan.sleep,parenthood$dan.grump)
## [1] -0.903384
# Pearson correlation coefficient is -.9
# -.9 is close to -1, which means there is a strong negative association between the variables

cor(parenthood$baby.sleep, parenthood$dan.grump)
## [1] -0.5659637
#-.56 pearson correlation coefficient, still some negative association, but not as strong as the previous relationship.

Properties of \(r_{XY}\)

  • always between \(-1\) and \(1\)
  • \(-1\) is a perfect negative relationship
  • \(1\) is a perfect positive relationship
  • \(0\) is no relationship at all
  • Numbers closer to \(0\) represent a weaker relationship
  • Numbers closer to \(1\) or \(-1\) represent a stronger relationship

Since \(X\) and \(Y\) have a correlation coefficient closer to \(-1\) that means Danielle’s sleep and her grumpiness level are more strongly associated than the Danielle’s sleep and the baby’s sleep.

The cor() function in R is actually more powerful than how we previously used it. We can find all Pearson coefficients at once by plugging in the data frame:

cor(parenthood)
##              dan.sleep  baby.sleep   dan.grump         day
## dan.sleep   1.00000000  0.62794934 -0.90338404 -0.09840768
## baby.sleep  0.62794934  1.00000000 -0.56596373 -0.01043394
## dan.grump  -0.90338404 -0.56596373  1.00000000  0.07647926
## day        -0.09840768 -0.01043394  0.07647926  1.00000000

Interpreting the Pearson correlation coefficient

Below is data with various Pearson correlation coefficients:

ggplot(outcomes, aes(x=V1,y=V2))+geom_point() + facet_wrap(~pearson)

  • Exactly what constitutes as a strong correlation depends on the context.
  • You can, however, use these general guidlines:
Correlation Strength Direction
\(-1\) to \(-0.9\) Very Strong Negative
\(-0.9\) to \(-0.7\) Strong Negative
\(-0.7\) to \(-0.4\) Moderate Negative
\(-0.4\) to \(-0.2\) Weak Negative
\(-0.2\) to \(0\) Negligible Negative
\(0\) to \(0.2\) Negligible Positive
\(0.2\) to \(0.4\) Weak Positive
\(0.4\) to \(0.7\) Moderate Positive
\(0.7\) to \(0.9\) Strong Positive
\(0.9\) to \(1\) Very Strong Positive

Caution

  • Use caution when interpreting a pearson correlation coefficient.
  • The correlation may not tell you what you think it does about the data.

Consider the following data set:

cor(anscombe$x1,anscombe$y1)
  • Based on the correlation coefficient we might imagine a scatter plot with a slight positive linear association.
ggplot(anscombe, aes(x=x1,y=y1))+geom_point()

Now let’s check

cor(anscombe$x2,anscombe$y2)
## [1] 0.8162365
cor(anscombe$x3,anscombe$y3)
## [1] 0.8162867
cor(anscombe$x4,anscombe$y4)
## [1] 0.8165214
  • The same correlation coefficient! We should get a similar graph right?
ggplot(anscombe, aes(x=x2,y=y2))+geom_point()
# correlation coefficients measure LINEAR relationships
ggplot(anscombe, aes(x=x3,y=y3))+geom_point()
ggplot(anscombe, aes(x=x4,y=y4))+geom_point()

  • You should always make a scatter plot before using the pearson correlation coefficient to conclude any thing about the shape of your data.
  • The Pearson correlation coefficient measures how close the data is to fitting on a specific line.
  • It is looking for a linear relationship.
  • The correlation coefficient is not the slope

Linear Regression

ggplot(parenthood, aes(x=dan.sleep,y=dan.grump))+geom_point() +geom_abline(slope=-12,intercept = 144, color="blue")

# How grumpy would Danielle be according to our model if she got 8 hours of sleep?
-12*8 + 144
## [1] 48
# We predict she would have 48 grumps

# You can answer other questions like: How much sleep does Danielle need to have 0 grumps? Set y =0 and solve for x (find the x-intercept)

The data looks like there is a linear relationship between dan.sleep and dan.grump, so let’s add a line to our plot that represents this relationship.

GOAL: What do we mean by BEST fitting line and how do we find it?

Meaning of the Parameters

To make an equation for a line we use the slope-intercept formula:

\[y=mx+b\] where \(m\) is the slope of the line and \(b\) is the \(y\)-intercept.

In statistics, we typically use different letters but the equation is the same: \[\hat{Y_i} = b_1 X_i + b_0\] where \(b_1\) is the slope and \(b_0\) is the \(y\)-intercept. The values \(b_1\) and \(b_0\) are called coefficients.

  1. Interpret \(b_1\), \(b_0\), \(X_i\), and \(\hat{Y}_i\) in the context of the parenthood data. Make a best guess for \(b_0\) and \(b_1\).

\(b_1\) is -12: for every additional hour of sleep Daniele has 12 less grumps. \(b_0\) is 144: Danielle has 144 grumps when she stays up all night

  • the X variable is often called the explanatory variable
  • the Y variable is often called the response variable

Residuals

  • We can do better than estimating by eye though.
  • We should choose the line that minimize the error in our predictions.

The graph below displays the error between our prediction and the true data in red:

# Use our slope and intercept guesses to predict the y-values

predictions <- -12*parenthood$dan.sleep+144

# Plot segments joining the actual points and the predicted points
ggplot(parenthood, aes(x=dan.sleep,y=dan.grump)) +
  geom_point() + 
  geom_abline(intercept = 144, slope = -12, color="blue") + 
  geom_segment(aes(xend = dan.sleep, yend = predictions, color = "residual"))

To lengths of these little red segments are called residuals. We can find how big the residuals are by finding the difference between the predicted value and the actual value:

\[\epsilon_i = Y_i - \hat{Y}_i\]

  1. Calculate the residual corresponding to the first day, \(\epsilon_1\).
  • the residual is negative when our prediction is an overestimate
  • the residual is positive when our prediction is an underestimate.

Putting these ideas together, we can see that our real data is equal to our prediction plus the residuals:

\[Y_i = \hat{Y}_i + \epsilon_i = b_1X_i + b_0 + \epsilon_i\]


Minimizing error

  • One approach for this is to minimize the so called sum of the squared residuals:

\[ \sum \epsilon_i^2 = \sum (Y_i - \hat{Y}_i)^2\]


Using R to find the fits

Multiple Regression

The model we use for two predictors is: \[\hat{Y}_i = b_2 X_{i,2} + b_1 X_{i,1} + b_0\]

where \(X_{i,2}\) is the amount of sleep the baby got on day \(i\) and \(X_{i,1}\) is the amount of sleep Danielle got on day \(i\).

  • If we had three predictors the equation would be: \[\hat{Y}_i = b_3 X_{i,3} + b_2 X_{i,2} + b_1 X_{i,1} + b_0\]

To find the best fitting regression line we use lm():

scatter3d(dan.grump ~ dan.sleep + baby.sleep, parenthood)
rglwidget()
  1. Find the residual corresponding to the first day using this new model with two predictors.

Class Activity

  1. Using the penguins data set, perform a simple linear regression to model body mass using the predictor variable bill length. What values do you get for \(\hat{b}_0\) and \(\hat{b}_1\)? Graph a scatter plot for the data and include your regression model in the plot.

$_0 = $

$_1 = $

  1. Find the sum of the square residuals for the model in exercise 1.

  2. Use cor() to find the strength of the correlation between bill length and body mass.

  3. Use simple linear regression to model the relationship between flipper length and body mass for the penguin data. What values do you get for \(\hat{b}_0\) and \(\hat{b}_1\)? Plot a scatter plot for the data with the line you found.

  4. Use multiple linear regression to model the body mass using the predictors flipper length and bill length for the penguin data. Assuming \(X_1\) is flipper length and \(X_2\) is bill length, what values do you get for \(\hat{b}_0\), \(\hat{b}_1\), and \(\hat{b}_2\)?