Correlation

For the following section we will use data from: “Learning Statistics with R” by Danielle Navarro:

load("~/MAT104-Fall2024/Week12/parenthood.Rdata")
load("~/MAT104-Fall2024/Week12/pearson_correlations.RData")
load("~/MAT104-Fall2024/Week12/effort.Rdata")
  • This data captures how grumpy Danielle is, how much she slept in a day, and how much her baby slept in a day.

Consider the following plots:

ggplot(parenthood, aes(x=dan.sleep, y=dan.grump)) + geom_point()

ggplot(parenthood, aes(x=baby.sleep, y=dan.grump)) + geom_point()

  • from the graphs we see that danielles sleep is a better predictor of her grumpiness than her babys sleep.

Correlation coefficient

  • we would like to have a number that describes how clumped together the data is.
  • a number that describes the strength of the association (predictive power)

The Pearson correlation coefficient \(r_{XY}\) is a standardized covariance measure:

\[r_{XY} = \frac{1}{N-1} \sum_{i=1}^{N} \frac{X_i-\bar{X}}{s_X}\frac{Y_i-\bar{Y}}{s_Y} \]

where \(s_x\) and \(s_y\) are the sample standard deviations.

  • average product of the standardized deviations

The R code for the Pearson correlation coefficient is:

cor(parenthood$dan.sleep, parenthood$dan.grump)
## [1] -0.903384
#-.90
cor(parenthood$baby.sleep, parenthood$dan.grump)
## [1] -0.5659637
#-.57

cor(parenthood)
##              dan.sleep  baby.sleep   dan.grump         day
## dan.sleep   1.00000000  0.62794934 -0.90338404 -0.09840768
## baby.sleep  0.62794934  1.00000000 -0.56596373 -0.01043394
## dan.grump  -0.90338404 -0.56596373  1.00000000  0.07647926
## day        -0.09840768 -0.01043394  0.07647926  1.00000000

Properties of \(r_{XY}\)

  • the pearson correlation coefficient is always between 1 and -1
  • closer to -1 or 1 means a stronger negative or positive correlation respectively
  • a correlation of -1 looks like a line of points with no variation. A perfect linear relationship with negative slope.
  • a correlation of 1 looks like a line of points with no variation. A perfect linear relationship with positive slope.
  • numbers closer to 0 are weaker correlations. Not a lot of predictive power between the variables

Interpreting the Pearson correlation coefficient

Below is data with various Pearson correlation coefficients:

ggplot(outcomes, aes(x=V1,y=V2))+geom_point() + facet_wrap(~pearson)

Correlation Strength Direction
\(-1\) to \(-0.9\) Very Strong Negative
\(-0.9\) to \(-0.7\) Strong Negative
\(-0.7\) to \(-0.4\) Moderate Negative
\(-0.4\) to \(-0.2\) Weak Negative
\(-0.2\) to \(0\) Negligible Negative
\(0\) to \(0.2\) Negligible Positive
\(0.2\) to \(0.4\) Weak Positive
\(0.4\) to \(0.7\) Moderate Positive
\(0.7\) to \(0.9\) Strong Positive
\(0.9\) to \(1\) Very Strong Positive
  • Judging if your correlation is strong or not really depends on context and the data. However, the above table is a good rule of thumb.

Caution

  • Use caution when interpreting a pearson correlation coefficient.
  • The correlation may not tell you what you think it does about the data.

Consider the following data set:

cor(anscombe$x1,anscombe$y1)
## [1] 0.8164205
  • Based on the correlation coefficient we might imagine a scatter plot with a slight positive linear association.
  • We would be correct!
ggplot(anscombe, aes(x=x1,y=y1))+geom_point()

Now let’s check

cor(anscombe$x2,anscombe$y2)
## [1] 0.8162365
  • The same correlation coefficient! We should get a similar graph right?
ggplot(anscombe, aes(x=x2,y=y2))+geom_point()

  • Nope, what about the others?
cor(anscombe$x3,anscombe$y3)
## [1] 0.8162867
cor(anscombe$x4,anscombe$y4)
## [1] 0.8165214
ggplot(anscombe, aes(x=x3,y=y3))+geom_point()

ggplot(anscombe, aes(x=x4,y=y4))+geom_point()


Shortcomings and Alternatives

  • You should always make a scatter plot before using the pearson correlation coefficient to conclude any thing about the shape of your data.
  • The Pearson correlation coefficient measures how close the data is to fitting on a specific line.
  • It is looking for a linear relationship.

Class Activity

  1. Make a scatter plot for penguin body mass vs. flipper length with facets by species. Looking only at the plots, can you tell for which species the body mass has the strongest relationship to flipper length? What about the weakest?

  2. Use the filter() and cor() function to find the Pearson correlation coefficient for body mass and flipper length for each species. Which species has the strongest relationship?