CA 21: Correlation

Correlation

For the following section we will use data from: “Learning Statistics with R” by Danielle Navarro:

load("~/MAT104-Fall2024/Week12/parenthood.Rdata")
load("~/MAT104-Fall2024/Week12/pearson_correlations.RData")
load("~/MAT104-Fall2024/Week12/effort.Rdata")

This data captures how grumpy Danielle is, how much she slept in a day, and how much her baby slept in a day.

Consider the following plots:

ggplot(parenthood, aes(x=dan.sleep, y=dan.grump)) + geom_point()

ggplot(parenthood, aes(x=baby.sleep, y=dan.grump)) + geom_point()

from the graphs we see that danielles sleep is a better predictor of her grumpiness than her babys sleep.

Correlation coefficient

we would like to have a number that describes how clumped together the data is.
a number that describes the strength of the association (predictive power)

The Pearson correlation coefficient \(r_{XY}\) is a standardized covariance measure:

\[r_{XY} = \frac{1}{N-1} \sum_{i=1}^{N} \frac{X_i-\bar{X}}{s_X}\frac{Y_i-\bar{Y}}{s_Y} \]

where \(s_x\) and \(s_y\) are the sample standard deviations.

average product of the standardized deviations

The R code for the Pearson correlation coefficient is:

cor(parenthood$dan.sleep, parenthood$dan.grump)

## [1] -0.903384

#-.90
cor(parenthood$baby.sleep, parenthood$dan.grump)

## [1] -0.5659637

#-.57

cor(parenthood)

##              dan.sleep  baby.sleep   dan.grump         day
## dan.sleep   1.00000000  0.62794934 -0.90338404 -0.09840768
## baby.sleep  0.62794934  1.00000000 -0.56596373 -0.01043394
## dan.grump  -0.90338404 -0.56596373  1.00000000  0.07647926
## day        -0.09840768 -0.01043394  0.07647926  1.00000000

Properties of \(r_{XY}\)

the pearson correlation coefficient is always between 1 and -1
closer to -1 or 1 means a stronger negative or positive correlation respectively
a correlation of -1 looks like a line of points with no variation. A perfect linear relationship with negative slope.
a correlation of 1 looks like a line of points with no variation. A perfect linear relationship with positive slope.
numbers closer to 0 are weaker correlations. Not a lot of predictive power between the variables

Interpreting the Pearson correlation coefficient

Below is data with various Pearson correlation coefficients:

ggplot(outcomes, aes(x=V1,y=V2))+geom_point() + facet_wrap(~pearson)

Correlation	Strength	Direction
\(-1\) to \(-0.9\)	Very Strong	Negative
\(-0.9\) to \(-0.7\)	Strong	Negative
\(-0.7\) to \(-0.4\)	Moderate	Negative
\(-0.4\) to \(-0.2\)	Weak	Negative
\(-0.2\) to \(0\)	Negligible	Negative
\(0\) to \(0.2\)	Negligible	Positive
\(0.2\) to \(0.4\)	Weak	Positive
\(0.4\) to \(0.7\)	Moderate	Positive
\(0.7\) to \(0.9\)	Strong	Positive
\(0.9\) to \(1\)	Very Strong	Positive

Judging if your correlation is strong or not really depends on context and the data. However, the above table is a good rule of thumb.

Caution

Use caution when interpreting a pearson correlation coefficient.
The correlation may not tell you what you think it does about the data.

Consider the following data set:

cor(anscombe$x1,anscombe$y1)

## [1] 0.8164205

Based on the correlation coefficient we might imagine a scatter plot with a slight positive linear association.
We would be correct!

ggplot(anscombe, aes(x=x1,y=y1))+geom_point()

Now let’s check

cor(anscombe$x2,anscombe$y2)

## [1] 0.8162365

The same correlation coefficient! We should get a similar graph right?

ggplot(anscombe, aes(x=x2,y=y2))+geom_point()

Nope, what about the others?

cor(anscombe$x3,anscombe$y3)

## [1] 0.8162867

cor(anscombe$x4,anscombe$y4)

## [1] 0.8165214

ggplot(anscombe, aes(x=x3,y=y3))+geom_point()

ggplot(anscombe, aes(x=x4,y=y4))+geom_point()

Shortcomings and Alternatives

You should always make a scatter plot before using the pearson correlation coefficient to conclude any thing about the shape of your data.
The Pearson correlation coefficient measures how close the data is to fitting on a specific line.
It is looking for a linear relationship.

Class Activity

Make a scatter plot for penguin body mass vs. flipper length with facets by species. Looking only at the plots, can you tell for which species the body mass has the strongest relationship to flipper length? What about the weakest?
Use the filter() and cor() function to find the Pearson correlation coefficient for body mass and flipper length for each species. Which species has the strongest relationship?