For the following section we will use data from: “Learning Statistics with R” by Danielle Navarro:
load("~/MAT104-Fall2024/Week12/parenthood.Rdata")
load("~/MAT104-Fall2024/Week12/pearson_correlations.RData")
load("~/MAT104-Fall2024/Week12/effort.Rdata")
Consider the following plots:
ggplot(parenthood, aes(x=dan.sleep, y=dan.grump)) + geom_point()
ggplot(parenthood, aes(x=baby.sleep, y=dan.grump)) + geom_point()
The Pearson correlation coefficient \(r_{XY}\) is a standardized covariance measure:
\[r_{XY} = \frac{1}{N-1} \sum_{i=1}^{N} \frac{X_i-\bar{X}}{s_X}\frac{Y_i-\bar{Y}}{s_Y} \]
where \(s_x\) and \(s_y\) are the sample standard deviations.
The R code for the Pearson correlation coefficient is:
cor(parenthood$dan.sleep, parenthood$dan.grump)
## [1] -0.903384
#-.90
cor(parenthood$baby.sleep, parenthood$dan.grump)
## [1] -0.5659637
#-.57
cor(parenthood)
## dan.sleep baby.sleep dan.grump day
## dan.sleep 1.00000000 0.62794934 -0.90338404 -0.09840768
## baby.sleep 0.62794934 1.00000000 -0.56596373 -0.01043394
## dan.grump -0.90338404 -0.56596373 1.00000000 0.07647926
## day -0.09840768 -0.01043394 0.07647926 1.00000000
Below is data with various Pearson correlation coefficients:
ggplot(outcomes, aes(x=V1,y=V2))+geom_point() + facet_wrap(~pearson)
Correlation | Strength | Direction |
---|---|---|
\(-1\) to \(-0.9\) | Very Strong | Negative |
\(-0.9\) to \(-0.7\) | Strong | Negative |
\(-0.7\) to \(-0.4\) | Moderate | Negative |
\(-0.4\) to \(-0.2\) | Weak | Negative |
\(-0.2\) to \(0\) | Negligible | Negative |
\(0\) to \(0.2\) | Negligible | Positive |
\(0.2\) to \(0.4\) | Weak | Positive |
\(0.4\) to \(0.7\) | Moderate | Positive |
\(0.7\) to \(0.9\) | Strong | Positive |
\(0.9\) to \(1\) | Very Strong | Positive |
Consider the following data set:
cor(anscombe$x1,anscombe$y1)
## [1] 0.8164205
ggplot(anscombe, aes(x=x1,y=y1))+geom_point()
Now let’s check
cor(anscombe$x2,anscombe$y2)
## [1] 0.8162365
ggplot(anscombe, aes(x=x2,y=y2))+geom_point()
cor(anscombe$x3,anscombe$y3)
## [1] 0.8162867
cor(anscombe$x4,anscombe$y4)
## [1] 0.8165214
ggplot(anscombe, aes(x=x3,y=y3))+geom_point()
ggplot(anscombe, aes(x=x4,y=y4))+geom_point()
Make a scatter plot for penguin body mass vs. flipper length with facets by species. Looking only at the plots, can you tell for which species the body mass has the strongest relationship to flipper length? What about the weakest?
Use the filter()
and cor()
function to find the Pearson correlation coefficient for body mass and flipper length for each species. Which species has the strongest relationship?