In this lab you will explore how to compare means across many groups. We have already learned ways to test for the difference between two means. We will learn about a new tool called ANalysis Of VAriance, or ANOVA for short. Before performing ANOVA we must check three criteria:
palmerpenguin
library and save the penguins
data set locally removing the data points with no flipper length data. We can assume that the penguin observations are independent.library(palmerpenguins)
penguins <- filter(penguins, flipper_length_mm != "NA")
filter()
function. The first argument is the data set that you want to filter from and the second argument is the condition you want to filter by.filter()
function is part of a package called dplyr
. This package is include in the very large package called tidyverse
that we already loaded in line 19.tidyverse
methods are much more common among statisticiansAny ANOVA test we run will always have the same hypotheses:
\(H_0\): the mean is the same across all groups
\(H_A\): at least one mean is different
\(H_0:\)
\(H_A:\)
species
and \(y\)-axis flipper_length_mm
. Do the data seem approximately normal? Does it look like there is a significant difference between the median flipper lengths of the species?# insert code here
filter()
save three new data sets, one for each species of penguin: Adelie
, Chinstrap
, and Gentoo
.# insert code here
# insert code here
# insert code here
# insert code here
# insert code here
# insert code here
Since the \(p\)-value is very small (it’s so small that R says it is zero), we reject the notion that there is no difference between the average flipper length for each species. The data support that there is a statistically significant difference between average flipper length across species.
Luckily, we don’t need to go through all of those steps in order to calculate ANOVA in R. The following code will calculate all of the values we calculated above. Run this code and make sure the values above match with the values below.
summary(aov(flipper_length_mm ~ species, data=penguins))
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 52473 26237 594.8 <2e-16 ***
## Residuals 339 14953 44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mlb_players_18
and filter the data to only contain outfielders (outfielders have position LF
, RF
, or CF
).# insert code here
AVG
. Compute the variance of the three groups (LF, RF, CF). Do the data satisfy the criteria to perform ANOVA?# insert code here
# insert code here