In this lab you will explore how to compare means across many groups. We have already learned ways to test for the difference between two means. We will learn about a new tool called ANalysis Of VAriance, or ANOVA for short. Before performing ANOVA we must check three criteria:

  1. Run the code below to load the palmerpenguin library and save the penguins data set locally removing the data points with no flipper length data. We can assume that the penguin observations are independent.
library(palmerpenguins)
penguins <- filter(penguins, flipper_length_mm != "NA")

Filter

Hypotheses for ANOVA

Any ANOVA test we run will always have the same hypotheses:

\(H_0\): the mean is the same across all groups

\(H_A\): at least one mean is different

  1. Write appropriate hypotheses (within the context of the problem) for an ANOVA test trying to identify if different species of penguins have different mean flipper lengths.

\(H_0:\)

\(H_A:\)

  1. Make a box plot with \(x\)-axis species and \(y\)-axis flipper_length_mm. Do the data seem approximately normal? Does it look like there is a significant difference between the median flipper lengths of the species?
# insert code here
  1. Using filter() save three new data sets, one for each species of penguin: Adelie, Chinstrap, and Gentoo.
# insert code here
  1. Find the mean and variance of flipper length for each group. For the variance across groups to be considered approximately equal we ask that no variance is more than double another. Do the variances satisfy this criterion?
# insert code here
  1. A first step to perform ANOVA is to find the sum of squares between groups (SSG) of flipper lengths using the following formula: \[SSG = \sum_{i=1}^k n_i(\bar{x_i}-\bar{x})^2 \] where \(k\) is the number of groups (in this case \(k=3\) for the three species), \(n_i\) is the sample size of each group, \(\bar{x_i}\) is the average for that group, and \(\bar{x}\) is the average the whole group (all the penguins). Find the SSG for flipper length below:
# insert code here
  1. The next step is to find the mean square error (SSE) of flipper lengths using the following formula: \[SSE = \sum_{i=1}^k (n_i-1)\cdot s_i^2 \] where \(n_i\) is the sample size of each group, and \(s_i\) is the standard deviation of each group. Find the SSE below:
# insert code here
  1. There is an associated degrees of freedom for SSG and SSE. For SSG, the degrees of freedom, \(df_G\), is the number of groups minus \(1\). For SSE, the degrees of freedom, \(df_E\), is the total number of observations minus the number of groups. We can use the degrees of freedom to calculate the mean square group (MSG) and mean square error (MSE) by \(MSG = \frac{SSG}{df_G}\) and \(MSE = \frac{SSE}{df_E}\). Calculate MSG and MSE for flipper length.
# insert code here
  1. Finally, we calculate the \(F\)-statistic as \(F=MSG/MSE\). Calculate the \(F\)-statistic below and also use the code \(1-pf(F,df_G,df_E)\) to calculate the probability of the observed data assuming that there is no difference between the sample means, this is the \(p\)-value.
# insert code here

Since the \(p\)-value is very small (it’s so small that R says it is zero), we reject the notion that there is no difference between the average flipper length for each species. The data support that there is a statistically significant difference between average flipper length across species.

Luckily, we don’t need to go through all of those steps in order to calculate ANOVA in R. The following code will calculate all of the values we calculated above. Run this code and make sure the values above match with the values below.

summary(aov(flipper_length_mm ~ species, data=penguins))
##              Df Sum Sq Mean Sq F value Pr(>F)    
## species       2  52473   26237   594.8 <2e-16 ***
## Residuals   339  14953      44                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Load the data set mlb_players_18 and filter the data to only contain outfielders (outfielders have position LF, RF, or CF).
# insert code here
  1. Make a box plot of the data with \(x\)-axis position and \(y\)-axis AVG. Compute the variance of the three groups (LF, RF, CF). Do the data satisfy the criteria to perform ANOVA?
# insert code here
  1. Perform ANOVA and use the \(p\)-value to determine if there is a statistically significant difference between the mean batting average (AVG) for the three positions using a significance level of \(0.10\).
# insert code here