Today’s Agenda

  • Chi-Square tests can be used in three ways:
    • indpenedence of variables
    • goodness of fit
    • homogeneity
  • In every case you want to compute the chi-squared statistic:

\[\chi^2 = \sum \frac{(\text{observed}-\text{expected})^2}{\text{expected}}\]

  • In a goodness of fit test you KNOW the expected values.
  • For example:
    • is rolling a die uniform?
    • does a roulette table seem fair?
    • do your data match expectations from a population?
  • In a test of homogeneity you you don’t exactly know the expected values, you need to calculate them.
  • For example;
    • do North Carolinians and Californians have the same distribution of TVs in their homes (we don’t know the distribution, we are just asking if the are similar)?
    • is distribution of amount of time spent on the computer per day the same between apple and android users?

Compare and contrast the two questions:

Below is a table describing average amount of time spent on the computer in two groups:

Avg. Hours per day Group A Group B
\(<1\) hour \(30\) \(26\)
\(1-3\) hours \(35\) \(42\)
\(3-5\) hours \(25\) \(22\)
\(>5\) hours \(10\) \(10\)
Totals \(100\)
  1. If Group A is apple users and Group B is windows users, do the data provide evidence that the groups follow different distributions?

\(H_0:\) they are homogeneous ( both data come from the same or similar distributions) \(H_A:\) they are not homogeneous (the data seem to come from different distributions)

#This is a test of homogeneity, we want to know if A and B come from the same distribution.
e_11 <- 56*100/200
e_12 <- 56*100/200
e_21 <- 77*100/200
e_22 <- 77*100/200
e_31 <- 47*100/200
e_32 <- 47*100/200
e_41 <- 20*100/200
e_42 <- 20*100/200
chi <- (30-e_11)^2/e_11 + (26-e_12)^2/e_12 + (35-e_21)^2/e_21 + (42-e_22)^2/e_22 + (25-e_31)^2/e_31 + (22-e_32)^2/e_32 + (10-e_41)^2/e_41 + (10-e_42)^2/e_42

# Our data does not provide statistically significant evidence that the distributions are different from one another.
  1. If Group A is apple users and Group B is the expected percentage in the entire population, do the data provide evidence that the group A does not follow the expected distribution?
# Goodness of fit test, we want to know if group A matches the expected distribution which is group B

chi <- (30-26)^2/26 + (35-42)^2/42 + (25-22)^2/22 + (10-10)^2/10

# This data does not provide statistically significant evidence that apple users are different than the expected distribution

Recap of three test

  • Test of independence
    • the null is always that two variables are independent of each other (think grade level and what they found important)
    • the alternate is always that the two variables are dependent
    • the expected values are always row total * column total / overall total
    • dof are (# of rows -1) * (# of columns -1 )
  • Test of homogeneity
    • the null is always that two distributions match each other
    • the alternate is always that the two distributions are different
    • the expected values are always row total * column total / overall total
    • dof are (# of rows -1) * (# of columns -1 )
  • Goodness of fit test
    • null is that the data matches an expected distribution
    • alternate is that it doesn’t
    • expected values are always given to you
    • dof are number of bins - 1

Correlation

For the following section we will use data from: “Learning Statistics with R” by Danielle Navarro:

load("~/MAT104-Spring25/Week12/parenthood.Rdata")
load("~/MAT104-Spring25/Week12/pearson_correlations.RData")
load("~/MAT104-Spring25/Week12/effort.Rdata")
  • This data captures how grumpy Danielle is, how much she slept in a day, and how much her baby slept in a day.

Consider the following plots:

ggplot(parenthood, aes(x=dan.sleep, y=dan.grump)) + geom_point()

ggplot(parenthood, aes(x=baby.sleep, y=dan.grump)) + geom_point()

Correlation coefficient

The Pearson correlation coefficient \(r_{XY}\) is a standardized covariance measure:

\[r_{XY} = \frac{1}{N-1} \sum_{i=1}^{N} \frac{X_i-\bar{X}}{s_X}\frac{Y_i-\bar{Y}}{s_Y} \]

where \(s_x\) and \(s_y\) are the sample standard deviations.

The R code for the Pearson correlation coefficient is:

Properties of \(r_{XY}\)

Interpreting the Pearson correlation coefficient

Below is data with various Pearson correlation coefficients:

ggplot(outcomes, aes(x=V1,y=V2))+geom_point() + facet_wrap(~pearson)

  • Exactly what constitutes as a strong correlation depends on the context.
  • You can, however, use these general guidlines:
Correlation Strength Direction
\(-1\) to \(-0.9\) Very Strong Negative
\(-0.9\) to \(-0.7\) Strong Negative
\(-0.7\) to \(-0.4\) Moderate Negative
\(-0.4\) to \(-0.2\) Weak Negative
\(-0.2\) to \(0\) Negligible Negative
\(0\) to \(0.2\) Negligible Positive
\(0.2\) to \(0.4\) Weak Positive
\(0.4\) to \(0.7\) Moderate Positive
\(0.7\) to \(0.9\) Strong Positive
\(0.9\) to \(1\) Very Strong Positive