Day34–ChiSquareIV : Practice

Today’s Agenda

Chi-Square tests can be used in three ways:
- indpenedence of variables
- goodness of fit
- homogeneity
In every case you want to compute the chi-squared statistic:

\[\chi^2 = \sum \frac{(\text{observed}-\text{expected})^2}{\text{expected}}\]

In a goodness of fit test you KNOW the expected values.
For example:
- is rolling a die uniform?
- does a roulette table seem fair?
- do your data match expectations from a population?
In a test of homogeneity you you don’t exactly know the expected values, you need to calculate them.
For example;
- do North Carolinians and Californians have the same distribution of TVs in their homes (we don’t know the distribution, we are just asking if the are similar)?
- is distribution of amount of time spent on the computer per day the same between apple and android users?

Compare and contrast the two questions:

Below is a table describing average amount of time spent on the computer in two groups:

Avg. Hours per day	Group A	Group B
\(<1\) hour	\(30\)	\(26\)
\(1-3\) hours	\(35\)	\(42\)
\(3-5\) hours	\(25\)	\(22\)
\(>5\) hours	\(10\)	\(10\)
Totals		\(100\)

If Group A is apple users and Group B is windows users, do the data provide evidence that the groups follow different distributions?

\(H_0:\) they are homogeneous ( both data come from the same or similar distributions) \(H_A:\) they are not homogeneous (the data seem to come from different distributions)

#This is a test of homogeneity, we want to know if A and B come from the same distribution.
e_11 <- 56*100/200
e_12 <- 56*100/200
e_21 <- 77*100/200
e_22 <- 77*100/200
e_31 <- 47*100/200
e_32 <- 47*100/200
e_41 <- 20*100/200
e_42 <- 20*100/200
chi <- (30-e_11)^2/e_11 + (26-e_12)^2/e_12 + (35-e_21)^2/e_21 + (42-e_22)^2/e_22 + (25-e_31)^2/e_31 + (22-e_32)^2/e_32 + (10-e_41)^2/e_41 + (10-e_42)^2/e_42

# Our data does not provide statistically significant evidence that the distributions are different from one another.

If Group A is apple users and Group B is the expected percentage in the entire population, do the data provide evidence that the group A does not follow the expected distribution?

# Goodness of fit test, we want to know if group A matches the expected distribution which is group B

chi <- (30-26)^2/26 + (35-42)^2/42 + (25-22)^2/22 + (10-10)^2/10

# This data does not provide statistically significant evidence that apple users are different than the expected distribution

Recap of three test

Test of independence
- the null is always that two variables are independent of each other (think grade level and what they found important)
- the alternate is always that the two variables are dependent
- the expected values are always row total * column total / overall total
- dof are (# of rows -1) * (# of columns -1 )
Test of homogeneity
- the null is always that two distributions match each other
- the alternate is always that the two distributions are different
- the expected values are always row total * column total / overall total
- dof are (# of rows -1) * (# of columns -1 )
Goodness of fit test
- null is that the data matches an expected distribution
- alternate is that it doesn’t
- expected values are always given to you
- dof are number of bins - 1

Correlation

For the following section we will use data from: “Learning Statistics with R” by Danielle Navarro:

load("~/MAT104-Spring25/Week12/parenthood.Rdata")
load("~/MAT104-Spring25/Week12/pearson_correlations.RData")
load("~/MAT104-Spring25/Week12/effort.Rdata")

This data captures how grumpy Danielle is, how much she slept in a day, and how much her baby slept in a day.

Consider the following plots:

ggplot(parenthood, aes(x=dan.sleep, y=dan.grump)) + geom_point()

ggplot(parenthood, aes(x=baby.sleep, y=dan.grump)) + geom_point()

Correlation coefficient

The Pearson correlation coefficient \(r_{XY}\) is a standardized covariance measure:

\[r_{XY} = \frac{1}{N-1} \sum_{i=1}^{N} \frac{X_i-\bar{X}}{s_X}\frac{Y_i-\bar{Y}}{s_Y} \]

where \(s_x\) and \(s_y\) are the sample standard deviations.

The R code for the Pearson correlation coefficient is:

Properties of \(r_{XY}\)

Interpreting the Pearson correlation coefficient

Below is data with various Pearson correlation coefficients:

ggplot(outcomes, aes(x=V1,y=V2))+geom_point() + facet_wrap(~pearson)

Exactly what constitutes as a strong correlation depends on the context.
You can, however, use these general guidlines:

Correlation	Strength	Direction
\(-1\) to \(-0.9\)	Very Strong	Negative
\(-0.9\) to \(-0.7\)	Strong	Negative
\(-0.7\) to \(-0.4\)	Moderate	Negative
\(-0.4\) to \(-0.2\)	Weak	Negative
\(-0.2\) to \(0\)	Negligible	Negative
\(0\) to \(0.2\)	Negligible	Positive
\(0.2\) to \(0.4\)	Weak	Positive
\(0.4\) to \(0.7\)	Moderate	Positive
\(0.7\) to \(0.9\)	Strong	Positive
\(0.9\) to \(1\)	Very Strong	Positive