CA 19 : Chi-Square

Today’s Agenda

Fitting a distribution
chi-square independence

Fitting a distribution

Often we have data that we expect will fit a certain distribution.
We can use statistical tests to determine if the data really does fit that distribution

Example: On a fair six sided die, each side is expected to be equally likely (uniform distribution). However, many die are constructed so that the numbers are marked by hollowed out pips. This should, in theory, cause the sides with more pips to be lighter than the sides with fewer pips. A person conjectures that the five and six should show up more often when rolling a six sided die since those sides are the lightest sides and the heavier sides should be on the bottom. Over many, many weeks, they roll a die \(300,000\) times and record their results:

Outcome	Observed	Expected
\(1\)	\(50,611\)	\(50,000\)
\(2\)	\(49,523\)	\(50,000\)
\(3\)	\(49,812\)	\(50,000\)
\(4\)	\(49,924\)	\(50,000\)
\(5\)	\(49,672\)	\(50,000\)
\(6\)	\(50,458\)	\(50,000\)
Total:	\(300,000\)	\(300,000\)

\(H_0\): expect the die to form a uniform distribution. No side of the die is favored.

\(H_A\): the die does not follow a uniform distribution. Some sides seem to be favored.

we need to compute some statistic that summarizes how different our data is from the uniform data.
the statistic we will use is called chi-square(d)
once we have the statistic we calculate its p-value and compare it to the significance level

\[ \displaystyle \chi^2 = \sum \frac{ (\text{observed}- \text{expected})^2}{\text{expected}}\]

Outcome	Observed	Expected	\(\frac{ (\text{observed}- \text{expected})^2}{\text{expected}}\)
\(1\)	\(50,611\)	\(50,000\)	\(7.46642\)
\(2\)	\(49,523\)	\(50,000\)	\(4.55058\)
\(3\)	\(49,812\)	\(50,000\)	\(0.70688\)
\(4\)	\(49,924\)	\(50,000\)	\(0.11552\)
\(5\)	\(49,672\)	\(50,000\)	\(2.15168\)
\(6\)	\(50,458\)	\(50,000\)	\(4.19528\)
Total:	\(300,000\)	\(300,000\)	\(19.18636\)

# our chi-square statistic is 19.18636
# we will put our chi-square statistic on the the chi-square distribution with the correct degrees of freedom
# the p-value will be the area to the right of the chi-square statistic
# for a goodness-of-fit test the degrees of freedom is the # of bins - 1 (in this case 6-1 = 5)

1-pchisq(19.186,5)

## [1] 0.001774658

# we obtain a p-value of .00177
# extremely rare, only .177% of all experiments with 300,000 dice rolls will be further from the uniform distribution than this data
# this data provides statistically significant evidence that the die does not follow a uniform distribution.

How to tell if variables are independent

Example: Students in grades 4-6 were asked whether good grades, athletic ability, or popularity was most important to them. A table separating the students by grade and by choice of most important factor is shown below. Do these data provide evidence to suggest that goals vary by grade?

	Grades	Popular	Sports	Total
\(4^{th}\)	\(63\)	\(31\)	\(25\)	\(119\)
\(5^{th}\)	\(88\)	\(55\)	\(33\)	\(176\)
\(6^{th}\)	\(96\)	\(55\)	\(32\)	\(183\)
Totals:	\(247\)	\(141\)	\(90\)	\(478\)

\(H_0:\) that the variables are independent. Grade level does not have an effect on what a student finds most important

\(H_A:\) There is a dependence on the two variables. Grade level does effect what a student finds most important

To test this we again want to compute a chi-square statistic \(\chi^2\):

\[ \displaystyle \chi^2= \sum \frac{ (\text{observed}- \text{expected})^2}{\text{expected}}\]

\[ \displaystyle \frac{ (\text{row total} \cdot \text{column total})}{\text{table total}}\]

#expected number of 4th graders that think grades are the most important
grades_4 <- (119*247)/478
pop_4 <- (119*141)/478
sport_4 <- (119*90)/478
grades_5 <- (176*247)/478
pop_5 <- (176*141)/478
sport_5 <- (176*90)/478
grades_6 <- (183*247)/478
pop_6 <- (183*141)/478
sport_6 <- (183*90)/478

chisq <- (63-grades_4)^2/grades_4 + (31-pop_4)^2/pop_4 + (25-sport_4)^2/sport_4 +
  (88-grades_5)^2/grades_5 + (55-pop_5)^2/pop_5 + (33-sport_5)^2/sport_5 +
  (96-grades_6)^2/grades_6 + (55-pop_6)^2/pop_6 + (32-sport_6)^2/sport_6

# for a test of independence the degrees of freedom is (# of rows -1 )* (# of columns - 1) in this case (3 -1 ) * (3 -1) = 4
1-pchisq(chisq,4)

## [1] 0.8593185

# we have a p-value of 85.9%, with any reasonable significance we fail to reject the null hypothesis.
# Our data does not provide statistically significant evidence that grade level effect what students find important.

# we need another way!
df <- data.frame(grades = c(63,88,96), popular = c(31,55,55), sports = c(25,33,32))
chisq.test(df)

## 
##  Pearson's Chi-squared test
## 
## data:  df
## X-squared = 1.3121, df = 4, p-value = 0.8593

Class Activity

A college is interested in the relationship between anxiety levels and pressure to succeed in school. A random sample of \(400\) students responded in the following way:

Pressure to Succeed	High Anxiety	Medium-High Anxiety	Medium Anxiety	Medium-Low Anxiety	Low Anxiety	Total
High	\(35\)	\(42\)	\(53\)	\(15\)	\(10\)	\(155\)
Medium	\(18\)	\(48\)	\(63\)	\(33\)	\(31\)	\(193\)
Low	\(4\)	\(5\)	\(11\)	\(15\)	\(17\)	\(52\)
Total	\(57\)	\(95\)	\(127\)	\(163\)	\(158\)	\(400\)

Is there sufficient evidence to conclude that a student’s anxiety level depends on the pressure to succeed?

\(H_0:\)

\(H_A:\)

Employers want to know which days of the week employees are absent in a five-day work week. Most employers would like to believe that employees are absent equally during the week. Suppose a random sample of \(60\) managers were asked on which day of the week they had the highest number of employee absences. The results were distributed as:

	Monday	Tuesday	Wednesday	Thursday
number of absences	15	12	9	9

Suppose there are \(60\) absences in an average week. Test the goodness of fit of this data to a uniform distribution with a significance level of \(.05\).