Today’s Agenda

  • Fitting a distribution
  • chi-square independence

Fitting a distribution

  • Often we have data that we expect will fit a certain distribution.
  • We can use statistical tests to determine if the data really does fit that distribution

Example: On a fair six sided die, each side is expected to be equally likely (uniform distribution). However, many die are constructed so that the numbers are marked by hollowed out pips. This should, in theory, cause the sides with more pips to be lighter than the sides with fewer pips. A person conjectures that the five and six should show up more often when rolling a six sided die since those sides are the lightest sides and the heavier sides should be on the bottom. Over many, many weeks, they roll a die \(300,000\) times and record their results:

Outcome Observed Expected
\(1\) \(50,611\) \(50,000\)
\(2\) \(49,523\) \(50,000\)
\(3\) \(49,812\) \(50,000\)
\(4\) \(49,924\) \(50,000\)
\(5\) \(49,672\) \(50,000\)
\(6\) \(50,458\) \(50,000\)
Total: \(300,000\) \(300,000\)

\(H_0\): we expect to observe a uniform distribution, each die roll is equally likely

\(H_A\): the distribution is not uniform.

\[ \displaystyle \chi^2 = \sum \frac{ (\text{observed}- \text{expected})^2}{\text{expected}}\]

Outcome Observed Expected \(\frac{ (\text{observed}- \text{expected})^2}{\text{expected}}\)
\(1\) \(50,611\) \(50,000\) \(7.46642\)
\(2\) \(49,523\) \(50,000\) \(4.55058\)
\(3\) \(49,812\) \(50,000\) \(0.70688\)
\(4\) \(49,924\) \(50,000\) \(0.11552\)
\(5\) \(49,672\) \(50,000\) \(2.15168\)
\(6\) \(50,458\) \(50,000\) \(4.19528\)
Total: \(300,000\) \(300,000\) \(19.18636\)
#p-value for goodness of fit 1-pchisq(X^2, # of bins -1)
1-pchisq(19.18636,5)
## [1] 0.001774384
# p-value of .00177, is much smaller than a significance level of .05. So our data provides statistically significant evidence that die is not following a uniform distribution 

Properites of a chi-square distribution

  • unimodal
  • right skewed
  • non-negative
  • a different distribution for each degrees of freedom
  • as the degrees of freedom increase it gets closer to a normal distribution (but it is always slightly right-skewed)
  • for a goodness of fit test, use df=number of bins - 1

How to tell if variables are independent

Example: Students in grades 4-6 were asked whether good grades, athletic ability, or popularity was most important to them. A table separating the students by grade and by choice of most important factor is shown below. Do these data provide evidence to suggest that goals vary by grade?

Grades Popular Sports Total
\(4^{th}\) \(63\) \(31\) \(25\) \(119\)
\(5^{th}\) \(88\) \(55\) \(33\) \(176\)
\(6^{th}\) \(96\) \(55\) \(32\) \(183\)
Totals: \(247\) \(141\) \(90\) \(478\)

For tests of independence, your hypothesis always look like:

\(H_0:\) grade level and most important thing are independent of each other

\(H_A:\) what you find most important does depend on your grade level


To test this we again want to compute a chi-square statistic \(\chi^2\):

\[ \displaystyle \chi^2= \sum \frac{ (\text{observed}- \text{expected})^2}{\text{expected}}\]

\[ \displaystyle \text{expected} = \frac{ (\text{row total} \cdot \text{column total})}{\text{table total}}\]

Calculating \(\chi^2\)


Class Activity

  1. A college is interested in the relationship between anxiety levels and pressure to succeed in school. A random sample of \(400\) students responded in the following way:
Pressure to Succeed High Anxiety Medium-High Anxiety Medium Anxiety Medium-Low Anxiety Low Anxiety Total
High \(35\) \(42\) \(53\) \(15\) \(10\) \(155\)
Medium \(18\) \(48\) \(63\) \(33\) \(31\) \(193\)
Low \(4\) \(5\) \(11\) \(15\) \(17\) \(52\)
Total \(57\) \(95\) \(127\) \(163\) \(158\) \(400\)

Is there sufficient evidence to conclude that a student’s anxiety level depends on the pressure to succeed?

\(H_0:\)

\(H_A:\)

  1. Employers want to know which days of the week employees are absent in a five-day work week. Most employers would like to believe that employees are absent equally during the week. Suppose a random sample of \(60\) managers were asked on which day of the week they had the highest number of employee absences. The results were distributed as:
Monday Tuesday Wednesday Thursday
number of absences 15 12 9 9

Suppose there are \(60\) absences in an average week. Test the goodness of fit of this data to a uniform distribution with a significance level of \(.05\).