Day27 - Two Proportions and T-distributions

\[\hat{p_1} - \hat{p_1} \sim N \left( p_1- p_2, \sqrt{ \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \right) \]

A survey asked \(827\) randomly sampled registered voters in California “Do you support? Or do you oppose? Drilling for oil and natural gas off the Coast of California? Or do you not know enough to say?” Below is the distribution of responses, separated based on whether or not the respondent graduated from college. Test whether the data provide statistically significant evidence that the proportion of college grads and non-college grads that don’t have an opinion on the matter are different at a \(5\%\) significance levels.

	College Grad	Not College Grad
Support	\(154\)	\(132\)
Oppose	\(180\)	\(126\)
Don’t know	\(104\)	\(131\)
Total	\(438\)	\(389\)

\(H_0: p_1 - p_2 = 0\) or \(p_1 = p_2\)

\(H_A: p_1 - p_2 \neq 0\) or \(p_1 \neq p_2\)

Since we are assuming that the two proportions are equal, when computing the standard error we need some “best” estimate of the proportion.
this is called a pooled proportion

\[\text{pooled} = \frac{\text{# of successes in first group + seceond group}}{\text{# the total in both groups}}\]

pooled <- (104+131)/(438+389)

se <- sqrt((pooled*(1-pooled)/438) + (pooled*(1-pooled)/389))

sample_diff <- (104/438)- (131/389)

z <- (sample_diff - 0)/se

pnorm(z)*2

## [1] 0.001573334

# Our p-value is .0015
# Our data provides statistically significant evidence that there is a difference in the two proportions

When computing a confidence interval, you no longer need to use a pooled proportion.

Now calculate a \(95 \%\) confidence interval for the difference between the proportion of college grads and non-college grads that don’t have an opinion on the matter.

phat1 <- 104/438

phat2 <- 131/389

diff <- phat1-phat2

# we add and subtract something from diff to get an interval and we hope that the true difference is in that interval

z <- 1.96
se <- sqrt(phat1*(1-phat1)/438 + phat2*(1-phat2)/389)

diff + 1.96*se

## [1] -0.03772409

diff - 1.96*se

## [1] -0.1609119

# We are 95% confident that the true difference in the proportions is between -.16 and -.038.
# There is between 3.8% and 16% more non-college grads that don't have an opinion on the matter

According to a report on sleep deprivation by the Centers for Disease Control and Prevention, the proportion of California residents who reported insufficient rest or sleep during each of the preceding \(30\) days is \(8.0 \%\), while this proportion is \(8.8 \%\) for Oregon residents. These data are based on simple random samples of \(11,545\) California and \(4,691\) Oregon residents. Determine, at a \(10 \%\) significance level, whether the data provide evidence that the proportions are different.
Calculate a \(95 \%\) confidence interval for the difference between the proportions of Californians and Oregonians who are sleep deprived and interpret it in context of the data.
Suppose a baker claims that his cookie diameter is more than \(10\) cm, on average. Several of his customers do not believe him. To persuade his customers that he is right, the baker decides to do a hypothesis test. He bakes \(10\) cookies. The mean diameter of the sample is \(12\) cm with a standard deviation of \(0.5\) cm. and the distribution of diameters is normal. Perform a hypothesis test with a \(5 \%\) significance level.

Since we only know the sample sd and not the population sd we have to use t-tests

\(H_0: \mu \leq 10\)

\(H_A: \mu > 10\)

t <- (12-10)/(0.5/sqrt(10))

# Since the critical t-score (the t-score that corresponds to a significance level of 5%) is smaller than the t-score of our observation, that means our data provides statistically significant evidence that the true mean is bigger than 10.

Suppose you take a sample and get a sample mean of 20 with a sample standard deviation of 6, with 25 observations. At a 10% significance level.

\(H_0: \mu \geq 21\)

\(H_A: \mu < 21\)