Day25_Notes : Difference in means and proportions

Today’s Agenda

Two Means
Two Proportions

Difference of means

\[\bar{X_1} - \bar{X_2} \sim N \left( \mu_1-\mu_2, \sqrt{\frac{(\sigma_1)^2}{n_1} + \frac{(\sigma_2)^2}{n_2}} \right) \]

When the population standard deviation is unknown for both groups, we use the sample standard deviation and the \(t\)-distribution with degrees of freedom equal to the smaller of \(n_1-1\) and \(n_2-1\).

A scientific experiment measured change in blood pressure due to a medication in a control and treatment group. In their measurements negative data indicates a decrease in blood pressure. The control group had an average decrease of \(-1.4\) and the treatment group had an average decrease of \(-4\). With \(9\) people in each group and sample standard deviations \(5.2\) and \(2.4\) in the control and treatment respectively, does this data provide statistically significant evidence of the effectiveness of the medication? (with significance level .05)

\(H_0: \mu_T = \mu_C\)

\(H_A: \mu_T < \mu_C\)

true_diff <- 0
observed_diff <- (-4)- (-1.4)

se <- sqrt(5.2^2/9 + 2.4^2/9)

# find the standarized score: t-score

t <- (observed_diff - true_diff)/se
pt(t,8)

## [1] 0.1051635

# our p-value is .105, so our observed difference is not very rare
# since our p-value of .105 is not smaller than the significance level of .05, we fail to reject the null hypothesis.
# since our p-value of .105 is not smaller than the significance level of .05, there is not statistically significant evidence that the medication works.

It is thought that middle school age boys and girls spend an equal time on average watching tv. A study is done for \(25\) randomly selected children. The study had \(16\) boys and \(9\) girls. The \(16\) boys watched tv for an average of \(3.22\) hours per day with a sample standard deviation of \(1\). The \(9\) girls watched an average of two hours of television per day with a sample standard deviation of \(.866\). Does the study suggest a statistically significant difference in the two population means using a significance level of \(.05\)?

\(H_0: \mu_B = \mu_G\)

\(H_A: \mu_B \neq \mu_G\)

true_diff <- 0 
observed_diff <- 3.22 - 2
se <- sqrt(1^2/16 + .866^2/9)

# find t-score 
t <- (observed_diff - true_diff)/se

(1-pt(t,8))*2

## [1] 0.01271201

# our p-value is .0127
# since our p-value of .0127 is smaller than the significance level of .05, we reject the null hypothesis
# there is statistically significant evidence that the average amount of time the groups spend watching tv is different.

\[\hat{p_1} - \hat{p_2} \sim N \left( p_1- p_2, \sqrt{ \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \right) \]

A survey asked \(827\) randomly sampled registered voters in California “Do you support? Or do you oppose? Drilling for oil and natural gas off the Coast of California? Or do you not know enough to say?” Below is the distribution of responses, separated based on whether or not the respondent graduated from college. Use hypothesis test to determine if there is a true difference in the proportion of college grads that support and non-college grads that support drilling.

	College Grad	Not College Grad
Support	\(154\)	\(132\)
Oppose	\(180\)	\(126\)
Don’t know	\(104\)	\(131\)
Total	\(438\)	\(389\)

\(H_0: p_c = p_n\)

\(H_A: p_c \neq p_n\)

phat_c <- 154/438
phat_n <- 132/389
# since we are assuming p_c=p_n we need to use a pooled proportion
pooled <- (154+132)/(438+389)

# check the success-failure condition
# the number of success in the college grad group
pooled*438

## [1] 151.4728

# the number of failure in the college grad group
(1-pooled)*438

## [1] 286.5272

# the number of success in the non-college grad group
pooled*389

## [1] 134.5272

# the number of failure in the non-college grad group
(1-pooled)*389

## [1] 254.4728

# all bigger than 10, so conditions are satisfied

observed_diff <- phat_c-phat_n
true_diff <- 0
se <- sqrt(pooled*(1-pooled)/438 + pooled*(1-pooled)/389)

# find the z-score
z<-(observed_diff - true_diff)/se
(1-pnorm(z))*2

## [1] 0.7112531

# Since our p-value is big, this data is not rare, we have no reason to reject the null
# since the p-value of .71 is bigger than the significance level of .05, we fail to reject the null

According to a report on sleep deprivation by the Centers for Disease Control and Prevention, the proportion of California residents who reported insufficient rest or sleep during each of the preceding \(30\) days is \(8.0 \%\), while this proportion is \(8.8 \%\) for Oregon residents. These data are based on simple random samples of \(11,545\) California and \(4,691\) Oregon residents. Calculate a \(95 \%\) confidence interval for the difference between the proportions of Californians and Oregonians who are sleep deprived and interpret it in context of the data.