Suppose you are trying to predict the proportion of Davidson students that plan to vote for Kamala Harris.
There is some true proportion (this is not known to us, for the sake of this example let’s say it is \(50 \%\))
If we sample \(400\) students, what is the expected value for the proportion of students voting for Harris?
There are two things we can say with some confidence
how much we are likely to be wrong by is called the standard error
the standard error is a measurement of how well we are able to predict some value
when we increase our sample size:
candidates <- c("H","T")
table(sample(candidates, 100,replace=TRUE))
##
## H T
## 45 55
# how many Harris do I expect? = 50
# the error in the number of Harris voters is 2
# the proportion of expected Harris voters is .5
# the error in the proportion was .02
table(sample(candidates, 10000,replace=TRUE))
##
## H T
## 5011 4989
# how many Harris do I expect? = 5000
# the error in the number of Harris voters is 49
# the proportion of expected Harris voters is .5
# the error in the proportion was .0049
die <- c(1:6)
# hist(die,breaks=20,prob=TRUE)
# let's roll 10 dice and compute the mean, repeated 10000 times
# expected value of a dice roll is 3.5
experiment<-data.frame(results = replicate(10000,mean(sample(die,10,replace=TRUE))))
# look at the distribution of our results
hist(experiment$results,breaks=25)
the central limit theorem says as we increase our sample size, the sampling distribution of the sample mean approaches a normal distribution (regardless of the distribution of the population)
we need to figure which normal distribution??? Where is it centered and how much spread does it have?
the spread of the sampling distribution is described by the standard error