Today’s Agenda

  • Law of Large Numbers
  • Central Limit Theorem

Law of Large Numbers

  • Suppose you are trying to predict the proportion of Davidson students that plan to vote for Kamala Harris.

  • There is some true proportion (this is not known to us, for the sake of this example let’s say it is \(50 \%\))

  • If we sample \(400\) students, what is the expected value for the proportion of students voting for Harris?

  • There are two things we can say with some confidence

    • the sample proportion is likely to be “close” to the true proportion
    • the sample proportion is almost definitely not equal to the true proportion.
  • how much we are likely to be wrong by is called the standard error

  • the standard error is a measurement of how well we are able to predict some value

    • a large standard error means we can’t predict the value well
    • a small standard error means we can
  • when we increase our sample size:

    • predicting the number of people that will vote for Kamala Harris gets harder
    • predicting the proportion of people that will vote for Kamala Harris gets easier.
candidates <- c("H","T")
table(sample(candidates, 100,replace=TRUE))
## 
##  H  T 
## 45 55
# how many Harris do I expect? = 50
# the error in the number of Harris voters is 2
# the proportion of expected Harris voters is .5
# the error in the proportion was .02


table(sample(candidates, 10000,replace=TRUE))
## 
##    H    T 
## 5011 4989
# how many Harris do I expect? = 5000
# the error in the number of Harris voters is 49
# the proportion of expected Harris voters is .5
# the error in the proportion was .0049

Central Limit Theorem (CLT)

  • Now let’s investigate what happens to the sampling distribution of the sample means as the sample size increases.
die <- c(1:6)
# hist(die,breaks=20,prob=TRUE)

# let's roll 10 dice and compute the mean, repeated 10000 times
# expected value of a dice roll is 3.5
experiment<-data.frame(results = replicate(10000,mean(sample(die,10,replace=TRUE))))
# look at the distribution of our results
hist(experiment$results,breaks=25)

  • the central limit theorem says as we increase our sample size, the sampling distribution of the sample mean approaches a normal distribution (regardless of the distribution of the population)

  • we need to figure which normal distribution??? Where is it centered and how much spread does it have?

  • the spread of the sampling distribution is described by the standard error