One continuous probability distribution that comes up often is the normal distribution.
many natural processes follow a normal distribution
Consider running a simulation where you take a sample of a fixed size over and over.
the collection of all your errors will be normally distributed
the standard normal distribution is center at \(0\) with standard deviation \(1\).
so changing the population mean will change where the normal distribution is centered
and a different standard deviation affects how spread out it is.
Image source: OpenIntro
Probability distributions are used to find the probability of certain events occurring, to do this, we need to compute areas.
pnorm(1)
## [1] 0.8413447
To find the probability that a randomly selected observation being greater than 1 is:
1-pnorm(1)
## [1] 0.1586553
pnorm()
function always gives the area to the left
of the input value on a standard normal curve1-pnorm()
pnorm(1,lower.tail=FALSE)
## [1] 0.1586553
z <- (23-19)/4
pnorm(z)
## [1] 0.8413447
Use R to find the area between -1 and 1 on the standard normal curve
pnorm(1)-pnorm(-1)
## [1] 0.6826895
# About 68%
Area between -2 and 2
pnorm(2)-pnorm(-2)
## [1] 0.9544997
# About 95%
Area between -3 and 3
pnorm(3)-pnorm(-3)
## [1] 0.9973002
# About 99.7%
# fake data with some very tall people
fake_data <- data.frame("height" = 57+50*(rbeta(10000,3,10)))
ggplot(fake_data, aes(x=height))+geom_histogram(binwidth = 1)
# fake data with some very tall people
# suppose this is the population of interest
mean_height <- mean(fake_data$height)
sd_height <- sd(fake_data$height)
# average height of 68.5
# and standard deviation of 5.6
samp_1 <- sample(fake_data$height,50,replace=TRUE)
mean(samp_1)
## [1] 69.96653
samp_2 <- sample(fake_data$height,50,replace=TRUE)
mean(samp_2)
## [1] 68.02546
sampling_dist <- data.frame(samp_mean = replicate(10000, mean(sample(sample(fake_data$height,50,replace=TRUE)))))
ggplot(sampling_dist, aes(x=samp_mean))+geom_histogram(binwidth = .2)