One continuous probability distribution that comes up often is the normal distribution.
many natural processes follow a normal distribution
Consider running a simulation where you take a sample of a fixed size over and over.
the collection of all your errors will be normally distributed
the standard normal distribution is center at 0 with standard deviation 1.
so changing the population mean will change where the normal distribution is centered
and a different standard deviation affects how spread out it is.
Image source: OpenIntro
Probability distributions are used to find the probability of certain events occurring, to do this, we need to compute areas.
pnorm(1)
## [1] 0.8413447
To find the probability that a randomly selected observation being greater than 1 is:
1-pnorm(1)
## [1] 0.1586553
pnorm()
function always gives the area to the left
of the input value on a standard normal curve1-pnorm()
pnorm(1,lower.tail=FALSE)
## [1] 0.1586553
z <- (23-19)/4
pnorm(z)
## [1] 0.8413447
Use R to find the area between -1 and 1 on the standard normal curve
pnorm(1)-pnorm(-1)
## [1] 0.6826895
# About 68%
Area between -2 and 2
pnorm(2)-pnorm(-2)
## [1] 0.9544997
# About 95%
Area between -3 and 3
pnorm(3)-pnorm(-3)
## [1] 0.9973002
# About 99.7%
# fake data with some very tall people
fake_data <- data.frame("height" = 57+50*(rbeta(10000,3,10)))
ggplot(fake_data, aes(x=height))+geom_histogram(binwidth = 1)
# fake data with some very tall people
# suppose this is the population of interest
mean_height <- mean(fake_data$height)
sd_height <- sd(fake_data$height)
# average height of 68.5
# and standard deviation of 5.6
samp_1 <- sample(fake_data$height,50,replace=TRUE)
mean(samp_1)
## [1] 69.96653
samp_2 <- sample(fake_data$height,50,replace=TRUE)
mean(samp_2)
## [1] 68.02546
sampling_dist <- data.frame(samp_mean = replicate(10000, mean(sample(sample(fake_data$height,50,replace=TRUE)))))
ggplot(sampling_dist, aes(x=samp_mean))+geom_histogram(binwidth = .2)