Day 13–Normal Distribution CLT

The Normal Distribution

One continuous probability distribution that comes up often is the normal distribution.
- it is continuous
- symmetric
- unimodal (meaning it has one mode, there is a particular value that is a more common than other values)
- is bell shaped
many natural processes follow a normal distribution
- like diameter of tree trunks for a particular tree species
- height
- error in a factory
Consider running a simulation where you take a sample of a fixed size over and over.
the collection of all your errors will be normally distributed
the standard normal distribution is center at \(0\) with standard deviation \(1\).
so changing the population mean will change where the normal distribution is centered
and a different standard deviation affects how spread out it is.

Image source: OpenIntro

Finding areas

Probability distributions are used to find the probability of certain events occurring, to do this, we need to compute areas.

eg. Find the probability of a randomly selected observation from a standard normal distribution being less than 1.

pnorm(1)

## [1] 0.8413447

To find the probability that a randomly selected observation being greater than 1 is:

1-pnorm(1)

## [1] 0.1586553

the pnorm() function always gives the area to the left of the input value on a standard normal curve
to find the area to the right, you need to do 1-pnorm()

pnorm(1,lower.tail=FALSE)

## [1] 0.1586553

to find areas for non-standard normal curves first find the z-score and then use the standard normal curve.

z <- (23-19)/4
pnorm(z)

## [1] 0.8413447

Use R to find the area between -1 and 1 on the standard normal curve

pnorm(1)-pnorm(-1)

## [1] 0.6826895

# About 68%

Area between -2 and 2

pnorm(2)-pnorm(-2)

## [1] 0.9544997

# About 95%

Area between -3 and 3

pnorm(3)-pnorm(-3)

## [1] 0.9973002

# About 99.7%

On the standard normal curve about 68% of observations are within 1 standard deviation of 0.
95% are within 2
99.7% are within 3 standard deviations
This is sometimes referred to as the 68-95-99.7 rule
people use these guidelines even with non-normal distributions.

Sampling distribution

# fake data with some very tall people
fake_data <- data.frame("height" = 57+50*(rbeta(10000,3,10)))
ggplot(fake_data, aes(x=height))+geom_histogram(binwidth = 1)

# fake data with some very tall people
# suppose this is the population of interest
mean_height <- mean(fake_data$height)
sd_height <- sd(fake_data$height)
# average height of 68.5
# and standard deviation of 5.6

samp_1 <- sample(fake_data$height,50,replace=TRUE)

mean(samp_1)

## [1] 69.96653

samp_2 <- sample(fake_data$height,50,replace=TRUE)

mean(samp_2)

## [1] 68.02546

Sampling Distribution

sampling_dist <- data.frame(samp_mean = replicate(10000, mean(sample(sample(fake_data$height,50,replace=TRUE)))))
ggplot(sampling_dist, aes(x=samp_mean))+geom_histogram(binwidth = .2)

we started with a skewed population, it has a mean and standard deviation
we collected repeated samples of size 50 and looked at the mean in each sample, these are called sample means.
we looked at the distribution of sample means, this is called the sampling distribution
the central limit theorem says the sampling distribution will be normal