The Normal Distribution

  • One continuous probability distribution that comes up often is the normal distribution.

    • it is continuous
    • symmetric
    • unimodal (meaning it has one mode, there is a particular value that is a more common than other values)
    • is bell shaped
  • many natural processes follow a normal distribution

    • like diameter of tree trunks for a particular tree species
    • height
    • error in a factory
  • Consider running a simulation where you take a sample of a fixed size over and over.

  • the collection of all your errors will be normally distributed

  • the standard normal distribution is center at \(0\) with standard deviation \(1\).

  • so changing the population mean will change where the normal distribution is centered

  • and a different standard deviation affects how spread out it is.

Image source: OpenIntro

Finding areas

Probability distributions are used to find the probability of certain events occurring, to do this, we need to compute areas.

  • eg. Find the probability of a randomly selected observation from a standard normal distribution being less than 1.
pnorm(1)
## [1] 0.8413447

To find the probability that a randomly selected observation being greater than 1 is:

1-pnorm(1)
## [1] 0.1586553
  • the pnorm() function always gives the area to the left of the input value on a standard normal curve
  • to find the area to the right, you need to do 1-pnorm()
pnorm(1,lower.tail=FALSE)
## [1] 0.1586553
  • to find areas for non-standard normal curves first find the z-score and then use the standard normal curve.
z <- (23-19)/4
pnorm(z)
## [1] 0.8413447

Use R to find the area between -1 and 1 on the standard normal curve

pnorm(1)-pnorm(-1)
## [1] 0.6826895
# About 68%

Area between -2 and 2

pnorm(2)-pnorm(-2)
## [1] 0.9544997
# About 95%

Area between -3 and 3

pnorm(3)-pnorm(-3)
## [1] 0.9973002
# About 99.7%
  • On the standard normal curve about 68% of observations are within 1 standard deviation of 0.
  • 95% are within 2
  • 99.7% are within 3 standard deviations
  • This is sometimes referred to as the 68-95-99.7 rule
  • people use these guidelines even with non-normal distributions.

Sampling distribution

# fake data with some very tall people
fake_data <- data.frame("height" = 57+50*(rbeta(10000,3,10)))
ggplot(fake_data, aes(x=height))+geom_histogram(binwidth = 1)

# fake data with some very tall people
# suppose this is the population of interest
mean_height <- mean(fake_data$height)
sd_height <- sd(fake_data$height)
# average height of 68.5
# and standard deviation of 5.6
samp_1 <- sample(fake_data$height,50,replace=TRUE)
mean(samp_1)
## [1] 69.96653
samp_2 <- sample(fake_data$height,50,replace=TRUE)
mean(samp_2)
## [1] 68.02546

Sampling Distribution

sampling_dist <- data.frame(samp_mean = replicate(10000, mean(sample(sample(fake_data$height,50,replace=TRUE)))))
ggplot(sampling_dist, aes(x=samp_mean))+geom_histogram(binwidth = .2)

  • we started with a skewed population, it has a mean and standard deviation
  • we collected repeated samples of size 50 and looked at the mean in each sample, these are called sample means.
  • we looked at the distribution of sample means, this is called the sampling distribution
  • the central limit theorem says the sampling distribution will be normal