Today’s Agenda

  • Normal Distribution
  • Difference between vectors and data frames
  • Sampling Distribution

Normal Distribution

  • Recall that a normally distributed random variable is denoted by

\[X \sim N(\mu,\sigma)\]

where \(\mu\) and \(\sigma\) are the mean and standard deviation of \(X\) respectively.

  • \(\mu\) is the center and shifts the distribution
  • \(\sigma\) is the spread and it will flatten or compress the curve

Image source: OpenIntro

  • for any normal distribution of a random variable we find all the z-scores in order to standardize the scores
  • standardizing scores (finding the z-score) translates any normal curve into the standard normal curve
  • Example from Review 1
    • Leo completed a race in 4900 seconds in a group with mean 4300 and sd 500
    • Mary completed a race in 5500 seconds in a group with mean 5200 and sd 600
# computing the percentage of racers faster than Leo
pnorm(4900,4300,500)
## [1] 0.8849303
# or, we can standardize the score first
pnorm(1.2)
## [1] 0.8849303
# Standardize your scores first and then use pnorm

Finding the area between two values

# to find the area between two values you need to find the area below the larger value, minus the area below the smaller value

pnorm(1)-pnorm(-1)
## [1] 0.6826895
# this gives the area between 1 and -1 standard deviations on the standard normal curve

# to find the area to the right of a value do 1-pnorm(value)
1-pnorm(1)
## [1] 0.1586553

Difference between vectors and dataframes

# a vector is a list of data, like the body mass of the penguins
library(palmerpenguins)
penguins <- penguins
mass <- penguins$body_mass_g
# a data frame shows up usually with rows and columns
# we can turn vectors into data frames
df_mass <- data.frame(weights = mass)

why use vectors vs data frames

# some functions in R are meant for vectors or data frames, and they behave weirdly if you use the wrong type

# say we want to take 5 penguin body masses at random from our data set
sample(penguins$body_mass_g,5)
## [1] 4200 3450 3500 6000 4675
# maybe we can sample from the data frame instead of the vector
sample(penguins,5)
## # A tibble: 344 × 5
##    island    body_mass_g  year species bill_length_mm
##    <fct>           <int> <int> <fct>            <dbl>
##  1 Torgersen        3750  2007 Adelie            39.1
##  2 Torgersen        3800  2007 Adelie            39.5
##  3 Torgersen        3250  2007 Adelie            40.3
##  4 Torgersen          NA  2007 Adelie            NA  
##  5 Torgersen        3450  2007 Adelie            36.7
##  6 Torgersen        3650  2007 Adelie            39.3
##  7 Torgersen        3625  2007 Adelie            38.9
##  8 Torgersen        4675  2007 Adelie            39.2
##  9 Torgersen        3475  2007 Adelie            34.1
## 10 Torgersen        4250  2007 Adelie            42  
## # … with 334 more rows
# this does not give us 5 penguins

sample_n(penguins,5)
## # A tibble: 5 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Chinst… Dream            46.9          16.6              192        2700 fema…
## 2 Chinst… Dream            43.5          18.1              202        3400 fema…
## 3 Chinst… Dream            42.4          17.3              181        3600 fema…
## 4 Gentoo  Biscoe           47.6          14.5              215        5400 male 
## 5 Adelie  Dream            40.6          17.2              187        3475 male 
## # … with 1 more variable: year <int>
# this function lives in the package tidyverse
# you might need to run library(tidyverse) to use this

Sampling Distribution

# Make a uniform distribution
# a school has 5000 students, suppose each GPA is equally likely to occur
students <- data.frame(id= seq(1,5000,1),GPA = runif(5000, 0, 4))