Today’s Agenda

  • Recap
  • Variance
  • Sample vs Population
  • Standard Deviation
  • standard scores

Recap

  • two main characteristics of data that we care about:
    • where is the center of the data? (mean, median, mode)
      • mean is often advantageous when there aren’t many outliers
      • median is often better for dealing with outliers
      • mode is good for categorical data ( eg. if you are analyzing majors at Davidson, there is no way to take an “average” or “median” major, instead, you might care about the most common major: the mode )
    • how spread out is the data?
      • in the lab you learned about mean absolute deviation and median absolute deviation
      • these introduced you to the idea that: if we want to measure how spread out the data is we can add up all the deviations, find the absolute value, and then take the average of the deviations
      • these are not the most common ways to measure spread
      • the most common way is to add up all the deviations, square them, and then take the average.

Variance

\[\text{variance} = \sigma^2 = \displaystyle \frac{\sum_{i=1}^n (\mu-x_i)^2}{N}\]

# create toy data to experiment with
toy_data <- c(6, 5, 6, 10, 9, 8, 6)
toy_data
## [1]  6  5  6 10  9  8  6

Let’s compute the variance.

# find the mean of the toy_data
mean(toy_data)
## [1] 7.142857
# find all the deviations to get an idea of how far data is from the mean
mean(toy_data)-toy_data
## [1]  1.1428571  2.1428571  1.1428571 -2.8571429 -1.8571429 -0.8571429  1.1428571
# square all the deviations 
# why? Because we want all deviation to be positive
# it makes points that are close to the mean less important and points far from the mean more important.

(mean(toy_data) - toy_data)^2
## [1] 1.3061224 4.5918367 1.3061224 8.1632653 3.4489796 0.7346939 1.3061224
# now find the average of the squared deviations
# variance could more accurately be called "mean squared deviation" 

mean((mean(toy_data) - toy_data)^2)
## [1] 2.979592

Of course, R has a built in function for this:

# now find the average of the squared deviations
# variance could more accurately be called "mean squared deviation" 

var(toy_data)
## [1] 3.47619
# but it doesn't give the same answer!

Population vs. Sample

  • We usually want to know a characteristic of the WHOLE population.
  • For example, we want to know how effective a new medication is for everyone.
  • Or, we want to know if people are more in favor of Kamala Harris or Donald Trump in all the US.
  • But, we usually only have access to a sample of the population
  • We want to know \(\mu\) the population average, but we only have access to \(\bar{x}\) the sample average.
  • So, we try to use characteristics of the sample to predict characteristics of the population as closely as possible.
    • This is the entire goal of inferential statistics (use information about a sample to infer things about the population)
  • So, let’s return to our discussion of variance (spread of the data)
  • We usually want to know the spread of the data in the population \(\sigma^2\), but we only have access to the spread of data in the sample \(s^2\)
  • In any sample you take, your data will almost surely be less spread out than the population.
  • The bigger the group is, the more variety you should expect.
  • So, any time we find the variance for the sample, we should expect that our calculation is smaller than the variance in the population.
  • If we know that our calculation is going to give an estimate that is too small, we should make our estimate better
  • dividing by \(n-1\) makes our estimate a little bigger than if we divided by \(n\).

Variance in a sample

\[\text{variance} = s^2 = \displaystyle \frac{\sum_{i=1}^n (\bar{x}-x_i)^2}{n-1}\]

Variance in a population

\[\text{variance} = \sigma^2 = \displaystyle \frac{\sum_{i=1}^n (\mu-x_i)^2}{N}\]

  • R assumes we are working with a sample (because we almost always are)
  • For larger sample sizes and populations, it makes very little difference.
  • If you are asked to find the variance of some data you should assume that your data is a sample from a population, unless specifically told otherwise
  • so the var() function is okay to use unless you are specifically told otherwise
# if the toy data represents how many hours I slept each night of a week, and I want to know how spread out my overall sleep is then I should use
var(toy_data)
## [1] 3.47619
# My sleep is 3.476 hours squared away from the mean on average?

Standard Deviation

  • the units of variance are bad
  • we don’t want all of our units to be squared.
  • So, the square root of variance is standard deviation

\[\text{standard deviation} = \sqrt{\text{variance}} = \sqrt{\sigma^2} = \sigma = \sqrt{\displaystyle \frac{\sum_{i=1}^n (\mu-x_i)^2}{N}}\]

  • The same story goes for standard deviation, this is for a population.
  • For a sample

standard deviation in a Sample

\[\text{standard deviation} = \sqrt{\text{variance}} = \sqrt{s^2} = s = \sqrt{\displaystyle \frac{\sum_{i=1}^n (\bar{x}-x_i)^2}{n-1}}\]

  • sample standard deviation is what R computes
sd(toy_data)
## [1] 1.864454
# my sleep has a standard deviation of 1.86 hours from the mean

68-95-99

  • a good rule of thumb is that 68% of all data is one standard deviation from the mean
# 68% of the time my sleep is 
mean(toy_data) - sd(toy_data)
## [1] 5.278403
mean(toy_data) + sd(toy_data)
## [1] 9.007312
# between 5.28 and 9 hours

# 95% of the time my sleep is between 
mean(toy_data) - 2*sd(toy_data)
## [1] 3.413948
mean(toy_data) + 2*sd(toy_data)
## [1] 10.87177
# 3.4 and 10.87 hours

# 99% of the time, my sleep is between 
mean(toy_data) - 3*sd(toy_data)
## [1] 1.549494
mean(toy_data) + 3*sd(toy_data)
## [1] 12.73622
# 1.55 and 12.74 hours
  • these guidelines work best if your data is symmetric (bell-shaped)
  • the more skew, the less good this estimation is.