Day 6 Notes: Standard Deviation

Today’s Agenda

Recap
Variance
Sample vs Population
Standard Deviation
standard scores

Recap

two main characteristics of data that we care about:
- where is the center of the data? (mean, median, mode)
  - mean is often advantageous when there aren’t many outliers
  - median is often better for dealing with outliers
  - mode is good for categorical data ( eg. if you are analyzing majors at Davidson, there is no way to take an “average” or “median” major, instead, you might care about the most common major: the mode )
- how spread out is the data?
  - in the lab you learned about mean absolute deviation and median absolute deviation
  - these introduced you to the idea that: if we want to measure how spread out the data is we can add up all the deviations, find the absolute value, and then take the average of the deviations
  - these are not the most common ways to measure spread
  - the most common way is to add up all the deviations, square them, and then take the average.

Variance

\[\text{variance} = \sigma^2 = \displaystyle \frac{\sum_{i=1}^n (\mu-x_i)^2}{N}\]

# create toy data to experiment with
toy_data <- c(6, 5, 6, 10, 9, 8, 6)
toy_data

## [1]  6  5  6 10  9  8  6

Let’s compute the variance.

# find the mean of the toy_data
mean(toy_data)

## [1] 7.142857

# find all the deviations to get an idea of how far data is from the mean
mean(toy_data)-toy_data

## [1]  1.1428571  2.1428571  1.1428571 -2.8571429 -1.8571429 -0.8571429  1.1428571

# square all the deviations 
# why? Because we want all deviation to be positive
# it makes points that are close to the mean less important and points far from the mean more important.

(mean(toy_data) - toy_data)^2

## [1] 1.3061224 4.5918367 1.3061224 8.1632653 3.4489796 0.7346939 1.3061224

# now find the average of the squared deviations
# variance could more accurately be called "mean squared deviation" 

mean((mean(toy_data) - toy_data)^2)

## [1] 2.979592

Of course, R has a built in function for this:

# now find the average of the squared deviations
# variance could more accurately be called "mean squared deviation" 

var(toy_data)

## [1] 3.47619

# but it doesn't give the same answer!

Population vs. Sample

We usually want to know a characteristic of the WHOLE population.
For example, we want to know how effective a new medication is for everyone.
Or, we want to know if people are more in favor of Kamala Harris or Donald Trump in all the US.
But, we usually only have access to a sample of the population
We want to know \(\mu\) the population average, but we only have access to \(\bar{x}\) the sample average.
So, we try to use characteristics of the sample to predict characteristics of the population as closely as possible.
- This is the entire goal of inferential statistics (use information about a sample to infer things about the population)
So, let’s return to our discussion of variance (spread of the data)
We usually want to know the spread of the data in the population \(\sigma^2\), but we only have access to the spread of data in the sample \(s^2\)
In any sample you take, your data will almost surely be less spread out than the population.
The bigger the group is, the more variety you should expect.
So, any time we find the variance for the sample, we should expect that our calculation is smaller than the variance in the population.
If we know that our calculation is going to give an estimate that is too small, we should make our estimate better
dividing by \(n-1\) makes our estimate a little bigger than if we divided by \(n\).

Variance in a sample

\[\text{variance} = s^2 = \displaystyle \frac{\sum_{i=1}^n (\bar{x}-x_i)^2}{n-1}\]

Variance in a population

\[\text{variance} = \sigma^2 = \displaystyle \frac{\sum_{i=1}^n (\mu-x_i)^2}{N}\]

R assumes we are working with a sample (because we almost always are)
For larger sample sizes and populations, it makes very little difference.
If you are asked to find the variance of some data you should assume that your data is a sample from a population, unless specifically told otherwise
so the var() function is okay to use unless you are specifically told otherwise

# if the toy data represents how many hours I slept each night of a week, and I want to know how spread out my overall sleep is then I should use
var(toy_data)

## [1] 3.47619

# My sleep is 3.476 hours squared away from the mean on average?

Standard Deviation

the units of variance are bad
we don’t want all of our units to be squared.
So, the square root of variance is standard deviation

\[\text{standard deviation} = \sqrt{\text{variance}} = \sqrt{\sigma^2} = \sigma = \sqrt{\displaystyle \frac{\sum_{i=1}^n (\mu-x_i)^2}{N}}\]

The same story goes for standard deviation, this is for a population.
For a sample

standard deviation in a Sample

\[\text{standard deviation} = \sqrt{\text{variance}} = \sqrt{s^2} = s = \sqrt{\displaystyle \frac{\sum_{i=1}^n (\bar{x}-x_i)^2}{n-1}}\]

sample standard deviation is what R computes

sd(toy_data)

## [1] 1.864454

# my sleep has a standard deviation of 1.86 hours from the mean

68-95-99

a good rule of thumb is that 68% of all data is one standard deviation from the mean

# 68% of the time my sleep is 
mean(toy_data) - sd(toy_data)

## [1] 5.278403

mean(toy_data) + sd(toy_data)

## [1] 9.007312

# between 5.28 and 9 hours

# 95% of the time my sleep is between 
mean(toy_data) - 2*sd(toy_data)

## [1] 3.413948

mean(toy_data) + 2*sd(toy_data)

## [1] 10.87177

# 3.4 and 10.87 hours

# 99% of the time, my sleep is between 
mean(toy_data) - 3*sd(toy_data)

## [1] 1.549494

mean(toy_data) + 3*sd(toy_data)

## [1] 12.73622

# 1.55 and 12.74 hours

these guidelines work best if your data is symmetric (bell-shaped)
the more skew, the less good this estimation is.