Today’s Agenda
- Recap
- Variance
- Sample vs Population
- Standard Deviation
- standard scores
Recap
- two main characteristics of data that we care about:
- where is the center of the data? (mean, median, mode)
- mean is often advantageous when there aren’t many outliers
- median is often better for dealing with outliers
- mode is good for categorical data ( eg. if you are analyzing majors
at Davidson, there is no way to take an “average” or “median” major,
instead, you might care about the most common major:
the mode )
- how spread out is the data?
- in the lab you learned about mean absolute deviation and median
absolute deviation
- these introduced you to the idea that: if we want to measure how
spread out the data is we can add up all the deviations, find the
absolute value, and then take the average of the deviations
- these are not the most common ways to measure spread
- the most common way is to add up all the deviations, square them,
and then take the average.
Variance
\[\text{variance} = \sigma^2 =
\displaystyle \frac{\sum_{i=1}^n (\mu-x_i)^2}{N}\]
# create toy data to experiment with
toy_data <- c(6, 5, 6, 10, 9, 8, 6)
toy_data
## [1] 6 5 6 10 9 8 6
Let’s compute the variance.
# find the mean of the toy_data
mean(toy_data)
## [1] 7.142857
# find all the deviations to get an idea of how far data is from the mean
mean(toy_data)-toy_data
## [1] 1.1428571 2.1428571 1.1428571 -2.8571429 -1.8571429 -0.8571429 1.1428571
# square all the deviations
# why? Because we want all deviation to be positive
# it makes points that are close to the mean less important and points far from the mean more important.
(mean(toy_data) - toy_data)^2
## [1] 1.3061224 4.5918367 1.3061224 8.1632653 3.4489796 0.7346939 1.3061224
# now find the average of the squared deviations
# variance could more accurately be called "mean squared deviation"
mean((mean(toy_data) - toy_data)^2)
## [1] 2.979592
Of course, R
has a built in function for this:
# now find the average of the squared deviations
# variance could more accurately be called "mean squared deviation"
var(toy_data)
## [1] 3.47619
# but it doesn't give the same answer!
Population vs. Sample
- We usually want to know a characteristic of the WHOLE
population.
- For example, we want to know how effective a new medication is for
everyone.
- Or, we want to know if people are more in favor of Kamala Harris or
Donald Trump in all the US.
- But, we usually only have access to a sample of the population
- We want to know \(\mu\) the
population average, but we only have access to \(\bar{x}\) the sample average.
- So, we try to use characteristics of the sample to predict
characteristics of the population as closely as possible.
- This is the entire goal of inferential statistics
(use information about a sample to infer things about the
population)
- So, let’s return to our discussion of variance (spread of the
data)
- We usually want to know the spread of the data in the population
\(\sigma^2\), but we only have access
to the spread of data in the sample \(s^2\)
- In any sample you take, your data will almost surely be less spread
out than the population.
- The bigger the group is, the more variety you should expect.
- So, any time we find the variance for the sample, we should expect
that our calculation is smaller than the variance in the
population.
- If we know that our calculation is going to give an estimate that is
too small, we should make our estimate better
- dividing by \(n-1\) makes our
estimate a little bigger than if we divided by \(n\).
Variance in a sample
\[\text{variance} = s^2 = \displaystyle
\frac{\sum_{i=1}^n (\bar{x}-x_i)^2}{n-1}\]
Variance in a population
\[\text{variance} = \sigma^2 =
\displaystyle \frac{\sum_{i=1}^n (\mu-x_i)^2}{N}\]
R
assumes we are working with a sample (because we
almost always are)
- For larger sample sizes and populations, it makes very little
difference.
- If you are asked to find the variance of some data you should assume
that your data is a sample from a population, unless specifically told
otherwise
- so the
var()
function is okay to use unless you are
specifically told otherwise
# if the toy data represents how many hours I slept each night of a week, and I want to know how spread out my overall sleep is then I should use
var(toy_data)
## [1] 3.47619
# My sleep is 3.476 hours squared away from the mean on average?
Standard Deviation
- the units of variance are bad
- we don’t want all of our units to be squared.
- So, the square root of variance is standard deviation
\[\text{standard deviation} =
\sqrt{\text{variance}} = \sqrt{\sigma^2} = \sigma = \sqrt{\displaystyle
\frac{\sum_{i=1}^n (\mu-x_i)^2}{N}}\]
- The same story goes for standard deviation, this is for a
population.
- For a sample
standard deviation in a Sample
\[\text{standard deviation} =
\sqrt{\text{variance}} = \sqrt{s^2} = s = \sqrt{\displaystyle
\frac{\sum_{i=1}^n (\bar{x}-x_i)^2}{n-1}}\]
- sample standard deviation is what
R
computes
sd(toy_data)
## [1] 1.864454
# my sleep has a standard deviation of 1.86 hours from the mean
68-95-99
- a good rule of thumb is that 68% of all data is one standard
deviation from the mean
# 68% of the time my sleep is
mean(toy_data) - sd(toy_data)
## [1] 5.278403
mean(toy_data) + sd(toy_data)
## [1] 9.007312
# between 5.28 and 9 hours
# 95% of the time my sleep is between
mean(toy_data) - 2*sd(toy_data)
## [1] 3.413948
mean(toy_data) + 2*sd(toy_data)
## [1] 10.87177
# 3.4 and 10.87 hours
# 99% of the time, my sleep is between
mean(toy_data) - 3*sd(toy_data)
## [1] 1.549494
mean(toy_data) + 3*sd(toy_data)
## [1] 12.73622
# 1.55 and 12.74 hours
- these guidelines work best if your data is symmetric
(bell-shaped)
- the more skew, the less good this estimation is.