You may work on this in groups of two. When finished, knit to PDF and submit to Moodle. You should both submit identical PDFs.
When we talk about the dispersion of a data set we are concerned with how spread out it is.
both range and IQR are coarse measurements of dispersion.
Instead, we want a measurement that tells us how spread out all the data is.
one idea is to measure how far each point is from the center of the data.
Suppose there is a data set with \(n\) data points (\(n\) can be any whole number).
So, our data set has values \(X_1,X_2,X_3, \dots, X_n\)
For every point \(X_i\) in our data set, we will measure how far it is (how much it deviates) from the mean:
\[ \bar{X}-X_i \]
\[| \bar{X}-X_i |\]
Finally, we take the average of all the absolute deviations from the mean. (hence, the name mean absolute deviation):
\[ \displaystyle \frac{\sum_{i=1}^n | \bar{X}-X_i |}{n}\]
Let’s start with a small data set. Run the code below to load the data set:
example_data <- c(2, 0, 6, 28, 19, 65)
example_data
and
save it in a local variable called mean_example_data
.# insert code here
data index \(i\) | value \(X_i\) | deviation from mean \(\bar{X}-X_i\) | absolute deviation \(|\bar{X}-X_i|\) |
---|---|---|---|
1 | 2 | ? | ? |
2 | 0 | ? | ? |
3 | 6 | ? | ? |
4 | 28 | ? | ? |
5 | 19 | ? | ? |
6 | 65 | ? | ? |
# insert code here
Now for the example data, we will compute the median absolute
deviation using more help from R
. The code below finds the
median absolute deviation in a single line of code:
mean(abs(median(example_data)- example_data))
## [1] 17.33333
ANSWER HERE
openintro
package. Also save the
yrbss
data from the open intro package to your local
environment as teen_data
.#insert code here
teen_data
so that you can focus in on the
ninth graders. Save the data set with only ninth grade data as
ninth_graders
.#insert code here
#insert code here
#insert code here
ANSWER HERE
#insert code here
#insert code here
ANSWER HERE
#insert code here
ANSWER HERE
#insert code here
ANSWER HERE
yrbss
teen data.#insert code here
Explain some reasons why it makes sense for fewer outliers to be underweight? (Think about how far the whisker extends and how much someone would need to weigh to be considered an outlier.)
Plot a scatter plot with the weight data on the \(x\)-axis and the height data on the \(y\)-axis.
List at least three things that a data scientist may want to know about the data that you can see in the scatterplot.
ANSWER HERE