In Part 1 of this question, we explored how the correlation coefficient is calculated, and how that calculation relies heavily on the covariance between two quantitative variables. We left off with a few questions: why is r bound between -1 and 1, and why does a value of r near 0 indicate a weak association (and near an extreme indicate a strong one)? In this post, we will answer these questions!
A student asked me a really interesting question recently; a pair of questions, really. We have just discussed the correlation coefficient as a measure of the direction/strength of a linear association between two quantitative variables, and I demonstrated in class that the calculation for this quantity, referred to by the letter r, can be found by the formula
In other words, for each point of a scatterplot, find the z-score for the x-coordinate and the y-coordinate of that point and multiply those together. Do this for all of the points in your scatterplot, add them together, and divide by n-1 to get your correlation coefficient.
We discussed various properties of this quantity, and my student asked me that question that teachers always hope for (if not without a bit of dread sometimes!): “Why?” Why does this formula produce a quantity that measures the strength of a linear association? Also, why must the value of r necessarily be bound between -1 and 1? In this post, I seek to start an answer to these questions.