# Mathematical Musing: What is r?

A student asked me a really interesting question recently; a pair of questions, really. We have just discussed the correlation coefficient as a measure of the direction/strength of a linear association between two quantitative variables, and I demonstrated in class that the calculation for this quantity, referred to by the letter r, can be found by the formula

In other words, for each point of a scatterplot, find the z-score for the x-coordinate and the y-coordinate of that point and multiply those together. Do this for all of the points in your scatterplot, add them together, and divide by n-1 to get your correlation coefficient.

We discussed various properties of this quantity, and my student asked me that question that teachers always hope for (if not without a bit of dread sometimes!): “Why?” Why does this formula produce a quantity that measures the strength of a linear association? Also, why must the value of r necessarily be bound between -1 and 1? In this post, I seek to start an answer to these questions.

The correlation coefficient that we use in AP Statistics is actually something called the PCC or Pearson Correlation Coefficient (or, if you really want to impress your friends, the “Pearson Product-Moment Correlation Coefficient” or PPMCC). The formula for this value can be expressed in a number of different ways, and the version above is probably the most succinct. But probably most conventionally, the formula for the PCC of a sample of data is generally given as:

where the values in the denominators represent the standard deviations of the variables on the x- and y-axis, and cov(X,Y) represents the covariance of the two variables. The covariance of a sample is a measure of the joint variability of the two variables in that sample, and is found with the formula: