Why the correlation gives the best slope#
In which we return to the simple regression problem, and work through the mathematics to show why the correlation is the best least-squares slope for two z-scored arrays.
As background, you may also want to read the regression notation page.
The example regression problem#
As you remember, we have (fake) measured scores for a “psychopathy” personality trait in 12 students. We also have a measure of how much sweat each student had on their palms, and we call this a “clammy” score.
# Import numerical and plotting libraries
import numpy as np
import numpy.linalg as npl
import matplotlib.pyplot as plt
# Only show 6 decimals when printing
np.set_printoptions(precision=6)
psychopathy = np.array([11.416, 4.514, 12.204, 14.835,
8.416, 6.563, 17.343, 13.02,
15.19, 11.902, 22.721, 22.324])
clammy = np.array([0.389, 0.2, 0.241, 0.463,
4.585, 1.097, 1.642, 4.972,
7.957, 5.585, 5.527, 6.964])
plt.plot(psychopathy, clammy, '+')
plt.xlabel('Clammy')
plt.ylabel('Psychopathy');
To get ourselves into a mathematical mood, we will rename our x-axis values to
x
and the y-axis values to y
. We will store the number of observations in
both as n
.
x = psychopathy
y = clammy
n = len(x)
n
12
x
and y
are Numpy arrays, but we can also call them vectors. Vector is
just a technical and mathematical term for a one-dimensional array.
We are on a quest to find why it turned out that the correlation coefficient is
always the best (least squared error) slope for the z-scored versions of these
vectors (1D arrays). To find out why, we need to express these x
and y
vectors in the most general way possible — independent of the numbers we have
here.
\(\newcommand{\yvec}{\vec{y}} \newcommand{\xvec}{\vec{x}} \newcommand{\evec}{\vec{\varepsilon}}\)
First, in mathematical notation, we will call our x
array \(\xvec\), to remind
us that x
is a vector of numbers, not just a single number.
Next we generalize \(\xvec\) to refer to any set of \(n\) numbers. So, in our x
we have n
specific numbers, but we generalize \(\xvec\) to mean a series of \(n\)
numbers, where \(n\) can be any number. We can therefore write \(\xvec\)
mathematically as:
By this we mean that we take \(\xvec\) to be a vector (1D array) of any \(n\) numbers.
By the same logic, we generalize the array y
as the vector \(\yvec\):
Writing sums#
We are next going to do the z-score transform on \(\xvec\) and \(\yvec\). As you remember, this involves subtracting the mean and dividing by the standard deviation. We need to express these mathematically.
We will use sum
notation for this
task. This uses the symbol \(\Sigma\) to mean adding up the values in a vector.
\(\Sigma\) is capital S in the Greek alphabet. It is just a shorthand for the
operation we are used to calling sum
in code. So:
The \(i\) above is the index for the vector, so \(x_i\) means the element from the vector at position \(i\). So, if \(i = 3\) then \(x_i\) is \(x_3\) — the element at the third position in the vector.
Therefore, in code we, can write \(\Sigma_{i=1}^{n} x_i\) as:
np.sum(x)
160.448
The \(i=1\) and \(n\) at the bottom and top of the \(\Sigma\) mean that adding should start at position 1 (\(x_1\)) and go up to position \(n\) (\(x_n\)). Of course, this means all the numbers in the vector. In this page, we will always sum across all the numbers in the vector, and we will miss off the \(i=1\) and \(n\) above and below the \(\Sigma\), so we will write:
when we mean:
Mean and standard deviation#
Now we have a notation for adding things up in a vector, we can write the mathematical notation for the mean and standard deviation.
We write the mean of a vector \(\xvec\) as \(\xbar\). Say \(\xbar\) as “x bar”. The definition of the mean is:
Read this as add up all the elements in \(\xvec\) and divide by \(n\). In code:
# Calculation of the mean.
x_bar = np.sum(x) / n
x_bar
13.370666666666667
The mean of \(\yvec\) is:
y_bar = np.sum(y) / n
y_bar
3.301833333333333
Next we are going to calculate the standard deviation. In order to do that, we start with the deviations. The deviation for each element is the result of subtracting the mean from the element.
deviations = x - x_bar
deviations
array([-1.954667, -8.856667, -1.166667, 1.464333, -4.954667, -6.807667,
3.972333, -0.350667, 1.819333, -1.468667, 9.350333, 8.953333])
In mathematical notation, we can write the deviations as:
We have the deviations but we want something called the standard deviation, a measure of how far the typical (standard) value deviates from the mean.
To start, we turn all the deviations to positive values by squaring them. These are the squared deviations:
squared_deviations = deviations ** 2
squared_deviations
array([ 3.820722, 78.440544, 1.361111, 2.144272, 24.548722, 46.344325,
15.779432, 0.122967, 3.309974, 2.156982, 87.428733, 80.162178])
We can now get an average squared deviation. We can call this the mean squared deviation, but we often call this value — the variance.
# Variance is the mean squared deviation.
variance = np.sum(squared_deviations) / n
variance
28.80166355555556
The standard deviation is the square root of the mean squared deviation — the square root of the variance. We often write the standard deviation as \(\sigma\). Read this as “sigma”. \(\sigma\) is a small “s” in the Greek alphabet.
# Standard deviation
sigma = np.sqrt(variance)
sigma
5.366718136399149
These are simple calculations, but Numpy also has a short-cut to do them, called np.std
(NumPy STandard Deviation):
# Numpy calculation of the standard deviation.
np.std(x)
5.366718136399149
We often find ourselves writing equations with the standard deviation \(\sigma\) and the variance (mean squared deviation). Because the variance is the squared standard deviation, we usually write the variance as \(\sigma^2\):
And:
We can be more specific about which vector we are referring to, by adding the vector name as a subscript. For example, if we mean the standard deviation for \(\xvec\), we could write this as \(\sigma_x\):
sigma_x = np.std(x)
sigma_x
5.366718136399149
Similarly for \(\yvec\):
sigma_y = np.std(y)
sigma_y
2.78136974149229
The z-score transformation#
The z-score transformation is to subtract the mean and divide by the standard deviation. Put another way, the z-scores are the deviations divided by the standard deviation:
z_x = (x - x_bar) / sigma_x
z_x
array([-0.36422 , -1.650295, -0.217389, 0.272855, -0.923221, -1.268497,
0.740179, -0.065341, 0.339003, -0.273662, 1.742281, 1.668307])
z_y = (y - y_bar) / sigma_y
z_y
array([-1.047266, -1.115218, -1.100477, -1.02066 , 0.461343, -0.792715,
-0.596768, 0.600484, 1.673696, 0.820879, 0.800025, 1.316677])
Write the z-scores corresponding to \(\xvec\) as \(\zxvec\):
That is the same as saying:
Correlation#
You may remember that the correlation calculation is:
r = np.sum(z_x * z_y) / n
r
0.5178777312914015
In mathematical notation:
Sum of squared error for z-scored lines#
We remind ourselves of the sum of squared error (SSE) for a given slope b
and
intercept c
.
Here is a function to calculate the sum of squared error for a given x and y vector, slope and intercept:
def calc_sse(x_vals, y_vals, b, c):
""" Sum of squared error for slope `b`, intercept `c`
"""
predicted = x_vals * b + c
errors = y_vals - predicted
return np.sum(errors ** 2)
Let us start by calculating the SSE values for 1000 intercepts between -1 and 1, and slope of 0.6, using the z-scored x and y values.
n_intercepts = 1000
intercepts = np.linspace(-1, 1, n_intercepts)
sse_inters_b0p6 = np.zeros(n_intercepts)
for i in range(n_intercepts):
sse_inters_b0p6[i] = calc_sse(z_x, z_y, 0.6, intercepts[i])
plt.plot(intercepts, sse_inters_b0p6)
plt.xlabel('intercept')
plt.ylabel('SSE for intercept, b=0.6');
As before, it looks as though \(c=0\) is the intercept that minimizes the SSE, for a slope of 0.6.
Now let us try many slopes, for an intercept of 0.
n_slopes = 1000
slopes = np.linspace(-1, 1, n_slopes)
sse_slopes_c0 = np.zeros(n_slopes)
for i in range(n_slopes):
sse_slopes_c0[i] = calc_sse(z_x, z_y, slopes[i], 0)
plt.plot(slopes, sse_slopes_c0)
plt.xlabel('slope')
plt.ylabel('SSE for slope, c=0');
This suggests that the slope minimizing the SSE is around 0.52, at least, for an intercept of 0. It turns out that the correlation value \(r\) above is always the best SSE slope for the z-score line. Our remaining task is to prove this with the mathematical notation we have built up.
Some interesting characteristics of z-scores#
Z-score vectors have a couple of interesting and important mathematical properties.
They have a sum of zero (within the calculation precision of the computer):
np.sum(z_x)
6.661338147750939e-16
Therefore they also have a mean of (near as dammit) zero.
np.mean(z_x)
5.551115123125783e-17
The z-scores have a standard deviation of (near-as-dammit) 1, and therefore, a variance of 1:
np.std(z_x)
0.9999999999999999
This is true for any vector \(\xvec\) (or \(\yvec\)). Why?
To answer that question, we need the results from the algebra of sums. If you read that short page, you fill find that these general results hold:
Addition inside sum#
Multiplication by constant inside sum#
where \(c\) is some constant (number).
Sum of constant value#
Characteristics of z-scores#
With the results above, we can prove z-scores have a sum of 0:
Therefore, it is always true that the sum and mean of a z-score vector are 0.
Next we show z-scores have a variance, and therefore, standard deviation, of 1.
Thus we have learned that:
Because \(\sigma^2_{z_x} = \frac{1}{n} \Sigma (z_{x_i})^2 = 1\):
The least-squares line for z-scores#
Let us say we have a data vector \(\yvec\). We have somehow calculated an estimated (fitted) value \(\hat{y}_i\) for every corresponding value \(y_i\) in \(\yvec\). Then the errors at each value \(i\) are given by:
The sum of squared errors are:
Remembering that \((a + b)^2 = (a + b)(a + b) = a^2 + 2ab + b^2\) we get:
Simplifying with rules of sums above:
Now we introduce our fitted value \(\hat{y}_i\), which we get from our straight line with slope \(b\) and intercept \(c\):
Substituting, then simplifying:
Now, let us assume our \(y_i\) and \(x_i\) values are z-scores. To indicate this, we will use \(z_{y_i}\) and \(z_{x_i}\) instead of \(y_i\) and \(x_i\). As you remember from the results above:
In that case, the equation above simplifies to:
Reality check#
Let us pause at this point for a reality check. If we have done our derivation
correctly, then the formula above should give the same answer as the calc_sse
function above. Let’s check.
def calc_sse_simplified(z_x, z_y, b, c):
""" Implement simplified z-score SSE equation above
"""
n = len(z_x)
return (n - 2 * b * np.sum(z_x * z_y) +
b ** 2 * n + n * c ** 2)
Does this give the same values as we saw before, for our example intercepts and slope of 0.6?
sse_simplified_b0p6 = np.zeros(n_intercepts)
for i in range(n_intercepts):
sse_simplified_b0p6[i] = calc_sse_simplified(z_x, z_y, 0.6, intercepts[i])
# Does it give the same results as we got from the original function?
assert np.allclose(sse_simplified_b0p6, sse_inters_b0p6)
How about the example slopes and intercept of 0?
sse_simplified_c0 = np.zeros(n_slopes)
for i in range(n_slopes):
sse_simplified_c0[i] = calc_sse_simplified(z_x, z_y, slopes[i], 0)
# Does it give the same results as we got from the original function?
assert np.allclose(sse_simplified_c0, sse_slopes_c0)
No errors — we appear to be on the right mathematical track.
Finding the trough in the SSE function#
We want to find the slope, intercept pair \(b, c\) that gives us the lowest SSE value.
To do this, we remind ourselves that the SSE equation above is a function that varies with respect to \(b\) and with respect to \(c\).
Here we plot this function, again, for a variety of values of \(b\):
plt.plot(slopes, sse_simplified_c0)
plt.xlabel('b')
plt.ylabel('SSE for given b, c=0');
We would like to find the slope \(b\) value corresponding to the minimum of the SSE function. The standard mathematical way to do that is to first make a new function that gives the gradient of the SSE line, at every value for - in our case - \(b\). This new gradient function is called the derivative of the SSE function. Then we find the value of \(b\) for which the gradient (derivative function) is 0. You can see from the graph that the gradient will go to 0 when the function has got to the minimum point, because the gradient is turning from negative to positive. So, when we have found the gradient=0 point, and confirmed it is at the bottom of a trough, as we see in the graph, then we have found the value of \(b\) that minimizes the original SSE function.
We can see the shape of our function curve, and it looks like it is a simple parabola, with one minimum. When we take the derivative of function, which gives us the gradient, we expect to see a single x value where the gradient is 0, and we can see that is a minimum point. But, to be sure about this, in the general case, we will first want to check that there is in fact only one x value for which the derivative is 0. And next, we will want to check whether this point is a minimum point or a maximum point. You can probably imagine that a peak (maximum point) is also going to have a derivative of 0, as the slope goes from positive to negative. In order to check whether a 0 in the derivative is a maximum or a minimum, we take the gradient of the gradient (the derivative of the derivative). This is the second derivative. Think of it as the rate of change of the gradient. If we are at a minimum point, then this rate of change of the gradient is positive, as the gradient goes from negative to positive. If we are at a maximum point, this rate of change of the gradient will be negative, as the gradient goes from positive to negative.
Taking the derivative is also called differentiating the function.
You need to know some rules for taking the derivative, but these rules are not hard to derive, once you are used to the idea of taking the derivative. The key ones rules here are:
You need to know too, that when differentiating with respect to some variable, say \(x\), you can ignore (set to 0) all terms that do not vary with \(x\).
Let us first take the derivative of the \(SSE_z\) function above, with respect to \(c\).
This is zero only when \(c = 0\). Differentiate again to give \(2n\); the second derivative is always positive, so the zero point of the first derivative is a trough (minimum) rather than a peak (maximum).
We have discovered that, regardless of the slope \(b\), the intercept \(c\) that minimizes the sum (and mean) squared error is 0.
Now we have established the always-gives-SSE-minimum value for \(c\), substitute back into the equation above:
Differentiate with respect to \(b\):
This is zero when:
Differentiate again to get:
The second derivative is always positive, so the zero for the first derivative corresponds to a trough (minimum) rather than a peak (maximum).
We have discovered that the slope of the line that minimizes \(SSE_z\) is \(\frac{1}{n} \Sigma z_{x_i} z_{y_i}\) — AKA the correlation.