A practical interpertation of the Pearson correlation coefficient

\(\rho=1\) means perfect positive correlation, \(\rho=-1\) means perfect negative correlation, \(\rho=0\) means no correlation. But what does \(\rho=0.72\) mean?
Author

Tom Shlomo

Published

January 20, 2024

\[ \renewcommand{\E}[1]{\operatorname{E}\left[#1\right]} \renewcommand{\var}[1]{\operatorname{Var} \left[#1 \right]} \renewcommand{\cov}[1]{\operatorname{Cov} \left[#1 \right] } \] My goal is to explain the Pearson correlation coefficient without using the word correlation, which is often used to describe it.
The Pearson correlation coefficient of two random variables \(X\) and \(Y\) is \[ \rho := \frac{\sigma_{XY}}{\sigma_X \sigma_Y}, \] where \(\sigma_X\) and \(\sigma_Y\) are the standard deviation of \(X\) and \(Y\) respectively, and \(\sigma_{XY}\) is their covariance.

A motivation for the definition \(\rho\) comes from the problem of estimating \(Y\) from an observation of \(X\). It turns out that in the optimal (lowest MSE) linear estimation, the number of standard deviations \(Y\) is above it’s mean is \(\rho\) times the number of standard deviations \(X\) is above it’s mean.
For example, consider a population of people where height and weight are correlated with \(\rho=0.72\), heights are distributed with mean \(170\)cm and a standard deviation of \(10\)cm, weights are distributed with mean \(70\)Kg and a standard deviation of \(20\)Kg. If we know that the height of a certain person is \(190\)cm, a good guess for it’s weight is \(70 + 2 \cdot 0.72 \cdot 20 = 98.8\)Kg.

The proof is very simple. Since we are dealing with linear (actually, affine) estimators, we need to show that the \(a\) and \(b\) that would minimize \[ \text{MSE} := \E{ \left( \hat{Y} - Y \right) ^2}, \] where \(\hat{Y} := a (X - \mu_x) + b\), are \(\rho \sigma_Y / \sigma_X\) and \(\mu_Y\).
The MSE is the sum of bias squared and variance. The variance doesn’t depend on \(b\), and the bias is \(\E{ \hat{Y} - Y } = b - \mu_Y\) which doesn’t depend on \(a\), so \(b=\mu_Y\). To minimize the variance, we simplify: \[ \begin{align*} \var{\hat{Y} - Y} &= \var{\hat{Y}} + \var{Y} - 2 \cov{\hat{Y}, Y} \\&= \sigma_x ^ 2 a^2 + \sigma_Y ^2 -2 \sigma_{XY} a. \end{align*} \] This is just a parabola in \(a\), so the optimal \(a\) is \[ a=\frac{2 \sigma_{XY}} {2 \sigma_X ^2} = \rho \frac{\sigma_Y } {\sigma_X } \] (which is what we wanted to show).

The estimator is unbiased, so it’s MSE is equal to it’s variance: \[ \text{MSE} = \sigma_Y ^2 (1 - \rho ^ 2). \] This equation gives another concrete interpretation of \(\rho\): If \(X\) and \(Y\) are correlated with coefficient \(\rho\), observing \(X\) will decrease the standard deviation of a \(Y\) estimate by a factor of at least \(\sqrt{1 - \rho^2}\).
“at least” since the the optimal linear estimator is equal or worse than the optimal estimator.
In the example above, knowing the height decreases weight estimation standard deviation from 20Kg to \(20 (1 - 0.72^2) = 9.6\)Kg.

Randomly ordered notes:

  1. If \(X\) and \(Y\) are jointly Gaussian, the optimal linear estimator is also the optimal estimator.

  2. The “mean” in “MSE” is an average over the joint distribution of \(X\) and \(Y\), which is different than over the distribution of \(Y\) given \(X\), for which our estimator is not the optimal linear estimator (and biased).
    In our example, we estimated the weight to be \(98.8\)Kg with variance \(9.6^2\). It doesn’t mean that if we will sample random people with height \(190\)cm, we would get a mean weight of \(98.8\)Kg and variance smaller than \(9.6^2\). It means that if we sample random people, and estimate their weight from their height using the optimal linear estimator, our error will be zero on average, and with variance \(9.6^2\). If we use the optimal estimator, the \(9.6^2\) is an upper bound on the variance.

  3. The sentence “\(X\) and \(Y\) are not correlated” now has a concrete meaning: it means that the optimal linear estimator of \(Y\) from \(X\) will be the mean of \(Y\), ignoring \(X\) completely.

  4. The discussion above is “Bayesian”, in the sense that it assumes you have some knowledge about the distribution of \(X\) and \(Y\). In practice we usually get \(n\) samples of \(X\) and \(Y\) pairs, and we use plug-in estimators to estimate the means, variances, and covariance, which we will then use build our \(Y\) from \(X\) linear estimator.
    Machine learning people would say: we can use the samples to train a linear regression model to predict \(Y\) from \(X\) directly. Sounds better, more “end-to-end”y, but actually it gives exactly the same result1. Proof:
    We denote by \(x\) and \(y\) be the vectors of samples of \(X\) and \(Y\), by \(\mathbf{1}\) a vector of ones, and by \(A\) the matrix whose first column is \(x\) and the second column is \(\mathbf{1}\). The coefficients of the linear model are given by: \[ \begin{align*} \begin{bmatrix} \theta_{\text{slope}} \\ \theta_{\text{intercept}} \end{bmatrix} &:= \text{argmin}_\theta \| A \theta - y \|^2 \\&= \left( A ^T A \right)^{-1} A^T y \\&= \begin{bmatrix} \|x\|^2 && \mathbf{1}^Tx \\ \mathbf{1}^T x && \mathbf{1}^T \mathbf{1} \end{bmatrix} ^{-1} \begin{bmatrix} x^T y \\ \mathbf{1} ^T y \end{bmatrix} \\&= \begin{bmatrix} \sigma_X^2 + \mu_X^2 && \mu_X \\ \mu_X && 1 \end{bmatrix} ^{-1} \begin{bmatrix} \sigma_{XY} + \mu_X \mu_Y \\ \mu_Y \end{bmatrix} \\&= \frac{1}{\sigma_X ^2} \begin{bmatrix} 1 && -\mu_X \\ -\mu_X && \sigma_X^2 + \mu_X^2 \end{bmatrix} \begin{bmatrix} \sigma_{XY} + \mu_X \mu_Y \\ \mu_Y \end{bmatrix} \\&= \frac{1}{\sigma_X ^2} \begin{bmatrix} \sigma_{XY} \\ -\mu_X \sigma_{XY} + \sigma_X^2 \mu_Y \end{bmatrix} \\&= \begin{bmatrix} a \\ -\mu_X a + b \end{bmatrix}. \end{align*} \] Note also that the r2-score of this fit is equal to \(\rho^2\): \[ r^2 := 1 - \frac{\text{MSE}}{\sigma_Y^2} = 1 - \frac{\sigma_Y ^2 \left(1-\rho^2\right)}{\sigma_Y ^2} = \rho^2. \]

Footnotes

  1. Assuming we don’t use Bessel’s correction↩︎