A practical interpertation of the Pearson correlation coefficient

\(\rho=1\) means perfect positive correlation, \(\rho=-1\) means perfect negative correlation, \(\rho=0\) means no correlation. But what does \(\rho=0.72\) mean?
Author

Tom Shlomo

Published

January 20, 2024

\[ \renewcommand{\E}[1]{\operatorname{E}\left[#1\right]} \renewcommand{\var}[1]{\operatorname{Var} \left[#1 \right]} \renewcommand{\cov}[1]{\operatorname{Cov} \left[#1 \right] } \]

My goal is to explain the Pearson correlation coefficient without using the word “correlation,” which is often used to describe it.

The Pearson correlation coefficient of two random variables \(X\) and \(Y\) is \[ \rho := \frac{\sigma_{XY}}{\sigma_X \sigma_Y}, \] where \(\sigma_X\) and \(\sigma_Y\) are the standard deviations of \(X\) and \(Y\) respectively, and \(\sigma_{XY}\) is their covariance.

A motivation for this definition stems from the problem of estimating \(Y\) from an observation of \(X\). It turns out that in the optimal (lowest MSE) linear estimation, the number of standard deviations \(Y\) is above its mean is \(\rho\) times the number of standard deviations \(X\) is above its mean.

For example, consider a population where height and weight are correlated with \(\rho=0.72\), heights are distributed with a mean of \(170\)cm and a standard deviation of \(10\)cm, weights are distributed with a mean of \(70\)Kg and a standard deviation of \(20\)Kg. If we know that a certain person’s height is \(190\)cm, a good estimate for their weight would be \(70 + 2 \cdot 0.72 \cdot 20 = 98.8\)Kg.

The proof is straightforward. Since we are dealing with linear (actually, affine) estimators, we need to show that the \(a\) and \(b\) that would minimize \[ \text{MSE} := \E{ \left( \hat{Y} - Y \right) ^2}, \] where \(\hat{Y} := a (X - \mu_x) + b\), are \(\rho \sigma_Y / \sigma_X\) and \(\mu_Y\).

The MSE is the sum of the square of the bias and the variance. The variance doesn’t depend on \(b\), and the bias is \(\E{ \hat{Y} - Y } = b - \mu_Y\) which doesn’t depend on \(a\), so \(b=\mu_Y\). To minimize the variance, we simplify: \[ \begin{align*} \var{\hat{Y} - Y} &= \var{\hat{Y}} + \var{Y} - 2 \cov{\hat{Y}, Y} \\&= \sigma_x ^ 2 a^2 + \sigma_Y ^2 -2 \sigma_{XY} a. \end{align*} \] This is simply a parabola in \(a\), so the optimal \(a\) is \[ a=\frac{2 \sigma_{XY}} {2 \sigma_X ^2} = \rho \frac{\sigma_Y } {\sigma_X } \] (which is what we wanted to show).

The estimator is unbiased, so its MSE is equal to its variance: \[ \text{MSE} = \sigma_Y ^2 (1 - \rho ^ 2). \] This equation provides another concrete interpretation of \(\rho\): *If \(X\) and \(Y\) are correlated with coefficient \(\rho\), observing \(X\) will decrease the standard deviation of a \(Y\) estimate by a factor of at least \(\sqrt{1 - \rho^2}\) (“At least” since the optimal linear estimator is equal to or worse than the optimal estimator). In the example above, knowing the height decreases weight estimation standard deviation from 20Kg to \(20 \sqrt{1 - 0.72^2} = 13.9\)Kg.

Randomly ordered notes:

  1. If \(X\) and \(Y\) are jointly Gaussian, the optimal linear estimator is also the optimal estimator.

  2. The “mean” in “MSE” represents an average over the joint distribution of \(X\) and \(Y\), which differs from the distribution of \(Y\) given \(X\), for which our estimator is not the optimal linear estimator (and is biased).

    In our example, we estimated the weight to be \(98.8\)Kg with variance \(9.6^2\). This doesn’t mean that if we sample random people with a height of \(190\)cm, we would get a mean weight of \(98.8\)Kg and variance smaller than \(9.6^2\). Instead, it means that if we sample random people and estimate their weight from their height using the optimal linear estimator, our error will be zero on average, with a variance of \(9.6^2\). If we use the optimal estimator, \(9.6^2\) is an upper bound on the variance.

  3. The statement “\(X\) and \(Y\) are not correlated” now has a concrete meaning: it means the optimal linear estimator of \(Y\) from \(X\) will be the mean of \(Y\), completely ignoring \(X\).

  4. The discussion above is “Bayesian” in the sense that it assumes prior knowledge about the distribution of \(X\) and \(Y\). In practice, we typically obtain \(n\) samples of \(X\) and \(Y\) pairs and use plug-in estimators to estimate the means, variances, and covariance, which we then use to construct a linear estimator of \(Y\) from \(X\).

    Machine learning practitioners would say: we can use the samples to train a linear regression model to predict \(Y\) from \(X\) directly. This sounds more “end-to-end,” but it actually yields exactly the same result1.

    Proof:
    Let \(x\) and \(y\) denote the vectors of samples of \(X\) and \(Y\), \(\mathbf{1}\) a vector of ones, and \(A\) as the matrix whose first column is \(x\) and the second column is \(\mathbf{1}\). The coefficients of the linear model are given by: \[ \begin{align*} \begin{bmatrix} \theta_{\text{slope}} \\ \theta_{\text{intercept}} \end{bmatrix} &:= \text{argmin}_\theta \| A \theta - y \|^2 \\&= \left( A ^T A \right)^{-1} A^T y \\&= \begin{bmatrix} \|x\|^2 && \mathbf{1}^Tx \\ \mathbf{1}^T x && \mathbf{1}^T \mathbf{1} \end{bmatrix} ^{-1} \begin{bmatrix} x^T y \\ \mathbf{1} ^T y \end{bmatrix} \\&= \begin{bmatrix} \sigma_X^2 + \mu_X^2 && \mu_X \\ \mu_X && 1 \end{bmatrix} ^{-1} \begin{bmatrix} \sigma_{XY} + \mu_X \mu_Y \\ \mu_Y \end{bmatrix} \\&= \frac{1}{\sigma_X ^2} \begin{bmatrix} 1 && -\mu_X \\ -\mu_X && \sigma_X^2 + \mu_X^2 \end{bmatrix} \begin{bmatrix} \sigma_{XY} + \mu_X \mu_Y \\ \mu_Y \end{bmatrix} \\&= \frac{1}{\sigma_X ^2} \begin{bmatrix} \sigma_{XY} \\ -\mu_X \sigma_{XY} + \sigma_X^2 \mu_Y \end{bmatrix} \\&= \begin{bmatrix} a \\ -\mu_X a + b \end{bmatrix}. \end{align*} \] Note also that the r2-score of this fit is equal to \(\rho^2\): \[ r^2 := 1 - \frac{\text{MSE}}{\sigma_Y^2} = 1 - \frac{\sigma_Y ^2 \left(1-\rho^2\right)}{\sigma_Y ^2} = \rho^2. \]

Footnotes

  1. Assuming we don’t use Bessel’s correction↩︎