The sparse approximation algorithm no one talks about

A somewhat unique introduction to greedy algorithms for the sparse approximation problem, and proposing an obvious algorithm that seems to be overlooked.
Author

Tom Shlomo

Published

December 6, 2024

In the problem of sparse approximation, we are trying to approximate a given vector \(y\) as a linear combination of few columns of a given matrix \(A\). It is useful in many applications, such as machine learning (feature selection), image processing (denoising, inpainting, deblurring, compression), and signal processing (spectral estimation, DOA estimation, compressive sensing).

In this post, we will discuss greedy algorithms for solving the sparse approximation problem, focusing on the well-known Matching Pursuit (MP) and Orthogonal Matching Pursuit (OMP) algorithms, as well as a “new” algorithm that I call the Most Obvious Matching Pursuit (MOMP).

Notation

We will consider the following flavor of the sparse approximation problem: \[\begin{equation} \begin{array}{ll} \underset{x}{\mbox{minimize}} & \| Ax - y\|^2 \\ \mbox{subject to} & \text{$x$ has at most $k$ non-zero entries}. \end{array} \label{e-opt-prob} \end{equation}\] Where

  • \(A\) is the \(m \times n\) dictionary matrix with columns \(a_1, \ldots, a_n\),

  • \(y\) is the \(m\)-vector we are trying to approximate as the linear combination of the columns of \(A\), and

  • \(x\) is the \(n\)-vector of coefficients.

When \(S\) is an ordered subset of \(1, \ldots, n\), we denote by \(A_S\) the sub-matrix of \(A\) obtained by keeping only the columns in \(S\), and by \(x_S\) the sub-vector of \(x\) obtained by keeping only entries in \(S\).

For a matrix \(M\), \(\text{Range}(M)\) denotes it’s column space, and \(M^H\) denotes it’s conjugate transpose (if you don’t care about complex data, you can think of it as the transpose).

Support Recovery

If the support of \(x\) is known and denoted by \(S\), then since \(A x = A_S x_S\), finding the optimal \(x_S\) is a simple least squares problem: \[ \underset{x_S}{\text{minimize}} \| A_S x_S - y \|^2. \] This means that in practice, the problem is to recover the support of \(x\). The following results will be useful in this context:

  • An optimal support \(S\) maximizes \(y^H P_{A_S} y\), where \(P_{A_S}\) is the projection matrix onto \(\text{Range}(A_S)\).

  • An optimal \(x_S\) is given by any solution to \(A_S^H A_S x_S = A_S^H y\) (aka the normal equations).

First, we rewrite \(y\) as \(y^{||} + y^{\perp}\), where \(y^{||}\) is in \(\text{Range}(A_S)\) and \(y^{\perp}\) is orthogonal to it. Since \(A_S x_S\) is in \(\text{Range}(A_S)\), we can apply the Pythagorean theorem (twice) to get: \[\begin{align} \| A_S x_S - y \|^2 &= \| \left(A_S x_S - y^{||}\right) - y^{\perp} \|^2 \\&= \| A_S x_S - y^{||} \|^2 + \| y^{\perp} \|^2 \\&= \| A_S x_S - y^{||} \|^2 + \| y \|^2 - \| y^{||} \|^2 . \end{align}\] Minimizing across \(x_S\), the first term vanishes (since \(y^{||}\) is in \(\text{Range}(A_S)\), there is by definition an \(x_S\) such that \(y^{||} = A_S x_S\)).
Thus, an optimal \(S\) maximizes \(\| y^{||} \|^2\).
Since \(P_{A_S}\) is a projection matrix, it is symmetric and idempotent, so \[ \| y^{||} \|^2 = \| P_{A_S} y \|^2 = y^H P_{A_S}^H P_{A_S} y = y^H P_{A_S} y \] which proves the first item.

An optimal \(x_S\) satisfies \(y^{||} = A_S x_S\).
As the projection of \(y\) onto \(\text{Range}(A_S)\), \(y^{||}\) is characterized by \(A_S ^H \left(y - y^{||} \right) = 0\).
Combining the two, we get the second result.

Greedy Algorithms

Exactly Solving the sparse approximation problem is NP-hard. It turns out there isn’t a significantly better way than brute force checking all \(n \choose k\) possible supports. Thus we often turn to greedy algorithms2, which are faster but not guaranteed to find the optimal solution. We will present 3 such algorithms and their Python implementations. We will rely on the following helper functions:

import numpy as np


def project_y_onto_range_A_S(A: np.ndarray, y: np.ndarray, S: list[int]) -> np.ndarray:
    """
    Returns $P_{A_S} y$
    """
    A_S = A[:, S]
    Q, _ = np.linalg.qr(A_S)
    return Q @ Q.T.conj() @ y


def score_S(A: np.ndarray, y: np.ndarray, S: list[int]) -> float:
    """
    Returns $y^H P_{A_S} y$
    """
    A_S = A[:, S]
    Q, _ = np.linalg.qr(A_S)
    z = Q.T.conj() @ y
    return np.real(z.T.conj() @ z)  # real is needed only because of numerical errors

You may wonder why we are using the QR decomposition to compute the projection, and not simply solving the normal equations directly to get \(x_S\) and then computing \(A_S x_S\) to get the projection. The reason is that the normal equations are not always well-conditioned, which may lead to numerical errors in \(x_S\), which in turn may lead to numerical errors in the projection. But we don’t really care about \(x_S\), we only care about the projection. The QR decomposition is used to compute the projection directly in a more numerically stable way, without the need to compute the intermediate \(x_S\).

Matching Pursuit (MP)

MP is a simple and popular iterative algorithm for the sparse approximation problem. We start with an empty support. Then, at each iteration:

  1. Add the column \(i\) that maximizes \(y^H P_{a_i} y\) to the support.

  2. Update \(y\) by projecting it onto the space orthogonal to \(a_i\).

In Python, it looks like this:

def mp(A: np.ndarray, y: np.ndarray, k: int) -> set[int]:
    S = set()
    n = A.shape[1]
    while len(S) < k:
        i = max(range(n), key=lambda i: score_S(A, y, [i]))
        S.add(i)
        y = y - project_y_onto_range_A_S(A, y, [i])
    return S

In MP, we are always considering a support of size 1, so the normal equations are actually scalar equations: \[\begin{align*} a_i^H a_i x_i = a_i^H y \end{align*}\] If we normalize the columns of \(A\) to have unit norm, we get very simple equations for \(x_i\) and \(y^H P_{a_i} y\): \[\begin{align*} x_i &= a_i^H y, \\ y^H P_{a_i} y &= | x_i |^2. \end{align*}\] which can be utilized to make the algorithm more efficient. In Python, this can be implemented as follows:

def mp(A: np.ndarray, y: np.ndarray, k: int) -> set[int]:
    S = set()
    A = A / np.linalg.norm(A, axis=0)
    while len(S) < k:
        x = A.T.conj() @ y
        i = np.argmax(np.abs(x))
        S.add(i)
        y = y - A[:, i] * x[i]
    return S

Orthogonal Matching Pursuit (OMP)

OMP is a popular variant of MP. While in MP we project \(y\) onto the space orthogonal to the selected column in the current iteration, in OMP we project \(y\) onto the space orthogonal to all the selected columns so far:

def omp(A: np.ndarray, y: np.ndarray, k: int) -> set[int]:
    S = set()
    n = A.shape[1]
    while len(S) < k:
        i = max(range(n), key=lambda i: score_S(A, y, [i]))
        S.add(i)
        y = y - project_y_onto_range_A_S(A, y, list(S))
    return S

Like in MP, if the columns are normalized, the selected column is the one with the largest absolute inner product with \(y\). Unlike MP, we can’t reuse the inner product to compute the projection of \(y\) onto the space orthogonal to the selected columns. However, there are ways to speed up the computation of the projection by using the solution of the previous iteration, using e.g. incremental QR factorizations. I plan to discuss this in a future post.

The Most Obvious Matching Pursuit (MOMP)

Start with and empty support. At each iteration add the column \(i\) that maximizes \(y^H P_{A_{S \cup \left\{ i \right\}}} y\):

def momp(A: np.ndarray, y: np.ndarray, k: int) -> set[int]:
    S = set()
    n = A.shape[1]
    while len(S) < k:
        S.add(max(set(range(n)) - S, key=lambda i: score_S(A, y, list(S) + [i])))
    return S

That’s it, there are no \(y\) updates here.

OMP aims to enhance MP by optimizing the coefficients of all the selected columns at each iteration. However, this optimization occurs only after a new column has been added. During the selection of the new column, OMP—like MP—assumes \(k=1\), optimizing only the coefficient of the candidate column while keeping the coefficients of the previously selected columns fixed.

In contrast, MOMP simultaneously optimizes the coefficients of all selected columns, including the new candidate.

To clarify this distinction, consider the j’th iteration:

  • MOMP adds to the support \(S\) the column \(i\) that minimizes \[ \underset{z}{\text{min}} \left\{ \| A_{S \cup \left\{ i \right\}} z - y \|^2 \right\} \]
  • Meanwhile, OMP adds to the support \(S\) the column \(i\) that minimizes \[ \underset{z}{\text{min}} \left\{ \| A_{S \cup \left\{ i \right\}} z - y \|^2 \: | \: z_S = \text{argmin}_{z'} \| A_S z' - y \|^2 \right\}, \] which imposes additional constraints.

MP and OMP are well-known and widely used algorithms. However, I have not encountered MOMP in the literature (though I’d be happy to be proven wrong). Despite this, I believe MOMP is the most natural greedy algorithm to consider.

A possible reason for MOMP’s absence in the literature could be its computational complexity. At each iteration, MOMP requires solving a least squares problem for every candidate column, whereas OMP solves it only for the newly added column.

That said, there are ways to leverage the solution from the previous iteration to expedite these computations — for example, by using incremental Cholesky or QR factorizations. I plan to explore this in a future post.

Numerical Experiment

In this section, we compare the performance of the three algorithms on the classic problem of decomposing a signal into a sum of sinusoids.

For each value of \(k\), we generate 25 samples of a signal composed of \(k\) sinusoids with random frequencies, phases, and amplitudes. Noise is added to the signal, and the three algorithms are then used to approximate it. We evaluate their performance by measuring the root mean squared error (RMSE) of the approximation over 1000 repetitions.

from itertools import product
from tqdm import tqdm
import pandas as pd

rng = np.random.default_rng(0)


def generate_data(m: int, n: int, k: int) -> tuple[np.ndarray, np.ndarray]:
    t = np.arange(m)
    f = np.arange(n) / n
    A = np.exp(2j * np.pi * np.outer(t, f)) / np.sqrt(m)
    S = rng.choice(n, size=k, replace=False)
    x = (rng.standard_normal(k) + 1j * rng.standard_normal(k)) / np.sqrt(2 * k)
    y = A[:, S] @ x + 0.1 * (
        rng.standard_normal(m) + 1j * rng.standard_normal(m)
    ) / np.sqrt(2)
    return A, y


m = 25
n = 200
results = []
for trial, k in tqdm(product(range(1000), [10, 15, 20]), total=3000):
    A, y = generate_data(m, n, k)
    for algo in [mp, omp, momp]:
        S = algo(A, y, k)
        y_est = project_y_onto_range_A_S(A, y, list(S))
        rmse = np.linalg.norm(y - y_est) / np.sqrt(m)
        results.append((trial, k, algo.__name__, rmse))


df = pd.DataFrame(results, columns=["trial", "k", "algo", "rmse"])
100%|██████████| 3000/3000 [07:39<00:00,  6.52it/s]

We now plot the distribution of the RMSE for each algorithm, for each value of \(k\).

import plotly.express as px
import plotly.io as pio

pio.renderers.default = "notebook"

px.ecdf(
    df,
    x="rmse",
    color="algo",
    facet_row="k",
    title="ECDF of the sparse approximation RMSE",
    height=700,
)

There is a clear pattern: MP performs the worst, MOMP performs the best, and OMP falls in between. The differences between the algorithms become more pronounced as \(k\) increases.

Conclusion

In this post, we explored greedy algorithms for tackling the sparse approximation problem. We introduced a scoring method to evaluate the quality of a proposed support, which naturally led to the development of a new greedy algorithm, MOMP. This algorithm is arguably simpler to understand than both MP and OMP and demonstrated superior performance in the example presented.

Theoretical results exist for MP and OMP, such as exact support recovery guarantees under specific conditions. In the future, I plan to investigate whether similar results can be extended to MOMP and to discuss strategies for accelerating its computation.

Footnotes

  1. While these are standard results for least squares problems, I find that they are not always presented in a clear way. I like proving such results without using gradients (which obscure the intuition, in my opinion) and without inverting matrices (which adds unnecessary caveats about rank and conditioning).↩︎

  2. Convex relaxation (e.g. Basis Pursuit, LASSO) is another approach, but it is not the focus of this post.↩︎