Augmentation is Regularization

On the equivalence of training data augmentation and quadratic regularization for linear models - a very useful (but not well known) result.
Author

Tom Shlomo

Published

January 15, 2024

Training data augmentation enhances the training dataset by applying transformations to existing training data instances. The specific transformations vary depending on the type of data involved, and this flexibility allows to leverage domain knowledge, such as known invariants, effectively. The goal is to introduce variability and increase the diversity of the training set, allowing the model to better generalize to unseen data and exhibit improved robustness. Despite the advantages, training data augmentation introduces an inherent computational cost: the increased volume of data requires additional computational resources, impacting both training time and memory requirements.

As we will show below, for linear models with the sum of squares loss, training data augmentation is equivalent to adding quadratic regularization term, which implies that the computational cost of fitting a model to an augmented dataset is the same as using no augmentation at all!

This link between augmentation and regularization is useful in the other direction as well: it gives a concrete interpretation to the value of regularization hyperparameters, and can be used to avoid costly hyperparameters tuning (np.logspace(-6, 6, 100) much?), and to design regularizers that are more appropriate to the data than the simple ones (i.e sum of squares regularization used in ridge regression).

Notation

Suppose we have a training data set comprised of \(n\) pairs \(x_i,\,y_i\) for \(i=0, \dots, n-1\), where \(x_i\) is the \(d\) dimensional feature vector of the \(i\)’th training data, and \(y_i\) is the corresponding label. Here we will assume \(y_i \in \mathrm{R}\), however our results can be easily extended to the vector-labeled (aka multi-output) case. We will also denote by \(X\) the \(n\)-by-\(d\) matrix with rows \(x_0^T, \dots, x_{n-1}^T\) and by \(y\) the \(n\)-vector with entries \(y_0, \dots, y_{n-1}\).

Let \(a:\mathrm{R}^d \times \mathcal{P} \mapsto \mathrm{R}^d\) denote the augmentation function that given the augmentation params \(p \in \mathcal{P}\), maps a feature vector \(x\) to a transformed feature vector. The augmentation parameters \(p\) are usually sampled randomly from a given distribution. For example, for image data, \(a\) is often a composition of small shifts, rotations, brightness changes, etc. while \(p\) specifies the amount of shifting, rotation and brightness change.

Ordinary least squares (OLS)

Let’s quickly discuss OLS so we can compare it’s equations with the augmented version we will derive after.
To fit an OLS model, we find a vector of coefficients \(\theta_\text{OLS}\) that minimizes the sum of squared training errors: \[\begin{align*} \theta_\text{OLS} &:= \text{argmin} _\theta \sum_{i=0} ^{n-1} \left( x_i ^T \theta - y_i \right) ^2 \\&= \text{argmin} _\theta \| X \theta - y \|^2 \tag{1} \end{align*}\] To solve the optimization problem (1), we solve the equation \(X^TX \theta_\text{OLS} = X^T y\), which has time complexity \(O(n d^2)\).

Augmented least squares

We will now fit a model by finding coefficients \(\theta_\text{ALS}\) that minimize the expected error over the augmented training dataset: \[\begin{align*} \theta_\text{ALS} &:= \text{argmin} _\theta \mathrm{E} \left[ \sum_{i=0} ^{n-1} \left( a\left(x_i, p_i\right) ^T \theta - y_i \right) ^2 \right], \tag{2} \end{align*}\] where the expectation is over \(p_0,\dots, p_{n-1}\), the random augmentation parameters. As we will see below, \(\theta_\text{ALS}\) depends on \(a\) and the distribution of \(p\) only through the 2nd order moments, which we denote by \[\begin{align*} \mu_i &:= \mathrm{E} \left[a(x_i, p_i) \right]\\ R_i &:= \mathrm{C}\text{ov} \left[ a(x_i, p_i) \right]. \end{align*}\]

Continuing from (2), we use the standard trick of subtracting and adding the mean: \[\begin{align*} \theta_\text{ALS} &= \text{argmin} _\theta \mathrm{E} \left[ \sum_{i=0} ^{n-1} \left( \left( \mu_i^T \theta - y_i \right) + \left( a\left(x_i, p_i\right) - \mu_i \right)^T\theta \right) ^2 \right] \end{align*}\] Note that the first term $ ( _i^T - y_i ) $ is deterministic, while the second term $ ( a(x_i, p_i) - _i )^T $ has zero mean. Therefore \[\begin{align*} \theta_\text{ALS} &= \text{argmin} _\theta \sum_{i=0} ^{n-1} \left( \mu_i^T \theta - y_i \right) + \mathrm{E} \left[\left( \left( a\left(x_i, p_i\right) - \mu_i \right)^T\theta \right)^2 \right] \\ &= \text{argmin} _\theta \| M \theta - y\|^2 + \theta ^T R \theta, \tag{3} \end{align*}\] where \(M\) is the \(n\)-by-\(d\) matrix whose rows are \(\mu_0^T, \dots, \mu_{n-1}^T\), and \[ R := \sum_{i=0} ^{n-1} R_i. \] Equation (3) shows exactly what we set to prove - fitting a model on augmented training dataset, is equivalent to fitting a non-augmented, but quadratically regularized, least squares model. We just replace \(X\) with it’s mean, and use the sum of all covariances as the regularization matrix.

To solve the optimization problem (3), we solve the equation \((X^T X + R) \theta_\text{ALS} = X^T y\), which has the same \(O(n d^2)\) complexity as OLS.

Ridge regression

Ride regression (aka Tykhonov regularization) has the form (3) with \(M=X\) and \(R=\lambda I\). As an augmentation, it can be interpreted as follows: perturb each feature vector by a zero mean noise, with variance \(\lambda/n\), uncorrelated across features.
This interpretation of \(\lambda\) can be used to set it (at least roughly): just think what level of perturbation \(\sigma\) is reasonable for your features, and set \(\lambda = n \sigma^2\).
This also shows that when different feature are scaled differently, ridge regression is perhaps not the best fit. A standard deviation of 100 might be reasonable for a feature with values in the order of millions, but it is probably not suitable for a feature with values in the order of 1. In these cases, we may use a diagonal \(R\): \[\begin{align*} R= n \, \text{diag} \left( \sigma_0^2, \dots, \sigma_{d-1}^2 \right) \end{align*}\] where \(\sigma_i\) is the standard deviation of the perturbation of feature \(i\).

Another option is to scale the transformations before fit, e.g using sklearn’s StandardScaler. With all features scaled to have unit variance, setting \(\lambda = n \, 10 ^{-6}\) is a sensible rule of thumb, as it is often reasonable to assume a \(0.1\%\) perturbation.

Note that often the model includes an intercept (aka constant) term by adding a column of ones to \(X\). Since this column remain unchanged through any augmenting transformation, the corresponding row and column of \(R\) should be all zeros.

Example

For the example we are gonna use the House Sales in King County, USA dataset. Each row describes a house, sold between May 2014 and May 2015. Our goal will be to predict the log price given features like number of rooms, area, and geography.

Note: several decisions outlined below weren’t necessarily the most effective,;, rather, they were chosen to showcase different modelling techniques in the context of augmentation via regularization.

Let’s begin by importing everything we will need, loading our data, and adding some columns.

from typing import Callable, Hashable, Self

import numpy as np
from numpy.typing import NDArray
import pandas as pd
import scipy
from sklearn.base import BaseEstimator
from sklearn.cluster import KMeans
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

Array = NDArray[np.float64]

df = pd.read_csv("data/kc_house_data.csv.zip", parse_dates=["date"])
df["long_scaled"] = df["long"] * np.mean(
    np.abs(np.cos(df["lat"] * np.pi / 180))
)  # earth-curvature correction for (approximate) distance calculations
df.describe()
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15 long_scaled
count 2.161300e+04 21613 2.161300e+04 21613.000000 21613.000000 21613.000000 2.161300e+04 21613.000000 21613.000000 21613.000000 ... 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000
mean 4.580302e+09 2014-10-29 04:38:01.959931648 5.400881e+05 3.370842 2.114757 2079.899736 1.510697e+04 1.494309 0.007542 0.234303 ... 1788.390691 291.509045 1971.005136 84.402258 98077.939805 47.560053 -122.213896 1986.552492 12768.455652 -82.471784
min 1.000102e+06 2014-05-02 00:00:00 7.500000e+04 0.000000 0.000000 290.000000 5.200000e+02 1.000000 0.000000 0.000000 ... 290.000000 0.000000 1900.000000 0.000000 98001.000000 47.155900 -122.519000 399.000000 651.000000 -82.677673
25% 2.123049e+09 2014-07-22 00:00:00 3.219500e+05 3.000000 1.750000 1427.000000 5.040000e+03 1.000000 0.000000 0.000000 ... 1190.000000 0.000000 1951.000000 0.000000 98033.000000 47.471000 -122.328000 1490.000000 5100.000000 -82.548783
50% 3.904930e+09 2014-10-16 00:00:00 4.500000e+05 3.000000 2.250000 1910.000000 7.618000e+03 1.500000 0.000000 0.000000 ... 1560.000000 0.000000 1975.000000 0.000000 98065.000000 47.571800 -122.230000 1840.000000 7620.000000 -82.482651
75% 7.308900e+09 2015-02-17 00:00:00 6.450000e+05 4.000000 2.500000 2550.000000 1.068800e+04 2.000000 0.000000 0.000000 ... 2210.000000 560.000000 1997.000000 0.000000 98118.000000 47.678000 -122.125000 2360.000000 10083.000000 -82.411796
max 9.900000e+09 2015-05-27 00:00:00 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.500000 1.000000 4.000000 ... 9410.000000 4820.000000 2015.000000 2015.000000 98199.000000 47.777600 -121.315000 6210.000000 871200.000000 -81.865195
std 2.876566e+09 NaN 3.671272e+05 0.930062 0.770163 918.440897 4.142051e+04 0.539989 0.086517 0.766318 ... 828.090978 442.575043 29.373411 401.679240 53.505026 0.138564 0.140828 685.391304 27304.179631 0.095033

8 rows × 22 columns

We will want a polynomial (rather than linear) dependency on the age of the house:

df["age"] = df["date"].dt.year - df["yr_built"]
age_cols = ["age"]
for power in range(2, 5):
    col = f"age ^ {power}"
    df[col] = df["age"] ** power
    age_cols.append(col)

We do a 10-90 train-test split to demonstrate the effectiveness of augmentation when we have little data.

y = np.log(df["price"])
X = df.drop(columns=["price"])
x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.9, random_state=42
)
print(f"{x_train.shape=}, {x_test.shape=}")
x_train.shape=(2161, 25), x_test.shape=(19452, 25)

There is no reason to expect a linear relationship between the house geographical coordinates and it’s price.
However, we do expect a strong dependency between price and location in the sense that houses with similar features should share similar prices when located in close geographical proximity.
One way to model this is to cluster the data geographically, and tag each house with the cluster it belongs to using one hot encoding:

# We need this class mainly since the transform method of sklearn's k-means class yields cluster centers, and we want one hot encoding.
class OneHotEncodedKMeansTransformer:
    def __init__(self, k: int, columns: list[str], name: str) -> None:
        self.columns = columns
        self.k = k
        self.name = name

    def fit(self, X: pd.DataFrame) -> Self:
        self.kmeans_ = KMeans(n_clusters=self.k, n_init="auto", random_state=42)
        self.kmeans_.fit(X[self.columns])
        return self

    def column_names(self) -> list[str]:
        return [f"{self.name}_{i}" for i in range(self.k)]

    def transform(self, X: pd.DataFrame):
        cluster_index = self.kmeans_.predict(X[self.columns])
        return pd.concat(
            [
                X,
                pd.DataFrame(
                    np.eye(self.k)[cluster_index],
                    columns=self.column_names(),
                    index=X.index,
                ),
            ],
            axis=1,
        )

    def clusters_adjacency_matrix(self):
        edges = np.array(
            scipy.spatial.Voronoi(self.kmeans_.cluster_centers_).ridge_points
        ).T
        a = scipy.sparse.coo_matrix(
            (np.ones(edges.shape[1]), (edges[0], edges[1])),
            shape=(self.k, self.k),
        )
        return a + a.T


kmeans_transformer = OneHotEncodedKMeansTransformer(
    k=500,
    columns=["lat", "long_scaled"],
    name="geo_cluster",
)
x_train = kmeans_transformer.fit(x_train).transform(x_train)
x_test = kmeans_transformer.transform(x_test)

We will evaluate our models by their R squared score. From a quick glance over Kaggle, it seems that sophisticated and advanced models (e.g XGBoost) can achieve a score of about 0.9. Let’s see if we can get there using a linear model.

def evaluate_model(model) -> None:
    y_train_pred = model.fit(x_train, y_train).predict(x_train)
    r2_train = r2_score(y_train, y_train_pred)
    y_test_pred = model.predict(x_test)
    r2_test = r2_score(y_test, y_test_pred)
    print(f"{r2_train=:.3f}, {r2_test=:.3f}")

Let’s start with a vanilla linear model, without any regularization/augmentations.

columns = (
    [
        "bedrooms",
        "bathrooms",
        "floors",
        "waterfront",
        "view",
        "condition",
        "grade",
        "sqft_living",
        "sqft_lot",
        "sqft_above",
        "sqft_basement",
        "sqft_lot15",
        "sqft_living15",
    ]
    + age_cols
    + kmeans_transformer.column_names()
)
columns_selector = ColumnTransformer(
    [("selector", "passthrough", columns)],
    remainder="drop",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

simple_linear = Pipeline(
    [
        ("selector", columns_selector),
        ("linear", LinearRegression(fit_intercept=False)),
    ]
)
evaluate_model(simple_linear)
r2_train=0.918, r2_test=0.862

Not bad, but we do have some overfitting. Let’s see if we can improve generalization with regularization/augmentation. First we try ridge regression:

ridge = Pipeline(
    [
        ("selector", columns_selector),
        ("scale", StandardScaler()),
        (
            "linear",
            RidgeCV(
                fit_intercept=True,
                alphas=x_train.shape[0] * np.logspace(-9, -2, 100),
            ),
        ),
    ]
)
evaluate_model(ridge)
r2_train=0.918, r2_test=0.863

That didn’t really help. That makes sense since using a diagonal regularization matrix doesn’t make sense for our correlated features.
Let’s see if we can do better by using augmentations that are more appropriate for our data.

First let’s build a class for linear models with augmentation via regularization.
We will pass to the constructor a callable that takes the input features and returns their mean and (sum of) covariance after the augmentation, since as we shown above these are all we need from the augmentations.
Since often the transformations of different features are uncorrelated, it is convenient to specify features in groups, and assume zero covariance for features that are not in the same group (i.e a block diagonal covariance matrix).

class AugmentedLinearModel:
    def __init__(
        self,
        augmentation_moments: list[  # one item for each group of features
            tuple[
                list[Hashable],  # column names of features in the group
                Callable[[pd.DataFrame], tuple[Array, Array]],  # maps X to M and R
            ]
        ],
    ) -> None:
        self.augmentation_moments = augmentation_moments

    def fit(self, X: pd.DataFrame, y: pd.Series) -> Self:
        means, covs = zip(
            *(
                moments(X.loc[:, columns])
                for columns, moments in self.augmentation_moments
            )
        )
        M = np.hstack(means)
        # https://scikit-learn.org/stable/developers/develop.html#estimated-attributes
        self.R_ = scipy.linalg.block_diag(*covs)
        self.theta_ = np.linalg.solve(M.T @ M + self.R_, M.T @ y)
        return self

    def predict(self, X: pd.DataFrame) -> pd.Series:
        cols = [col for cols, _ in self.augmentation_moments for col in cols]
        return X.loc[:, cols] @ self.theta_

Here are the augmentations we are gonna use:
With 10% probability, a bathroom is counted as half a bedroom.

def bedrooms_bathrooms_moments(X: pd.DataFrame) -> tuple[Array, Array]:
    p = 0.1
    v = np.array([1, -0.5])
    mask = X[["bathrooms"]] >= 1
    M = np.where(mask, X + p * v, X)
    R = p * (1 - p) * np.outer(v, v) * mask.values.sum()
    return M, R


augmentation_moments = [(["bedrooms", "bathrooms"], bedrooms_bathrooms_moments)]

A 5% perturbation for the features
sqft_living, sqft_lot, sqft_above, sqft_basement, sqft_lot15, sqft_living15, uncorrelated across the features.

def relative_perturbation_moments(X: pd.DataFrame) -> tuple[Array, Array]:
    return X.values, np.sum(X.values**2) * 0.05**2


augmentation_moments.extend(
    ([column], relative_perturbation_moments)
    for column in [
        "sqft_living",
        "sqft_lot",
        "sqft_above",
        "sqft_basement",
        "sqft_lot15",
        "sqft_living15",
    ]
)

A perturbation of 0.01 for the features floors, waterfront, view, condition, grade, uncorrelated across the features

def absolute_perturbation_moments(X: pd.DataFrame) -> tuple[Array, Array]:
    return X.values, X.shape[0] * 0.01**2


augmentation_moments.extend(
    ([column], absolute_perturbation_moments)
    for column in ["floors", "waterfront", "view", "condition", "grade"]
)

perturbing age with a uniform distribution between -1 and 1. We need to calculate the moments for the power of age accordingly.

def age_moments(X: pd.DataFrame) -> tuple[Array, Array]:
    a = X[["age"]].values - 1
    b = X[["age"]].values + 1
    max_power = X.shape[1]
    np1 = np.arange(2, 2 * max_power + 2)
    # https://en.wikipedia.org/wiki/Continuous_uniform_distribution#Moments
    mu = (b**np1 - a**np1) / (np1 * (b - a))
    mu_sum = mu[:, 1:].sum(axis=0)
    mu = mu[:, :max_power]
    idx = np.add.outer(np.arange(max_power), np.arange(max_power))
    c = mu_sum[idx] - mu.T @ mu
    return mu, c


augmentation_moments.append((age_cols, age_moments))

And finally, with probability 50%, the geo cluster is changed to one of it’s neighbors.

def geo_cluster_moments(X: pd.DataFrame) -> tuple[Array, Array]:
    p = 0.5
    adj_mat = kmeans_transformer.clusters_adjacency_matrix()
    P = scipy.sparse.eye(adj_mat.shape[0]) * p + (adj_mat / adj_mat.sum(axis=1)) * (
        1 - p
    )  # transition probabilities matrix
    M = scipy.sparse.csr_array(X.values) @ P
    # https://en.wikipedia.org/wiki/Multinomial_distribution#Matrix_notation
    R = scipy.sparse.diags(M.sum(axis=0)) - M.T @ M
    return M.toarray(), R.toarray()


augmentation_moments.append((kmeans_transformer.column_names(), geo_cluster_moments))

Le’t fit the augmented model and see how we did:

augmented_linear = AugmentedLinearModel(augmentation_moments)
evaluate_model(augmented_linear)
r2_train=0.903, r2_test=0.882

We managed to improve the test accuracy, and reduce overfit.

Beyond least squares

Is it possible to extend the result to models that use a non-quadratic loss (e.g logistic regression)? Well the proof heavily relies on that, so probably not, but let’s if we can at least can an approximate result using a 2nd order taylor approximation for the loss.

The goal is to (approximately) express \[\begin{align*} \mathrm{E} \left[ \sum_{i=0} ^{n-1} l \left( a\left(x_i, p_i\right) ^T \theta \,;\, y_i \right) \right], \end{align*}\] as a sum of a non-augmented loss term, and a regularization term. Here, \(l(\hat{y}\,;\,y)\) measures how bad is the prediction \(\hat{y}\), given the true value \(y\) (the loss).
For example, for logistic regression we use the logistic loss \[ l(\hat{y}; y) = \log \left( 1 + \exp \left(-y \, \hat{y} \right) \right) \] (with \(y \in \{ -1, 1 \}\)).
Let’s expand \(l \left( a\left(x_i, p_i\right) ^T \theta\,;\,y_i \right)\) around \(\mu_i ^T \theta\) and simplify: \[\begin{align*} \mathrm{E} \left[ \sum_{i=0} ^{n-1} l \left( a\left(x_i, p_i\right) ^T \theta \,;\, y_i \right) \right] \approx \sum_{i=0} ^{n-1} l(\mu_i ^T \theta\,;\,y_i) + \frac{1}{2} l'' \left( \mu_i ^T \theta\,;\,y_i \right) \theta^T R \theta \end{align*}\] (the order-1 term vanishes as it has zero mean, similar to the delta method).
So like in the least squares case, in the loss term we just replace each \(x\) with it’s mean. But, the regularization term is not quadratic, since we have the second derivative factor which is not constant (unless the loss is quadratic…).

I use this result to tell myself that it is ok to select an \(R\) for a quadratic regularization based on the covariance of an augmentation, as long as the covariance is small (usually correct for augmentations), and \(l''\) is bounded (correct for logistic regression).