Training data augmentation enhances the training dataset by applying transformations to existing training data instances. The specific transformations vary depending on the type of data involved, and this flexibility allows to leverage domain knowledge, such as known invariants, effectively. The goal is to introduce variability and increase the diversity of the training set, allowing the model to better generalize to unseen data and exhibit improved robustness. Despite the advantages, training data augmentation introduces an inherent computational cost: the increased volume of data requires additional computational resources, impacting both training time and memory requirements.

As we will show below, for linear models with the sum of squares loss, training data augmentation is equivalent to adding quadratic regularization term, which implies that the computational cost of fitting a model to an augmented dataset is the same as using no augmentation at all!

This link between augmentation and regularization is useful in the other direction as well: it gives a concrete interpretation to the value of regularization hyperparameters, and can be used to avoid costly hyperparameters tuning (np.logspace(-6, 6, 100) much?), and to design regularizers that are more appropriate to the data than the simple ones (i.e sum of squares regularization used in ridge regression).

Notation

Suppose we have a training data set comprised of \(n\) pairs \(x_i,\,y_i\) for \(i=0, \dots, n-1\), where \(x_i\) is the \(d\) dimensional feature vector of the \(i\)’th training data, and \(y_i\) is the corresponding label. Here we will assume \(y_i \in \mathrm{R}\), however our results can be easily extended to the vector-labeled (aka multi-output) case. We will also denote by \(X\) the \(n\)-by-\(d\) matrix with rows \(x_0^T, \dots, x_{n-1}^T\) and by \(y\) the \(n\)-vector with entries \(y_0, \dots, y_{n-1}\).

Let \(a:\mathrm{R}^d \times \mathcal{P} \mapsto \mathrm{R}^d\) denote the augmentation function that given the augmentation params \(p \in \mathcal{P}\), maps a feature vector \(x\) to a transformed feature vector. The augmentation parameters \(p\) are usually sampled randomly from a given distribution. For example, for image data, \(a\) is often a composition of small shifts, rotations, brightness changes, etc. while \(p\) specifies the amount of shifting, rotation and brightness change.

Ordinary least squares (OLS)

Let’s quickly discuss OLS so we can compare it’s equations with the augmented version we will derive after.
To fit an OLS model, we find a vector of coefficients \(\theta_\text{OLS}\) that minimizes the sum of squared training errors: \[\begin{align*} \theta_\text{OLS} &:= \text{argmin} _\theta \sum_{i=0} ^{n-1} \left( x_i ^T \theta - y_i \right) ^2 \\&= \text{argmin} _\theta \| X \theta - y \|^2 \tag{1} \end{align*}\] To solve the optimization problem (1), we solve the equation \(X^TX \theta_\text{OLS} = X^T y\), which has time complexity \(O(n d^2)\).

Augmented least squares

We will now fit a model by finding coefficients \(\theta_\text{ALS}\) that minimize the expected error over the augmented training dataset: \[\begin{align*} \theta_\text{ALS} &:= \text{argmin} _\theta \mathrm{E} \left[ \sum_{i=0} ^{n-1} \left( a\left(x_i, p_i\right) ^T \theta - y_i \right) ^2 \right], \tag{2} \end{align*}\] where the expectation is over \(p_0,\dots, p_{n-1}\), the random augmentation parameters. As we will see below, \(\theta_\text{ALS}\) depends on \(a\) and the distribution of \(p\) only through the 2nd order moments, which we denote by \[\begin{align*} \mu_i &:= \mathrm{E} \left[a(x_i, p_i) \right]\\ R_i &:= \mathrm{C}\text{ov} \left[ a(x_i, p_i) \right]. \end{align*}\]

Continuing from (2), we use the standard trick of subtracting and adding the mean: \[\begin{align*} \theta_\text{ALS} &= \text{argmin} _\theta \mathrm{E} \left[ \sum_{i=0} ^{n-1} \left( \left( \mu_i^T \theta - y_i \right) + \left( a\left(x_i, p_i\right) - \mu_i \right)^T\theta \right) ^2 \right] \end{align*}\] Note that the first term \(\mu_i^T \theta - y_i\) is deterministic, while the second term \(\left(a\left(x_i, p_i\right) - \mu_i\right)^T\theta\) has zero mean. Therefore \[\begin{align*} \theta_\text{ALS} &= \text{argmin} _\theta \sum_{i=0} ^{n-1} \left( \mu_i^T \theta - y_i \right) + \mathrm{E} \left[\left( \left( a\left(x_i, p_i\right) - \mu_i \right)^T\theta \right)^2 \right] \\ &= \text{argmin} _\theta \| M \theta - y\|^2 + \theta ^T R \theta, \tag{3} \end{align*}\] where \(M\) is the \(n\)-by-\(d\) matrix whose rows are \(\mu_0^T, \dots, \mu_{n-1}^T\), and \[ R := \sum_{i=0} ^{n-1} R_i. \] Equation (3) shows exactly what we set to prove - fitting a model on an augmented training dataset is equivalent to fitting a non-augmented, but quadratically regularized, least squares model. We just replace \(X\) with it’s mean, and use the sum of all the covariances as the regularization matrix.

To solve the optimization problem (3), we solve the equation \((X^T X + R) \theta_\text{ALS} = X^T y\), which has the same \(O(n d^2)\) complexity as OLS.

Ridge regression

Ride regression (aka Tykhonov regularization) has the form (3) with \(M=X\) and \(R=\lambda I\). As an augmentation, it can be interpreted as follows: perturb each feature vector by a zero mean noise, with variance \(\lambda/n\), uncorrelated across features.
This interpretation of \(\lambda\) can be used to set it (at least roughly): just think what level of perturbation \(\sigma\) is reasonable for your features, and set \(\lambda = n \sigma^2\).
This also shows that when different features are scaled differently, ridge regression is perhaps not the suitable. A standard deviation of 100 might be reasonable for perturbing a feature with values in the order of millions, but it is probably not suitable for a feature with values in the order of 1. In these cases, we may use a diagonal \(R\): \[\begin{align*} R= n \, \text{diag} \left( \sigma_0^2, \dots, \sigma_{d-1}^2 \right) \end{align*}\] where \(\sigma_i\) is the standard deviation of the perturbation of feature \(i\).

Another option is to scale the features before fitting, e.g using sklearn’s StandardScaler transformer. With all features scaled to have unit variance, setting \(\lambda = n \, 10 ^{-6}\) is sensible as rule of thumb, as it is often reasonable to assume at least \(0.1\%\) perturbation.

Note that often the model includes an intercept (aka constant) term by adding a column of ones to \(X\). Since this column remain unchanged through any augmenting transformation, the corresponding row and column of \(R\) should be all zeros.

Example

For the example we are gonna use the House Sales in King County, USA dataset. Each row describes a house sold between May 2014 and May 2015. Our goal will be to predict the log price given features like number of rooms, area, and location.

Note: several decisions outlined below weren’t necessarily the most effective; rather, they were chosen to showcase different modelling techniques in the context of augmentation via regularization.

Let’s begin by importing everything we will need, loading our data, and adding some columns.

from typing import Callable, Hashable, Self

import numpy as np
from numpy.typing import NDArray
import pandas as pd
import scipy
from sklearn.base import BaseEstimator
from sklearn.cluster import KMeans
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

Array = NDArray[np.float64]

df = pd.read_csv("data/kc_house_data.csv.zip", parse_dates=["date"])
df["long_scaled"] = df["long"] * np.mean(
    np.abs(np.cos(df["lat"] * np.pi / 180))
)  # earth-curvature correction for (approximate) distance calculations
df.describe()

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	...	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15	long_scaled
count	2.161300e+04	21613	2.161300e+04	21613.000000	21613.000000	21613.000000	2.161300e+04	21613.000000	21613.000000	21613.000000	...	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000
mean	4.580302e+09	2014-10-29 04:38:01.959931648	5.400881e+05	3.370842	2.114757	2079.899736	1.510697e+04	1.494309	0.007542	0.234303	...	1788.390691	291.509045	1971.005136	84.402258	98077.939805	47.560053	-122.213896	1986.552492	12768.455652	-82.471784
min	1.000102e+06	2014-05-02 00:00:00	7.500000e+04	0.000000	0.000000	290.000000	5.200000e+02	1.000000	0.000000	0.000000	...	290.000000	0.000000	1900.000000	0.000000	98001.000000	47.155900	-122.519000	399.000000	651.000000	-82.677673
25%	2.123049e+09	2014-07-22 00:00:00	3.219500e+05	3.000000	1.750000	1427.000000	5.040000e+03	1.000000	0.000000	0.000000	...	1190.000000	0.000000	1951.000000	0.000000	98033.000000	47.471000	-122.328000	1490.000000	5100.000000	-82.548783
50%	3.904930e+09	2014-10-16 00:00:00	4.500000e+05	3.000000	2.250000	1910.000000	7.618000e+03	1.500000	0.000000	0.000000	...	1560.000000	0.000000	1975.000000	0.000000	98065.000000	47.571800	-122.230000	1840.000000	7620.000000	-82.482651
75%	7.308900e+09	2015-02-17 00:00:00	6.450000e+05	4.000000	2.500000	2550.000000	1.068800e+04	2.000000	0.000000	0.000000	...	2210.000000	560.000000	1997.000000	0.000000	98118.000000	47.678000	-122.125000	2360.000000	10083.000000	-82.411796
max	9.900000e+09	2015-05-27 00:00:00	7.700000e+06	33.000000	8.000000	13540.000000	1.651359e+06	3.500000	1.000000	4.000000	...	9410.000000	4820.000000	2015.000000	2015.000000	98199.000000	47.777600	-121.315000	6210.000000	871200.000000	-81.865195
std	2.876566e+09	NaN	3.671272e+05	0.930062	0.770163	918.440897	4.142051e+04	0.539989	0.086517	0.766318	...	828.090978	442.575043	29.373411	401.679240	53.505026	0.138564	0.140828	685.391304	27304.179631	0.095033

8 rows × 22 columns

We will want a polynomial (rather than linear) dependency on the age of the house:

df["age"] = df["date"].dt.year - df["yr_built"]
age_cols = ["age"]
for power in range(2, 5):
    col = f"age ^ {power}"
    df[col] = df["age"] ** power
    age_cols.append(col)

We do a 10-90 train-test split to demonstrate the effectiveness of augmentation when we have little data.

y = np.log(df["price"])
X = df.drop(columns=["price"])
x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.9, random_state=42
)
print(f"{x_train.shape=}, {x_test.shape=}")

x_train.shape=(2161, 25), x_test.shape=(19452, 25)

There is no reason to expect a linear relationship between the house geographical coordinates and it’s price.
However, we do expect a strong dependency between price and location in the sense that houses with similar features should share similar prices when located in close geographical proximity.
One way to model this is to cluster the data geographically, and tag each house with the cluster it belongs to using one hot encoding:

# We need this class mainly since the transform method of sklearn's k-means class yields cluster centers, and we want one hot encoding.
class OneHotEncodedKMeansTransformer:
    def __init__(self, k: int, columns: list[str], name: str) -> None:
        self.columns = columns
        self.k = k
        self.name = name

    def fit(self, X: pd.DataFrame) -> Self:
        self.kmeans_ = KMeans(n_clusters=self.k, n_init="auto", random_state=42)
        self.kmeans_.fit(X[self.columns])
        return self

    def column_names(self) -> list[str]:
        return [f"{self.name}_{i}" for i in range(self.k)]

    def transform(self, X: pd.DataFrame):
        cluster_index = self.kmeans_.predict(X[self.columns])
        return pd.concat(
            [
                X,
                pd.DataFrame(
                    np.eye(self.k)[cluster_index],
                    columns=self.column_names(),
                    index=X.index,
                ),
            ],
            axis=1,
        )

    def clusters_adjacency_matrix(self):
        edges = np.array(
            scipy.spatial.Voronoi(self.kmeans_.cluster_centers_).ridge_points
        ).T
        a = scipy.sparse.coo_matrix(
            (np.ones(edges.shape[1]), (edges[0], edges[1])),
            shape=(self.k, self.k),
        )
        return a + a.T


kmeans_transformer = OneHotEncodedKMeansTransformer(
    k=500,
    columns=["lat", "long_scaled"],
    name="geo_cluster",
)
x_train = kmeans_transformer.fit(x_train).transform(x_train)
x_test = kmeans_transformer.transform(x_test)

We will evaluate our models by their R squared score. From a quick glance over Kaggle, it seems that sophisticated and advanced models (e.g XGBoost) can achieve a score of about 0.9. Let’s see if we can get there using a linear model.

def evaluate_model(model) -> None:
    y_train_pred = model.fit(x_train, y_train).predict(x_train)
    r2_train = r2_score(y_train, y_train_pred)
    y_test_pred = model.predict(x_test)
    r2_test = r2_score(y_test, y_test_pred)
    print(f"{r2_train=:.3f}, {r2_test=:.3f}")

Let’s start with a vanilla linear model, without any regularization/augmentations.

columns = (
    [
        "bedrooms",
        "bathrooms",
        "floors",
        "waterfront",
        "view",
        "condition",
        "grade",
        "sqft_living",
        "sqft_lot",
        "sqft_above",
        "sqft_basement",
        "sqft_lot15",
        "sqft_living15",
    ]
    + age_cols
    + kmeans_transformer.column_names()
)
columns_selector = ColumnTransformer(
    [("selector", "passthrough", columns)],
    remainder="drop",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

simple_linear = Pipeline(
    [
        ("selector", columns_selector),
        ("linear", LinearRegression(fit_intercept=False)),
    ]
)
evaluate_model(simple_linear)

r2_train=0.918, r2_test=0.862

Not bad, but we do have some overfitting. Let’s see if we can improve generalization with regularization/augmentation. First we try ridge regression:

ridge = Pipeline(
    [
        ("selector", columns_selector),
        ("scale", StandardScaler()),
        (
            "linear",
            RidgeCV(
                fit_intercept=True,
                alphas=x_train.shape[0] * np.logspace(-9, -2, 100),
            ),
        ),
    ]
)
evaluate_model(ridge)

r2_train=0.918, r2_test=0.863

That didn’t really help. That makes sense since using a diagonal regularization matrix doesn’t make sense for our correlated features.
Let’s see if we can do better by using augmentations that are more appropriate for our data.

First let’s build a class for linear models with augmentation via regularization.
We will pass to the constructor a callable that takes the input features and returns their mean and (sum of) covariance after the augmentation, since as we showed above these are all we need from the augmentations.
Since often the transformations of different features are uncorrelated, it is convenient to specify features in groups, and assume zero covariance for features that are not in the same group (i.e. a block diagonal covariance matrix).

class AugmentedLinearModel:
    def __init__(
        self,
        augmentation_moments: list[  # one item for each group of features
            tuple[
                list[Hashable],  # column names of features in the group
                Callable[[pd.DataFrame], tuple[Array, Array]],  # maps X to M and R
            ]
        ],
    ) -> None:
        self.augmentation_moments = augmentation_moments

    def fit(self, X: pd.DataFrame, y: pd.Series) -> Self:
        means, covs = zip(
            *(
                moments(X.loc[:, columns])
                for columns, moments in self.augmentation_moments
            )
        )
        M = np.hstack(means)
        # https://scikit-learn.org/stable/developers/develop.html#estimated-attributes
        self.R_ = scipy.linalg.block_diag(*covs)
        self.theta_ = np.linalg.solve(M.T @ M + self.R_, M.T @ y)
        return self

    def predict(self, X: pd.DataFrame) -> pd.Series:
        cols = [col for cols, _ in self.augmentation_moments for col in cols]
        return X.loc[:, cols] @ self.theta_

Now we will use this class to fit a model, but with a more appropriate regularization.

For the bathrooms and bedrooms features, we set a 10% probability that a bathroom is counted as half a bedroom.

def bedrooms_bathrooms_moments(X: pd.DataFrame) -> tuple[Array, Array]:
    p = 0.1
    v = np.array([1, -0.5])
    mask = X[["bathrooms"]] >= 1
    M = np.where(mask, X + p * v, X)
    R = p * (1 - p) * np.outer(v, v) * mask.values.sum()
    return M, R


augmentation_moments = [(["bedrooms", "bathrooms"], bedrooms_bathrooms_moments)]

Next we set a 5% perturbation for the features
sqft_living, sqft_lot, sqft_above, sqft_basement, sqft_lot15, sqft_living15, uncorrelated across the features.

def relative_perturbation_moments(X: pd.DataFrame) -> tuple[Array, Array]:
    return X.values, np.sum(X.values**2) * 0.05**2


augmentation_moments.extend(
    ([column], relative_perturbation_moments)
    for column in [
        "sqft_living",
        "sqft_lot",
        "sqft_above",
        "sqft_basement",
        "sqft_lot15",
        "sqft_living15",
    ]
)

Next, we set a perturbation of 0.01 for the features floors, waterfront, view, condition, grade, uncorrelated across the features.

def absolute_perturbation_moments(X: pd.DataFrame) -> tuple[Array, Array]:
    return X.values, X.shape[0] * 0.01**2


augmentation_moments.extend(
    ([column], absolute_perturbation_moments)
    for column in ["floors", "waterfront", "view", "condition", "grade"]
)

Next, we perturb age with a uniform distribution between -1 and 1. We need to calculate the moments for the different powers of age accordingly.

def age_moments(X: pd.DataFrame) -> tuple[Array, Array]:
    a = X[["age"]].values - 1
    b = X[["age"]].values + 1
    max_power = X.shape[1]
    np1 = np.arange(2, 2 * max_power + 2)
    # https://en.wikipedia.org/wiki/Continuous_uniform_distribution#Moments
    mu = (b**np1 - a**np1) / (np1 * (b - a))
    mu_sum = mu[:, 1:].sum(axis=0)
    mu = mu[:, :max_power]
    idx = np.add.outer(np.arange(max_power), np.arange(max_power))
    c = mu_sum[idx] - mu.T @ mu
    return mu, c


augmentation_moments.append((age_cols, age_moments))

And finally, with probability 50%, the geo cluster is changed to one of it’s neighbors.

def geo_cluster_moments(X: pd.DataFrame) -> tuple[Array, Array]:
    p = 0.5
    adj_mat = kmeans_transformer.clusters_adjacency_matrix()
    P = scipy.sparse.eye(adj_mat.shape[0]) * p + (adj_mat / adj_mat.sum(axis=1)) * (
        1 - p
    )  # transition probabilities matrix
    M = scipy.sparse.csr_array(X.values) @ P
    # https://en.wikipedia.org/wiki/Multinomial_distribution#Matrix_notation
    R = scipy.sparse.diags(M.sum(axis=0)) - M.T @ M
    return M.toarray(), R.toarray()


augmentation_moments.append((kmeans_transformer.column_names(), geo_cluster_moments))

Le’t fit the augmented model and see how we did:

augmented_linear = AugmentedLinearModel(augmentation_moments)
evaluate_model(augmented_linear)

r2_train=0.903, r2_test=0.882

We managed to improve the test accuracy, and reduce overfit.

Beyond least squares

In this section we will extend the result to models that use a non-quadratic loss (e.g. logistic regression). The proof above doesn’t work for non-quadratic losses, but we can get an approximate result using a 2nd order taylor approximation for the loss.

The goal is to (approximately) express \[\begin{align*} \mathrm{E} \left[ \sum_{i=0} ^{n-1} l \left( a\left(x_i, p_i\right) ^T \theta \,;\, y_i \right) \right], \end{align*}\] as a sum of a non-augmented loss term, and a regularization term. Here, \(l(\hat{y}_i\,;\,y_i)\) is the loss function that measures how bad is the prediction \(\hat{y}_i\), given the true \(y_i\).
For example, for logistic regression we use the logistic loss \[ l(\hat{y}; y) = \log \left( 1 + \exp \left(-y \, \hat{y} \right) \right) \] (with \(y \in \{ -1, 1 \}\)).

First, we expand each loss term around \(\mu_i ^T \theta\): \[\begin{align*} l(\hat{y}_i; y_i) \approx l(\mu_i ^T \theta\,;\,y_i) + l'(\mu_i ^T \theta\,;\,y_i) \left( \hat{y}_i - \mu_i ^T \theta \right) + \frac{1}{2} l''('mu_i^T \theta\,;\,y_i) \left( \hat{y}_i - \mu_i ^T \theta \right)^2 \end{align*}\] (The derivatives here are with respect to the first argument of \(l\)).

Substitute this into the expectation, we get: \[\begin{align*} \mathrm{E} \left[ \sum_{i=0} ^{n-1} l \left( a\left(x_i, p_i\right) ^T \theta \,;\, y_i \right) \right] &\approx \mathrm{E} \left[ \sum_{i=0} ^{n-1} l(\mu_i ^T \theta\,;\,y_i) + l'(\mu_i ^T \theta\,;\,y_i) \left( a\left(x_i, p_i\right) ^T \theta - \mu_i ^T \theta \right) + \frac{1}{2} l''(\mu_i^T \theta\,;\,y_i) \left( a \left(x_i, p_i \right) ^T \theta - \mu_i ^T \theta \right)^2 \right] \\&= \sum_{i=0} ^{n-1} l(\mu_i ^T \theta\,;\,y_i) + l'(\mu_i ^T \theta\,;\,y_i) \mathrm{E}\left[ a\left(x_i, p_i\right) - \mu_i \right] ^T \theta + \frac{1}{2} l''(\mu_i^T \theta\,;\,y_i) \mathrm{E} \left[ \left( \left( a\left(x_i, p_i\right) - \mu_i \right) ^T \theta \right)^2 \right] \\&= \sum_{i=0} ^{n-1} l(\mu_i ^T \theta\,;\,y_i) + \frac{1}{2} l''(\mu_i^T \theta\,;\,y_i) \theta^T R_i \theta \end{align*}\] So like in the least squares case, in the loss term we just replace each \(x_i\) with it’s mean. But, the regularization term is not quadratic, since we have the second derivative factor which is not constant (unless the loss is quadratic).

I use this result to tell myself that it is reasonable to design a quadratic regularizer based on augmentation covariances, even for non-least squares models, as long as the covariances are small, and \(l''\) is bounded (like in logistic regression).

References

I actually don’t know of any reference that discusses this connection between augmentation and regularization, although I’m sure it’s out there. Let me know if you know of one!