On the equivalence of training data augmentation and quadratic regularization for linear models - a very useful (but not well known) result.
Author
Tom Shlomo
Published
January 15, 2024
Training data augmentation enhances the training dataset by applying transformations to existing training data instances. The specific transformations vary depending on the type of data involved, and this flexibility allows to leverage domain knowledge, such as known invariants, effectively. The goal is to introduce variability and increase the diversity of the training set, allowing the model to better generalize to unseen data and exhibit improved robustness. Despite the advantages, training data augmentation introduces an inherent computational cost: the increased volume of data requires additional computational resources, impacting both training time and memory requirements.
As we will show below, for linear models with the sum of squares loss, training data augmentation is equivalent to adding quadratic regularization term, which implies that the computational cost of fitting a model to an augmented dataset is the same as using no augmentation at all!
This link between augmentation and regularization is useful in the other direction as well: it gives a concrete interpretation to the value of regularization hyperparameters, and can be used to avoid costly hyperparameters tuning (np.logspace(-6, 6, 100) much?), and to design regularizers that are more appropriate to the data than the simple ones (i.e sum of squares regularization used in ridge regression).
Notation
Suppose we have a training data set comprised of \(n\) pairs \(x_i,\,y_i\) for \(i=0, \dots, n-1\), where \(x_i\) is the \(d\) dimensional feature vector of the \(i\)’th training data, and \(y_i\) is the corresponding label. Here we will assume \(y_i \in \mathrm{R}\), however our results can be easily extended to the vector-labeled (aka multi-output) case. We will also denote by \(X\) the \(n\)-by-\(d\) matrix with rows \(x_0^T, \dots, x_{n-1}^T\) and by \(y\) the \(n\)-vector with entries \(y_0, \dots, y_{n-1}\).
Let \(a:\mathrm{R}^d \times \mathcal{P} \mapsto \mathrm{R}^d\) denote the augmentation function that given the augmentation params \(p \in \mathcal{P}\), maps a feature vector \(x\) to a transformed feature vector. The augmentation parameters \(p\) are usually sampled randomly from a given distribution. For example, for image data, \(a\) is often a composition of small shifts, rotations, brightness changes, etc. while \(p\) specifies the amount of shifting, rotation and brightness change.
Ordinary least squares (OLS)
Let’s quickly discuss OLS so we can compare it’s equations with the augmented version we will derive after.
To fit an OLS model, we find a vector of coefficients \(\theta_\text{OLS}\) that minimizes the sum of squared training errors: \[\begin{align*}
\theta_\text{OLS} &:= \text{argmin} _\theta \sum_{i=0} ^{n-1} \left(
x_i ^T \theta - y_i
\right) ^2
\\&= \text{argmin} _\theta \| X \theta - y \|^2 \tag{1}
\end{align*}\] To solve the optimization problem (1), we solve the equation \(X^TX \theta_\text{OLS} = X^T y\), which has time complexity \(O(n d^2)\).
Augmented least squares
We will now fit a model by finding coefficients \(\theta_\text{ALS}\) that minimize the expected error over the augmented training dataset: \[\begin{align*}
\theta_\text{ALS} &:= \text{argmin} _\theta \mathrm{E}
\left[
\sum_{i=0} ^{n-1} \left(
a\left(x_i, p_i\right) ^T \theta - y_i
\right) ^2
\right], \tag{2}
\end{align*}\] where the expectation is over \(p_0,\dots, p_{n-1}\), the random augmentation parameters. As we will see below, \(\theta_\text{ALS}\) depends on \(a\) and the distribution of \(p\) only through the 2nd order moments, which we denote by \[\begin{align*}
\mu_i &:= \mathrm{E} \left[a(x_i, p_i) \right]\\
R_i &:= \mathrm{C}\text{ov} \left[ a(x_i, p_i) \right].
\end{align*}\]
Continuing from (2), we use the standard trick of subtracting and adding the mean: \[\begin{align*}
\theta_\text{ALS} &= \text{argmin} _\theta \mathrm{E}
\left[
\sum_{i=0} ^{n-1} \left(
\left(
\mu_i^T \theta - y_i
\right)
+ \left(
a\left(x_i, p_i\right) -
\mu_i
\right)^T\theta
\right) ^2
\right]
\end{align*}\] Note that the first term \(\mu_i^T \theta - y_i\) is deterministic, while the second term \(\left(a\left(x_i, p_i\right) - \mu_i\right)^T\theta\) has zero mean. Therefore \[\begin{align*}
\theta_\text{ALS} &= \text{argmin} _\theta
\sum_{i=0} ^{n-1}
\left(
\mu_i^T \theta - y_i
\right)
+
\mathrm{E} \left[\left(
\left(
a\left(x_i, p_i\right) -
\mu_i
\right)^T\theta
\right)^2 \right] \\
&= \text{argmin} _\theta
\| M \theta - y\|^2 + \theta ^T R \theta,
\tag{3}
\end{align*}\] where \(M\) is the \(n\)-by-\(d\) matrix whose rows are \(\mu_0^T, \dots, \mu_{n-1}^T\), and \[
R := \sum_{i=0} ^{n-1} R_i.
\] Equation (3) shows exactly what we set to prove - fitting a model on an augmented training dataset is equivalent to fitting a non-augmented, but quadratically regularized, least squares model. We just replace \(X\) with it’s mean, and use the sum of all the covariances as the regularization matrix.
To solve the optimization problem (3), we solve the equation \((X^T X + R) \theta_\text{ALS} = X^T y\), which has the same \(O(n d^2)\) complexity as OLS.
Ridge regression
Ride regression (aka Tykhonov regularization) has the form (3) with \(M=X\) and \(R=\lambda I\). As an augmentation, it can be interpreted as follows: perturb each feature vector by a zero mean noise, with variance \(\lambda/n\), uncorrelated across features.
This interpretation of \(\lambda\) can be used to set it (at least roughly): just think what level of perturbation \(\sigma\) is reasonable for your features, and set \(\lambda = n \sigma^2\).
This also shows that when different features are scaled differently, ridge regression is perhaps not the suitable. A standard deviation of 100 might be reasonable for perturbing a feature with values in the order of millions, but it is probably not suitable for a feature with values in the order of 1. In these cases, we may use a diagonal \(R\): \[\begin{align*}
R= n \, \text{diag} \left(
\sigma_0^2, \dots, \sigma_{d-1}^2
\right)
\end{align*}\] where \(\sigma_i\) is the standard deviation of the perturbation of feature \(i\).
Another option is to scale the features before fitting, e.g using sklearn’s StandardScaler transformer. With all features scaled to have unit variance, setting \(\lambda = n \, 10 ^{-6}\) is sensible as rule of thumb, as it is often reasonable to assume at least \(0.1\%\) perturbation.
Note that often the model includes an intercept (aka constant) term by adding a column of ones to \(X\). Since this column remain unchanged through any augmenting transformation, the corresponding row and column of \(R\) should be all zeros.
Example
For the example we are gonna use the House Sales in King County, USA dataset. Each row describes a house sold between May 2014 and May 2015. Our goal will be to predict the log price given features like number of rooms, area, and location.
Note: several decisions outlined below weren’t necessarily the most effective; rather, they were chosen to showcase different modelling techniques in the context of augmentation via regularization.
Let’s begin by importing everything we will need, loading our data, and adding some columns.
There is no reason to expect a linear relationship between the house geographical coordinates and it’s price.
However, we do expect a strong dependency between price and location in the sense that houses with similar features should share similar prices when located in close geographical proximity.
One way to model this is to cluster the data geographically, and tag each house with the cluster it belongs to using one hot encoding:
# We need this class mainly since the transform method of sklearn's k-means class yields cluster centers, and we want one hot encoding.class OneHotEncodedKMeansTransformer:def__init__(self, k: int, columns: list[str], name: str) ->None:self.columns = columnsself.k = kself.name = namedef fit(self, X: pd.DataFrame) -> Self:self.kmeans_ = KMeans(n_clusters=self.k, n_init="auto", random_state=42)self.kmeans_.fit(X[self.columns])returnselfdef column_names(self) ->list[str]:return [f"{self.name}_{i}"for i inrange(self.k)]def transform(self, X: pd.DataFrame): cluster_index =self.kmeans_.predict(X[self.columns])return pd.concat( [ X, pd.DataFrame( np.eye(self.k)[cluster_index], columns=self.column_names(), index=X.index, ), ], axis=1, )def clusters_adjacency_matrix(self): edges = np.array( scipy.spatial.Voronoi(self.kmeans_.cluster_centers_).ridge_points ).T a = scipy.sparse.coo_matrix( (np.ones(edges.shape[1]), (edges[0], edges[1])), shape=(self.k, self.k), )return a + a.Tkmeans_transformer = OneHotEncodedKMeansTransformer( k=500, columns=["lat", "long_scaled"], name="geo_cluster",)x_train = kmeans_transformer.fit(x_train).transform(x_train)x_test = kmeans_transformer.transform(x_test)
We will evaluate our models by their R squared score. From a quick glance over Kaggle, it seems that sophisticated and advanced models (e.g XGBoost) can achieve a score of about 0.9. Let’s see if we can get there using a linear model.
That didn’t really help. That makes sense since using a diagonal regularization matrix doesn’t make sense for our correlated features.
Let’s see if we can do better by using augmentations that are more appropriate for our data.
First let’s build a class for linear models with augmentation via regularization.
We will pass to the constructor a callable that takes the input features and returns their mean and (sum of) covariance after the augmentation, since as we showed above these are all we need from the augmentations.
Since often the transformations of different features are uncorrelated, it is convenient to specify features in groups, and assume zero covariance for features that are not in the same group (i.e. a block diagonal covariance matrix).
class AugmentedLinearModel:def__init__(self, augmentation_moments: list[ # one item for each group of featurestuple[list[Hashable], # column names of features in the group Callable[[pd.DataFrame], tuple[Array, Array]], # maps X to M and R ] ], ) ->None:self.augmentation_moments = augmentation_momentsdef fit(self, X: pd.DataFrame, y: pd.Series) -> Self: means, covs =zip(*( moments(X.loc[:, columns])for columns, moments inself.augmentation_moments ) ) M = np.hstack(means)# https://scikit-learn.org/stable/developers/develop.html#estimated-attributesself.R_ = scipy.linalg.block_diag(*covs)self.theta_ = np.linalg.solve(M.T @ M +self.R_, M.T @ y)returnselfdef predict(self, X: pd.DataFrame) -> pd.Series: cols = [col for cols, _ inself.augmentation_moments for col in cols]return X.loc[:, cols] @self.theta_
Now we will use this class to fit a model, but with a more appropriate regularization.
For the bathrooms and bedrooms features, we set a 10% probability that a bathroom is counted as half a bedroom.
def bedrooms_bathrooms_moments(X: pd.DataFrame) ->tuple[Array, Array]: p =0.1 v = np.array([1, -0.5]) mask = X[["bathrooms"]] >=1 M = np.where(mask, X + p * v, X) R = p * (1- p) * np.outer(v, v) * mask.values.sum()return M, Raugmentation_moments = [(["bedrooms", "bathrooms"], bedrooms_bathrooms_moments)]
Next we set a 5% perturbation for the features sqft_living, sqft_lot, sqft_above, sqft_basement, sqft_lot15, sqft_living15, uncorrelated across the features.
We managed to improve the test accuracy, and reduce overfit.
Beyond least squares
In this section we will extend the result to models that use a non-quadratic loss (e.g. logistic regression). The proof above doesn’t work for non-quadratic losses, but we can get an approximate result using a 2nd order taylor approximation for the loss.
The goal is to (approximately) express \[\begin{align*}
\mathrm{E}
\left[
\sum_{i=0} ^{n-1} l \left(
a\left(x_i, p_i\right) ^T \theta \,;\, y_i
\right)
\right],
\end{align*}\] as a sum of a non-augmented loss term, and a regularization term. Here, \(l(\hat{y}_i\,;\,y_i)\) is the loss function that measures how bad is the prediction \(\hat{y}_i\), given the true \(y_i\).
For example, for logistic regression we use the logistic loss \[
l(\hat{y}; y) = \log \left( 1 + \exp \left(-y \, \hat{y} \right) \right)
\] (with \(y \in \{ -1, 1 \}\)).
First, we expand each loss term around \(\mu_i ^T \theta\): \[\begin{align*}
l(\hat{y}_i; y_i) \approx
l(\mu_i ^T \theta\,;\,y_i)
+ l'(\mu_i ^T \theta\,;\,y_i) \left( \hat{y}_i - \mu_i ^T \theta \right)
+ \frac{1}{2} l''('mu_i^T \theta\,;\,y_i) \left( \hat{y}_i - \mu_i ^T \theta \right)^2
\end{align*}\] (The derivatives here are with respect to the first argument of \(l\)).
Substitute this into the expectation, we get: \[\begin{align*}
\mathrm{E} \left[
\sum_{i=0} ^{n-1}
l \left(
a\left(x_i, p_i\right) ^T \theta \,;\, y_i
\right)
\right]
&\approx
\mathrm{E} \left[
\sum_{i=0} ^{n-1}
l(\mu_i ^T \theta\,;\,y_i)
+ l'(\mu_i ^T \theta\,;\,y_i) \left(
a\left(x_i, p_i\right) ^T \theta - \mu_i ^T \theta
\right)
+ \frac{1}{2} l''(\mu_i^T \theta\,;\,y_i) \left(
a \left(x_i, p_i \right) ^T \theta
- \mu_i ^T \theta
\right)^2
\right]
\\&=
\sum_{i=0} ^{n-1}
l(\mu_i ^T \theta\,;\,y_i)
+ l'(\mu_i ^T \theta\,;\,y_i) \mathrm{E}\left[
a\left(x_i, p_i\right)
- \mu_i
\right] ^T \theta
+ \frac{1}{2} l''(\mu_i^T \theta\,;\,y_i) \mathrm{E} \left[
\left(
\left(
a\left(x_i, p_i\right)
- \mu_i
\right) ^T \theta
\right)^2
\right]
\\&=
\sum_{i=0} ^{n-1}
l(\mu_i ^T \theta\,;\,y_i)
+ \frac{1}{2} l''(\mu_i^T \theta\,;\,y_i)
\theta^T R_i \theta
\end{align*}\] So like in the least squares case, in the loss term we just replace each \(x_i\) with it’s mean. But, the regularization term is not quadratic, since we have the second derivative factor which is not constant (unless the loss is quadratic).
I use this result to tell myself that it is reasonable to design a quadratic regularizer based on augmentation covariances, even for non-least squares models, as long as the covariances are small, and \(l''\) is bounded (like in logistic regression).
References
I actually don’t know of any reference that discusses this connection between augmentation and regularization, although I’m sure it’s out there. Let me know if you know of one!