Adam, and even Muon, optimize attention’s query and key matrices as if they were independent. Treating them as the single bilinear form they jointly define yields a family of Muon-style update rules.
A somewhat unique introduction to greedy algorithms for the sparse approximation problem, and proposing an obvious algorithm that seems to be overlooked.
The non quadratic case
The derivation and implementation of a method for leave one out cross validation with neglible extra runtime compared to fitting alone.
A unique introduction to the MUSIC algorithm, as a general method to solve the multisnapshot sparse decomposition problem.
\(\rho=1\) means perfect positive correlation, \(\rho=-1\) means perfect negative correlation, \(\rho=0\) means no correlation. But what does \(\rho=0.72\) mean?
On the equivalence of training data augmentation and quadratic regularization for linear models - a very useful (but not well known) result.