Hello

Welcome to my research blog. I’m interested in basic questions about neural network training.

Muon: An optimizer for hidden layers in neural networks

Muon is an optimizer for the hidden layers in neural networks. It is used in the current training speed records for both NanoGPT and CIFAR-10 speedrunning. Many empirical results using Muon have already been posted, so this writeup will focus mainly on Muon’s design. First we will define Muon and provide an overview of the empirical results it has achieved so far. Then we will discuss its design in full detail, including connections to prior research and our best understanding of why it works....

December 8, 2024 · 15 min

Selected X posts

Accuracy on 115K cifar-10 train runs A crucial detail in Git Re-Basin 😇 We improve merged NN performance for any choice of norm layer My New Year’s resolution Something amusing in neural network optimization I decided to annotate this table from the GaLore paper with p-values 😇 I’m interested in this recent ICLR 2024 spotlight paper 😇 Horizontal flipping augmentation can be improved for free Here’s an SGD-Nesterov that outperforms both 😇 Variance in neural network training has a simple statistical structure A simple lower-bound on the variance of neural network training First NanoGPT speedrun I had a thought about one of the baselines in the Sophia paper 😇 Warming up the learning rate for X steps too long just delays training by X/2 steps earlier thread on the same It’s about how long you let the poor little fella think The effect of the learning rate on model outputs is locally linear if we average over repeated runs If “self-distillation is performing implicit ensemble,” then why do ensembles of self-distilled models underperform regular ensembles?...

December 7, 2024 · 2 min

[notes] NTK analysis for an MLP with one hidden layer and infinite data

Motivation I am interested in understanding the learning dynamics of neural networks. In particular, I’d like to understand the difference between the neural network “NTK parametrization” and the “mu Parametrization” (the latter from the Tensor Programs series of papers). I understand that both parametrizations are justified by various analyses of the behavior of training as the width goes to infinity. It is known that the NTK parametriation can essentially already fit any finite amount of data perfectly....

September 13, 2024 · 31 min

BatchNorm adaptation can be extended to BatchNorm-free networks

A new result regarding test-time adaptation which is practically obsolete but may be of some theoretical interest.

May 25, 2024 · 5 min

Empirical law 4: Obtaining examplewise influences with 1.5e7 runs of training

In the last post we showed that the mean hypothesis of a neural network training is influenced by the choice of training data in a locally additive manner. That is, given a large random base dataset $D$, and an extra dataset $S$, there exists a set of hypotheses $\bar f_1, \dots, \bar f_n \in \mathcal H$ such that \begin{equation} \bar f_{D \sqcup S’} = \bar f_D + \sum_{i=1}^n \bar f_i \cdot 1{(x_i, y_i) \in S’} \end{equation}...

May 15, 2024 · 5 min

Empirical law 2: Data influence is locally additive

In this post we explore a second empirical law of neural network training, this time regarding the influence of training data. The main result is that, if we get rid of the inherent noise in the training process by averaging over many repeated runs, then the effect of each additional training example is approximately additive. Setup For a machine learning problem where the goal is to map inputs from a space $\mathcal X$ to $k$ real-valued outputs, we call functions of the form $f: \mathcal X \to \mathbb R^k$ hypotheses....

May 14, 2024 · 15 min

Mean behavior is differentiable in the learning rate

In this post I will contribute a result about neural network training which is surprising but probably not useful. Introduction Neural network training is an inherently random process (Summers & Dineen 2021). Every run of training, with the same hyperparameters, produces a unique network with unique behavior. This variation makes it difficult to precisely study the effects of subtle changes in hyperparameters, like the learning rate. For example, if we compare two runs of training with learning rates 0....

May 11, 2024 · 8 min

94% on CIFAR-10 in 3.29 seconds

Check out my latest paper on speedrunning CIFAR-10! Twitter announcement I built this method not just for the sake of competition, but rather as a telescope to find new phenomena within neural network training. Something else: although many new optimizers have been proposed, I’ve found that none of them work better than Lookahead SGD-Nesterov for CIFAR-10 training. Except for one I’ve been playing with, which both (a) actually works slightly better than any other optimizer for CIFAR-10 speedrunning, and (b) completely gets rid of weight decay....

April 4, 2024 · 1 min

Regularities in ROC curve behavior

The Receiver Operating Characteristic (ROC) curve of a binary classification model describes all operating points obtainable by shifting the prediction threshold. Each operating point is characterized by a true positive rate (TPR, i.e. the accuracy on positive examples) and a true negative rate (TNR, i.e. accuracy on negative examples). For example, the following are ROC curves for a small ResNet trained and evaluated on the cat/dog subset of CIFAR-10. The traditional presentation is with TPR on the y-axis and false positive rate (1 - TNR) on the x-axis, but in this post I’ll prefer TNR vs TPR....

May 17, 2023 · 7 min

Training on only correctly predicted examples can surpass the original model

In which a simple model will do the job, but SGD learns a more complex and accurate one anyway.

May 11, 2023 · 2 min