Hello

Welcome to my research blog. I’m interested in basic questions about neural network training.

Muon: An optimizer for hidden layers in neural networks

Muon is an optimizer for the hidden layers in neural networks. It is used in the current training speed records for both NanoGPT and CIFAR-10 speedrunning. Many empirical results using Muon have already been posted, so this writeup will focus mainly on Muon’s design. First we will define Muon and provide an overview of the empirical results it has achieved so far. Then we will discuss its design in full detail, including connections to prior research and our best understanding of why it works....

[notes] NTK analysis for an MLP with one hidden layer and infinite data

Motivation I am interested in understanding the learning dynamics of neural networks. In particular, I’d like to understand the difference between the neural network “NTK parametrization” and the “mu Parametrization” (the latter from the Tensor Programs series of papers). I understand that both parametrizations are justified by various analyses of the behavior of training as the width goes to infinity. It is known that the NTK parametriation can essentially already fit any finite amount of data perfectly....

BatchNorm adaptation can be extended to BatchNorm-free networks

A new result regarding test-time adaptation which is practically obsolete but may be of some theoretical interest.

Empirical law 4: Obtaining examplewise influences with 1.5e7 runs of training

In the last post we showed that the mean hypothesis of a neural network training is influenced by the choice of training data in a locally additive manner. That is, given a large random base dataset $D$, and an extra dataset $S$, there exists a set of hypotheses $\bar f_1, \dots, \bar f_n \in \mathcal H$ such that \begin{equation} \bar f_{D \sqcup S’} = \bar f_D + \sum_{i=1}^n \bar f_i \cdot 1{(x_i, y_i) \in S’} \end{equation}...

Empirical law 2: Data influence is locally additive

In this post we explore a second empirical law of neural network training, this time regarding the influence of training data. The main result is that, if we get rid of the inherent noise in the training process by averaging over many repeated runs, then the effect of each additional training example is approximately additive. Setup For a machine learning problem where the goal is to map inputs from a space $\mathcal X$ to $k$ real-valued outputs, we call functions of the form $f: \mathcal X \to \mathbb R^k$ hypotheses....

Mean behavior is differentiable in the learning rate

In this post I will contribute a result about neural network training which is surprising but probably not useful. Introduction Neural network training is an inherently random process (Summers & Dineen 2021). Every run of training, with the same hyperparameters, produces a unique network with unique behavior. This variation makes it difficult to precisely study the effects of subtle changes in hyperparameters, like the learning rate. For example, if we compare two runs of training with learning rates 0....

94% on CIFAR-10 in 3.29 seconds

Check out my latest paper on speedrunning CIFAR-10! Twitter announcement I built this method not just for the sake of competition, but rather as a telescope to find new phenomena within neural network training. Something else: although many new optimizers have been proposed, I’ve found that none of them work better than Lookahead SGD-Nesterov for CIFAR-10 training. Except for one I’ve been playing with, which both (a) actually works slightly better than any other optimizer for CIFAR-10 speedrunning, and (b) completely gets rid of weight decay....

Regularities in ROC curve behavior

The Receiver Operating Characteristic (ROC) curve of a binary classification model describes all operating points obtainable by shifting the prediction threshold. Each operating point is characterized by a true positive rate (TPR, i.e. the accuracy on positive examples) and a true negative rate (TNR, i.e. accuracy on negative examples). For example, the following are ROC curves for a small ResNet trained and evaluated on the cat/dog subset of CIFAR-10. The traditional presentation is with TPR on the y-axis and false positive rate (1 - TNR) on the x-axis, but in this post I’ll prefer TNR vs TPR....

Training on only correctly predicted examples can surpass the original model

In which a simple model will do the job, but SGD learns a more complex and accurate one anyway.

Valuation-resistant subsets

In this post I’ll show how to construct subsets of data which resist being valued (in the sense of the data valuation research program). Setup We will work with the cat/dog subset of CIFAR-10. This forms a binary classification task with 10,000 training examples and 2,000 test examples. We will construct two subsets of the training data, $A$ and $B$, which yield the following test-set accuracies: Train on just $A$: 86% accuracy....