Selected X posts | Keller Jordan blog

Accuracy on 115K cifar-10 train runs
A crucial detail in Git Re-Basin 😇
We improve merged NN performance for any choice of norm layer
My New Year’s resolution
Something amusing in neural network optimization
I decided to annotate this table from the GaLore paper with p-values 😇
I’m interested in this recent ICLR 2024 spotlight paper 😇
Horizontal flipping augmentation can be improved for free
Here’s an SGD-Nesterov that outperforms both 😇
Variance in neural network training has a simple statistical structure
A simple lower-bound on the variance of neural network training
First NanoGPT speedrun
I had a thought about one of the baselines in the Sophia paper 😇
Warming up the learning rate for X steps too long just delays training by X/2 steps earlier thread on the same
It’s about how long you let the poor little fella think
The effect of the learning rate on model outputs is locally linear if we average over repeated runs
If “self-distillation is performing implicit ensemble,” then why do ensembles of self-distilled models underperform regular ensembles? 😇
Three reasons to be skeptical regarding the claim that ternary weights are just as good as full precision 😇
Ternary weights are outperformed by septernary
Introduction of Muon optimizer
Scaling the NanoGPT speedrun to 1.5B parameters yields GPT-2 XL HellaSwag in 10 8xH100-hours
Repeating data apparently becomes benign when the token to parameter ratio gets large enough (😇 wrt https://arxiv.org/abs/2305.16264)
BatchNorm has the same behavior under distribution shift as other norm layers