- Accuracy on 115K cifar-10 train runs
- A crucial detail in Git Re-Basin ๐
- We improve merged NN performance for any choice of norm layer
- My New Year’s resolution
- Something amusing in neural network optimization
- I decided to annotate this table from the GaLore paper with p-values ๐
- I’m interested in this recent ICLR 2024 spotlight paper ๐
- Horizontal flipping augmentation can be improved for free
- Here’s an SGD-Nesterov that outperforms both ๐
- Variance in neural network training has a simple statistical structure
- A simple lower-bound on the variance of neural network training
- First NanoGPT speedrun
- I had a thought about one of the baselines in the Sophia paper ๐
- Warming up the learning rate for X steps too long just delays training by X/2 steps earlier thread on the same
- It’s about how long you let the poor little fella think
- The effect of the learning rate on model outputs is locally linear if we average over repeated runs
- If “self-distillation is performing implicit ensemble,” then why do ensembles of self-distilled models underperform regular ensembles? ๐
- Three reasons to be skeptical regarding the claim that ternary weights are just as good as full precision ๐
- Ternary weights are outperformed by septernary
- Introduction of Muon optimizer
- Scaling the NanoGPT speedrun to 1.5B parameters yields GPT-2 XL HellaSwag in 10 8xH100-hours
- Repeating data apparently becomes benign when the token to parameter ratio gets large enough (๐ wrt https://arxiv.org/abs/2305.16264)
- BatchNorm has the same behavior under distribution shift as other norm layers