1. Accuracy on 115K cifar-10 train runs
  2. A crucial detail in Git Re-Basin ๐Ÿ˜‡
  3. We improve merged NN performance for any choice of norm layer
  4. My New Year’s resolution
  5. Something amusing in neural network optimization
  6. I decided to annotate this table from the GaLore paper with p-values ๐Ÿ˜‡
  7. I’m interested in this recent ICLR 2024 spotlight paper ๐Ÿ˜‡
  8. Horizontal flipping augmentation can be improved for free
  9. Here’s an SGD-Nesterov that outperforms both ๐Ÿ˜‡
  10. Variance in neural network training has a simple statistical structure
  11. A simple lower-bound on the variance of neural network training
  12. First NanoGPT speedrun
  13. I had a thought about one of the baselines in the Sophia paper ๐Ÿ˜‡
  14. Warming up the learning rate for X steps too long just delays training by X/2 steps earlier thread on the same
  15. It’s about how long you let the poor little fella think
  16. The effect of the learning rate on model outputs is locally linear if we average over repeated runs
  17. If “self-distillation is performing implicit ensemble,” then why do ensembles of self-distilled models underperform regular ensembles? ๐Ÿ˜‡
  18. Three reasons to be skeptical regarding the claim that ternary weights are just as good as full precision ๐Ÿ˜‡
  19. Ternary weights are outperformed by septernary
  20. Introduction of Muon optimizer
  21. Scaling the NanoGPT speedrun to 1.5B parameters yields GPT-2 XL HellaSwag in 10 8xH100-hours
  22. Repeating data apparently becomes benign when the token to parameter ratio gets large enough (๐Ÿ˜‡ wrt https://arxiv.org/abs/2305.16264)
  23. BatchNorm has the same behavior under distribution shift as other norm layers