94% on CIFAR-10 in 3.29 seconds | Keller Jordan blog

Check out my latest paper on speedrunning CIFAR-10!

I built this method not just for the sake of competition, but rather as a telescope to find new phenomena within neural network training.

Something else: although many new optimizers have been proposed, I’ve found that none of them work better than Lookahead SGD-Nesterov for CIFAR-10 training.

Except for one I’ve been playing with, which both (a) actually works slightly better than any other optimizer for CIFAR-10 speedrunning, and (b) completely gets rid of weight decay.

I don’t plan to publish anything about it, until I get the chance to try it for LLM training, which is of course what really matters, and is also where in all likelihood it will fail, hopefully in an instructive way.