First I got to say wow this is an impressive work not just as a discovery but is a brilliant paper that explains the idea well with good examples. It is also a paper that doesn’t lean too heavily into mathematical formulas for its description which I feel makes the paper more accessible for a wider audience. That said this is definitely a paper for those with a deep understanding of vision model in machine learning.


A new way of training models has emerged relatively recently where a model is used in a masked auto-encoder as the encoder. In this it is trained in the auto encoder format to take an image that has some pixels masked out and then reproduce that image without the masks. The encoder can then be cut out to and trained on the actual problem traditionally afterwards. This allow the model to get a head start and a better accuracy.

This methodology presents an issue though with convolutional networks as the mathematical process of convolutions tends to spread into the masked regions in a way that is unhelpful. This has led to vision transformers displacing convolutional neural networks in the machine vision space This paper has developed a technique to overcome this issue and places convolutional neural network back as state of the art vision models. They use a special sparse convolution network prior to the encoder that allows it to mask in a way friendly to convolutional neural networks and therefore allows it to train with a masked auto-encoder.

This feels like a breakthrough discovery. The paper is extremely well described and I highly recommend it.