MAE stand for masked auto-encoder and this is a technique for pre-training a neural network for image recognition. An auto encoder in this context takes in an image and then has the task of simply reproducing that image. At some point in will encode the image into a smaller dimensional space before expanding it out into the original image. In this way a simple way to picture the auto encoder is a network that compresses or zips an image file and then decompresses it. Auto encoders use two sections to achieve this task, one is the encoder section that shrinks the image to a smaller dimensional space. This is the compression part of the network. The other section of an auto encoder is the decoding network that expands the image into a larger dimensional space. This is the decompression part.
Masking describes blanking out part of the image and then having an auto encoder attempt to reproduce the image with the now missing details filled in. In this way the auto encoder is also generating new image data based on what it can see of the image. In the paper I read today they have split the task into different parts, where the network first is judged on its ability to recreate the parts of the image it can see this is the “Mimic” in the paper. Then it goes on to reconstruct the image by filling in the missing details that were masked.
After an auto encoder is trained in this way, the encoding part can be liberated from the network and be further trained on other perceptual image tasks like image classification. In this way the masked auto encoder training section is used as a jumpstart to make the Neural Network perform better at a specific task.
Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking
Peng Gao, Renrui Zhang, Rongyao Fang, Ziyi Lin, Hongyang Li, Hongsheng Li and Qiao Yu
Shanghai AI Laboratory, Shanghai, China.