There are now large open source models and weights for the generation of images from text prompts. The results of these networks like the common stable diffusion are quite impressive and but for few issues they are capable of producing some interesting artworks. I’ve used them myself to do some interesting abstract artwork in my own experiments and I think most with any interest in machine learning have at least seen the results of these new image generation techniques. These networks have often been trained on an enormous amount of images and generally require huge datasets to get working effectively. The computational resources used in the training of such models is also enormous and generally would require the sponsorship of a medium to large corporation to viably train.

In the following fascinating paper they are able to use the VQGAN image generation model as a pre-trained model that is able to help them solve image perception tasks like depth estimation and image segmentation. They do this by essentially setting up the VQGAN as the first processing task of the network and then using its output with a new model that trains for the visual perception task. In this way they are able to use the substantial pretrained visual knowledge in the image generation model to help them solve quite different image perception tasks. This is a great find and well worth reading. They have achieved something interesting here and also gotten some great results running this process against some challenging datasets.

Unleashing Text-to-Image Diffusion Models for Visual Perception
Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, Jiwen Lu