A very interesting paper that goes through the process of making reversible block within a neural network. The idea here is that the mathematical output of blocks in a network can be simply reversed with a few simple operations to get the input. This means that instead of storing enormous amounts of activation function output in GPU memory there can instead be a much lower memory footprint at the cost of a few minor additional operations that appear to have little impact on the speed of training. These blocks need to have the same dimensional output as the input and this is a limitation that means not all blocks in the network have the reversible property but a significant proportion do.
This idea I will say didn’t seem to have caught on but I think this will become a bigger idea as parameter counts for transformers are continuing to explode larger and larger.