This is a master’s thesis which is quite a divergence from the journal papers I’ve become used to reading but it is a great quality one which is well worth a read for anyone with a passing interest in audio generation of music. The explanation is detailed, the diagrams are useful and the work is interesting. I wish my own thesis had have been half this quality. I’m sure the author Flavio Schneider has a great future ahead of them.

On the subject of AI generated music I can’t help but feel that while highly interesting, I must question whether it would have a large impact. I think people listen to music on a deeply human level involving empathy and a sense of connection with other living beings. Really the library of music stored digitally is already extremely large and probably more than a person can explore in one lifetime already. If humanity refused to ever produce another piece of music would it actually effect a person’s ability to listen to music, explore their changing tastes and discover new things? I don’t think so. I don’t think AI music while hugely interesting actually presents any threat to current artists because music doesn’t really work as a product when you remove the human element.

That aside the exploration of AI generation of music presents a hugely interesting field for me because it feels like the beginning of synthesizers happening in my lifetime. Is there a Moog like figure who will emerge in this field? Have they already emerged and their creations are currently unknown to me. It’s impossible to not give into some excitement. The truth is that these developments will likely have far more effect behind the scenes of music in the area of production just like the development of synthesizers did generations ago.

Getting back to the topic of the paper itself, this is a work that goes to pains to explain each step of the research undertaken in a way that very much makes me appreciate the work that goes into a thesis. This is a text to music process inspired by stable diffusion’s text to image process. The code being available on Github is great. The work has also been done in a way to be broadly accessible from a hardware perspective as well, using models that can be used on a single GPU setup. This isn’t just a paper but a foundation for further work to be conducted in the area.

It is perhaps unfortunate that the music produced is terrible but that feels par for the course in exploring new spaces.

My primitive rating of such papers is found below:

Explanation: 9/10
Novelty: 7/10
Breakthrough: 1/10
Interest: 9/10
Accessibility: 9/10