I wanted to know a little more about the history of AI generated music, mainly as I wanted to get more background, and I wanted to separate the work that I’m going to algorithmic generations. Algorithmic generation is software that uses rules or formula to generate the results. A good example is Verbasizer (@taoski, no date), created for David Bowie to generate song lyrics from randomised words. It doesn’t use AI, but a randomised number generator to put lyrics together.
The first academic texts on music that used neural networks that were published are from Peter Todd (1988), “A sequential network design for musical applications” and Lewis (1988), “Creation by refinement: A creativity paradigm for gradient descent learning networks”. While I can get access to the paper that Lewis authored, I can only see the paper from Todd as a reference cited quite widely, with no available sources online. It’s unfortunate that I cannot get access to the paper from Todd as it is one of the first examples of the use of an RNN to generate music.
The next major advancements in music AI were in 2002. This was the use of LSTM for music. In their paper Eck and Schmidhuber used LSTMs (with RNNs) to find the temporal structure in Blues music.
Also in 2002 was the use of spectrograms as the input for a generative neural network (Marolt, 2002). While using spectrograms as data to train a music model doesn’t sound amazing. It was a speculative leap. The idea of using images, rather than just sound, to train a neural network for music opened the field of study. And allowed advances in image processing to be applied to music recognition and categorisation. During this time is when GPUs transistor count and acceleration of maths functions on a GPU really came into their own with parallel processing which made GPUs outperform CPUs (Merit, 2023)
To help visualise these papers over time, I’ve created a handy timeline (Figure 1):
As you can see from the timeline the technology around AI has increased dramatically in the last few years.
For ST4RT+, most of the technology that I’m working with is based on the 2002 paper by Eck & Schmidhuber. Their use of RNN and LSTM is the closest in terms of structure and concept to the ST4RT+ model. While the technology I’m using was first developed in the early 2000’s, later technology focused on generating and analysing waveform data rather than representational (MIDI) data. My reasoning for using representational data was so that I didn’t need to spend a lot of money in processing to create the artefacts. Plus, it has the added benefit that it allows me more creative freedom to choose instruments that work with the melodies that ST4RT+ generates. This allows me to be the “Music Producer” in the collaboration with ST4RT+ rather than have the model generate full music sequences.
Timeline Bibliography
Lewis and Todd papers from the 80’s:
Todd, P. (1988) “A sequential network design for musical applications,” in Proceedings of the 1988 connectionist models summer school, 1988, pp. 76-84.
Lewis, J. P. (1988) “Creation by refinement: a creativity paradigm for gradient descent learning networks.” IEEE 1988 International Conference on Neural Networks (1988): 229-233 vol.2.
The first time someone used LSTMs for music:
Eck, D., & Schmidhuber, J. (2002). “Finding temporal structure in music: blues improvisation with LSTM recurrent networks”. Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, 747-756.
The first time someone processed spectrograms with neural networks:
Marolt, M., Kavčič, A., & Privosnik, M. (2002). “Neural Networks for Note Onset Detection in Piano Music” in International Computer Music Conference (ICMC).
The first time someone built a music genre classifier with neural networks — based on Hinton’s deep belief networks for unsupervised pre-training:
Lee, H., Pham, P.T., Largman, Y., & Ng, A. (2009). “Unsupervised feature learning for audio classification using convolutional deep belief networks” in Neural Information Processing Systems (NIPS).
Hinton, G.E., Osindero, S., & Teh, Y.W. (2006). “A Fast Learning Algorithm for Deep Belief Nets” in Neural Computation, 18, 1527-1554.
The first time someone built an end-to-end music classifier:
Dieleman, S., & Schrauwen, B. (2014). End-to-end learning for music audio. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6964-6968.
A study run at Pandora Radio showing the potential of end-to-end learning at scale:
Pons, J., Nieto, O., Prockup, M., Schmidt, E.M., Ehmann, A.F., & Serra, X. (2017). End-to-end Learning for Music Audio Tagging at Scale. ArXiv, abs/1711.02520.
Humphrey and Bello did some work on chord recognition and wrote the deep learning for music manifesto:
Humphrey, E.J., & Bello, J.P. (2012). “Rethinking Automatic Chord Recognition with Convolutional Neural Networks” in 2012 11th International Conference on Machine Learning and Applications, 2, 357-362.
Humphrey, E.J., Bello, J.P., & LeCun, Y. (2012). “Moving Beyond Feature Design: Deep Architectures and Automatic Feature Learning in Music Informatics” in International Society for Music Information Retrieval Conference.
Discussion on how to improve current architectures:
Choi, K., Fazekas, G., & Sandler, M.B. (2016). “Automatic Tagging Using Deep Convolutional Neural Networks” in International Society for Music Information Retrieval Conference.
Pons, J., Lidy, T., & Serra, X. (2016). Experimenting with musically motivated convolutional neural networks. 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI), 1-6.
Lee, J., Park, J., Kim, K.L., & Nam, J. (2017). Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms. ArXiv, abs/1703.01789.
Some modern generative models for algorithmic composition (GANs and VAEs, basically):
Yang, L., Chou, S., & Yang, Y. (2017). MidiNet: A Convolutional Generative Adversarial Network for Symbolic-Domain Music Generation. ArXiv, abs/1703.10847.
Roberts, A., Engel, J., Raffel, C., Hawthorne, C., & Eck, D. (2018). A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music. ArXiv, abs/1803.05428.
And some works directly synthesizing music audio (waveGAN and Wavenet, basically):
Donahue, C., McAuley, J., & Puckette, M. (2018). Synthesizing Audio with Generative Adversarial Networks. ArXiv, abs/1802.04208.
Oord, A.V., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.W., & Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio. Speech Synthesis Workshop.
Dieleman, S., Oord, A.V., & Simonyan, K. (2018). The challenge of realistic music generation: modelling raw audio at scale. ArXiv, abs/1806.10474.
Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D., & Simonyan, K. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. ArXiv, abs/1704.01279.
Bibliography
Merrit, R (2023) “Why GPUs Are Great for AI”. Available at: https://blogs.nvidia.com/blog/why-gpus-are-great-for-ai/ (Accessed: 20 January 2025)
@taoski (no date) “Verbasizer”. Available at: https://verbasizer.com (Accessed: 20 January 2025)