To better prepare for this project I researched various types of Machine Learning (ML) models. This post is a distillation of my research into the models and why I have chosen to use RNNs (Recurrent Neural Networks) and VAEs (Variational Autoencoders).
Recurrent Neural Networks
Are a type of neural network that processes sequences of data, in the case of music, a time series. The difference between an RNN and other types of neural networks is that RNNs have “hidden memory” of any previous inputs. This is because they have loops that allow information to pass between one step in a sequence and the next sequence.

Because of the way that RNNs loop it means that they are good at time-series prediction. The issue with RNNs is that they aren’t so good with long-term dependencies. In the case of music, repeated motifs, may well disappear from longer generated passages because of the looping limitation. To combat this RNNs can be used in conjunction with LSTM (Long Short-Term Memory) for the ML to “remember” previous outputs for longer giving the music a more pleasing structure.
For this project I chose to use an RNN as it is a good model to both generate sequences and create based on previous input.
Variational Auto Encoders
Variational Autoencoders (VAEs) work differently and are designed to generate new data. A VAE has two main parts. An Encoder that compresses the input data into a smaller, abstract representation of the data, called a latent space. And a Decoder, which tries to reconstruct the input data from the compressed representation (Figure 2).

By encoding the data into a probability distribution rather than a single point in the latent space, you can introduce some randomness by sampling the learnt distribution in the latent space.

VAEs are usually trained to balance what is called Reconstructive Loss, where we can measure how well the VAE can reconstruct the original data from the latent space. And KL Divergence, which ensures that the latent space is organised and smooth so that any new samples it creates produce good results.
I chose to use a VAE as it is a true generative model. It can also handle interpolations and variations by sampling points near the original point in the sampled latent space.
Why both?
Well, VAEs are generative but aren’t good a time-series (notes over time). I could use Transformer-based architectures to do this but RNNs have the advantage for music models for keeping longer term structures in the music when combined with LSTM.
Bibliography
IRCAM (2022) “IRCAM Tutorials / Rave and nn~” [Online]. Available at: https://www.youtube.com/watch?v=o09BSf9zP-0 (Accessed: 1 November 2024).