Categories
Development

ST4RT+ now in a handy RNN form factor!

I started with a variational autoencoder model but ended up deciding to do the hard work first. So, I decided to create a test of “me” as a small model using only six midi files of various works that I’ve created over the last two years (Figure 1).

Figure 1: Screenshot of the six midi files I used to create the test version of ST4RT+

Using Google Colab I needed to first do my setup and install fluidsynth (a midi synth) and pretty_midi (used to handle midi). From there I then to be able to upload my dataset (Figure 2).

Figure 2: Code to upload the dataset (after removing any old versions if they exist).

I then grab the first file, in this case “Theft By Rhythm (Split)2.mid” as the main file (I will in a later version separate the model from the input, in this instance it was easier and quicker to put the two tasks together. 

From there I extract the notes from my input (Figure 3).

Figure 3: Extracting the notes from my sample file.

I then can visualize the piano roll so I can double check that I’m using the correct midi data (Figure 4).

Figure 4: Piano Roll of the sample midi file.

After this the geek in me decided to look at the distribution of each note variable (Figure 5). This allows me to see the number of times that a pitch occurs in the midi, the step timing (which is the time elapsed from the previous note or start of the track) and the duration of the notes.

Figure 5: Distribution of Pitch, Step, and Duration of the sample midi file.

Next, I create a dataset using all the midi files uploaded previously. With a small dataset of only six files this is super quick. From here we create a dataset in TensorFlow from all the notes that were in the upload files. This is the place where I will need to adjust for my model to get the best results. Using either magic numbers (trial and error for what works best) or use a technique called hyperparameter tuning (https://www.tensorflow.org/tutorials/keras/keras_tuner). Simply put it is a library that helps to pick the optimal set of parameters for the ML model to get the best results rather than using trial and error.

We then train the model (Figure 6). What we are looking for is that the plot goes down (it doesn’t need hit zero, but it does need to flatten out).

Figure 6: The model as it trains, it gets closer to zero and flattens out, this shows that the model training is working.

Finally, we start to generate notes from the model we have just built (Figure 7). We can use the “temperature” parameter from which higher numbers create more chaotic note choices or keep it lower to preserve more of what the training data looks like.

Figure 7: Generating notes from the model.

I then look at the midi file output (Figure 8) and could regenerate it or change the temperature and regenerate. For this generation it looks like the original input file (Figure 4) but with more from the other random midi data that I added to the training dataset. 

Figure 8: Generated midi file from the input against the model.

This was a lot of work to complete, but I can confirm that it works as expected even considering the small dataset. I’ll work on getting a larger set of data together and making a few other modifications to make it easier to use. 

I like how it modified the input, though it isn’t as easy a process to reiterate with the model yet. These future improvements will make this an interesting tool/collaborator.

References

FluidSynth (2024) FluidSynth (Version 2.4.3) [Computer program]. Available at: https://www.fluidsynth.org/(Accessed: 2 October 2024)

Pretty_midi (2023) pretty_midi (Version 0.2.10) [Computer program]. Available at: https://github.com/craffel/pretty-midi (Accessed: 2 October 2024)

Categories
Research

Like a circle in a spiral, like a wheel within a wheel…

For the Final Major Project (FMP), I have decided to combine my other passion, technology, with my music practice.

Introducing ST4RT+! An AI version of me, to help me co-create with an AI for composition.

To do this I have done a bit of a dive into Machine Learning (ML) models with a view to being able to do this project without having to train on an exceptionally large dataset. This is important for a few reasons: I do not want to spend most of my time creating training data (I want to be able to feasibly do the project in the time given), and I also want to make sure the the model data is ethical.

So, to find a model that has these attributes, I looked first at Recurrent Neural Networks (RNN) and Variational AutoEncoders (VAE). Both can utilise “latent space” in a pre-trained model. Latent space is where you create a smaller model within an existing model that trains on a much smaller set of data (Figure 1).

Figure 1: Using Latent Space in an VAE model to reduce the amount of training data needed to create a working ML model. (Source: https://magenta.tensorflow.org/midi-me)

For an RNN Model there are two different types, Lookback and Attention, that were both created to give the model the ability to create long term structure in the music it produces. The Lookback RNN can recognize patterns that occur over a 1-2 bar range, using a technique called LSTM (Long Short-Term Memory).

To learn a longer pattern phrase, we use an encoder/decoder during every output step for each new note. This is how the Attention RNN model works. Simply put, with every new note added to the sequence from the model, we look at the last number of steps to evaluate what the next note should be. 

My work from here is to play with some of the code and produce a proof of concept of both a RNN and a VAE-based model and see if the results are acceptable after training both types of models using my input in the latent space to create a model that sounds more like me.

From there I should have the tools to use the AI model as a way of bouncing off as a creative partner, and a form of co-creation, which is the aim of my project.

I was attempting to use only JavaScript for most of the melody generation, but on further research, I may have to use Python for the RNN but should still be able to use JavaScript for the VAE model. This is due to the complexities of having to either the lookback or attention models which require complex vector math that isn’t as efficient in JavaScript. 

Bibliography

Dinculescu, M., Engel, J., Roberts, A. (2019) MidiMe: Personalizing a MusicVAE model with user data. [Online]. Google Research. Available at https://research.google/pubs/midime-personalizing-a-musicvae-model-with-user-data/ (Accessed: 10 October 2024)

Waite, E. (2016) Generating Long -Term Structure in Songs and Stories. [Online]. Google TensorFlow. Available at: https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn (Accessed: 12 October 2024)