Final Major Project – Tim Shepherd

Track Inspiration

I’m a child of 90’s dance music. A lot of what I like is based on that period of music. Hence my first track for the EP (currently named “Latent Space”) is a homage to the Robert Miles track “Children” (1995).

The track didn’t start out that way, but it morphed into something like it over a period of several weeks.

The first version used a simple synth. It was just a sketch, so it was fine that it was using almost anything. There was almost no arrangement (Figure 1). The drums were just a loop, and the bass and pads were simple.

[LINK TO “Track1-v1(edit).mp3”]

https://artslondon-my.sharepoint.com/:u:/g/personal/t_shepherd0320221_arts_ac_uk/EedF1rSD4OhKnlRz2Q29nXcBW60h4_Uoc0Dm7CGYUnBUTA?e=2aKBlq

**Figure 1:** Track 1 | Version 1 | Almost no arrangement, simple instrumentation, not much happening.

By version two of the track, you can see a more defined arrangement (Figure 2). The main lead sound now uses Ableton “Tension” plugin for a piano-like sound (Figure 3). The piano is a little distorted (though the use of the “Pedal” plugin. And while not 100% right, it was closer than the original sound.

[LINK TO “Track1-v2”]

https://artslondon-my.sharepoint.com/:u:/g/personal/t_shepherd0320221_arts_ac_uk/Ee_z–KvIEVCibUqVlq30TUBd3epRY21dPFtIRYPzrviKw?e=MrJwf3

**Figure 2:** More arrangement, first use of a piano-like sound, and better bass.

**Figure 3:** Use of “Tension” Plugin from Ableton 12 Suite for the “Piano” sound.

From version two to version seven I still used the distorted “piano” sound. With feedback from several people, I swapped the distorted piano for a sample-based one, in this instance “The Grandeur” from Native Instruments (Figure 4).

[LINK TO Track1-v8]

https://artslondon-my.sharepoint.com/:u:/g/personal/t_shepherd0320221_arts_ac_uk/EXY60HwfvV1GhpOoDWHb1REBpA-T_phQLxdSQbT5mfKd7A?e=PmjuDa

**Figure 4:** “The Grandeur” from Native Instruments.

This piano was coupled with the Raum reverb plugin to put the piano into a larger space. It was at this time that I realised that I was creating a homage to “Children”. For the most part this was a subconscious decision. But it just works. So why does it work?

Looking at the frequency range of instruments

When considering sound design in the process of generating tracks it is sensible to keep in mind their frequency distribution. This can help in choosing instruments that don’t fight to be heard in a mix. Thankfully there are plenty of freely available charts on the internet to help with this. Figure 5 shows one from Sweetwater that will help me to illustrate why this track works, and why Children (1995) from Robert Miles worked.

Figure 5: Musical Instrument Frequency Cheat Sheet (Author: Sweetwater, 2024)

While a good start, I’ve pulled out the overlapping frequencies of the instruments (Figure 6) to better illustrate how the instruments overlap.

**Figure 6:** How the different instruments frequency ranges overlap in key areas.

The main thing to realise here is that as there is no vocal in this track, the piano isn’t masking a vocal. The main issues masking of the snare by the piano. And a lot of overcrowding in the high mid-range and the frequencies over 5K.

These issues are easier to fix without having to worry about a masked vocal. For the snare, I reduced around 3-5dB at around 300k. This preserves the body of the snare. In the mid-range I reduced the snare to remove some of the “boxiness” and make way for the piano to come to the front of the mix. In the highs it was a balance to keep the presence of the piano versus the snare and the hi-hats.

In the end, while the frequency ranges can help, I ultimately did this all by ear and trusted what I was hearing in my studio monitors. Though I do like how the chart helps to contextualise the decisions that I made with my ears.

Bibliography

Miles, R. (1995) “Children”. Available at: Apple Music (Accessed: 11 January 2025)

Sweetwater (2024) EQ Frequencies of Musical Instruments Explained. Available at: https://www.sweetwater.com/insync/music-instrument-frequency-cheatsheet/ (Accessed: 20 February 2025)

Planning

How am I feeling?

I’ve found the process for this EP has run across the entire spectrum of emotions. From dread to happiness, and everything in-between.

At the start of the process, there was the initial fear that I’d taken on a project that may fail. This was due to hearing the initial output of the AI generated midi. It sounded clunky, and I wasn’t sure that I’d be able to get anything sounding good from it. However, there were a few reasons for the clunky sounds. One, the velocity and the timing were all over the place. In the end I learnt that I was fighting against the output. And two, I couldn’t work out if the output was an issue with my training data, or the way I made the model, or both. It was very stressful, and I didn’t want to be stuck doing development work when I should be focusing on music production.

The other issue, and it was a big one, was the amount of data I needed to create to get this working. I thought I had a handle on it, it was going to be a lot. What I didn’t realise was just how much (Figure 1).

**Figure 1:** A screenshot of just some of the midi files used to create the ST4RT+ dataset

To get these all working I created a master Logic file (Figure 2) with all the midi files in there. Then had to export them one at a time. It was a tedious process to get it all completed.

**Figure 2:** Master Logic Pro X session with all of the midi ready to export.

I wanted the midi output to be more structured and considered. But try as I may, it just was not working. After I embraced the nature of what the AI was producing then I started to think more like a music producer and less like the AI would solve issues for me, this way of working, led me to the first track that I liked.

To me, this highlights what a music producer is and does. You work with artists to capture, edit, manipulate, arrange, and preserve (Slade, et.al., 2020) their ideas and make them better. The keywords here are: manipulate and arrange. By taking the generated melody and modifying the chunks into phrases, I found that I could get something – not only great – but amazing.

From taking raw material and working with an artist to create something that captures the feeling that the artists is trying to achieve.

I think that the main reason that I was having issues initially not seeing my role in that way in the early part of the process. Once that changed, I was easier able to create work that feels right and works with the methodology created for this project.

Bibliography

Slade, S., Thompson, D. Massy, S., Webber, S. (2020) Music Production: What Does a Music Producer Do? Available at: https://online.berklee.edu/takenote/music-production-what-does-a-music-producer-do/ (Accessed: 27 February, 2025)

Creative Planning

Sound Advice

I’ve been reflecting on the process of creating these tracks over the past few weeks, and one of the main things that I’ve realised is the importance of sound/instrument selection.

It has also been one of the hardest things to get right. For instance, Listen to the following (the main instrument for the melody at the start):

[LINK TO Track3.wav]

https://artslondon-my.sharepoint.com/:u:/g/personal/t_shepherd0320221_arts_ac_uk/EVqdlEVAY75MlRjiLQwzT7oB8w8hWXa17gctKCvDDWW90Q?e=c2bInb

Now, listen to the same segment with different instrument selection:

[LINK TO Track3-v11.wav]

https://artslondon-my.sharepoint.com/:u:/g/personal/t_shepherd0320221_arts_ac_uk/EdX5G-xOjQBOoIXS-tFIfW0BYmVFCRTCFJ9WKrn0ABLgFw?e=2DTe57

The original lead sound doesn’t have as much presence in the mix. So, why does the second one sound better?

I could have introduced EQ, compression, or saturation, to get the sound to sit better in the mix, and I will say, I tried that first. The results were fine. But I really wanted the lead sound to cut through. So, I decided to look for similar sounds that would sit better in the mix. It takes ages to do this, but it was a good way to keep myself overworking a sound.

It’s sometimes better to change the sound rather than trying to make incremental changes to get it to sound right. So, for that lead sound I changed the sound from (Figure 1):

**Figure 1:** Lead sound used in “Track3-v1”

To (Figure 2):

**Figure 2:** Lead sound used in “Track3-v11”

The original sound is a “purer” synth sound. And I loved the pitch shift at the start of the notes, but it didn’t cut through the mix enough. The second sound has more distortion but also a more organic feel.

Other techniques used for sounds

Layering

I’ve also used a few techniques to get the best from a few sounds where using the stock plugins could achieve a sound that I was going for by itself. For these types of sounds I would sometimes layer them together (Figure 3).

**Figure 3:** See the three tracks have the same notes with different instruments all playing at the same time.

Using layering, I was able to create and treat the attack of an instrument that I liked with the sustain of a different instrument that I liked too. While in Figure 3 the instruments are on separate tracks, I could have used an instrument rack. I have done this for a few of the tracks as it made automation easier, but for the example above it wasn’t necessary to achieve the sound I was after. And in this instance, I was using different octaves for the different layers, so keeping them separated made turning them into one instrument easier.

Modulation

I can’t say enough about how modulation can help with creating aural interest to sounds over time. While it’s possible to do this with automation of parameters using an automation lane, it’s usually a lot faster than easier to use an LFO.

I can say though, there are a few drawbacks with using the LFO in Ableton. First, if it’s not synced to the grid (using frequency rather than note values), you can end up playing a version that is amazing but when you render it out, the start value of the frequency is in a different place every time you render it out. Meaning it never sounds the same twice. This doesn’t happen when the LFO is set to the grid.

But an easy way around this is to freeze the track, see if it’s good, if not unfreeze and refreeze, listen and repeat until you are happy. Not ideal, but better than it sounding different every time.

Creative Development Planning Research

Can you collaborate with an ANT?

Post author By Tim Shepherd
Post date 28th February 2025
No Comments on Can you collaborate with an ANT?

Just kidding. But it does bring up the question of how do I collaborate with a non-human entity? Is it even possible?

Well, from my perspective it is. However, let’s look at what others in the fields of creativity and sociology say.

I’ll start with a sociological perspective. A definition of Actor Network Theory (ANT) is a good place to start so that we can break down how this relates to my Final Major Project (FMP) and what insights working with a non-human collaborator bring up.

Actor Network Theory: Is a theoretical framework within the field of science and technology studies (STS) that examines how social, technical, and material entities (referred to as “actors”) interact to form complex networks that shape and influence outcomes. ANT challenges traditional distinctions between human and non-human agents by treating them symmetrically as participants in these networks.

This definition, while not specifically saying that non-human entities can be collaborators as an overt term, states that the interactions between human and non-human “actors” to influence one another.

ANT is a contrasting view to Technological Determinism (TD), where the idea that technology develops independently of social change and drives social change (Bimber, 1990). For example, Karl Marx believed that the railway in colonial India changed social hierarchies by introducing new economic activities (ibid.). While TD can look like a good place to start when you take a cursory view of any technology and its impact on how people use it, I believe a more nuanced approach can lead to a better understanding of how we as humans interact with technology, and how we as humans shape technology. To gain a more holistic view of collaboration I’ll bring up Fraser’s “Collaboration bites back” (2022). In this paper Fraser creates a manifesto for collaboration as a tool for change. So, I thought it best to go through her 10-point manifesto and see/explain how working with ST4RT+ achieves her points.

Collaboration should not be predictable:
This is an easy one. While ST4RT+ is based on my melodies and data, it doesn’t create melodies that are 100% what I would do.
Collaboration should not be clean:
This one is a little more nuanced. I will say that when I was struggling with the outputs of the model at the start of this project, I had to get my hands dirty and get to the point where I started thinking more like a music producer and less like a developer.
Collaboration should not be safe:
This whole project was a risk, using technology I’d never used before, and risking that it was going to work has put me in a place where I thought I was going to be lucky to generate anything worthwhile.
Collaboration requires consent:
Harder to do this with a non-human collaborator, however if the original generation of a set of melodies is objectively awful (all the notes are overlapped and on bar one) then I just regenerate.
Collaboration requires trust:
This point is interesting, for me it was about trusting myself and the process. When I was fighting the models output it was because I wasn’t trusting my skills as a music producer. I wanted the model to generate clean melody lines. Trust in myself has really helped to get this project working.
Collaboration requires time, and time (usually) costs money:
This project has taken time to get working (far more time in the beginning than I anticipated). It has needed experimentation and failure to get to a point where the process and methodology are working.
Collaboration requires vigilance:
Regardless of a non-human collaborator, this still applies, though it relies more on me to do that work.
Collaboration is not compulsory:
Nothing to see here… in this case it was compulsory.
Collaboration is not cool:
I disagree here. Only because using an ANT framework almost everything is a collaboration even if you aren’t aware of it.
Collaboration is a tool for change:
I agree that any collaboration should challenge the status quo. For me the idea of creating an ethical use for AI trained only on the data that I have given it challenges how AI is being used and the data it is trained on. For me this is important and a point of difference with this project.

I think that when I look at Fraser’s 10-point manifesto that this project still works in terms of meeting what she defines as collaboration.

Bibliography

Bimber, B. (1990) Karl Marx and the Three Faces of Technological Determinism, in Social Studies of Science, Vol. 20, No. 2 (May, 1990), pp. 333-352. Available at: https://www.jstor.org/stable/285094 (Accessed: 2 December 2024).

Fraser, J. (2022) Collaboration bites back. Available at: https://www.julietfraser.co.uk/app/download/11414030/Collaboration+bites+back.pdf (Accessed: 18 October 2024)

Creative Research

Powers of Ten

For this project I’d like to emulate some of the sounds like Stephan Bodzin as I love his work and his aesthetic. Hence the need to do some research into how he got the sounds on his album, Powers of Ten (Bodzin, 2015).

A live set by Stephan Bodzin:

Most of the information that I can find online has been what Bodzin uses for his live shows (McGlynn, 2021). Although his main setup is confirmed by Stephan’s Equipboard profile (2024), though it is important to note that on Equipboard that this profile is “This is a community-built gear list for Stephan Bodzin.”

What does he use?

Moog Sub37: Serves as the foundation of his music and live set. Used to play basslines and is used as the timing master for the Matriarch and the Mordor (ibid).
Modor DR-2 Drum Machine: Is a digital drum machine using DSP synthesis rather than samples to produce the sounds. There is no sample memory, it’s all synth based (Modor, 2024).
Moog Matriarch Semi-Modular Synthesizer: Is a “patchable 4-note paraphonic analog synthesizer” (Moog, 2024).
Ableton Live: Live (Ableton, 2024) is used to trigger sounds and MIDI, including drums and basslines. It also serves as the hub for syncing his hardware synthesizers and drum machine.

Stephan’s music

It is safe to say that his main genres are trance and techno. More on the melodic side when talking about techno. This can be heard in the opening track on Powers of Ten, “Singularity” (Bodzin, 2015). Where there is a main melody line throughout the track. You can hear the analog Moog Matriarch used to full advantage. It’s almost organic in the way he plays with both the filter and the pitch to generate interest in the main melody as it evolves across the track.

In the title track “Powers of Ten” (Bodzin, 2015) Stephan uses a technique where the main rhythmic sound that is heard across the whole track is modified with adding in noise and changing both the filter cutoff and the filter resonance frequency.

I’m also interested in the kick he has used in this album. It has a sound that really cuts though his mixes. This was harder to find, but thankfully over at KVR Audio’s forums there was a solution (KVRAudio, 2016). From this site it sounds like it is a saturated 808 kick with some EQ and compression. And I can confirm that this works to create a kick that sounds right.

How I’m going to create some of these types of sounds

Since I don’t have these specific pieces of hardware, I’m going to use a combination of Ableton Analog, and a series of LFO’s to modulate the filter cutoffs and the resonance frequencies (without syncing them to the tempo, to make them more organic).

For the drum sound, while I have used 909 kicks for some of the tracks as they work better, I’m going to add 808 for some of the other tracks. The method found online works to create something that is indistinguishable from his kick when mixed.

I have a friend who will let me borrow his Sub37, so I’ll use that to track and record some of the bass sounds (I’ve looked for a VST/virtual instrument to do this and there are none available).

While I don’t think I’ll copy his style completely, even looking at how he has created his tracks is giving me ideas to further my own tracks.

Bibliography

McGlynn, D. (2021) Stephan Bodzin: How I Play Live. Available at: https://djmag.com/longreads/stephan-bodzin-how-i-play-live (Accessed: 12 November 2024)

Modor (2024) MODOR Digital Polyphonic Synths | DR-2. Available at: https://www.modormusic.com/dr2.html(Accessed: 12 November 2024)

Moog (2024) Matriarch | Moog. Available at: https://www.moogmusic.com/products/matriarch (Accessed: 12 November 2024)

Ableton (2024) What’s new in Live 12 | Ableton. Available at: https://www.ableton.com/en/live/ (Accessed: 12 November 2024)

Equipboard (2024) Stephan Bodzin | Equipboard. Available at: https://equipboard.com/pros/stephan-bodzin(Accessed: 12 November 2024)

Bodzin, S. (2015) Powers of Ten. Available at: Apple Music (Accessed 14 November 2024)

KVRAudio (2016) Stephan Bodzin kick. Available at: https://www.kvraudio.com/forum/viewtopic.php?t=469969(Accessed: 8 January 2025)

Planning

Methodology

The journey for this project so far has been experimental to say the least. Fortunately, the methodology for creating and producing the work has been less experimental.

The practice-as-research methodology employed in the production of the artefacts engages several phases of development:

Composition of musical material (Melody and Harmony) as input.
Generation of the output with the (AI) collaborator and reflection on it.
Development of the artefacts with a DAW.

While there is nothing unusual about the development phases compared to a more run-of-the-mill collaboration, the composition relies on a non-human participant, evoking Actor-Network Theory (ANT). This runs counter to Technological Determinism, which would posit that the technology itself is having an impact on the output, rather than the complex interactions of the inputs to the model (a technology), the model (another technology, the output, my ability to have some control over the output via weights, and any other human and non-human factors that create a complex network of relationships that can affect the final product.

To make this a lot easier to digest, I’ve created a 5-step methodology (Figure 1) that captures the process of working with my AI collaborative partner though to producing a track.

**Figure 1:** 5-Step Methodology for creating tracks for the ST4RT+ EP.

The breakdown of the phases goes as follow:

Generate a melody with ST4RT+ based on the input from my written melody and harmony.
Check the output from the model, if the output sounds wrong, modify the output weights and regenerate if necessary.
Document the weights used to create the output melody. These weights may be different for each melody to create a good output.
Generate a second melody (melody only) that is based on the first input using the same weights as last time.
Develop the materials in a DAW, reflect on the output and regenerate if necessary.

The regeneration parts of this methodology aren’t deleterious, in this case meaning that I keep all the previous outputs as both reference and as opportunities for remixing with other outputs. I’d do this with a human collaborator, so I may as well do it with the AI model. I guess the best way to look at this is if a human collaborator came up with an amazing guitar riff for a song, but the verses were terrible, you wouldn’t throw out the amazing riff because the verse was bad. You’d simply work on getting the verses better.

In this way, the AI has felt like a collaboration partner. Sometimes it gets it right. Other times I feel it misses the mark, but by working with it (rather than against it) it can sometimes surprise and delight when we find a solution that is better than the sum of its parts.

Development Research

Weights for an AI model

Weights

Some types of AI models use weights help tune the output that they produce. Simply put, weights are numbers that help to define the internal structure of a machine model. They help to tell the model what we consider to be important when creating an output. The closer you are to 1.0 indicates higher importance for the model, and the closer you get to zero the less important. In ST4RT+ we have Pitch, Step (start of notes), and Duration (end of notes). These work together to help the AI decide what notes to predict based on the input and the training data.

So how do these work to create an output?

Well, for the most part it’s the weights that give a level of control over what the model sees as the “right” predictions.

With the weights for the ST4RT+ model all at “1.0” we have all variables in the model to be of equal weight. Nothing is more important than anything else. So, when we add an input to the model, Pitch, Step, and Duration, are all equally important. This results in what we can see in Figure 1.

**Figure 1:** All weights are set to “1.0”

With all weights set to “1.0” we can see that because Step, and Duration, are equally as important as Pitch, that there is less variation in the pitches predicted. Also of note, around halfway through the output the Duration of the notes (their length) suddenly truncates. This is due to the Long/Short Term Memory (LSTM) window dropping out. The LSTM is the AI models memory and allows it to remember melodic phrases. I used a shorter window due to the amount of memory usage on Google Colabs. As you increase the LSTM window by a note you increase the processing requirements exponentially (mHelpMe, 2020) the software needs to remember not only the steps previously but also the predictions.

With the weights set as follows: Pitch “0.005”, Step “1.0”, and Duration “1.0” we get what you can see in Figure 2.

**Figure 2:** Weights set to Pitch “.005”, Step “1.0”, and Duration “1.0”

This creates a more random fluctuation of pitches. The model is told that you can be less worried about the predicted value of the pitches and instead concentrate on the Step and Duration parameters. In this instance, it created a more widely ranging melody, going down as low as note 40 (E2 – E 2^nd Octave), while when Pitch was set to “1.0” the lowest note was note 72 (C5 – C 5^th Octave). This can give me some control over the harmonic width of the melody.

When we set the weights as follow: Pitch “1.0”, Step “0.005”, and Duration “1.0” we get what you can see in Figure 3.

**Figure 3:** Weights set to Pitch “1.0”, Step “0.005”, and Duration “1.0”

With the same input as the previous examples, you can see that with the step value being low tells the model to make sure that Pitch and Duration are important. Step, where the start of a Pitch should be, now creates a large overlap between the notes. This is due to a large difference between the start of a note and its end (the Duration). The model correctly makes the end of the note according to the input data, but the start of each note is wildly early, which creates these overlaps.

Finally, when we set the weights as follows: Pitch “1.0”, Step “1.0”, and Duration “0.005” we get what you can see in Figure 4.

**Figure 4:** Weights set to Pitch “1.0”, Step “1.0”, and Duration “0.005”

In this result we can see that the Pitch and Step are even. Not too much pitch variation. And, while there is some variation in the length of notes, they tend to get shorter as the LSTM window starts to fill. Because Duration and Step are co-dependent on one another, we can see how they both influence one another.

How does this help me?

It helps me when I’m looking at any output generated by ST4RT+ and then can make an educated determination on how I can get something more musical from the model by making small modifications to the weights placed on the model. This type of fine tuning isn’t as necessary when you have a larger dataset, but as I’m working with a micro sized dataset, I sometimes need to make small modifications to get the best from the model.

Bibliography

mHelpMe (2020) “LSTM network window size selection and effect”. StackExchange. Available at: https://stats.stackexchange.com/questions/465807/lstm-network-window-size-selection-and-effect (Accessed: 4 February 2025)

Creative

OMFG! This AI is good!

I started working on this project not really knowing where it would end up. There was a fair amount of trepidation with this project working in terms of:

Would it create output that I could work with?
Would it produce reasonable results with a reduced data set?
Would it still kind of sound like a version of “me”?
Could I work creatively with this AI version of “me”?

Well, I have to say, I’m very surprised by the results of my first track. Not only does it sound like me, but it also does a great job at making the production experience pain free.

Now, this is only the first track, and other inputs may produce random outputs. But I’m happy with this as a first produced track for the EP.

So, without further a due, here is a link to the WIP for my first track of the EP:

After listening I think it’s best to go through the thoughts that I was going through at the beginning of this process to see how and what has changed in my creative and cognitive conception of how the process is going, and the opportunities that this can have both creatively and in terms of workflow.

Would it create output that I could work with, and would it produce reasonable results with a reduced data set?

I need to be honest here, the first outputs from the AI were not great. I tried with my versions of default weights with pitch, step, and duration, all at 1.0. The results were not great:

It’s boring, and I’ve used a lot of techniques to kind of hide the main melody. Not a great start and it had me questioning myself for choosing to do this as a project.

I thought it could be the size of my model data, or the input that I gave it. This led me to increase the amount of training data that I gave to the model. It also made me research how to use the “.repeat()” function in TensorFlow to selectively add more synthetic data into the dataset. After doing this, I still had issues with the output, so further research pointed me to the use of weights and their application. I’ll go into weights in a further blog post, but this important topic tells the model what I want it to find as important.

Changing the weights to what you can see in Figure 1 made the outputs for Figure 2 and Figure 3.

**Figure 1:** Weights used for the first set of outputs that I used for “Track 1”

**Figure 2:** Output 1 from the weights used

**Figure 3:** Output 2 from the weights used

This created something that sounded a lot closer to the kind of output that I was looking for. And created a back and forth between me and the AI that isn’t too dissimilar to working with a human collaborator when brainstorming at the beginning of the creative process.

Would it still kind of sound like a version of “me” and could I work creatively with this AI version of “me”?

I’m not sure if my influence over this piece is purely my production values or that the melody is very much me. But I have to say that an interesting thing has occurred when getting output from the AI. Almost all the output that I’m getting from ST4RT+ is in the key of C minor. While I trained the model with pieces in different keys, the bulk of what I create when writing is to default to C minor. I have no idea why I do this; I just like that key. But I found it interesting that ST4RT+ both picked up on this, and the outputs certainly sound like me because of it.

In a way, it makes the output more like it is a part of me. And that helps with the pieces feeling familiar. Does it sound like me? Yes. Can I work with the AI creatively? A resounding “YES”!

Research

A little AI music history in acedemia

Post author By Tim Shepherd
Post date 19th November 2024
No Comments on A little AI music history in acedemia

I wanted to know a little more about the history of AI generated music, mainly as I wanted to get more background, and I wanted to separate the work that I’m going to algorithmic generations. Algorithmic generation is software that uses rules or formula to generate the results. A good example is Verbasizer (@taoski, no date), created for David Bowie to generate song lyrics from randomised words. It doesn’t use AI, but a randomised number generator to put lyrics together.

The first academic texts on music that used neural networks that were published are from Peter Todd (1988), “A sequential network design for musical applications” and Lewis (1988), “Creation by refinement: A creativity paradigm for gradient descent learning networks”. While I can get access to the paper that Lewis authored, I can only see the paper from Todd as a reference cited quite widely, with no available sources online. It’s unfortunate that I cannot get access to the paper from Todd as it is one of the first examples of the use of an RNN to generate music.

The next major advancements in music AI were in 2002. This was the use of LSTM for music. In their paper Eck and Schmidhuber used LSTMs (with RNNs) to find the temporal structure in Blues music.

Also in 2002 was the use of spectrograms as the input for a generative neural network (Marolt, 2002). While using spectrograms as data to train a music model doesn’t sound amazing. It was a speculative leap. The idea of using images, rather than just sound, to train a neural network for music opened the field of study. And allowed advances in image processing to be applied to music recognition and categorisation. During this time is when GPUs transistor count and acceleration of maths functions on a GPU really came into their own with parallel processing which made GPUs outperform CPUs (Merit, 2023)

To help visualise these papers over time, I’ve created a handy timeline (Figure 1):

Timeline-v2-1 Download

As you can see from the timeline the technology around AI has increased dramatically in the last few years.

For ST4RT+, most of the technology that I’m working with is based on the 2002 paper by Eck & Schmidhuber. Their use of RNN and LSTM is the closest in terms of structure and concept to the ST4RT+ model. While the technology I’m using was first developed in the early 2000’s, later technology focused on generating and analysing waveform data rather than representational (MIDI) data. My reasoning for using representational data was so that I didn’t need to spend a lot of money in processing to create the artefacts. Plus, it has the added benefit that it allows me more creative freedom to choose instruments that work with the melodies that ST4RT+ generates. This allows me to be the “Music Producer” in the collaboration with ST4RT+ rather than have the model generate full music sequences.

Timeline Bibliography

Lewis and Todd papers from the 80’s:

Todd, P. (1988) “A sequential network design for musical applications,” in Proceedings of the 1988 connectionist models summer school, 1988, pp. 76-84.

Lewis, J. P. (1988) “Creation by refinement: a creativity paradigm for gradient descent learning networks.” IEEE 1988 International Conference on Neural Networks (1988): 229-233 vol.2.

The first time someone used LSTMs for music:

Eck, D., & Schmidhuber, J. (2002). “Finding temporal structure in music: blues improvisation with LSTM recurrent networks”. Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, 747-756.

The first time someone processed spectrograms with neural networks:

Marolt, M., Kavčič, A., & Privosnik, M. (2002). “Neural Networks for Note Onset Detection in Piano Music” in International Computer Music Conference (ICMC).

The first time someone built a music genre classifier with neural networks — based on Hinton’s deep belief networks for unsupervised pre-training:

Lee, H., Pham, P.T., Largman, Y., & Ng, A. (2009). “Unsupervised feature learning for audio classification using convolutional deep belief networks” in Neural Information Processing Systems (NIPS).

Hinton, G.E., Osindero, S., & Teh, Y.W. (2006). “A Fast Learning Algorithm for Deep Belief Nets” in Neural Computation, 18, 1527-1554.

The first time someone built an end-to-end music classifier:

Dieleman, S., & Schrauwen, B. (2014). End-to-end learning for music audio. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6964-6968.

A study run at Pandora Radio showing the potential of end-to-end learning at scale:

Pons, J., Nieto, O., Prockup, M., Schmidt, E.M., Ehmann, A.F., & Serra, X. (2017). End-to-end Learning for Music Audio Tagging at Scale. ArXiv, abs/1711.02520.

Humphrey and Bello did some work on chord recognition and wrote the deep learning for music manifesto:

Humphrey, E.J., & Bello, J.P. (2012). “Rethinking Automatic Chord Recognition with Convolutional Neural Networks” in 2012 11th International Conference on Machine Learning and Applications, 2, 357-362.

Humphrey, E.J., Bello, J.P., & LeCun, Y. (2012). “Moving Beyond Feature Design: Deep Architectures and Automatic Feature Learning in Music Informatics” in International Society for Music Information Retrieval Conference.

Discussion on how to improve current architectures:

Choi, K., Fazekas, G., & Sandler, M.B. (2016). “Automatic Tagging Using Deep Convolutional Neural Networks” in International Society for Music Information Retrieval Conference.

Pons, J., Lidy, T., & Serra, X. (2016). Experimenting with musically motivated convolutional neural networks. 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI), 1-6.

Lee, J., Park, J., Kim, K.L., & Nam, J. (2017). Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms. ArXiv, abs/1703.01789.

Some modern generative models for algorithmic composition (GANs and VAEs, basically):

Yang, L., Chou, S., & Yang, Y. (2017). MidiNet: A Convolutional Generative Adversarial Network for Symbolic-Domain Music Generation. ArXiv, abs/1703.10847.

Roberts, A., Engel, J., Raffel, C., Hawthorne, C., & Eck, D. (2018). A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music. ArXiv, abs/1803.05428.

And some works directly synthesizing music audio (waveGAN and Wavenet, basically):

Donahue, C., McAuley, J., & Puckette, M. (2018). Synthesizing Audio with Generative Adversarial Networks. ArXiv, abs/1802.04208.

Oord, A.V., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.W., & Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio. Speech Synthesis Workshop.

Dieleman, S., Oord, A.V., & Simonyan, K. (2018). The challenge of realistic music generation: modelling raw audio at scale. ArXiv, abs/1806.10474.

Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D., & Simonyan, K. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. ArXiv, abs/1704.01279.

Bibliography

Merrit, R (2023) “Why GPUs Are Great for AI”. Available at: https://blogs.nvidia.com/blog/why-gpus-are-great-for-ai/ (Accessed: 20 January 2025)

@taoski (no date) “Verbasizer”. Available at: https://verbasizer.com (Accessed: 20 January 2025)

Research

Got a little ahead of myself… what are RNNs and a VAEs?

Post author By Tim Shepherd
Post date 1st November 2024
No Comments on Got a little ahead of myself… what are RNNs and a VAEs?

To better prepare for this project I researched various types of Machine Learning (ML) models. This post is a distillation of my research into the models and why I have chosen to use RNNs (Recurrent Neural Networks) and VAEs (Variational Autoencoders).

Recurrent Neural Networks

Are a type of neural network that processes sequences of data, in the case of music, a time series. The difference between an RNN and other types of neural networks is that RNNs have “hidden memory” of any previous inputs. This is because they have loops that allow information to pass between one step in a sequence and the next sequence.

**Figure 1:** How an RNN works as an overview.

Because of the way that RNNs loop it means that they are good at time-series prediction. The issue with RNNs is that they aren’t so good with long-term dependencies. In the case of music, repeated motifs, may well disappear from longer generated passages because of the looping limitation. To combat this RNNs can be used in conjunction with LSTM (Long Short-Term Memory) for the ML to “remember” previous outputs for longer giving the music a more pleasing structure.

For this project I chose to use an RNN as it is a good model to both generate sequences and create based on previous input.

Variational Auto Encoders

Variational Autoencoders (VAEs) work differently and are designed to generate new data. A VAE has two main parts. An Encoder that compresses the input data into a smaller, abstract representation of the data, called a latent space. And a Decoder, which tries to reconstruct the input data from the compressed representation (Figure 2).

**Figure 2:** How a VAE works as an overview.

By encoding the data into a probability distribution rather than a single point in the latent space, you can introduce some randomness by sampling the learnt distribution in the latent space.

**Figure 3:** Showing the movement of the point in latent space to produce a novel image output. (Source: IRCAM via YouTube) (IRCAM 2022).

VAEs are usually trained to balance what is called Reconstructive Loss, where we can measure how well the VAE can reconstruct the original data from the latent space. And KL Divergence, which ensures that the latent space is organised and smooth so that any new samples it creates produce good results.

I chose to use a VAE as it is a true generative model. It can also handle interpolations and variations by sampling points near the original point in the sampled latent space.

Why both?

Well, VAEs are generative but aren’t good a time-series (notes over time). I could use Transformer-based architectures to do this but RNNs have the advantage for music models for keeping longer term structures in the music when combined with LSTM.

Bibliography

IRCAM (2022) “IRCAM Tutorials / Rave and nn~” [Online]. Available at: https://www.youtube.com/watch?v=o09BSf9zP-0 (Accessed: 1 November 2024).

Recent Posts

Categories