Category: Research

Can you collaborate with an ANT?

Post author By Tim Shepherd
Post date 28th February 2025
No Comments on Can you collaborate with an ANT?

Just kidding. But it does bring up the question of how do I collaborate with a non-human entity? Is it even possible?

Well, from my perspective it is. However, let’s look at what others in the fields of creativity and sociology say.

I’ll start with a sociological perspective. A definition of Actor Network Theory (ANT) is a good place to start so that we can break down how this relates to my Final Major Project (FMP) and what insights working with a non-human collaborator bring up.

Actor Network Theory: Is a theoretical framework within the field of science and technology studies (STS) that examines how social, technical, and material entities (referred to as “actors”) interact to form complex networks that shape and influence outcomes. ANT challenges traditional distinctions between human and non-human agents by treating them symmetrically as participants in these networks.

This definition, while not specifically saying that non-human entities can be collaborators as an overt term, states that the interactions between human and non-human “actors” to influence one another.

ANT is a contrasting view to Technological Determinism (TD), where the idea that technology develops independently of social change and drives social change (Bimber, 1990). For example, Karl Marx believed that the railway in colonial India changed social hierarchies by introducing new economic activities (ibid.). While TD can look like a good place to start when you take a cursory view of any technology and its impact on how people use it, I believe a more nuanced approach can lead to a better understanding of how we as humans interact with technology, and how we as humans shape technology. To gain a more holistic view of collaboration I’ll bring up Fraser’s “Collaboration bites back” (2022). In this paper Fraser creates a manifesto for collaboration as a tool for change. So, I thought it best to go through her 10-point manifesto and see/explain how working with ST4RT+ achieves her points.

Collaboration should not be predictable:
This is an easy one. While ST4RT+ is based on my melodies and data, it doesn’t create melodies that are 100% what I would do.
Collaboration should not be clean:
This one is a little more nuanced. I will say that when I was struggling with the outputs of the model at the start of this project, I had to get my hands dirty and get to the point where I started thinking more like a music producer and less like a developer.
Collaboration should not be safe:
This whole project was a risk, using technology I’d never used before, and risking that it was going to work has put me in a place where I thought I was going to be lucky to generate anything worthwhile.
Collaboration requires consent:
Harder to do this with a non-human collaborator, however if the original generation of a set of melodies is objectively awful (all the notes are overlapped and on bar one) then I just regenerate.
Collaboration requires trust:
This point is interesting, for me it was about trusting myself and the process. When I was fighting the models output it was because I wasn’t trusting my skills as a music producer. I wanted the model to generate clean melody lines. Trust in myself has really helped to get this project working.
Collaboration requires time, and time (usually) costs money:
This project has taken time to get working (far more time in the beginning than I anticipated). It has needed experimentation and failure to get to a point where the process and methodology are working.
Collaboration requires vigilance:
Regardless of a non-human collaborator, this still applies, though it relies more on me to do that work.
Collaboration is not compulsory:
Nothing to see here… in this case it was compulsory.
Collaboration is not cool:
I disagree here. Only because using an ANT framework almost everything is a collaboration even if you aren’t aware of it.
Collaboration is a tool for change:
I agree that any collaboration should challenge the status quo. For me the idea of creating an ethical use for AI trained only on the data that I have given it challenges how AI is being used and the data it is trained on. For me this is important and a point of difference with this project.

I think that when I look at Fraser’s 10-point manifesto that this project still works in terms of meeting what she defines as collaboration.

Bibliography

Bimber, B. (1990) Karl Marx and the Three Faces of Technological Determinism, in Social Studies of Science, Vol. 20, No. 2 (May, 1990), pp. 333-352. Available at: https://www.jstor.org/stable/285094 (Accessed: 2 December 2024).

Fraser, J. (2022) Collaboration bites back. Available at: https://www.julietfraser.co.uk/app/download/11414030/Collaboration+bites+back.pdf (Accessed: 18 October 2024)

Creative Research

Powers of Ten

For this project I’d like to emulate some of the sounds like Stephan Bodzin as I love his work and his aesthetic. Hence the need to do some research into how he got the sounds on his album, Powers of Ten (Bodzin, 2015).

A live set by Stephan Bodzin:

Most of the information that I can find online has been what Bodzin uses for his live shows (McGlynn, 2021). Although his main setup is confirmed by Stephan’s Equipboard profile (2024), though it is important to note that on Equipboard that this profile is “This is a community-built gear list for Stephan Bodzin.”

What does he use?

Moog Sub37: Serves as the foundation of his music and live set. Used to play basslines and is used as the timing master for the Matriarch and the Mordor (ibid).
Modor DR-2 Drum Machine: Is a digital drum machine using DSP synthesis rather than samples to produce the sounds. There is no sample memory, it’s all synth based (Modor, 2024).
Moog Matriarch Semi-Modular Synthesizer: Is a “patchable 4-note paraphonic analog synthesizer” (Moog, 2024).
Ableton Live: Live (Ableton, 2024) is used to trigger sounds and MIDI, including drums and basslines. It also serves as the hub for syncing his hardware synthesizers and drum machine.

Stephan’s music

It is safe to say that his main genres are trance and techno. More on the melodic side when talking about techno. This can be heard in the opening track on Powers of Ten, “Singularity” (Bodzin, 2015). Where there is a main melody line throughout the track. You can hear the analog Moog Matriarch used to full advantage. It’s almost organic in the way he plays with both the filter and the pitch to generate interest in the main melody as it evolves across the track.

In the title track “Powers of Ten” (Bodzin, 2015) Stephan uses a technique where the main rhythmic sound that is heard across the whole track is modified with adding in noise and changing both the filter cutoff and the filter resonance frequency.

I’m also interested in the kick he has used in this album. It has a sound that really cuts though his mixes. This was harder to find, but thankfully over at KVR Audio’s forums there was a solution (KVRAudio, 2016). From this site it sounds like it is a saturated 808 kick with some EQ and compression. And I can confirm that this works to create a kick that sounds right.

How I’m going to create some of these types of sounds

Since I don’t have these specific pieces of hardware, I’m going to use a combination of Ableton Analog, and a series of LFO’s to modulate the filter cutoffs and the resonance frequencies (without syncing them to the tempo, to make them more organic).

For the drum sound, while I have used 909 kicks for some of the tracks as they work better, I’m going to add 808 for some of the other tracks. The method found online works to create something that is indistinguishable from his kick when mixed.

I have a friend who will let me borrow his Sub37, so I’ll use that to track and record some of the bass sounds (I’ve looked for a VST/virtual instrument to do this and there are none available).

While I don’t think I’ll copy his style completely, even looking at how he has created his tracks is giving me ideas to further my own tracks.

Bibliography

McGlynn, D. (2021) Stephan Bodzin: How I Play Live. Available at: https://djmag.com/longreads/stephan-bodzin-how-i-play-live (Accessed: 12 November 2024)

Modor (2024) MODOR Digital Polyphonic Synths | DR-2. Available at: https://www.modormusic.com/dr2.html(Accessed: 12 November 2024)

Moog (2024) Matriarch | Moog. Available at: https://www.moogmusic.com/products/matriarch (Accessed: 12 November 2024)

Ableton (2024) What’s new in Live 12 | Ableton. Available at: https://www.ableton.com/en/live/ (Accessed: 12 November 2024)

Equipboard (2024) Stephan Bodzin | Equipboard. Available at: https://equipboard.com/pros/stephan-bodzin(Accessed: 12 November 2024)

Bodzin, S. (2015) Powers of Ten. Available at: Apple Music (Accessed 14 November 2024)

KVRAudio (2016) Stephan Bodzin kick. Available at: https://www.kvraudio.com/forum/viewtopic.php?t=469969(Accessed: 8 January 2025)

Development Research

Weights for an AI model

Weights

Some types of AI models use weights help tune the output that they produce. Simply put, weights are numbers that help to define the internal structure of a machine model. They help to tell the model what we consider to be important when creating an output. The closer you are to 1.0 indicates higher importance for the model, and the closer you get to zero the less important. In ST4RT+ we have Pitch, Step (start of notes), and Duration (end of notes). These work together to help the AI decide what notes to predict based on the input and the training data.

So how do these work to create an output?

Well, for the most part it’s the weights that give a level of control over what the model sees as the “right” predictions.

With the weights for the ST4RT+ model all at “1.0” we have all variables in the model to be of equal weight. Nothing is more important than anything else. So, when we add an input to the model, Pitch, Step, and Duration, are all equally important. This results in what we can see in Figure 1.

**Figure 1:** All weights are set to “1.0”

With all weights set to “1.0” we can see that because Step, and Duration, are equally as important as Pitch, that there is less variation in the pitches predicted. Also of note, around halfway through the output the Duration of the notes (their length) suddenly truncates. This is due to the Long/Short Term Memory (LSTM) window dropping out. The LSTM is the AI models memory and allows it to remember melodic phrases. I used a shorter window due to the amount of memory usage on Google Colabs. As you increase the LSTM window by a note you increase the processing requirements exponentially (mHelpMe, 2020) the software needs to remember not only the steps previously but also the predictions.

With the weights set as follows: Pitch “0.005”, Step “1.0”, and Duration “1.0” we get what you can see in Figure 2.

**Figure 2:** Weights set to Pitch “.005”, Step “1.0”, and Duration “1.0”

This creates a more random fluctuation of pitches. The model is told that you can be less worried about the predicted value of the pitches and instead concentrate on the Step and Duration parameters. In this instance, it created a more widely ranging melody, going down as low as note 40 (E2 – E 2^nd Octave), while when Pitch was set to “1.0” the lowest note was note 72 (C5 – C 5^th Octave). This can give me some control over the harmonic width of the melody.

When we set the weights as follow: Pitch “1.0”, Step “0.005”, and Duration “1.0” we get what you can see in Figure 3.

**Figure 3:** Weights set to Pitch “1.0”, Step “0.005”, and Duration “1.0”

With the same input as the previous examples, you can see that with the step value being low tells the model to make sure that Pitch and Duration are important. Step, where the start of a Pitch should be, now creates a large overlap between the notes. This is due to a large difference between the start of a note and its end (the Duration). The model correctly makes the end of the note according to the input data, but the start of each note is wildly early, which creates these overlaps.

Finally, when we set the weights as follows: Pitch “1.0”, Step “1.0”, and Duration “0.005” we get what you can see in Figure 4.

**Figure 4:** Weights set to Pitch “1.0”, Step “1.0”, and Duration “0.005”

In this result we can see that the Pitch and Step are even. Not too much pitch variation. And, while there is some variation in the length of notes, they tend to get shorter as the LSTM window starts to fill. Because Duration and Step are co-dependent on one another, we can see how they both influence one another.

How does this help me?

It helps me when I’m looking at any output generated by ST4RT+ and then can make an educated determination on how I can get something more musical from the model by making small modifications to the weights placed on the model. This type of fine tuning isn’t as necessary when you have a larger dataset, but as I’m working with a micro sized dataset, I sometimes need to make small modifications to get the best from the model.

Bibliography

mHelpMe (2020) “LSTM network window size selection and effect”. StackExchange. Available at: https://stats.stackexchange.com/questions/465807/lstm-network-window-size-selection-and-effect (Accessed: 4 February 2025)

Research

A little AI music history in acedemia

Post author By Tim Shepherd
Post date 19th November 2024
No Comments on A little AI music history in acedemia

I wanted to know a little more about the history of AI generated music, mainly as I wanted to get more background, and I wanted to separate the work that I’m going to algorithmic generations. Algorithmic generation is software that uses rules or formula to generate the results. A good example is Verbasizer (@taoski, no date), created for David Bowie to generate song lyrics from randomised words. It doesn’t use AI, but a randomised number generator to put lyrics together.

The first academic texts on music that used neural networks that were published are from Peter Todd (1988), “A sequential network design for musical applications” and Lewis (1988), “Creation by refinement: A creativity paradigm for gradient descent learning networks”. While I can get access to the paper that Lewis authored, I can only see the paper from Todd as a reference cited quite widely, with no available sources online. It’s unfortunate that I cannot get access to the paper from Todd as it is one of the first examples of the use of an RNN to generate music.

The next major advancements in music AI were in 2002. This was the use of LSTM for music. In their paper Eck and Schmidhuber used LSTMs (with RNNs) to find the temporal structure in Blues music.

Also in 2002 was the use of spectrograms as the input for a generative neural network (Marolt, 2002). While using spectrograms as data to train a music model doesn’t sound amazing. It was a speculative leap. The idea of using images, rather than just sound, to train a neural network for music opened the field of study. And allowed advances in image processing to be applied to music recognition and categorisation. During this time is when GPUs transistor count and acceleration of maths functions on a GPU really came into their own with parallel processing which made GPUs outperform CPUs (Merit, 2023)

To help visualise these papers over time, I’ve created a handy timeline (Figure 1):

Timeline-v2-1 Download

As you can see from the timeline the technology around AI has increased dramatically in the last few years.

For ST4RT+, most of the technology that I’m working with is based on the 2002 paper by Eck & Schmidhuber. Their use of RNN and LSTM is the closest in terms of structure and concept to the ST4RT+ model. While the technology I’m using was first developed in the early 2000’s, later technology focused on generating and analysing waveform data rather than representational (MIDI) data. My reasoning for using representational data was so that I didn’t need to spend a lot of money in processing to create the artefacts. Plus, it has the added benefit that it allows me more creative freedom to choose instruments that work with the melodies that ST4RT+ generates. This allows me to be the “Music Producer” in the collaboration with ST4RT+ rather than have the model generate full music sequences.

Timeline Bibliography

Lewis and Todd papers from the 80’s:

Todd, P. (1988) “A sequential network design for musical applications,” in Proceedings of the 1988 connectionist models summer school, 1988, pp. 76-84.

Lewis, J. P. (1988) “Creation by refinement: a creativity paradigm for gradient descent learning networks.” IEEE 1988 International Conference on Neural Networks (1988): 229-233 vol.2.

The first time someone used LSTMs for music:

Eck, D., & Schmidhuber, J. (2002). “Finding temporal structure in music: blues improvisation with LSTM recurrent networks”. Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, 747-756.

The first time someone processed spectrograms with neural networks:

Marolt, M., Kavčič, A., & Privosnik, M. (2002). “Neural Networks for Note Onset Detection in Piano Music” in International Computer Music Conference (ICMC).

The first time someone built a music genre classifier with neural networks — based on Hinton’s deep belief networks for unsupervised pre-training:

Lee, H., Pham, P.T., Largman, Y., & Ng, A. (2009). “Unsupervised feature learning for audio classification using convolutional deep belief networks” in Neural Information Processing Systems (NIPS).

Hinton, G.E., Osindero, S., & Teh, Y.W. (2006). “A Fast Learning Algorithm for Deep Belief Nets” in Neural Computation, 18, 1527-1554.

The first time someone built an end-to-end music classifier:

Dieleman, S., & Schrauwen, B. (2014). End-to-end learning for music audio. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6964-6968.

A study run at Pandora Radio showing the potential of end-to-end learning at scale:

Pons, J., Nieto, O., Prockup, M., Schmidt, E.M., Ehmann, A.F., & Serra, X. (2017). End-to-end Learning for Music Audio Tagging at Scale. ArXiv, abs/1711.02520.

Humphrey and Bello did some work on chord recognition and wrote the deep learning for music manifesto:

Humphrey, E.J., & Bello, J.P. (2012). “Rethinking Automatic Chord Recognition with Convolutional Neural Networks” in 2012 11th International Conference on Machine Learning and Applications, 2, 357-362.

Humphrey, E.J., Bello, J.P., & LeCun, Y. (2012). “Moving Beyond Feature Design: Deep Architectures and Automatic Feature Learning in Music Informatics” in International Society for Music Information Retrieval Conference.

Discussion on how to improve current architectures:

Choi, K., Fazekas, G., & Sandler, M.B. (2016). “Automatic Tagging Using Deep Convolutional Neural Networks” in International Society for Music Information Retrieval Conference.

Pons, J., Lidy, T., & Serra, X. (2016). Experimenting with musically motivated convolutional neural networks. 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI), 1-6.

Lee, J., Park, J., Kim, K.L., & Nam, J. (2017). Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms. ArXiv, abs/1703.01789.

Some modern generative models for algorithmic composition (GANs and VAEs, basically):

Yang, L., Chou, S., & Yang, Y. (2017). MidiNet: A Convolutional Generative Adversarial Network for Symbolic-Domain Music Generation. ArXiv, abs/1703.10847.

Roberts, A., Engel, J., Raffel, C., Hawthorne, C., & Eck, D. (2018). A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music. ArXiv, abs/1803.05428.

And some works directly synthesizing music audio (waveGAN and Wavenet, basically):

Donahue, C., McAuley, J., & Puckette, M. (2018). Synthesizing Audio with Generative Adversarial Networks. ArXiv, abs/1802.04208.

Oord, A.V., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.W., & Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio. Speech Synthesis Workshop.

Dieleman, S., Oord, A.V., & Simonyan, K. (2018). The challenge of realistic music generation: modelling raw audio at scale. ArXiv, abs/1806.10474.

Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D., & Simonyan, K. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. ArXiv, abs/1704.01279.

Bibliography

Merrit, R (2023) “Why GPUs Are Great for AI”. Available at: https://blogs.nvidia.com/blog/why-gpus-are-great-for-ai/ (Accessed: 20 January 2025)

@taoski (no date) “Verbasizer”. Available at: https://verbasizer.com (Accessed: 20 January 2025)

Research

Got a little ahead of myself… what are RNNs and a VAEs?

Post author By Tim Shepherd
Post date 1st November 2024
No Comments on Got a little ahead of myself… what are RNNs and a VAEs?

To better prepare for this project I researched various types of Machine Learning (ML) models. This post is a distillation of my research into the models and why I have chosen to use RNNs (Recurrent Neural Networks) and VAEs (Variational Autoencoders).

Recurrent Neural Networks

Are a type of neural network that processes sequences of data, in the case of music, a time series. The difference between an RNN and other types of neural networks is that RNNs have “hidden memory” of any previous inputs. This is because they have loops that allow information to pass between one step in a sequence and the next sequence.

**Figure 1:** How an RNN works as an overview.

Because of the way that RNNs loop it means that they are good at time-series prediction. The issue with RNNs is that they aren’t so good with long-term dependencies. In the case of music, repeated motifs, may well disappear from longer generated passages because of the looping limitation. To combat this RNNs can be used in conjunction with LSTM (Long Short-Term Memory) for the ML to “remember” previous outputs for longer giving the music a more pleasing structure.

For this project I chose to use an RNN as it is a good model to both generate sequences and create based on previous input.

Variational Auto Encoders

Variational Autoencoders (VAEs) work differently and are designed to generate new data. A VAE has two main parts. An Encoder that compresses the input data into a smaller, abstract representation of the data, called a latent space. And a Decoder, which tries to reconstruct the input data from the compressed representation (Figure 2).

**Figure 2:** How a VAE works as an overview.

By encoding the data into a probability distribution rather than a single point in the latent space, you can introduce some randomness by sampling the learnt distribution in the latent space.

**Figure 3:** Showing the movement of the point in latent space to produce a novel image output. (Source: IRCAM via YouTube) (IRCAM 2022).

VAEs are usually trained to balance what is called Reconstructive Loss, where we can measure how well the VAE can reconstruct the original data from the latent space. And KL Divergence, which ensures that the latent space is organised and smooth so that any new samples it creates produce good results.

I chose to use a VAE as it is a true generative model. It can also handle interpolations and variations by sampling points near the original point in the sampled latent space.

Why both?

Well, VAEs are generative but aren’t good a time-series (notes over time). I could use Transformer-based architectures to do this but RNNs have the advantage for music models for keeping longer term structures in the music when combined with LSTM.

Bibliography

IRCAM (2022) “IRCAM Tutorials / Rave and nn~” [Online]. Available at: https://www.youtube.com/watch?v=o09BSf9zP-0 (Accessed: 1 November 2024).

Research

Like a circle in a spiral, like a wheel within a wheel…

Post author By Tim Shepherd
Post date 10th October 2024
No Comments on Like a circle in a spiral, like a wheel within a wheel…

For the Final Major Project (FMP), I have decided to combine my other passion, technology, with my music practice.

Introducing ST4RT+! An AI version of me, to help me co-create with an AI for composition.

To do this I have done a bit of a dive into Machine Learning (ML) models with a view to being able to do this project without having to train on an exceptionally large dataset. This is important for a few reasons: I do not want to spend most of my time creating training data (I want to be able to feasibly do the project in the time given), and I also want to make sure the the model data is ethical.

So, to find a model that has these attributes, I looked first at Recurrent Neural Networks (RNN) and Variational AutoEncoders (VAE). Both can utilise “latent space” in a pre-trained model. Latent space is where you create a smaller model within an existing model that trains on a much smaller set of data (Figure 1).

**Figure 1:** Using Latent Space in an VAE model to reduce the amount of training data needed to create a working ML model. (Source: https://magenta.tensorflow.org/midi-me)

For an RNN Model there are two different types, Lookback and Attention, that were both created to give the model the ability to create long term structure in the music it produces. The Lookback RNN can recognize patterns that occur over a 1-2 bar range, using a technique called LSTM (Long Short-Term Memory).

To learn a longer pattern phrase, we use an encoder/decoder during every output step for each new note. This is how the Attention RNN model works. Simply put, with every new note added to the sequence from the model, we look at the last number of steps to evaluate what the next note should be.

My work from here is to play with some of the code and produce a proof of concept of both a RNN and a VAE-based model and see if the results are acceptable after training both types of models using my input in the latent space to create a model that sounds more like me.

From there I should have the tools to use the AI model as a way of bouncing off as a creative partner, and a form of co-creation, which is the aim of my project.

I was attempting to use only JavaScript for most of the melody generation, but on further research, I may have to use Python for the RNN but should still be able to use JavaScript for the VAE model. This is due to the complexities of having to either the lookback or attention models which require complex vector math that isn’t as efficient in JavaScript.

Bibliography

Dinculescu, M., Engel, J., Roberts, A. (2019) MidiMe: Personalizing a MusicVAE model with user data. [Online]. Google Research. Available at https://research.google/pubs/midime-personalizing-a-musicvae-model-with-user-data/ (Accessed: 10 October 2024)

Waite, E. (2016) Generating Long -Term Structure in Songs and Stories. [Online]. Google TensorFlow. Available at: https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn (Accessed: 12 October 2024)

Recent Posts

Categories