This program aims to generate a tune using Tensorflow.
To start off with, I acquired a dataset of tunes. I used https://thesession.org/ for this, which has a huge collection of traditional music, while also providing a handy download in a text-based format known as ABC. I picked ABC as it means I can use predictive text to generate the output, which is far easier than attempting to have the AI produce audio signals! The script for this is below.
import json import wget def fetch_abc() -> str: filename = "tunes.json" if input("refetch? (y/n)") == "y": print("downloading") wget.download("https://github.com/adactio/TheSession-data/raw/main/json/tunes.json", filename) print("done") print("parsing") with open(filename, "r", encoding="utf8") as f: content = json.load(f) print("reformatting") with open("dataset.txt", "w") as f: data = list(filter(lambda x: x["mode"] == "Cmajor" and x["meter"] == "4/4", content)) print(f"{len(data)}/{len(content)} tunes applicable") for tune in data: f.write(tune["abc"].replace("\u266f", "#").replace("\u266d", "b").replace("\u2028", "") + "\n") print("done") with open("dataset.txt", "rb") as f: return f.read().decode("cp1252")
The script also features some extra code to filter down by key and time signature to produce tunes of the same form and scale, which should help whatever it produces have some musical soundness 🙂
From this, I used the Tensorflow docs to produce a basic predictive text model. The way this works is by analysing the last n characters (in this case, I used 64) to attempt to predict the next one. It didn’t need many changes, so I set it running, but found it took an absurdly long time. After doing some testing on Google Colab, I determined that tensorflow should use the GPU if you want to get anywhere close to decent performance out of it, as opposed to CPU. I found it quite difficult to do this, and you need to download a separate package (I ended up needing miniconda too) but I did get there in the end. This link https://www.tensorflow.org/install/pip was very useful. Interestingly however, while testing, I saw that the TPU (tensor processing unit) provided by Google in their Colab actually ran the script slower – although this could just be due to the nature of the model.
I ended up with the following script in the end:
print("preparing...") import tensorflow as tf import numpy as np import os import time from fetch_dataset import fetch_abc print("starting...") text = fetch_abc() vocab = sorted(set(text)) vectorise = tf.keras.layers.StringLookup(vocabulary=list(vocab), mask_token=None) devectorise = tf.keras.layers.StringLookup( vocabulary=vectorise.get_vocabulary(), invert=True, mask_token=None ) def reassemble(vector) -> str: return tf.strings.reduce_join(devectorise(vector), axis=-1).numpy() vectorised_text = vectorise(tf.strings.unicode_split(text, "UTF-8")) vector_dataset = tf.data.Dataset.from_tensor_slices(vectorised_text) seq_length = 100 sequences = vector_dataset.batch(seq_length + 1, drop_remainder=True) def split_sequence(sequence): return sequence[:-1], sequence[1:] # trims last letter and first buffer_size = 10000 batch_size = 64 dataset = ( sequences.map(split_sequence) .shuffle(buffer_size) .batch(batch_size, drop_remainder=True) .prefetch(tf.data.experimental.AUTOTUNE) ) class MusicModel(tf.keras.Model): def __init__(self, vocab_size, embedding_dimension=256, rnn_units=1024): super().__init__(self) self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dimension) self.gru = tf.keras.layers.GRU( rnn_units, return_sequences=True, return_state=True ) self.dense = tf.keras.layers.Dense(vocab_size) @tf.function def call(self, inputs, states=None, return_state=False, training=False): x = inputs x = self.embedding(x, training=training) if states is None: states = self.gru.get_initial_state(x) x, states = self.gru(x, initial_state=states, training=training) x = self.dense(x, training=training) if return_state: return x, states else: return x vocab_size = len(vectorise.get_vocabulary()) print(f"vocabulary size {vocab_size}") model = MusicModel(vocab_size=vocab_size) # [(input_example_batch, target_example_batch)] = dataset.take(1) # example_batch_predictions = model(input_example_batch) # print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)") # sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1) # sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy() # print("Input:\n", reassemble(input_example_batch[0])) # print() # print("Next Char Predictions:\n", reassemble(sampled_indices)) loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True) # example_batch_mean_loss = loss(target_example_batch, example_batch_predictions) # print( # "Prediction shape: ", # example_batch_predictions.shape, # " # (batch_size, sequence_length, vocab_size)", # ) # print("Mean loss: ", example_batch_mean_loss) # print() #model.summary() model.compile(optimizer="adam", loss=loss) checkpoint_dir = "./training_checkpoints" checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}") checkpoint_callback = tf.keras.callbacks.ModelCheckpoint( filepath=checkpoint_prefix, save_weights_only=True ) if input("load past? (y/n)") == "y": [(input_example_batch, target_example_batch)] = dataset.take(1) model(input_example_batch) model.load_weights(checkpoint_dir+"/ckpt_"+input("checkpoint? ")) if input("train? (y/n)") == "y": epochs = int(input("epochs? ")) history = model.fit(dataset, epochs=epochs, callbacks=[checkpoint_callback]) class OneStep(tf.keras.Model): def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0): super().__init__() self.temperature = temperature self.model = model self.chars_from_ids = chars_from_ids self.ids_from_chars = ids_from_chars # Create a mask to prevent "[UNK]" from being generated. skip_ids = self.ids_from_chars(["[UNK]"])[:, None] sparse_mask = tf.SparseTensor( # Put a -inf at each bad index. values=[-float("inf")] * len(skip_ids), indices=skip_ids, # Match the shape to the vocabulary dense_shape=[len(ids_from_chars.get_vocabulary())], ) self.prediction_mask = tf.sparse.to_dense(sparse_mask) @tf.function def generate_one_step(self, inputs, states=None): # Convert strings to token IDs. input_chars = tf.strings.unicode_split(inputs, "UTF-8") input_ids = self.ids_from_chars(input_chars).to_tensor() # Run the model. # predicted_logits.shape is [batch, char, next_char_logits] predicted_logits, states = self.model( inputs=input_ids, states=states, return_state=True ) # Only use the last prediction. predicted_logits = predicted_logits[:, -1, :] predicted_logits = predicted_logits / self.temperature # Apply the prediction mask: prevent "[UNK]" from being generated. predicted_logits = predicted_logits + self.prediction_mask # Sample the output logits to generate token IDs. predicted_ids = tf.random.categorical(predicted_logits, num_samples=1) predicted_ids = tf.squeeze(predicted_ids, axis=-1) # Convert from token ids to characters predicted_chars = self.chars_from_ids(predicted_ids) # Return the characters and model state. return predicted_chars, states one_step_model = OneStep(model, devectorise, vectorise) start = time.time() states = None next_char = tf.constant(['A']) result = [next_char] for n in range(1000): next_char, states = one_step_model.generate_one_step(next_char, states=states) result.append(next_char) result = tf.strings.join(result) end = time.time() print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80) print('\nRun time:', end - start)
Which produced the following outputs (the first one seems to be far better as I trained it for longer):
AGE E3D|CEGc Acdc|BGAG D2 D2| ABcd ecBA|GdBd gdBG|ABcd ecBA|B2B2 B2:| |:A,2DF ECEG|BABc deBc|dcAc BDGA| BDGA BcdB|AGAB cedB|cAGE D4:| |:e2 de^fgaa|gede cAde|gf ge ed (3ded|cAGE D3:| |:e|dBcA GE EE|DG E2 GEDE|CE GE- GA:| |:GF ~E2 FAGA|cdec dcAB|cdcA GAcA|GEDE C2|| |:D2|DGBd c2 GE|FDCD FGAG|EGcB AF ~z2| cdcA GE E2|DCEG c3dB|1 GEcE EDDG:|2 GEDE Cc c2:| |:c'2 bc acgc|cdeg aged|cBcd edeg|aged cAAB| c2 gc acgc|c2 gc BAGB|AGAB cdea|gedf ec c2| g2 ag agea|gedc AGAB|c2 ec gcec|Addc dcAB| c2 gc acgc|c2 gc BAGB|cBcd edeg|aged cAAB| c2 gc acgc|c2 gc BAGB|AGAB cdea|gedg AcAB| c2 gc acgc|c2 gc BAGB|cBcd edeg|aged cAAB| c2 gc acgc|c2 gc BAGB|AGAB c3d:| c2gc ecgc|c2ac gcea|gece d2cA|| cdef gage|abaf g2ge|defg ecdc|A2c2 G2 (3GAB| c2ec Gcec|d2Bd AGED|C~E3 GAcd|egdB c2 cA| G3E GAcd|edcB cA G2|GA AB Ad cd|ed ec GABG| c3d/e dcAB|cAGE cdec|dcAc GEDE| CDEG AGBG|(GAGA ceda|gece dcAG|1 EGFD ECCB,:|2 GEFG ECCD|| EG G2 G3 E|FD (3EFG cdef|cAAG cABc|AGEC DC D/F/G:| |:(3EF^F
G4 c3c|B2 A3/2c4 B2B3|{88}3A4|G3G23A|G6 G2| c3e c3A|d3c AGED|G3E GECE|D2D2 D2CD| G,A,CD EDDC|EGAB cGcd|e/f/c dc BGGA|| |:e2ec GEce|agee cAGE|Dcdc Bcdc|B2G2 G2(3BAB|:cGEG AcGc|EDCD EC C2:| ecgc agec|GdBd fgga|gege fecd|ec'ba gcgc|ecgc acgc| acgc acgc|ecge f2ec|AccB c2ec|Acdb g2c'2|a4-agge|dcAcd A2c2|e8| B,2D2F2B2|ABdc d2fd|edc2 B2A2|BdBdBg|g2g2- gga| g2ef g2fe|Adfa gedc|B2d2 Baag|ec(3cBc Gcef| gece ac'ag|ecBc ABGF|ECDC B,DG,A,B,|C2 ((3A,B, CG, A,2(3E,FEG|AcBd cd (3=fgf)| ~c2ec gcec|~c3G AG~c2|dc~G2 Aage|ec~c2 acca|| eggc agag|f2ag ffac|fafg eg~c2|dcBA GGEG|1 dedB c2cd:|2 (3Bcd Bc d2 (df)|Tf2gf^ffd|c2Bc dcec|| |:(3EFG|:AGFE GAcd|eged cAGE|ADDC DCEG| g2ag agce|fgfe fefa|bcgc f2fa|gedg ac~c2| ageg fage|dcBc dcGc|(3gga ge a2ga|egde c2:| G|[E3G]|E3G cCEG|c3A cded|cBcd ec (3cBG|(3Ade fg (3fed cd| AcGc BGG^F|G2 DC G,CDC|F/F/F CB, A,2DF|EAGA FAdc|BGFD G2cEG|AdcB c2 ec| gc{/e} c/B/c Bcdf|eedc BcGF|EGcG EGcd|edce d2cd| e2d/c/c dcAc|cGAc G2EC|1 GAcd cAGE:|2 CDDE CDGE
You can listen to them using this site (https://abc.rectanglered.com/) if you like
It’s not perfect – it is quite often inaccurate, but I was quite pleasantly surprised to see it achieve correct bar lengths most of the time. If I had a larger dataset (it was quite small after filtering down with the constraints placed on the key/meter) or trained it for longer, I may have been able to see better results 🙂