Music Genre Clustering #5 – Extracting MFCCs from My iTunes Library

Training & Classification

Last post, I referenced a great MIR resource that already had outlined a process to do what I’m currently trying to do.

To recap some of the methods I learned, each training song is broken up into frames and the MFCCs of each frame became a training sample. Each song yielded around ~5000 training samples (5000 frames) and all of them were labelled with the song.

Test samples were a shorter excerpt from the song and consisted of ~430 frames. Each frame became a test sample and were classified individually. This means that a single song could have been classified as multiple classes. I could see a world where and R&B or Hip Hop song has a nice Jazz break in the middle, and if he just happened to take our test sample from that break, those frames could be classified as Jazz.

The consequences here are two-fold. Not only do we run the risk of misclassifying a song, we run the risk of actually feeding incorrect training data to our model! If we happen to catch the Jazz break of that R&B song and we classify that as R&B, it will cloud our model a bit.

Generating Data

Before I start my script, I just want to think a bit more about how much data I’m actually going to have. It’s a bit crazy because I didn’t realize every single frame would be a different training sample. The 1 minute librosa example clip yielded 2657 frames… That’s about 44 frames per second sampled, which comes out to about 1 frame every ~20ms or so.

The average song in my library is about 5 minutes, and I have just shy of 4000 songs. We’re talking

4000\ Songs\times5\ Minutes\times\ 60\ Seconds\times44\ Frames=52,800,000\ Samples

Lol ok, and each of those samples have however many MFCC coefficients I want to use… at 10 MFCCs we would literally have over 500M individual data points. If we used 8 bits to store each number (these MFCCs are floats so it’s probably even more), we’d have 4GB of data. I’m not so worried about the storage space as so much the amount of data our model will have to handle. Generally, I’ve dealt with \<2GB of data in a data analysis type of setting because I usually don’t have more than 8GB of memory to work with (and on the laptop I’m using now, I only have 4) so I’ll have to truncate this data set a bit further.

This is, of course, just an estimate, but what would this number look like if I only took 30 second extracts of each song? If I think of an average song, how much of it is… you know… the same beat… same words… same instruments… not for all songs, but if we’re looking at Pop and R&B we’d definitely see a lot of repetition. We’d probably only need like 15 seconds of a song to recognize what genre it is! Maybe not even that!

4000\ Songs\times15\ Seconds\times44\ Frames=2,464,000\ Samples

This, my computer can probably handle, but I’m a bit hesitant the models themselves can handle it… This will be a good opportunity to try out different models and see which ones really get bogged down by computation. Let’s just go with 15 seconds for now.

Next issue… where do I take this 15 second sample? Probably not right at the beginning or end because the intros and outros of songs vary quite often with how the rest of the song (arguably the main parts of the song that identifies with its genre) may sound. There are generally breaks and bridges that also may take on a bit of a different tone (that is, for those genres which even have breaks and bridges). Because I listen to a ton of dance music, I know that generally the first 30 seconds are an intro, then the real feel of the song kicks in. For R&B and pop, there are many examples I can think of where the real beat starts right away and others where there is an intro. Hip Hop generally gets right to the point. Folk and Rock are quite variable and I probably don’t listen to enough of those genres to really generalize the structure. Ambient is the same throughout. I feel like if I start about 45 seconds in, I’d be skipping the intro in most cases and we’d probably be somewhere in the first verse, bridge, or chorus for those songs where this type of structure applies. I’ll just go with this for now and tune this later if need be.

With this, I should be able to start writing my script.

Writing Feature Generation Script

Ok, so I’ve written the _load_song_and_extract_features.py_ script in the root directory of this project, and it should do just that.

A few technology sidenotes here:


In this script, I make use of the multiprocessing library to speed up the script a little bit. Multiprocessing is a library that spawns the analysis of each song as a separate system process. It took me a while to get, but multiprocessing is quite different from multithreading (not that I really knew what multithreading even was before tbh…).

In the case of multithreading, a single process can use multiple threads to jump back and forth between tasks. If we are handling a job where we need to pass multiple tasks off to another program, wait for a response, and get the responses back and process it, in a single threaded world our code would run in a serial fashion where it would send off one task, wait for the response, and process it before even sending the next task off. With multithreading, we can basically use that time while the process is waiting for the response to send off another task. The key here, however, is that the process is not processing in parallel, it is merely optimizing the dead space to perform whatever it has to do next. While we’re waiting for the response, the process isn’t doing anything, right? The process itself is not being utilized although the rest of the OS may be handling other tasks. If the first “thread” receives the response from the external program, it will have to drop whatever it’s doing to focus back on that one thread. While multithreading may give the appearance of parallel processing, the processor is only concentrating on one thread at a time and is able to switch back and forth dynamically. Keep in mind that a single thread is also contained within a single process and therefore a single CPU, and therein lies the difference between multithreading and multiprocessing.

Multiprocessing spawns a completely separate process in the OS altogether. This has pros and cons vs multithreading, with the pros being speed obviously. Multiple processes aren’t limited by each other, but only limited by any resources they share, e.g. a database, a file, the CPU cores, and memory. The con here, is that this speed doesn’t come in an automatic way. Tweaking and optimizing of the number of processes is not only an integral part of ensuring multiprocessing is working the way you had anticipated, but is required in the sense that you could literally be harming your system and program if you are putting too much strain on CPU and memory.

After I ran the script, it took about 30 minutes running on 4 processes. I built in some capabilities using the psutil python package to monitor the CPU and memory with each process I added. After 3 processes, I was hitting close to 100% CPU. At 4 processes, I was consistently hitting 100% CPU with every song I processed. Memory was nowhere near capacity so I didn’t have to worry too much about that (at most, we’d just be opening 4 songs max averaging ~12-16MB… each additional process seemed to add about 3-4% in memory on my 4GB machine. In the grand scheme of things, if I wanted to do this job the quick and dirty way, I never would have looked into multiprocessing because the time it took me to implement it probably took the same time as if the script just ran on a single process, but


I make use of queueing here to work in tandem with multiprocessing. Instead of running every single song as a separate process at once, we control it with a queue. We use our main process to scan through our iTunes library and put songs (or filepaths specifically) into the queue, and then we set our queue to have a maximum of 4 slots in which the 4 parallel processes take from the queue in a FIFO manner.


The last technology sidenote rabbithole I will go down is logging. Logging essentially allows us to output different levels of alerts / information out to different outputs. Outputs can be standard out console, standard error, a file, syslog… the python logging module seems quite comprehensive. I’m also learning about the different levels of logging that are available too, and how each stream you output to can have a filter to show you all messages you’ve programmed or only the more / most critical ones.

I think everybody does this in some way shape or form, you have to print something so you know your script is doing what it’s supposed to do, even a quick “helloworld” in certain places trigger an understanding in your mind of what’s happening. The issue is that I used to just put print’s everywhere…

print 'Analyzing song #{}'.format(song_id)

will let me know which iteration of the song the script is currently on, and in general, I would like to see this message in every scenario because it’s pretty important to know how far in the script I’m currently in (do I need to wait another 5 minutes? 10 minutes? an hour? It’ll at least give me an idea). There are other things, however, that I will print, like when I was generating the MFCCs, I wanted to see how many frames librosa broke my songs into:

print 'There are {} MFCCs in this song'.format(mfcc.shape[0])

But do I need to see this every time, for every song? If I’m choosing a static extract length for every song, all I need is to check the first MFCC length and I’ll have an idea. I don’t need to output this 4000 times for each song, or even with every run of the script.

This is where logging helps us. I can set these two messages to different log levels, with the first message as more “critical”. Then I can simply tweak the level of logs I want to see before I run the script each time and control my output this way. In this script, I also only log to the console and never to a file, but a file is quite common as well if you’re running something in production. My measly little program is just sputtering along.

Okay, end of technology sidenote. Let’s get back to it. I now have a features.csv which should contain my list of songs with 10 MFCCs as features. Woot woot. The resulting file contains 3825 songs and amounts to 385 MB. Not bad!

I’m hoping it’s not too crazy to load into memory, but let’s continue the training and testing in the next post.

One thought on “Music Genre Clustering #5 – Extracting MFCCs from My iTunes Library

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s