Librosa
I learned about LibROSA while watching a scipy video:
Seems pretty cool, the guy seems like a huge music nerd (in the senses of a nerd about music and just a nerd in general), he seems to get who I am and what I want to do, so why not give it a try.
I’ll just try to mimic his code here and get a feel for the librosa tool.
Let’s get and load librosa first. Librosa recommends that we install C++ compiler for python and ffmpeg, so I’ve done that prior to installing librosa.
# Enable plots in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
# Seaborn makes our plots prettier
import seaborn
seaborn.set(style = 'ticks')
# Import the audio playback widget
from IPython.display import Audio
import numpy as np
import pandas as pd
import librosa
import librosa.display
Librosa comes out of the box with an example audio file (OGG format, I’m on a Windows machine here, so I had to restart my computer after adding ffmpeg to PATH… caused me a bit of confusion in my troubleshooting!). Let’s load that file.
# Load example file
example_filepath = librosa.util.example_audio_file()
y, sr = librosa.load(example_filepath)
# Play audio using jupyter audio widget
Audio(data = y, rate = sr)
Man… that’s just super cool in general. We’ve loaded a song and we’re playing it in jupyter. What a time to be alive.
# Display waveform of song
librosa.display.waveplot(y, sr)
Frequency Domain & The Fourier Transform
Now, I remember being in digital signal processing in University. I didn’t say I was good at it, and I didn’t say I remember even remotely all of what I learned, but I remember being in the class lol. Even before that, in continuous & discrete signals and systems, we touched on the frequency domain. What is the frequency domain? I will start off with this video: First of all, my goodness, that music makes the Fourier transform so frickin epic. But yeah, the Fourier transform… I definitely had that smashed into my head in University. I’ll be honest, even with how much jammed that down our throats, I never had a really good appreciation for what it was and how it’s used. I really didn’t care at the time because, well, University was University to me and I wasn’t too excited about that. I’ll say partially that many of my professors in this space never really conveyed the information in ways that I was able to easily appreciate, and I think that’s a two-way street: I perhaps lacked the interest and knowledge to really appreciate their methods as well. All in all, the fact that the Fourier transform sounds familiar to me and that the base concept is stuck in my head is a success! The video above, though, oh my god what a masterpiece. It so well visualizes how a signal is made up of various frequencies, and how every single signal can be decomposed to groups of sine waves. This gif from wikipedia is a great example as well:
I don’t even want to put any formulas in this section of the post because the basic idea that any signal can be decomposed into sine waves (also I’m dumb) is the most important thing. After we understand that fundamental concept, the idea of the frequency domain explains itself. If every signal can be made of sine waves (some of which have infinite sine waves), we can take all those waves and break down which frequencies exist in that sound. A spectrogram does just that, taking the waveform above and swapping the y axis from amplitude to frequencies. The result would look something like this:
# Calculate short-time Fourier transform
example_stft = librosa.stft(y)
# Calculate log amplitude
example_stft_log_power = librosa.logamplitude(example_stft**2, ref_power = np.max)
# Plot spectrogram
librosa.display.specshow(example_stft_log_power, x_axis = 'time', y_axis = 'log')
plt.colorbar()


Constant-Q Transform
There apparently is something called the Constant-Q transform as well, but instead of the y-axis being represented in straight frequencies, it actually outputs a scale representing musical notes. Let’s check it out.
# Calculate short-time Fourier transform
example_cqt = librosa.cqt(y)
# Plot spectrogram
librosa.display.specshow(librosa.amplitude_to_db(example_cqt, ref = np.max), x_axis = 'time', y_axis = 'cqt_note')
plt.colorbar(format='%+2.0f dB')

-
- F#2
- F3
- C3
- E1
There’s kind of one at C5, but it seems to fade in and out. Also, these bands of notes seem to span the width of more than a note as well, and although some are centered at one note, some I had trouble deciphering. Not sure if this is the actual waveform or something off about librosa’s scale, but these are just estimates for now. What do I make of this… I just listened to the first 10 seconds of the song at least 12 times, and I’m picking up a bit on the notes at F#2 and the sets of two notes at C2 around the 3, 7, and 10 seconds mark. This is definitely the bassline, we can match that up almost perfectly. Are they F#2 and C2 to be exact? I’m not sure, but C2 is definitely about where the bassline is. We noticed something else at E1, right? What is that? I’m not quite sure… I wonder if that’s just some type of white noise or background noise or something… I can’t decipher a solid beat lower than the bassline. Okay, let’s break this down a bit more by instrument:
-
- 0:00 – 0:07: Bass, some kind of synth (?, the one that goes boop, boop), and what I think are strings or could be another synth
- 0:07 – 0:15: Add in what seem like high hats and rides
- 0:15 – 0:30: Add in snares, xylophone, bass changes rhythms
- 0:30 – 0:45: Add in what sounds like some potentially modulated voices
- 0:45 – 1:00: Cut out everything and only keep bass and xylophone
Well, let’s start with the end because there are only 2 instruments there and it seems quite simple to decipher:
0:45 – 1:00
Here, there only exists a bass and a xylophone. The bass we’ve already seen to be in that C2 – C3 range, but if we contrast it with the F3 note, it also takes on a very similar pattern. I really want to say that both the F#2 and F3 notes are the bass! Somehow, the bass it taking on multiple notes in our CQT plot. There seems to be another line around F4 as well, but I don’t think it’s the bass because it’s not present in the first 10 seconds, where the bass is also being played. Perhaps F4 and up is the xylophone?
0:07 – 0:15
I’ll jump to the beginning because I think I can decipher the high hats and rides here. The only thing that really differs between this section and the first 7 seconds is that we see more of a fingerprint in the C7 – C8 range and the C1 – C2 range… Are the high hats making their mark high and low? I’m not sure, but it seems that way.
0:15 – 0:30
Here, the xylophone comes in, very similar to the 0:45 – 1:00 section.
0:30 – 0:45
Here, the only difference from the last section is the modulated voice that comes in. This seems to play in that C5 – C8 level. So what I’m seeing is C3 – C4 kinda marks the break between bass and treble type sounds. This online app actually allows us to hear what C1 – C8 sounds like. C3 and C4 certainly sound like the break! Okay, so my impression is still a bit cloudy, but at least I have a better sense of how this scale is broken out.
Chromagram
In the Constant-Q Transform, I saw a few F’s / F#’s come into play (F1, F2, F3) and decided that this was likely the fingerprint of the bassline. I mentioned that from 0:15 – 0:45, the bassline seems to be present in the F2 and F3 ranges and wondered if both were from the same instrument. And now, I am just discovering the concept of timbre, which is the
perceived sound quality of a musical note, sound, or tone that distinguishes different types of sound production, such as choir voices and musical instruments, such as string instruments, wind instruments, and percussion instruments, and which enables listeners to hear even different instruments from the same category as different (e.g. a viola and a violin)
One attribute of timbre, as outlined on the wikipedia page as well, is that different sounds have different harmonics. A C played on a violin will be different from a C played on a piano which will be different from a C sung by an opera singer. Harmonics differentiate these sounds despite the fundamental frequency being played is still a C. A viola may have lower fundamental frequencies in the C2 – C4 range (I am TOTALLY making this up right now btw) whereas an opera singer may span higher frequency ranges. A chromagram, thus, compacts these into a single fundamental frequency scale and… well… let’s just take a look at it why don’t we:
# Calculate the condensed chroma CQT
example_chroma_cqt = librosa.feature.chroma_cqt(y=y, sr=sr)
# Plot chromagram
librosa.display.specshow(example_chroma_cqt, y_axis = 'chroma', x_axis = 'time')
plt.colorbar()
# Extract tempo and beat
tempo, beat_f = librosa.beat.beat_track(y = y, sr = sr, trim = False)
beat_f = librosa.util.fix_frames(beat_f, x_max = example_chroma_cqt.shape[1])
# Sync CQT w/ beat
example_cqt_sync = librosa.util.sync(example_chroma_cqt, beat_f, aggregate = np.median)
beat_t = librosa.frames_to_time(beat_f, sr = sr)
# Plot chromagram
librosa.display.specshow(example_cqt_sync, y_axis = 'chroma', x_axis = 'time', x_coords=beat_t)
plt.colorbar()

Tempogram
The last one that I want to exlpore is the tempogram. Easy enough, it tells us the increments of tempo that the song seems to fit.
# Generate tempogram
example_tgram = librosa.feature.tempogram(y = y, sr = sr)
# Show tempogram
librosa.display.specshow(example_tgram, x_axis = 'time', y_axis = 'tempo')
plt.colorbar()
# Append tempo estimate
plt.axhline(tempo, color = 'w', linestyle = '--', alpha = 1, label = 'Estimated tempo={:g}'.format(tempo))
plt.legend(frameon=True, framealpha=0.75)
And there you go… there’s the other dj software I use haha. One to estimate the BPM so you can get an even better sense of which songs will mix well with other songs.
I’m a bit confused by this tempogram though. We obviously see that it has multiple lines, and I get that, because something that’s 120 BPM will also be 240 BPM which will also be 60 BPM. Similarly for this song, I see something at 128 BPM, 256 BPM, and 64 BPM, but then many traces of BPM’s in between as well. In fact, there’s like 3 lines between 64 – 128…?! I’m not quite understanding that.
Listening to the song again, it’s easy to get tripped up if you’re just listening to the high hats or xylophones… I can see pieces where onset detection might think that the tempo is a bit more varied.
Looking at the intensity in the third dimension, however, we see very clearly that your 32 / 64 / 128 BPMs are prevalent throughout most of the song, and perhaps this is what made librosa’s beat_track() function detect 129.199. Great stuff.
Conclusion
Basically, I’m blown away that people are charging for software that does this shit and librosa’s community is nice enough to break it down to such a level that an idiot like me can understand…
Where do I go from here? I might try to tamper with a few sounds or songs… Maybe check out the waveform of some graphs… explore a bit more of what librosa has to offer so I can actually generate my features for my downstream clustering or supervised learning analysis.
Again, my one objective here is to simply learn something new that I didn’t know before in either tools, math, ML, or music. Easy enough!