Music Genre Clustering #1 – Intro

Yello

Yello. My name is Chi, and this is my second set of posts on this blog. In my first post about basketball, I go into a bit of an intro of myself. I won’t rehash everything, but maybe a small intro will help again. First of all, I work as an IT consultant by day and play a lot of basketball and listen to a lot of music by night. Honestly, that pretty much sums this blog up to now haha.

The short of it is that I’m trying to learn a bit more about this zoo of a world called data science. insert gif that’s quickly becoming my favourite gif.

I generally help clients with date warehousing / basic dashboarding needs – Rarely have I actually tried to implement some type of model before, but I have to get my practice from somewhere! Why not start with my own interests? My objective throughout my first blog post so far has simply been try to learn something more than you knew before, about basketball or about data science. Again, the data science world is messy… What the hell is data science even? Even the word data science just makes you want to cringe these days.

Data Science

I’ve been called a data scientist at work, and I immediately felt inept and that I was being extremely oversold. There are people out there using millions of users data to predict traffic patterns, or developing algorithms for image recognition, or predicting which movies you’d like to watch next while I sit here and write extremely simple SQL statements and pipe it into a Tableau graph… I mean, you can’t help but think “what am I? Sure as hell not a data scientist!”.

But, then, what is data science, and how can one be considered a data scientist without that sinking incapable feeling in their heart? A data scientist is part engineer, part mathematician, part software developer, part statistician, part business analyst, part logical thinker, part smart person, part person who will not jump off a cliff after nothing works… I mean, can one person even be all of these? Really, for me, I think what it comes down to is can you solve this problem. And in the case of my projects at work… Yes, I work with data… And yes, I solve problems or try to anyways. By this definition, I guess I can be a data scientist! But there’s something about simple addition, subtraction, YoY calculations, and simple means and standard deviations that just don’t scream “SCIENCE” to me. A data scientist solves problems, and I would like to be one that can not only solve a straightforward business problem, but also take a more scientific approach to problems too.

I’ve generally just My first loves (other than my lovely girlfriend) has always been split between basketball and music. In my first post, I tackled basketball. I came in almost with no objective other than to learn, and I think I’ve definitely done that. If I were to build a list of the domains (in hindsight) that I wanted to learn more about, I would’ve listed them down as:

  • Development Tools
  • Statistics
  • Machine Learning
  • Basketball

Let’s break these down one by one based on my first post:

Development Tools

In terms of development tools, this was the first time I used Jupyter. This was also a huge learning experience for me within R. Jupyter gave me the superpower to switch between python and R and pass dataframes back and forth, just an amazing capability. I had done some modeling in past with R, so I felt comfortable with the Y ~ X1 + X2 syntax. Honestly, it seems that most folks are pretty split, and python has even overtaken R in Kaggle competitions now I believe, but regardless, I wanted to just pick a language and just learn it. I had already gotten a lot of ETL / data transformation experience in python because I had done a lot of it in my job, but I felt I had the initial knowledge for R and it doesn’t hurt to know multiple languages anyways! Working in Jupyter has been great as well. Jupyter seems to make the code itself a bit less automatable because it’s not in a notebook and not necessarily just a script you can execute, but it goes without saying that the interpretability benefits that jupyter has brought has clearly improved my workflow rather than running entire scripts over and over again to see results of one command.

Mathematics

Mathematical knowledge goes hand in hand with machine learning, but they are really two different domains. Mathematics, removed from machine learning, is a theoretical science. Entire domains of knowledge like linear algebra make up the fundamental thought process of how to interpret something like linear algebra. More complex ideas like the singular value decomposition or eigenvectors are the life blood of the PCA algorithm. Machine learning takes this knowledge and applies them to a problem or solution, but there’s no chicken and egg here… the mathematical concepts came first for sure. When we creep a bit into probability and statistics, there are so many words we saw so often working through all our models… distributions, covariance, collinearity, log-odds, entropy… All these were fundamental to some process during some algorithm, and you honestly don’t even know how to tune your models if you don’t have this underlying knowledge. As I was making my last post, I was also finishing up the ESL book, which was vital in tying together mathematics and machine learning.

Machine Learning

Again, you gotta give it up for ESL here. We’ve already covered math, so let’s cover machine learning logic. This was probably the most I learned out of any of these domains over the entire post. I literally went through 10 posts outlining models as simple as me manually picking a decision boundary with the eye test, to something basic like a logistic regression, to way more complex ideas like the neural network and the gradient boosted trees. I got a sense of how easy it was to play and tune each model, how fast it took to run certain models, the pros and cons of certain models, and understood why trees were just so goddamn useful. At the end of the day, my gradient boosted trees correctly predicted 13 / 15 all-NBA players for the 2016-2017 season! This was for sure the sexy part of this whole experience, and it really did live up to it as I craved over AUC metrics, sensitivity rates, and specificity rates.

Basketball

Ah yes, who could forget basketball… the whole topic that started it all. With basketball, I have a working knowledge. I’ve played for the better part of 14 years now, I love to watch basketball and I spend way too much time on r/nba every single day. I like to think that I have a pretty good intuition compared to your average fan. What the post did for me was it helped me look at the data and summarize ideas in ways that I never thought about. Just looking at simple bar graphs, for example, help med understand just how much more prevalent the 3P shot was in today’s game compared to the 90’s and 00’s – With teams like Houston almost shooting 50% of their FGA as 3PA! Performing PCA on player’s stats and seeing the data broken down by position on the PCA bi-plot was another super cool way of making sense of data that I already had ideas about in my head. Plus, this actually arranges complex amounts of data into a lower dimension that we, humans, can make more sense of and actually act on and automate if we so choose to as well. Lastly, looking at all the individual statistics, advanced metrics, and throwing them into the models to predict all-NBA players. That was a doozie, and I learned so much from that. I learned a bit more about the advanced stats of WS and VORP among others, and got a sense of how these measurements came into play when the media voted for their all-NBA selections. The gradient boosted tree model provided a means to make sense of all these advanced stats and proved to us really clearly that there is some logical connection between how the media thinks when making these votes and how these advanced stats summarize the play of an individual in a similar fashion.

Music

Alright… music. So, what about music. Well, to begin, I love music. I grew up listening to a lot of R&B. I used to follow… man what was it called… I think rnbexclusive.com every day to download the newest songs. A lot of excitement in my life at that time came from music. I was a super shy kid, not amazing at school, didn’t enjoy school too much because of how shy I was and how shitty I was at it. Going to and from school, I would always listen to my MD player, and as soon as I got home from school, I would go straight to rnbexclusive to download the latest beats. At the time, the industry revolved around big names like Usher and Ciara, with more underground producers and artists like Ryan Leslie and Claude Kelly. I loved R&B, and dabbled in a bit of rap as well. When “Confessions” came out by Usher, I was on a high school band trip to Seattle, WA. It was probably the first time I’ve travelled across borders without my parents, and it was super exciting. On the bus ride from Edmonton to Seattle, I listened to Confessions for the first time and, till today, it’s probably one of the best albums I’ve ever heard. The excitement of going to Seattle and the excitement of how good Confessions was solidified that moment in my head and I remember that bus ride and trip super vividly.

That, right there, is the power of music to me. Music is just a vehicle to explain complex feelings and situations that a word or paragraph just can’t. A point in my life could be influenced by hundreds of factors, and a song would be able to capture a snapshot of all of them simultaneously. Back then, it was a craving and an addiction to listen to and discover more music. That extends to current day, except I’m listening to more disco and house music. R&B is still there as well, but more so in the form of the Marvin Gayes and Stevie Wonders of the world. Jazz is very well in the mix as well, firing up some Nat King Cole or Stacey Kent. Don’t even get me started on remixes… I love house-style remixes of old disco, R&B, and jazz tracks as well. Just like how the sax, drums, and bass is the lifeblood of jazz, a house record encapsulates the same love for music, but just in the form of an 808 or a synth, and just as the jazz club aims to be a cozy environment for jazz lovers to enjoy great acoustics and great musicianship, a house record aims to fill a club up with people letting go of their problems and just dancing for hours on end. It’s awesome how there is a style of music for any situation. Music is like a photo album to me… thinking about all these genres and the timeline of when I started getting into all these artists, I remember pretty vividly the point in my life that I was at. To me, music nowadays is a vehicle not to capture snapshots of my own life either, it’s a window for me to look into the lives of others, the artists as they created these albums and records, to be exposed to and even learn from their experiences. It’s such a beautiful art form and it gets even more beautiful when you transfer it over to the digital world!

Now… to snap back to the reality of this blog…

Okay, so… music. What about music. Well, if I think in terms of the 4 domains of data science learning that I want to explore more:

  • Development Tools: I definitely would like to become familiar with some tools that can perform digital signal processing within python, matlab seems to be the winner here historically but I’ve come across some interesting python libraries like LibROSA
  • Mathematics: Man… flashbacks to all the digital signal processing back in Electrical Engineering… I mean, there are a TON you can do with a digital signal in terms of mathematical algorithms and manipulations… one obvious algorithm that I will probably get into pretty much right away is the fast fourier transform, switching between the time domain and the frequency domain, frequency domain being somewhat of a “fingerprint” for sound and music in general. In this category, I’m going to keep an open mind and come in with the mentality of just learning as well
  • Machine Learning: Here, I’m coming in with an open mind as well… I have a general idea that I probably want to try some type of clustering and perhaps even some type of supervised learning to fingerprint properties of songs to the genre they belong in… surely disco has a much different finger print than techno right? I don’t know, but that’s what I aim to find out!
  • Music: I dunno what to say here that I haven’t already said… I’ll get to listen to a few of my favourite songs and that’s enough for me 🙂

Maybe that’s enough for now. Let’s fire up python and maybe try to play with LibROSA for a bit in the next post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s