Hello! My name is Chi, and this is my third set of posts on this blog. In my last post, I looked at machine learning models that could differentiate between genres of music. If you’ve read any of my previous posts, you would’ve already had the understanding that I have no understanding of data science haha. Maybe that’s a bit harsh, but realistically I’m an IT consultant working with clients to do simpler data tasks such as data warehousing and general dashboarding. In my work, I don’t quite get to be as creative with respect to modern “data science” methods and techniques. That’s why I wanted to start this blog in the first place… I really only have one goal in mind… learn something.
In my first post about predicting the 2016-2017 All-NBA teams, I introduced myself to a slew of machine learning models ranging from your non parametric KNN, to your simple parametric Linear / Logistic Regressions, and to your much more complex and much less interpretable Neural Networks and Gradient Boosted Trees.
In my second post about classifying musical genres using my own iTunes library, I learned a ton about music information retrieval and got to play around a bit more with complex models modelling highly nonlinear relationships.
In my previous posts, I’ve broken down my learning into the following 4 domains:
- Development Tools
- Machine Learning
- Domain Knowledge
Let’s stop and review the concepts that have stuck with me so far from last post:
Becoming Less Useless Every Single Day
I’ve been working in the Jupyter notebook this whole time and have constantly been amazed by its capabilities. Since my last post was on music, I got to see the Audio widget in Jupyter boast embedded audio playback capabilities! The librosa library in python showed off a ton of audio signal processing features… and this is where I scratched the surface on domain knowledge as well for music. I also tried out 2 new machine learning tools, xgboost for Gradient Boosted Trees, and Tensorflow for Neural Networks. Just constantly amazed at all the libraries that python has… goodness.
This section can kinda be intertwined with the machine learning and domain knowledge, so I’ll likely touch on the next 2 section but I’ll try to separate what I can. In terms of mathematics and statistics, I definitely picked up a bit more on measuring error. In my first post, I trained and tested on the same data the entire time. Not the most robust statistical approach as we really aren’t considering out-of-sample data sets and errors. We go back to the classic example of a KNN with 1-NN testing on a training set will always yield 100% training accuracy but will likely overfit on any other set of data. I never tried cross validation or anything like that, so I wanted to give these training methods a try to prevent overfitting. I first tried just a single train-test split with 90% of the data being training and 10% of the data being testing and that provided an initial benchmark. I then looked at the GridSearchCV function in the sklearn library to tune parameters of my xgboost model, and boy did that make it easy. This leads directly into the next section!
I didn’t quite learn any new models per se, but definitely got familiar with a few new tools in xgboost and Tensorflow. The xgboost model provided a feature-rich and multi-threaded platform for performing gradient boosting. Combined with GridSearchCV, this made for an intensive way to find the best model parameters. In terms of gradient boosting, I saw the effects of the learning rate, max depth of trees, number of trees / iterations in the model, and subsampling. In the end, my model predicted 23 / 25 songs correctly (92%!) on a test set between Rock and Easy Listening… two genres that I hand-picked because of their distinct styles, but I was extremely happy with how that turned out because I started out as someone with little to no knowledge in signal processing outside of a Fourier transform. This, again leads nicely to the next section!
This was probably where most of the learning in the last project happened. I came from an Engineering background and do remember taking digital signal processing classes, but knowing how much I didn’t care back then really didn’t help the knowledge I had coming into this exercise. Maybe I’m underestimating what having learned a Fourier transform subconsciously helped me pick up on further concepts, but I really felt at a loss when I was searching for what would eventually be the MFCC method. I think the biggest light bulb that went off in my head over the 2-3 week course of the project was that the training samples were created by 20ms frame! Going into the project, I had thought that I would be easily classifying song by song, but it became clear to me once I read up on MFCCs and was exposed to spectrograms that music and sounds are often looked at by frames. This ties into the machine learning domain as well because it was an important lesson on feature generation and provided insight into how dirty some of the data / methods for classification can be. At the end of the day, I had to break each song up into frames, train by frame, predict on each frame, and take a majority vote or select a percentage threshold to say “Well, if 25% of the frames were predicted as Easy Listening, then the song should be Easy Listening, otherwise Rock”… A train of thought that’s not quite obvious going into the project. I’m sure the world of machine learning is filled with so many different ways of approaching problems, and I’m interested in exploring more perspectives!
Edmonton Property Assessment
Interpretability Vs. Accuracy
One thing that I learned in the last project that I wanted to shine a spotlight on was the idea of interpretability vs accuracy. In the first post about basketball, we were dealing with an inherently simpler problem. I was asking questions like “At what Win Share and / or VORP values does a player make all-NBA?”. We found that a Win Share of above 7.5 generally made an All-NBA team as a rule of thumb. These are players who statistically carried their team to 7.5 or more wins. Given that even a great team rarely makes it to 50 wins, 7.5 wins attributed to a single person is saying a lot. Basically 7% of your team (one player) generated 15% of your wins. That’s a nice ratio. You can see how I’m able to actually talk through this logically while anyone can understand with a basic understanding of Win Shares (you may not even need to know / care and you’d probably still get the idea).
When we looked at music, we QUICKLY got into a ton of details that would be sure to lose anybody not coming from a technical background… starting with the Fourier transform! Just understanding the time and frequency domain is not intuitive without having studied the topic at least a little bit. What’s worse is that we ended up with 10 MFCCs as features. 10 was just a number I chose too to be cognizant of the hardware I’m working with in terms of disk, CPU, and RAM, but I could’ve easily taken the first 100 MFCCs! Who knows if the last 90 would’ve been important (we did end up getting 23 / 25 right with only 10), but the fact that we could end up with 90 features with the blink of an eye was not something we encountered in the NBA project. The NBA project was solving an inherently simpler problem, a problem that people created. People created the game of basketball with relatively easy to understand rules, and people voted on the all-NBA team based on takeaways that people saw. At the end of the day, the point of basketball is to put the ball in the hole and the team with more points win. Easy, isn’t it?
In music, on the other hand, we’re dealing with natural science and physics that exist on earth. Sciences that we, ourselves as a human race, do not understand completely. We try to solve problems here and there and people dedicate their entire lives to learning new things in the area, so how does a guy like me stand a chance? I actually find it pretty interesting that someone like me was able to put a model like that together, and that’s me boasting about myself at all. That’s really a commentary on interpretability vs accuracy.
In basketball, we really cared about interpretability. Being able to put the sentence together “Players with more than 7.5 win shares generally make the all-NBA team” is something that our tiny human brains can grasp. It’s awesome, because the next time I look at win shares, it will have a totally different meaning for me. Saying “MFCCs #3 and #5 have the largest impact when deciding between Rock and Easy Listening” obviously doesn’t have the same ring. Now this could EASILY be because I’m just not knowledgeable on that topic, and perhaps if I had studied it formally, it would ring a different bell for me, but when training the music genre model, I found myself tweaking parameters and trialing different models more to look for the best accuracy. In a sense, I almost didn’t care about what the model was doing on the inside, I just knew that it was modeling a non-linear relationship the best it could, and if it didn’t overfit, I was good.
In a consulting world, perhaps depending on who you’re talking to and the scope of the project, that would more or less never fly. How can you sit in front of someone and tell them that you don’t understand what the model you just built is doing? Why would they believe you? If I was in their shoes, I would be saying the same thing! Just because I predicted 23 out of 25 measly songs, does that mean I’ve solved the problem? Likely not. We have to understand what the caveats, assumptions, and intricacies are because we are getting paid for that exact service. That’s how I got the idea for this project.
The inception of this project is two-fold, number one, I loved looking at music and learning about digital signal processing, but I wanted something in my portfolio that showed how interpretable machine learning could be. Second, I always wanted to work with some government open data. Since I live in Edmonton, AB, why not start with some data in my hometown!
Property Assessment Data
My parents just sold and bought a new house, a bunch of my friends are looking for condos and houses, and many have just signed up for a mortgage at a bank. Christ… I’m getting old… lol. I find myself having real estate discussions more than I would like. Partially because I’m just too lazy to think about that stuff, and partially because I’m spending all my time doing this stuff haha. The Edmonton Open Data portal has property assessment data for the city, and I thought that it would be cool to see what the data consists of and either run some models or visualizations to understand the topic a bit better.