In the last post, we went over briefly some examples of distributed computing and narrowly surveyed the landscape of EMR and Hadoop. In this post, I want to try to actually spin up a cluster. Don’t really know what else to say at this point because I’m so uneducated in the area and have read about as much as I could on the product… so let’s just dive right into it.
AWS EMR Console
Let’s open up EMR and try to create a cluster. 4 sections are presented to us:
Pretty simple here.
- I’ll keep the default name
- I’ll turn off logging (this feature enables automatic logging to S3), note that by turning this off, I’ll still have local logs on the EC2s themselves
- I’ll keep launch mode to cluster by default as well, step execution indicates you want to turn up the cluster, run some automated package and terminate the cluster all in one fell swoop, I’d like to play around with the cluster live
Okay, a bit more complicated here because I’m not 100% familiar with all the services yet haha. It’s pretty obvious I’d like to use the Spark option as that’s what I set out to do. It looks like EMR can be sliced and diced in quite a few ways. Core Hadoop seems to provide us with HDFS and HIVE, and beyond that, I’m not really sure what the other services are. HBase and Presto I absolutely don’t feel comfortable speaking to, so I’m going to skip these altogether.
The Spark option gives us:
- Hadoop (HDFS)
- Ganglia (Cluster resource monitoring)
- Zeppelin (A Spark-compatible notebook, similar to Jupyter)
This one seems to have all the stuff we just went over plus some bells and whistles as tools layered on top of Hadoop and Spark. Let’s go with this.
Oh yeah, I also have no clue what AWS Glue Metastore is so I’m going to ignore this too (a theme is starting to develop, no?). This blog is after all about trying, FAILING, and sometimes succeeding at data science.
These options are relatively simple as well, but we need to get into a bit of math here and review the scope of our task. Our data, again, is around 1.5GB (yes, we’ve already made the argument that using EMR for this is slightly overkill), so I don’t think we need any more than 2 worker nodes. Worker nodes are defined as “Core” nodes within EMR. The default of 1 master + 2 workers sounds good to me.
Now, of what type? The default instance type is a general purpose m3.xlarge (4 CPU, 15 GB RAM, 26 cents / hr). 2 of these guys as workers are probably overkill, not to mention the 3 nodes that will be spun up from this will cost ~80 cents / hr. Not sure if I want to spend a dollar an hour on this just yet.
Let’s take a look at the cheapest instance in the general purpose category of EC2s: m4.large. I’m now ignoring m3 altogether because m3 gives you an SSD, which I will not need because I’ll theoretically be loading data from S3 straight into the cluster’s RAM resources.
The m4.large has 2 CPUs and 8GB RAM per node, and costs 10 cents / hr. I think I can deal with this, although at 30 cents an hour (1 master + 2 worker), that’s still a bit steeper than I’d like. Remember when we rented out the p2.xlarge box to run our Neural Network, we were paying about 20 cents / hr for our spot instance. Actually, that reminds me… I was looking at the on-demand prices for the m4.xlarge. A spot instance actually comes out to around 2.8 cents / hr. I don’t really have to be a data scientist to do that math right there:
Seriously though, at 28% of the original price (72% discount), it really makes a difference if we started using EMR often in the future. I’m not really going into production, so if I lose a spot instance, all good, whatever. Let’s go with that for now. I really don’t know what the overhead of SPARK and YARN will be, but I truly believe the m4.large will be just fine.
One problem though – I don’t see an option to request spot instances as my nodes, but I’ve read multiple blogs of folks who have used spot instances for EMR nodes.
— 5 minutes later —
Ah, okay, we have to go into advanced options… ugh. Alright well let’s just finish the next section and then I guess I’ll have to dive into the advanced options.
Security & Access
This one is relatively easy as well. I’m going to use my ec2-user ssh key that I already created in previous projects. I won’t play around with the IAM roles either because I don’t really care about security right now. I just want a cluster up and running.
AWS EMR Console – Advanced Options
Because I want to optimize on cost, I’m going to go ahead and explore the advanced options. I’ll skim what I don’t know / need and focus on the spot instance.
Software & Steps
Nothing too crazy here other than us being able to actually choose the packages we want. I’ve selected everything that was in the Spark package + Hue if I so choose to use HDFS or HIVE.
I will ignore the rest of the options right now, I don’t think they’re needed.
This is exactly what I was looking for. First of all, I can actually choose the VPC and security group here as well. I’m going to choose the ones that I created with my Terraform script in my previous project. These VPCs and Security Groups are wide open and anyone on the internet can access them. This just gives me ease in troubleshooting as this is all open data I’m working with anyways. Nothing really to be compromised here.
Secondly, I’m able to choose spot instance now and I’ve set it to 3 cents / hr. Got my 3 m4.large’s and I should be off and running. Everything else I’ve left as default.
General Cluster Settings
I don’t see the anything here that I need to tamper with either other than turning off logging to save myself unnecessary usage of s3 space.
Here, I’ve assigned my ssh key again, ignored the IAM users, and directed the clusters to sit in my pre-defined security groups so, again, I don’t have to deal with restricting ports and IPs. WIDE OPEN BABY.
Spinning Up The Cluster
Welp… without further ado… let’s spin up the cluster! Actually, this post has been long enough. Let’s do it in the next post.