# NYPD Crime #2 – Distributed Computing & Amazon EMR

## Distributed Computing

Before we jump straight into EMR (Elastic Map Reduce (The “Map Reduce” refers to essentially a distributed computing computational engine that is more or less obsolete these days)) or Hadoop, let’s just spend a few minutes (well, paragraphs) talking about distributed computing in general.

Distributed computing is the idea of distributing your computing over multiple capable workers. At the very basic level, if you’re moving houses and you need to move all your furniture, you’ll be able to move it much faster with 2 buddies than by yourself. In computing, the baseline premise is the same. I would be stupid to ignore the details behind the concept, but my brain is dumb so I’ll take baby steps at a time.

If I think about my friends helping me move (which I have absolutely done before), it is obviously faster – Obviously! 3 bodies is better than 1! However, it’s not without its complications either. Here’s a few things that I’ve come across personally:

• Even before moving, you have to call your friends, ask and bribe, and ensure they know the time and location
• Each person needs to be in a (relatively) well-rested condition, and have their own car to get to my place and help store items to transport
• I’ve come across times where I don’t actually need to move everything in the old house at that moment, or some stuff will be going back to my parents’ / friends’ houses rather than my new house – that info needs to be communicated to my friends or we’ll have items ending up at incorrect destinations!
• Ever had a friend stub his toe or get a cut moving? Stuff happens like that all the time, and if and when it does, it needs to be tended to… the more people you have, the more opportunities to get hurt as well
• Some of my friends are bigger and stronger than some of my other friends! Some of them are 5-foot girls who likely will not be the ones lifting a couch haha

My point here being the more people you have, the more complicated it now gets to actually organize or efficiently utilize all the resources. Sometimes it’s not that much more to think about, other times I wish I’d just straight up done it all myself. In the distributed computing world, we see not necessarily the same complications, but we see the same concept of trade-off.

If we have 1 computer, we have let’s say 2 cores, 4GB RAM, maybe 200GB of disk space etc. If we have 3 of those, we now have, in aggregate, 6 cores, 12GB RAM, and 600 GB of disk space. I know, I know, we literally just spun up a machine with 4 CPUs, a GPU, and 61GB RAM and that’s probably good enough for most applications involving a dataset under 2GB like I’m doing in this project, but I’m purposely trying to dive into distributed computing here in a controlled way haha. Anyways, yes, we get the aggregate firepower of all 3 computers, but here comes the trade-off!

• Where is the data stored? Should it be stored on all 3 machines?
• If so, should the data be replicated across all 3 machines? Or evenly split up and distributed among all the HDDs?
• How do we actually make these decisions? Are one of these machines the proverbial “guy who is asked for help moving and knows exactly how and where everything should be moved”? AKA somewhat of a “boss” or “master” node?
• Will the communication between the nodes be fast?
• What if one machine crashes? How will the other machines handle that?
• What if another machine wants to join in on the fun? How will the distribution of tasks change with this new addition?

Although, to no surprise, I’m no expert on distributed computing (we can get into some really complicated consensus algorithms), but I wanted to give some context around why EMR is even a service. Why do we actually need to give a name to what essentially comes down to 3 EC2s? The answer lies in the software that sits on top of these EC2s that have answers to all the problems we just stated. I wish I knew the nuts and bolds, but unfortunately I just don’t have the time nor brainpower to do that! Maybe something to stick on the to-do list.

### Distributed Computing Example #1: Sum

Let’s say we have 6 numbers $[1, 2, 3, 4, 5, 6]$ and we want to find its sum. Obviously we can do a bit of foreshadowing… the answer is 21.

Not very difficult to do on one machine, but let’s complicate our lives an unnecessary amount and try to compute this over 3 machines. This isn’t to difficult either luckily – We’d probably just distribute the data among the 3 machines evenly (this is coordinated by a master node which can be randomly designed as any of the 4):

• Machine #1 gets $[1, 2]$
• Machine #2 gets $[3, 4]$
• Machine #3 gets $[5, 6]$

Each machine does an addition of their own data (traditionally called the map step of map-reduce and they get the following results:

• Machine #1 calculates a sum of $1+2=3$
• Machine #2 calculates a sum of $3+4=7$
• Machine #3 calculates a sum of $5+6=11$

Each machine would then send it back to the master node, let’s say machine #1 in this case, and machine #1 would calculate the final sum (traditionally called the reduce step of the map-reduce framework):

• Machine #1 calculates a sum of $3+7+11=21$

The master node would then return the final result to the users. Let’s look at another example.

### Distributed Computing Example #2: Average

Let’s say we have the same 6 numbers $[1, 2, 3, 4, 5, 6]$ and we want to find the average this time. We can probably distribute the data in the same way, but how exactly do we calculate the average?

$Average=\frac{Sum}{Number\ of\ Entities}$

It seems that the master node needs 2 pieces of information now to calculate the average. Each slave node now needs to pass the master node 2 pieces of information. We’ve already calculated the sum, and the number of entities is relatively easy to calculate for each node as well:

• Machine #1 calculates a sum of $1+2=3$ and counts a total of $2\ entities$
• Machine #2 calculates a sum of $3+4=7$ and counts a total of $2\ entities$
• Machine #3 calculates a sum of $5+6=11$ and counts a total of $2\ entities$

Now the master has 2 final calculations it needs to do instead of 1

• Machine #1 calculates a sum of $3+7+11=21$
• Machine #1 calculates a total number of $2+2+2=6\ entities$

Then the master calculates the final answer

• Machine #1 calculates an average of $\frac{Sum}{Number\ of\ Entities}=\frac{21}{6}=3.5$

A bit harder, but we got through it!

### Distributed Computing Example #3: Add A Constant

This one’s even easier. Let’s just add the number 5 Ao all numbers and store that as another result, maybe in another column of the dataframe that contains our original data. We don’t even have a proverbial “reduce” step here. If we were doing this type of calculation with a single processor, the simplest way to do that is to just go line by line, perform the addition, store the result, and go to the next line. Something like this:

• Machine #1 calculates $1+5=6$ and $2+5=7$
• Machine #2 calculates $3+5=8$ and $4+5=9$
• Machine #3 calculates $5+5=10$ and $6+5=11$

After storing the results, we’re done!

### Distributed Computing Example #4: Cumulative Sum

Alrighty – last one. Let’s try the cumulative sum. This will be similar to the last example where each “row” of data will have a result. If we perform the cumulative sum calculation on $[1, 2, 3, 4, 5, 6]$, we should come out with $[1, 3, 6, 10, 15, 21]$.

Let’s start out the same way with the data distributed evenly amongst the 3 machines.

• Machine #1 calculates $1+0=1$ and $2+1=3$ and stores those results
• Machine #2 calculates $3+x=y$

Hmm… How does machine 2 calculate $y$? It has the number $3$… but what’s $x$? Well $x$ actually depended on the data that machine 1 had… In order for machine 2 to perform its calculations, it has to wait until machine 1 has finished its calculations. In this case, there’s no difference between distributing this calculation and not distributing. In fact, it will be slower to perform this calculation because machine 1 now needs to communicate its calculation to machine 2 through a wired or wireless network and that will cause the unnecessary overhead that a single processor just won’t have to deal with. Bottom line: There are functions which we cannot perform in a distributed manner.

We see the types of coordination and organization that the different nodes have to go through to perform to perform more complex functions like standard deviations or joins with other data sets!

Alright. EMR. What is EMR? First of all, EMR stands for Elastic Map Reduce. Elastic means the architecture is flexible and scalable – we can add and remove nodes on the fly and it’ll integrate (hopefully seamlessly) into the family. Map Reduce we’ve covered a bit, refers to one of the original mainstream distributed computing frameworks. It’s more or less obsolete now so I won’t go into it any more than I already have with the examples.

Hadoop is a suite of tools that deliver a tangible distributed computing architecture to a user. Some of the main components of Hadoop are:

• HDFS (Hadoop Distributed File System): A single abstracted file system that lies on top of multiple nodes but we can navigate as a single unified file system (a single file can lie on multiple nodes, but we see it as a single file in HDFS, we don’t worry about how it’s distributed underneath)
• HIVE: A single abstracted relational database that lies on top of multiple nodes, similarly to HDFS
• YARN (Yet Another Resource Negotiator): This wise-ass-named tool is the algorithm that facilitates the consensus logic amongst the nodes

Hadoop has way more modules, but these are the only 3 that may relate to Spark for the scope of my project, so I’ll stop here and dive a bit more into HDFS to make the distributed concept a bit more tangible.

### HDFS

HDFS stands for Hadoop Distributed File System. What happens when I store a file on my Macbook? Let’s pretend we’re downloading a file from the internet…

1. As you browse a website, your processor is receiving HTML code via your ethernet card and it transforms it to a website which you can digest visually
2. Once you click on a file to download it, the HTTP protocol begins to transfer the file to your laptop via packets of bytes (generally using the TCP and IP protocols that I briefly talk about in this post)
3. As packets are loaded into RAM, your processor stores them to free HDD space, in chunks, until the whole file is stored on your HDD
4. Note that, while the bytes are stored on the HDD itself, the “file” that you see in your OSX Finder or Windows Explorer or BASH console is nothing but an abstracted pointer that the computer is using to know where to find that file on your HDD… the pointer says this file somefile.txt is represented by x, y, and z HDD memory sectors
5. When you delete a file, you are merely deleting the pointer to that file, and the computer no longer recognizes that somefile.txt lives in sectors x, y, and z – while the data still lives on the disk, the acknowledgement that the file exists is gone and the processor is free to write back to those HDD sectors

Now… how does this change when we want to store files in a distributed manner? In our examples above when we had the simple array $[1, 2, 3, 4, 5, 6]$, we stored 2 array elements per node. There are a few problems with this, though:

• What if one node goes down? We would literally be missing 1/3 of the file!
• What if one of the nodes HDDs fills up? We can no longer distribute across all 3.

This is where HDFS steps in. HDFS is designed for robustness and scalability. Replication is one the mechanisms in which HDFS provides solutions to these problems. By default, HDFS replicates each file by a factor of 3. Or, more specifically, each file is broken down into blocks, and each block is replicated by a factor of 3:

Notice here that, with 4 nodes and a replication factor of 3, any one node can go down and we would still have all the blocks required to recreate the file. In fact, any two can go down and we’d still be good. This would change with the scale of your cluster and the number of nodes you spin up, but you get the idea. HDFS also performs this replication and distribution of blocks based on available resources as well, namely hard drive space. From the HDFS architecture guide:

The HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold.

Once again, another domain that I’m nowhere near an expert on, or even knowledgeable about, so I will leave it at that for now.

Hadoop generally includes Hue, a web-based HDFS file browser that provides a simple user interface that we can use to avoid getting caught up in all of HDFS’ details. Hue looks something like this:

Pretty much like a normal file browser!

### YARN

YARN is, well annoyingly it’s yet another resource manager *eyeroll*

I’m sorry… how could I resist? ANYWAYS… YEAH… YARN = Yet Another Resource Manager. I suspect this means that there were other resource managers before this one in the world of distributed computing. For the 8000th time, I’m no expert in any of this, YARN is really the first one that I’m working with (and even then, it will all work in the background, transparent to me), but I think it’s worth me at least exploring so I can understand the inner workings in hopes that it will help me design and debug when the time comes!

YARN is the protocol that the nodes communicate using to determine which machines will do what. Remember in our examples, we had our map phases and reduce phases. I found this article from the folks over at Cloudera extremely helpful because it was written for folks like me with half a brain (just kidding, just people with less experience in distributed computing). I’ll be using some of their illustrations in this section.

A “cluster” in hadoop indicates machines who are working together as one distributed computing platform. There exists 1 master node, and however many worker nodes the user wants (depending on the task at hand).

In this example, we have $n$ nodes, each with their own processing and RAM. YARN utilizes a Resource Manager daemon and $n$ Node Manager daemons to logically make up this Hadoop “cluster”.

Each node in the cluster has its own resources, right? The master node then takes a global view of the available resources in the cluster in aggregate:

In the example above, we have 100 worker nodes each with 64 cores and 128GB RAM each, making a cluster with

$64\ CPU\ cores\times 100=6400\ CPU\ cores$$128\ GB\ RAM\times 100=12800\ GB\ RAM$

Easy enough. Notice here that the master node does not seem to be one of the worker nodes in the cluster, but the duties of the master could be put on one of the workers.

#### YARN Workflow

When a cluster is first kicked off, the client (whether it be an end-user or an application) will first communicate the task to the YARN master node. The YARN master node has access and communication to worker nodes within that cluster.

Various hadoop configuration parameters and the current workload of the nodes will help guide the master node in determining how much resources to ask for from the cluster of workers. If there is only one worker node, or if only one worker is available, the cluster will operate in a design like this:

The Node Manager daemon on the worker node spawns a logical container with the worker process running within it (could be a map or reduce job here). If more nodes are summoned or become available, the Resource Manager will continue to spawn workers until the optimal number is reached.

Here, the another worker is spawned and the master understands that it has an aggregate of 2 workers, 120 CPU cores and 180GB of RAM. As workers finish their tasks, their Node Manager daemon terminates and, as all the workers finish their tasks, the Resource Manager daemon terminates, but not before the result is presented back to the requesting client.

### HIVE

Right off the bat, I don’t know enough about HIVE and haven’t used HIVE enough to really comment on its inner workings, but it’s basically a distributed relational database. We saw, with HDFS, that just storing a single files or doing a simple sum takes quite a bit of organization and communication. You extend this out to all the functionalities that make a relational database work… indexes… table metadata… transformation of SQL joins and group bys into a distributed framework… I’ll need quite a bit more education before I really start to understand how HIVE is built.

Just wanted to mention it though, because Spark can pull data from HDFS or HIVE directly!

Anyways, that’s probably enough for this post. I think I’ll try to spin a cluster up in the next post.