Storing Big Data
Woah, the week is getting away from me fast. I thought I had more time than I did. I’m now on Chapter 13 of Algorithms for Dummies by John Paul Mueller and Luca Massaron.
This post represents my notes on the book so it is not written in a style to be informative but as a style of my own note-taking as I learn from the book. It’s my thinking out loud as I process the information. So grammar, spelling has not been checked. I might not even be correct in my understanding of the materials.
My last note taking centered on sampling/streaming and sketching/hashing. These algorithms are great when you are okay with an approximation of the deluge of data, but there are times when you need to be more precise, such as when you are doing a keyword inquiry or you are searching for an image or you are trying to match DNA sequences. In these situations, you probably need to store the data and then perform your calculations. But the storing of the data requires a lot of hardware and other resources to handle the big data.
Computer engineers came up with distributed computing (or parallelism) to handle storing big data.
What distributed computing does is break up those streaming data into packages and store them on a network of computers. This storage is probably based upon satisfying the associative and commutative properties of data – much like those found in numbers. The book doesn’t go into any detail on those properties. The other thing about these packages of data is that they are copied at least two or three times to provide redundancy or backups. Finally, an index called a master node – actually multiple indexes – is created to store the packages’ addresses.
Storing the data packages is quick; retrieving them takes longer.
The basic set of steps to retrieve the needed data is to go to the master node to get the addresses for the data needed and then send an order to retrieve/move the data from multiple computers onto a single computer before beginning any kind of analytical work. All of this transmission of data takes up processing time and eats up resources though.
MapReduce
MapReduce provides an engineering solution that eliminates the massive moving of data onto a single computer. Instead of moving the data to a single computer for processing/calculations, the MapReduce does the processing on the multiple computers and then move, or maybe it’s retrieve, the “reduce” answers.
One key thing to note: MapReduce is actually not an algorithm but more like a process or a “framework”. And I thought this MapReduce was a Google thing but it might actually be Apache.
- Maps: a function is first applied to the data, which is supposed to be in a one-dimensional array, and a new array containing the transformed data is created. Each value in the array should be independent of each other; this independence enables parallelism.
- Shuffle and sort: when the mapping is done, you have a set of tuples of key and value pairs. A hash function is applied to each key of the (key, value) pair to determine which computer the pair will go for the reduce part.
- Reduce: the reduce is another function(?) that applies to each element in the array, acting as if the data is a data stream, and then stores the interim results. So, for example, the reduce applies a function to the first element in the array and stores the result. Then the function is applied to the second element in the array and is added to the result of the first element. Next, the function is applied to the third element in the array and that result is added to the summed-up result of elements one and two. And the process continues on with elements four, five and so on until all elements have been processed.
Or maybe the reduce first applies the calculating function to all elements in an array and stores the interim element. Then a summation part of the reduce sums up the results. I still have a few questions on exactly the approach: is the calculating function (the one where you get the answer) in the map part or in the reduce part? If the calculating function is in the map part, then reduce would be just summing up the answers? If the calculating part is in the reduce, then what’s the point of the function in the map?
I think the calculating for your answer is in the map part and the reduce just sums up the elements to “reduce” the data.
Some common uses of MapReduce: processing text; graphing and graph algorithms; data mining and machine learning.
My Understanding of the Python Program Provided
The Python program provided was to count specific keywords in a book. The example provided was a set of four keywords.
I played around with the Python program, printed out some variable to see what it was producing, and it looks like the program “splits and distributes” the data (basically a list of words from a book) to the computers (actually, in my case, the Windows core processors as a stand in for computers). It appears the “calculation” function is also “distributed” to the computers/cores, but I’m not 100% certain. Each computer/core creates a list of requested keywords pulled from its distributed data and that list is just a jumbled list of keywords, and somehow those separate lists from each computer/core combine into a single master list. So, lists within a master list.
Remember, those lists are really just a list of keywords that were divvied up amongst the Windows cores. The program just pulled out the keywords and dumped them into each computer/core’s list which somehow combined to form a master list.
That’s the Mapping/distributing piece.
In the book example, the program was looking for a count of 4 keywords and these keywords are still jumbled up inside those lists within the list. They were just pulled and divvied up from the original data. We still need to count how many keywords were found in the book and the current master list or the separate lists within the master list won’t help us get there. We need to do some more work.
Here’s where the shuffling comes in: the data is re-arranged into a fashion you need. In this Python program of counting keywords, the lists within the master list are rearranged into separate lists of keywords within the master list. Once those lists are “shuffled” into lists of keywords, then the “reduce” program comes into play. It looks like from the Python program, the “reduce” action is distributed amongst the computers/cores (I’ll admit I’m fuzzy here) to perform the reduction activity which is basically to sum up each element in the keyword list to get the count. So, let’s pretend one of the lists in the master list is a keyword “Mary Davenport” (I’m making up names). The reduce function will take the first element in the list which is “Mary Davenport” and count that as 1. Next, the program adds the second element in the list which is also “Mary Davenport” and add its count of 1 to the first element to make 2. The program stores that interim result 2, moves on to the third element “Mary Davenport” and add that count to make 3. The program continues until the list runs out of “Mary Davenport”, at which point, it has the count for “Mary Davenport”.
Wrap Up
That’s pretty much it for Chapter 13. In looking ahead in the book, it looks like data compression is next.
You must be logged in to post a comment.