Learning about Algorithms for Big Data

[Note: this post is part of a series of posts being written about what I’m learning in Algorithms for Dummies by John Paul Mueller and Luca Massaron. These posts are written more as notes rather than as being instructive for the readers. Thinking and taking down notes and playing around with the Python programs are how I learn.]

The next chapter in Algorithms for Dummies pertains to big data. So far, in previous chapters, I’ve read about sorting and searching in general and various things to do with graphs: traversing graphs, finding centrality or importance in graphs, shortest distance in graphs, and PageRank to summarize off the top of my head.

Big data can be characterized by the four “V” characteristics: 1) immense volume of data; 2) the velocity or speed of the data generation; 3) the variety of data that goes beyond just text or numerals to include images and audio and maybe more; and 4) the veracity of the data, meaning not everything is true or high quality so you have a lot of noise or garbage floating around in the data that you have to discard.

An interesting comment made by the authors is that big data challenges the scientific method: one does not have to deploy the scientific method to make discoveries. No theories are needed in the big data world.

The other critical element about big data is that the value of data does not lie within the data itself because data has become so prevalent that it’s commoditized; rather the value comes from how you use the data and that use is often driven by algorithms.

There are three noted ways of handling the large number of fast streaming data coming from all types of devices in the world: 1) storing them for use or analysis later, 2) summarizing only the most important data so you have to make a decision on which data is truly important, and 3) consuming the data, I assume as it goes by (I think the algorithms are just consuming a sample of the data stream going by but not sure yet.)

These algorithms are called streaming algorithms and due to the large number and fast-moving nature of the data stream, these algorithms give an “approximate” answer or “good enough” answer. They are also simple and have low computational complexity. Because of the volume and velocity, you have to concentrate only on things that matter. The following are the main algorithmic tools mentioned by the authors:

Sampling: instead of taking in the entire dataset, you sample the stream or you “observe within a data window” much like the concept of sampling a population set in statistics.
Hashing: here you “reduce infinite stream variety to a limited set of simple integer numbers”. Previously, I read about hashing used to efficiently store and retrieve data without having to sort the data first. This hashing was through the use of a formula called a hash function to create a number that represents a position in a hash table.
Sketching: Similar to sketching your environment such as a park or city scene, sketching here could be creating a summary of the streaming data or some measurement without all of the details. Sketching allows you to use a simple storage such as the computer’s main memory or hard disk.

That’s the generalities about the streaming algorithms. Now we get into the topic of using sampling to cull or measure data in a data stream. In other words, we’ll dive into the more complicated concepts.

SAMPLING

In the census, you try to get data from every person in the entire population, but in doing polls, you sample a much smaller subset of the population and make inferences about the larger population from the smaller sample. This sampling is done within “an acceptable margin of error”. That’s the sampling method from statistics. A very important part of this sampling is that each element in the population must have an equal chance of being chosen for the sample. The sample can’t be skewed towards certain selections in the population.

The same logic is being applied to streaming data with some slight adjustments made to overcome the fact that the data is constantly streaming. The book talks about two kinds of sampling: reservoir sampling and window sampling. I understand reservoir sample as randomly pulling in a selection of data into a “reservoir” until some desired number of samples is reached. Once that number is reached, any new data that comes into the “reservoir”, again randomly, pushes out old data from the reservoir. The window sampling is more often used for time-based sampling such as reporting on how many users request a page during a minute. Under window sampling, a queue is created and data goes into the queue, you count the data, and then release it.

““Windowing looks at samples using a sliding window – it shows the elements under the window, which represents a certain time slice or a certain segment of the stream. Reservoir sampling represents the entire stream scope instead by offering a manageable amount of data, which is a statistical sample of the stream.” P236, hardcopy.”
Algorithms for Dummies, John Paul Mueller and Luca Massaron, 2017, p. 236, hardcopy.

Based upon the Python algorithm given in the book, it looks like the algorithm first fills up a “reservoir” with the data elements – it seems like the first “x” number of elements sitting in your data stream goes into a reservoir. The “x” number of elements represents your chosen sample size and the data stream needs to be in a list datatype in order for this algorithm to work. Once the “reservoir” reaches the number of elements that you have chosen as your sample size, then each new element that “comes in” is assigned a random number pulled by a random generator and checked to see if this random number is less than your sample size. If the random number is less than the sample size, then the new element goes into the “reservoir” and replaces one of the older elements (basically the one whose index or position in the stream matches the random number).

This sampling method will not tell you whether you’ve seen a data element before or count how many of a particular data element there are or its frequency. You need to do either hashing or sketching for that.

HASHING

At first, I had problems figuring out what the authors were trying to say but I think I got it now, in a very superficial way. The hash table that I first learned, back when I was reading about storing and searching for specific data or information, was basically a table of stored information accessed and found via a mathematical function called a hash function. A mathematical function is first applied to the data, whatever it may be – text, numbers, images – and a specific number (a hash number) comes back that tells where to store the information in the hash table. To search for that information later, you run that information through the same formula to get the same hash position in a hash table.

The hashing for big data sort of works in a similar way. Hash functions are apparently a quick and efficient way of storing streams of data but there is a limit to how much you can store – the size of your hash table or the number of slots in the hash table. With hash functions, you will be trading time and space for allowable errors (positive errors where the algorithm will erroneously think that you’ve already seen such data element before due to collisions in the table).

Instead of hash tables though, we have Bloom filters. According to the authors, Bloom filters are a “frugal way of storing traces of data elements” without having to store them the way a hash table does. Bloom filters have two parts:

Bit vectors – a list of bits represented as 0 or 1. “The list is a long number of bits called m.” I’m not sure of the importance of “m” but the authors put that sentence in for a reason. I’m not sure yet if the bit vector is a single vector or if there can be multiple vectors. At this time, I think it is a single bit vector.
A series of hash functions: I’m kind of vague here but I think these may be mathematical functions that represents different values or positions in the bit vector.

A Bloom filter of a certain size is first created and then elements are added to the filter. I suspect the Bloom filter starts out with all zeroes and as you add elements, the zeros turn into ones due to the hash functions. The series of hash functions determines the “position” in the bit vector. Using the example given in the book, here’s how I understand it starting from all zeros to adding elements X and Y:

The example seems to suggest three hash functions giving 3 different position values in the bit vector for each data element. You’ll notice that X and Y both have a one in the same position 7 in the bit vector in this example. This is the ‘collision’ which could lead to a false positive.

In this hashing method, to search for an element, you search for zeroes in the bit vector. If after crunching the hash functions for a particular data element, you find that one of the positions in the bit vector is still zero, then that data element has not been encountered yet. In the following example of searching for Z, position 2 is found to be zero in the Bloom filter, so it is presumed that element Z has not been encountered yet.

There is a Python algorithm in the book for this Bloom filter where you can add to the filter and search the filter. The authors briefly mention that crawlers use these Bloom filters to hash the contents of a web page and then check for changes in the web page later on.

There is a short discussion on how to figure out the size of the bit vector and the ideal number of hash functions to use in order to reduce the probability of collision but I’m not going to go into that section. I can always go back to that page if I need to use Bloom filters and need to figure out exactly how to use it.

Okay, my understanding of the remainder of the chapter, which is sketching, gets fuzzier because there are no Python programs for me to explore.

SKETCHING

Bloom filters can’t count the number of distinct elements there are in a stream (you might want to track how many distinct users visit a site or you want to know the number of distinct search engines queries). To do that, you do what they call a numeric sketch, which still uses some form of hash function.

Sketching is an approximation and the algorithm mentioned in the book to do this sketching is HyperLogLog. No Python program of HyperLogLog was provided in the book; the authors provided only a top-level description of the concept of HyperLogLog.

Generally, every number in a stream is first put through a hash function and converted to a number. Then this number is converted into a binary (base 2), a numeric of zeros and ones. The algorithm counts the number of leading zeros in each element and tracks only the maximum number of zeros (this tracking is in “n”). Finally, the algorithm calculates the number of distinct elements as 2^n where “n” is the aforementioned maximum number of leading zeros. This algorithm works best when you perform the HyperLogLog many times, using different hash functions and averaging the answers; however, this takes time. As an alternative, the authors discussed splitting the streams into groups and tracking the number of maximum leading zeros for each group. In the end you compute the estimated number of distinct elements in each group and then do an arithmetic average of all of the estimates. (This is called stochastic averaging).

Then finally, yet another sketching algorithm called the Count-Min Sketch which is used to count objects in a stream. This algorithm is often used for finding the most frequent items or rank usual or unusual events. Examples provided in the book are finding the most frequent queries in a search engine, bestselling items from an online retailer, the most popular pages in a website, or the most volatile stocks.

As best as I understand the algorithm, the Count-Min Sketch utilizes multiple bit vectors and each bit vector has its own hash function. You start out by initializing all of the bit vectors to zeros. Then when an object comes in, you apply each of the bit vector’s hash function to the object to get a numeric address for each bit vector and increment (+1) the position at that numeric address for each bit vector. This is the counting as the stream goes by. Then to retrieve the count of the specific element, you apply the hash functions to that element and retrieve the smallest number as your count. You take the smallest because the false positives could inflate the count.

WRAP UP

So, I reached the end of the chapter and I’m ready to wrap up this post. Perusing the next chapter, I see that it will be a discussion about how to store all of the streaming data that is passing by. This might get into how Google and other large companies capture and store data that is streaming. The notes will probably be done in the next two weeks.