Big Data Compression Algorithms
[Note: this post is part of a series of posts being written about what I’m learning in Algorithms for Dummies by John Paul Mueller and Luca Massaron. These posts are written more as notes rather than as being instructive for the readers. Thinking and taking down notes and playing around with the Python programs are how I learn.]
Now I move to compressing data, still part of the big data series in Algorithms for Dummies by John Paul Mueller and Luca Massaron. I’m in Chapter 14. Even though storage technologies have become much cheaper, we still use data compression, especially for mobile devices and for instances where data growth is greater than storage space.
Computer data is made up of bits – zeroes and ones. Our normal counting system uses the decimal system; computers use the binary system. Computers store our normal language and number system in bits by encoding our characters into zero-one bits. This reminds me of cryptography where you convert the real message into an encrypted message via replacing real characters with characters for secret message.
You may have seen the phrase ASCII system or UTF-8. I think these are encoding systems. Under ASCII, letter A is 65, a is 97. In Excel, you can get the ASCII code by using CODE(“character”). To go from ASCII to our alphanumeral (65 to A), use CHAR(“ASCII code”). Unicode coding (UTF) is the default coding for Python 3.
Some ingenious compressions take advantage of the fact that not all letters are used or some letters are used more frequently than others, so fixed sized encoding may be needlessly taking up too much space and can be turned into varied sized encoding (some characters might be encoded as 32 bits whereas others might be encoded as 16 bits – just to throw out my understanding).
Images and sounds work the same way. There is also an additional feature where the human eye or the human ear cannot see or hear certain details, so those details could be eliminated from the compression without loss of quality. These compressions are lossy compressions. A key thing to note: removing details means you can’t reconstitute the original data structure. Lossy compression for screens will usually be okay but for print, you may compromise quality. Printers display at 300 dpi whereas screens are either at 72 or 96 dpi.
Examples of lossy compressions: JPEG, DjVu, MPEG, MP3, WMA.
Four Types of Compression Strategies
Shrinking character encoding: If you only use some of the character set, then you can shrink the number of bits for encoding. This method works well with DNA type of data. With DNA, there are only 4 characters so the encoding shrinks dramatically since those 4 characters encoding can be shrunk from 8 bits to 4 bits, or something like that.
Shrinking long sequences of identical bits: Here, if you have quite a bit of long string of zeroes or strings of ones, you can encode with a code that will replace the repeated zeroes (or ones) with one zero (or one) along with the number of times to repeat that zero (or one). This method works well with images and DNA data.
Leveraging statistics: For characters that show up more frequently, you can code that character to have less bits. I don’t know how this one differs from the first one so my understanding of the first method is probably incorrect. The authors mentioned the Huffman coding algorithm which uses this method but did not elaborate on it much other than to say that Chapter 15 will discuss more about this algorithm.
Encoding frequent long sequences of characters more efficiently: This one is supposed to be similar to shrinking long sequences of identical bits but instead of bits (ones or zeroes) we’re talking about sequences of characters. The LZW algorithm works by learning about the data on the fly and encoding longer sequences of characters. The book provides this algorithm for playing.
The authors do stress the importance of choosing your encoding because certain ones work better with certain types of data.
Algorithms
1. Run-length encoding (RLE): works best with data with “many long repetitions”. This is the shrinking of long sequences of zeroes and ones. So, say you have a string of zeroes and ones (after the initial encoding of the original set of characters) such that the first 17 bits are zeroes, the next 15 bits are ones, the next 5 are zeroes and the remaining 10 are ones (17 16 5 10). Take those counts and convert them into binary to get your compressed encoding.
Example of some set of characters being encoded into zeroes and ones bits:
00000000 00000000 01111111 11111111 10000011 11111111
The counts are 17, 16, 5, and 10.
Converting those counts to binary gives you 00010001 00001111 00000101 00001010, a much shorter string. The Excel function =DEC2BIN will do this trick.
Here’s some mumble jumble…Apparently, this method requires a maximum of 255 counts and once you reach the 255 counts, you have to “insert a zero” and you start the counting over again. The book provides some “rules” of this RLE but I’m fuzzy on it. I think these rules are there to enable the algorithm to decode or decompress the encoded data. I think if the first sequence of the initial encoding is actually a 1 rather than a zero, then the counts starts off with a zero (in the place of 17) and then you count the ones (whatever count you get would be in the place of 16).
This type of encoding is supposed to work well for DNA compression and JPEG/MPEG compression.
2. Huffman compression: the keys to the Huffman coding are a). The more frequent characters are encoded with shorter sequences of bits; b) those shorter sequences of bits are unique (so that longer sequences do not utilize them and thus confuse the decoding; c) manage those various sequences of bits using “Huffman” trees (a form of heap trees).
Huffman encoding is an iterative process that deploys the greedy strategy (using locally optimal strategy to make decisions at each stage). Encoding starts at the leaf or bottom of the tree and decoding starts at the root or top of the tree. Again, less frequent characters are encoded with longer length of binary sequences. To decompress, both the compressed binary sequence and the Huffman tree will be needed.
Huffman encoding works best on larger data files.
3. LZW algorithm: While Huffman works on assigning shorter binary sequences to the most frequently used characters, the LZW algorithm assigns shorter binary sequences to the most frequently used character sequences. To re-phrase, instead of a single letter, now it’s the most frequent words. The Huffman encoding cannot be used for streaming data because of the use of trees (trees require you to have the entire dataset upfront in order to build the mapping tree). LZW on the other hand can be used on streaming data.
The LZW used been used in UNIX compression, GIF image format, documents and book text.
As best as I can tell, the LZW starts off with encoding single characters using a “symbolic” table (usually ASCII). Later as more data comes in, LZW starts to add more encoding to the “symbolic” table for larger sequences as the LZW encounters them. It sounds like the LZW builds on top of the original table. Because the algorithm “learns”, it works best on larger datasets.
The algorithm does not need to store the table for decompression. Apparently, it can “learn” backwardly. Although I perused and played with the example provided by the authors, I have a superficial “feel” for what is going on.
Wrap Up
I’m ready to wrap up my note-taking. This chapter ends the big data series. The next set of chapters deal with “challenging difficult problems”. Oh boy – like the chapters I encountered weren’t already challenging enought.
The very next chapter discusses greedy algorithms.