5.5 Data Compression

This section under major construction.

Data compression: reduces the size of a file to save space when storing it and to save time when transmitting it. Moore's law: # transistor on a chip doubles every 18-24 months. Parkinson's law: data expands to fill available space. Text, images, sound, video, etc. Wikipedia provides public dumps of all content for academic research and republishing. Uses bzip and SevenZip's LZMA. Can take a week to compress of 300GB of data.

Ancient ideas.

Morse code, decimal number system, natural language, rotary phones (lower numbers were quicker to dial, so New York was 212 and Chicago 312).

Binary input and output streams.

We use BinaryStdIn.java, BinaryStdOut.java, BinaryDump.java, HexDump.java, and PictureDump.java.

Fixed length codes.

Need ceil(lg R) bits to specify one of R symbols. Genome.java. Uses Alphabet.java.

Run length encoding.

RunLength.java.

Variable-length codes.

Desire unique decodable codes. One way to achieve this is to append a special stop symbol to each codeword. Better approach: prefix-free codes: no string is a prefix of another. For example, { 01, 10, 0010, 1111 } is prefix free, but { 01, 10, 0010, 1010 } is not because 10 is a prefix of 1010.

Give fax machine example.

Huffman codes.

Specific way to construct optimal prefix-free codes. Invented by David Huffman while a student at MIT in 1950. Huffman.java implements Huffman algorithm.

Property A. No prefix free code uses fewer bits.

LZW compression.

Using prefix match code from TST.java, LZW.java implements LZW compression.

Real world: Pkzip = LZW + Shannon-Fano, GIF, TIFF, V.42bis modem, Unix compress. Practical issues:

Encode everything in binary.
Limit the number of elements in the symbol table (GIF = throw away and start over, Unix compress = throw away when not effective).
Initially dictionary has 512 elements (with 256 elements filled in for ASCII characters), so we transmit 9 bits per integer. When it fills up, we double it to 1024 and start transmitting 10 bits per integer.
Only traverse the tree once (might break our string table abstraction).

Practical issues: limit the number of elements in the symbol table.

Summary.

Huffman: variable length code for fixed length symbols. LZW: fixed length code for variable length strings.

Universal compression algorithm.

Impossible to compress all files (proof by simple counting argument). Intuitive argument: compress life work of Shakespeare, then compress result, then compress result again. If each file strictly shrinks, eventually you will be left with one bit.

References.

Guy Blelloch of CMU has an excellent chapter on data compression.

Error correction / detection.

Suppose channel for sending information is noisy and each bit gets flipped with probability p. Send each bit 3 times; to decode take the majority of the 3 bits. Decoded bit is correct with probability 3p^2 - 2p^3. This is less than p (if p < 1/2). Can reduce probability of decoding the bit incorrectly by sending each bit k times, but this is wasteful in terms of the transmission rate.

Reed-Solomon codes.

Reference. Used in mass storage systems (CDs and DVDs) and satellite transmissions (Voyager space probe, Mars Pathfinder) when the errors are bursty. Think of data to send as a degree d polynomial. Only need d+1 points to uniquely specify the polynomial. Send more points to enable error correction / detection. If code we want to send is a0, a1, ..., am-1 (each elements over finite field K), think of it as the polynomial p(x) = a0 + a1x + ... + am-1 x^m-1. Send p(0), p(b), p(b^2), ..., where b is a generator of multiplicative cyclic group over K.

Shannon's coding theorem.

Roughly speaking, if channel capacity is C, then we can send bits at a rate slightly less than C with an encoding scheme that will reduce probability of a decoding error to any desired level. Proof is nonconstructive.

Q+A

Exercises

Which of the following codes are prefix free? Uniquely decodable? For those that are uniquely decodable, give the encoding of 1000000000000.

    code 1    code 2    code 3   code 4
A     0         0          1       1
B     100       1         01      01
C     10       00        001     001
D     11       11       0001     000

Given an example of a uniquely-decodable code that is not prefix free.
Solution. Any suffix-free code is uniquely decodable, e.g., { 0, 01 }.
Given an example of a uniquely-decodable code that is not prefix free or suffix free.
Solution. { 0011, 011, 11, 1110 } or { 01, 10, 011, 110 }.
Are { 1, 100000, 00 }, { 01, 1001, 1011, 111, 1110 }, and { 1, 011, 01110, 1110, 10011 } uniquely decodable? If not, find a string with two encodings. Solution. The first set of codewords is uniquely decodable. The second set of codewords is not uniquely decodable because 111-01-1110-01 and 1110-111-1001 are two decodings of 11101111001. The third set of codewords ins not uniquely decodable because 01110-1110-011 and 011-1-011-10011 are two decodings of 011101110011.
Test for uniquely decodability. Implement the Sardinas-Patterson algorithm for testing whether a set of codewords is uniquely decodable: Add all of the codewords to a set. Examine all pairs of codewords to see if any one is a prefix of another; if so, extract the dangling suffix (i.e., the part of the longer string that is not a prefix of the shorter one). If the dangling suffix is a codeword, then the code is not uniquely decodable; otherwise, add the dangling suffix to the list (provided it is not already there). Repeat this process with the larger list until there are no remaining new dangling suffix.
The algorithm is finite because all dangling suffixes added to the list are suffixes of a finite set of codewords, and a dangling suffix can be added at most once.
- { 0, 01, 11 }. The codeword 0 is a prefix of 01, so add the dangling suffix 1. { 0, 01, 11, 1 }. The codeword 0 is a prefix of 01, but the dangling suffix 1 is already in the list; the codeword 1 is a prefix of 11, but the dangling suffix 1 is already in the list. There are no other dangling suffixes, so conclude that the set is uniquely decodable.
- { 0, 01, 10 }. The codeword 0 is a prefix of 01, so add the dangling suffix 1 to the list. { 0, 01, 10, 1 }. The codeword 1 is a prefix of 10, but the dangling suffix 0 is a codewords. So, conclude that the code is not uniquely decodeable.
Kraft-McMillan inequality. Conside a code C with N codewords of lengths n1, n2, ..., nN. Prove that if the code is uniquely decodable, then K(C) = sum_i = 1 to N 2^(-ni) ≤ 1.
Kraft-McMillan construction. Suppose that we have a set of integers n1, n2, ..., nN that satisfy the inequality sum_i = 1 to N 2^(-ni) ≤ 1. Prove that it is always possible to find a prefix-free code with codewords lengths n1, n2, ..., nN. Thus, by restricting attention to prefix-free codes (instead of uniquely decodable codes), we do not lose much.
Kraft-McMillan equality for optimal prefix-free codes. Prove that if C is an optimal prefix-free code then the Kraft-McMillan inequality is an equality: K(C) = sum_i = 1 to N 2^(-ni) = 1.
Suppose that all of the symbol probabilities are negative powers of 2. Describe the Huffman code.
Suppose that all of the symbol frequencies are equal. Describe the Huffman code.
Find a Huffman code where the length of a symbol with probability pi is greater than ceil(-lg pi).
Solution. .01 (000), .30 (001), .34 (01), .35 (1). The codeword 001 has length greater than ceil(-lg .30).
True or false. Any optimal prefix-free code can be obtained via Huffman's algorithm.
Solution. False. Consider the following set of symbols and frequencies (A 26, B 24, C 14, D 13, E 12, F 11).
C1 C2 C3 A 26 01 10 00 B 24 10 01 01 C 14 000 111 100 D 13 001 110 101 E 12 110 001 110 F 11 111 000 111
In any Huffman code, the codings for A and B must begin with different bits, but the code C3 does not have this property (yet it is an optimal prefix-free code).
What is the LZW encoding of the following inputs?
- T O B E O R N O T T O B E
- Y A B B A D A B B A D A B B A D O O
- A A A A A A A A A A A A A A A A A A A A A
Characterize the tricky situation in LZW coding.
Solution. Whenever it encounteres cScSc, where c is a symbol, S is a string, cS is in the dictionary but cSc is not.
As a function of N, how many bits are needed to encode N copies of the symbol A? N copies of the sequence ABC?
Let F(i) be the ith Fibonacci number. Consider N symbols, where the ith symbol has frequency F(i). Note that F(1) + F(2) + ... + F(N) = F(N+2) - 1. Describe the Huffman code.
Solution. Longest codeword has length N-1.
Show that there are at least 2^(N-1) different Huffman codes corresponding to a given set of N symbols.
Solution. There are N-1 internal nodes and each one has an arbitrary choice to assign its left and right children.
Give a Huffman code where the frequency of 0s in the output is much much higher than the frequency of 1s.
Solution. If the character 'A' occurs one million times and the character 'B' occurs once, the code word for 'A' will be 0 and the codeword for 'B' will be 1.
Prove the following facts about Huffman tries.
- The two longest codewords have the same length.
- If the frequency of symbol i is strictly larger than the frequency of symbol j, then the length of the codeword for symbol i is less than or equal to the length of the codeword for symbol j.
Describe how to transmit a Huffman code (or optimal prefix-free code) on a set of symbols { 0, 1, ..., N-1 } using 2N - 1 + N ceil(lg N) bits.
Hint: use 2N-1 bits to specify the structure of the corresponding trie.
Suppose that in an extended ASCII file (8-bit characters), the maximum character frequency is at most twice the minimum character frequency. Prove that and fixed-length 8-bit extended ASCII code is optimal.
Shannon-Fano coding. Prove that the following top-down version of Huffman's algorithm is not optimal. Split the set of codewords C into two subsets C1 and C2 with (almost) equal frequencies. Recursively build the tree for C1 and C2, starting all codewords for C1 with 0 and all codewords for C2 with 1. To implement the first step, Shannon and Fano propose sorting the codewords by frequency and breaking the set up into two subarrays as best as possible.
Solution. S 32, H 25, A 20, N 18, O 5.
LZMW coding (Miller-Wegman 1985). LZ variant: search input for longest string already in the dictionary (the current match); add concatenation of previous match to current match to the dictionary. Dictionary entries grow more rapidly. Can also delete low-frequency entries when the dictionary fills up. Hard to implement.
LZAP coding. Similar to LZMW: instead of adding just the concatenation of the previous match with the current match, add the concatenation of the previous match with all prefixes of the current match. Easier than LZMW to implement, but even more dictionary entries.
Identify an optimal code that is not prefix-free.
Hint: only need 3 symbols with equal frequencies.
Identify two optimal prefix-free codes for the same input that have a different distribution of codeword lengths.
Hint: only need 4 symbols.
Minimum variance Huffman coding. Due to the nondeterminism associated with tiebraking, Huffman's algorithm may produce codes with different distributions of codeword lengths. When transmitting the compressed stream as it is being generated, it is desirable to transmit bits at a (near) constant rate. Find Huffman code that minimize sum_i (p_i (l_i - l_average(T)) ^2).
Solution. When combining tries, break ties by picking the earliest produced trie with the smallest probability.
Two-queue algorithm for Huffman coding. Prove that the following algorithm computes a Huffman code (and runs in linear time if the input symbols are already sorted by frequency). Maintain two FIFO queues: the first queue contains the input symbols, in ascending order of frequency, the second queue contains the internal nodes with combined weights. As long as there is more than one node in the two queues, dequeue the two nodes with the smallest weight by examining the fronts of both queues. Create a new internal node (left and right child = two nodes, weight = sum of weight of two nodes) and enqueue on the second queue.
To obtain a minimum variance Huffman code, break ties by choosing nodes from the first queue.
Hint: prove that the second queue is sorted in ascending order of frequency.
Sibling property. A binary tree has the sibling property if (i) every node (except the root) has a sibling and (ii) the binary tree can be listed in non-increasing order of probability such that, in the list, all siblings are adjacent. Prove that a binary tree represents a Huffman tree if and only if it has the sibling property.
Relative coding. Instead of compressing each pixel in an image, consider the difference between a pixel and the previous one and encode the difference. Intuition: usually the pixels don't change much. Use with LZW over color table alphabet.
Variable-width LZW codes. Increase the width of the table from p to p+1 after 2^p th codeword is inserted into table. Used with color table alphabet.
Adaptive Huffman coding. One-pass algorithm and don't need to send prefix-free code. Build Huffman tree based on frequency of characters read in so far. Update tree after reading in each character. Encoder and decoder need to coordinate on tie-breaking conventions.
Shannon entropy. The entropy H of a discrete random variable X with possible values x1, ..., xN that occur with probability p1, ..., pN is defined as H(X) = -p1 lg p1 - p2 lg p2 - ... - pN lg pN, where 0 lg 0 = 0 is consistent with the limit.
- What is the entropy of a fair coin?
- What is the entropy of a coin where both sides are heads?
- What is the entropy of a six-sided die?
  Solution. -lg (1/6) which is about 2.584962.
- What is the entropy of the sum of two fair dice?
- Given a random variable that takes on N values. What distribution maximizes the entropy?
The entropy is a fundamental concept in information theory. Shannon's source coding theorem asserts that to compress the data from a stream of independent and identically distributed random variables requires at least H(X) bits per symbol in the limit. For example, to send the results of a sequence of fair die tosses requires at least 2.584962 bits per die toss.
Empirical entropy. The empirical entropy of a piece of text is obtained by computing the frequency of occurrence of each symbol and using these as the probabilities for a discrete random variable. Compute the empirical entropy of your favorite novel. Compare it to the data compression rate achieved by a Huffman code.
Shannon experiment. Perform the following experiment. Give a subject a sequence of k letters from a piece of text (or Leipzig corpus) and ask them to predict the next letter. Estimate the fraction of times the subject gets the answer right for k = 1, 2, 5, 100.
True or false. Fixed-length codes are uniquely decodable.
Solution. True, they are prefix free.
Give two different Huffman trees the string ABCCDD, with different heights.
Prefix-free codes. Design an efficient algorithm to determine if a set of binary code words is prefix-free. Hint: use a binary trie or sort.
Uniquely decodable code. Devise a uniquely decodable code that is not a prefix free code. Hint: suffix free codes = reverse of prefix free codes. Reverse of suffix free code is prefix free code -> can decode by reading compressed message in reverse order. Not very convenient.
Huffman tree. Modify Huffman.java so that the encoder prints out the lookup table instead of the preorder traversal, and modify the decoder so that it constructs the tree by reading in the lookup table.
True or false. In an optimal prefix-free ternary code, the three symbols that occur least frequently have the same length.
Solution. False.
Ternary Huffman codes. Generalize the Huffman algorithm to codewords over the ternary alphabet (0, 1, and 2) instead of the binary alphabet. That is, given a bytestream, find a prefix-free ternary code that uses as few trits (0s, 1s, and 2s) as possible. Prove that it yields optimal prefix-free ternary code.
Solution. Combine smallest 3 probabilities at each step (instead of smallest 2). This works when there are 3 + 2k symbols for some integer k. To reduce to this case, add 1 or 2 dummy symbols of probability 0. (Alternatively, combine fewer than 3 symbols in the first step if the number of symbols is not 3 + 2k.) Ex: { 0.1, 0.2, 0.2, 0.5 }.
Nonbinary Huffman codes. Extend the Huffman algorithm to codewords over the m-ary alphabet (0, 1, 2, ..., m-1) instead of the binary alphabet.
Consider the following 21 character message that consists of 3 a's, 7c's, 6 t's and 5 g's.
a a c c c c a c t t g g g t t t t c c g g
Are the following 43 bits a possible Huffman encoding of the message above?
0000001111000101010010010010101010111001001
Justify your answer as concisely and rigorously as possible.
Solution. A Huffman encoding for a message produces an encoding that uses the fewest bits among any prefix free code. The 2-bit binary code a = 00, c = 01, g = 10, t = 11 is a prefix free code that uses 21 * 2 = 42 bits. Thus, a Huffman code would use fewer than 43 bits.
A binary tree is full if every node that is not a leaf has two children. Prove that any binary tree corresponding to an optimal prefix-free code is full.
Hint: if an internal node has only one child, replace that internal node with its unique child.
Move-to-front coding (Bentley, Sleator, Tarjan, and Wei 1986). Write a program MoveToFront that implements move-to-front encoding and decoding. Maintain alphabet of symbols in a list, where frequently occurring symbols are towards the front. A symbol is encoded as the number of symbols that precede it in the list. After encoding a symbol, move it to the front of the list. reference
Move-ahead-k coding. Same as move-to-front coding, but move symbol k positions toward the front.
Wait-c-and-move. Same as move-to-front coding, but move symbol to the front only after it has been encountered c times since the last time it was moved to the front.
Double Huffman compression. Find an input for which applying the compress() method in Huffman.java twice leads to a strictly smaller output than applying compress() only once.
Merging k sorted arrays. You have k sorted lists, of lenths n1, n2, ..., nk. Supposet that the only operation you can perform to combine lists is a 2-way merge: given one sorted array of length n1 and another sorted array of length n2, replace them with a sorted array of length n = n1 + n2. Moreover, the 2-way merge operation takes exactly n units of time. What is the optimal way to merge the k sorted arrays?
Solution. Sort the list lengths so that n1 < n2 < ... < nk. Repetedly take the two smallest lists and apply the 2-way merge operation. The proof of optimality is the same as the proof of optimality of Huffman codes: repeatedly applying 2-way merge operations induces a binary tree in which each leaf node corresponds to one of the original sorted lists and each internal node corresponds to a 2-way merge operation. The contribution of any original list to the overall cost is the length of the list multiplied by its tree depth (because that is the number of times its elements are involved in a 2-way merge).