Lempel-Ziv-Welch (LZW) decompression
The output stream, as shown in the last column of the table, will be the only “data” to the decompression program. The LZW decoder must be able to fill in or reconstruct the string table but, as the last column shows, the decompression algorithm is always one code behind the compression algorithm. For example, after receiving ‘Y’, the decompression program needs to receive first the second code ‘E’ before it can define the string “YE” into the string table, with a code of 257. This process goes on until all the codes are received. Thus, there can be a situation in which the decompression algorithm will receive a code that it has not even defined yet. The last transmitted string code in the table (code 270) precisely demonstrates this instance in a decompression process.
The question mark (?) in the table is shown to illustrate a special case that arises when the decoder encounters a code (to output strings) which is not yet defined (in this case, code 270). This happens when the compression program has just defined a string and immediately sees the same string, thereby also quickly transmitting its code. Because the decompression algorithm is always one code behind, it has not yet defined the code that it just received and hence seems to not be able to translate this new code into the correct string. However, this situation only occurs when a string “PATTERN” (with “PATTERN” already defined previously) appears consecutively in the data source and when a character equal to the first letter of the string (‘P’) immediately follows the second string. The decompression algorithm is particularly designed to handle this situation:
Figure 3: The LZW Decompression Algorithm
In the source message example, the string segment “HUFYHUFYHUFYH” triggers an undefined code (notice the three superimposed “HUFYH” strings). This happens when the string “HUFY” is first defined as code 268 (i.e., restarting the search at ‘Y’, and defining the string “YH” as code 269), and is transmitted as output when the second “HUFYH” string is encountered by the compression algorithm. Then the algorithm defines code 270 for the pattern “HUFYH”, and starts the search at the last ‘H’ of this newly defined string (the second “HUFYH” string). Thus, it will immediately encounter the third string “HUFYH” and transmit its “newly-defined” code, which is 270.
Complex Decoding. As shown in the table, when the decompression algorithm receives code 268, it will first have to define code 269. After defining code 269, however, it will immediately receive code 270 which it has not even defined yet into the string table. The decompression algorithm handles this by first translating the previous code it has received (i.e., code 268 = "HUFY"), and the first character of the translated string (i.e., the character 'H') is then tacked to this string. Next, the decompressor transmits or “writes”the newly-translated string to the output file, and then enters it into the string table. Thus, the unknown string is always the previous string plus its first character.
Improving LZW Codes
One of the most important considerations in designing compression programs is that of optimizing the codelengths that are transmitted, since they are the ones that ultimately dictate the compression ratio of the file. The smaller the codewords, the smaller the resulting compressed file.
In line with this, we can improve LZW by assigning smaller codelengths at the beginning of the compression process. Since at the start the lzwcode count begins with code 256, we can assign the bit lengths to be 9 bits first (i.e., code 256 already requires 9 bits so this codelength will accommodate integer codes from 0..511), and then add more bits as necessary. Specifically, when you have already “defined” code 512, then you should assign a bit length of 10 bits to properly output the new codes.
The “variable-length” LZW coding method indeed requires an “increase in logic complexity” for the encoder and decoder [Welch 1984]. Fortunately, this technique adds to the compression ratio by about 2 to 3%, and is definitely a welcome improvement.
An “accelerated loading” of strings into the dictionary is also possible, with marked improvement in compression [Horspool 1991]. The effect in computing speed is only minimal compared to the benefit of added compression.
For LZW coding, a hash data structure is best in terms of both speed and memory consumption. The Unix compress program (also known as LZC) uses hashing to implement LZW. The simple hash function in LZC (i.e., SHL XOR) is very good, and the design of the probing function also teaches the reader a great deal about hashing in general. That is, the probing function of any hashing scheme must also be a function of the preceding computed hash value—to take advantage of the computations “already performed” by the first hash function.