Compression and Decompression using LZW Algorithm

Compression is the art or Science which represents information in a compact form. Compression is the process of reducing the size of a file. The aim of the data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems. Compression is used in the case of text, video, audio data etc. In audio and video compression it is used to reduce the number of bits required to represent an image or video.

Data Compression and Decompression is often referred to as coding, where coding is a very general term encompassing any special representation of data which satisfies a given need. Information theory is defined to be the study of effective coding and its consequences, in the form of speed of transmission and probability of errors. Data compression may be viewed as a branch of information theory in which the primary objective is to minimize the amount of memory that is to be used to store the data. The purpose of this is to present a data compression and decompression software to reduce the amount memory used for storing data

There are mainly two types of compression methods are used: lossless compression and lossy compression

In lossless compression there is no loss of data during compression and decompression. It is mainly used in compressing text documents because in documents if it losses even a bit of data or if there is a change in word it becomes a large change in the meaning. Example for lossless compression is  Arithmetic coding, Substitutional Compressors, Huffman coding etc.

In lossy compression there is a lose of data during compression. It is mainly used in video compression etc.. In this, we change some features of the picture or video for compression so there a loss of actual data. So we cant get the exact decompressed file. Eg: Audio compression, Fractal compression etc. In lossy compression the compression rate is higher than that of lossless compression. But it is least efficient

There are many compression techniques used for many applications but only the most commonly used and simplest compression and decompression technique that is  LZW compression algorithm is explained here.

 

1.  LZW Compression and Decompression

This is the most commonly used and the simplest type of compression and decompression algorithm. The original Lempel Ziv approach to data was published in 1977.Terry Welch’s refinements to the algorithm were published in 1984. This algorithm starts with a dictionary of characters. In this algorithm, it replaces strings of characters with single codes.

There are mainly two types of implementations in LZW compression. They are: Static and Dynamic.

 

1.1.   Static Compression

It use when we use a static number of bits to make the compression. The total number of bits we have to our disposal is 32, which gives a total of 232 that is 4,294,967,295 possible entries in the dictionary. But in almost all cases we don’t get that much entries so most of the cases we use a constant number of bits commonly set the number of bits as 14. In 14 bit format ‘A ‘ is represented by 00000001000001.

 

1.2.   Dynamic Compression

Dynamic compression changes the number of bits used to compress the data. It starts with 9 bits for each new value, and goes up until it reaches 32 or until the file ends. In this type of compression we can set the size of the dictionary we have. The dictionary begins with all the 256 ASCII codes. In dynamic compression there is only one leading 0 in front of each number. In dynamic compression ‘A’ is represented as 001000001.If we change the character in the ASCII to a string (group of characters) it changes the number of bits used to represent the string. Because when we use 9 bits only up to 255 is possible.

Advantage of dynamic compression over static compression is that we get a more compressed file when we use the former. The disadvantage is that the dynamic compression takes more time to compress a file. Even if it takes more time, the most commonly used method is dynamic compression.

 

2.  ALGORITHMS:

2.1.   LZW Algorithm for Compression

The LZW Compression algorithm starts with a “Dictionary” of all the single characters with indexes 0-255. It then starts to expand the dictionary as information get sent through. The redundant strings will be coded as a single bit so the compression has occurred.

Algorithm

STRING = get input character

WHILE there are still input characters DO

CHARACTER = get input character

IF STRING + CHARACTER is in the string table then

STRING = STRING + CHARACTER

ELSE

output the code for STRING

add STRING + CHARACTER to the string table

STRING = CHARACTER

END of IF

END of WHILE

output the code for STRING

 

The program reads one character at a time. If the code is in the dictionary, then it adds the character to the current work string, and waits for the next one. This occurs on the first character as well. If the work string is not in the dictionary, (such as when the second character comes across), it adds the work string to the dictionary and sends over the wire the work string without the new character. It then sets the work string to the new character.

 

Explanation with Example

 

A sample string used to demonstrate the algorithm is shown in Figure 1. Add all ASCII characters in the table from 0-255. you can see that the first pass through the loop, a check is performed to see if the string “/A” is in the table. Since it isn’t, the code for ‘/’ is output, and the string “/A” is added to the table. Since we have 256 characters already defined for codes 0-255, the first string definition can be assigned to code 256. After the third letter, ‘B’, has been read in, the second string code, “AB” is added to the table, and the code for letter ‘A’ is output. This continues until in the second word, the characters ‘/’ and ‘A’ are read in, matching string number 256. In this case, the code 256 is output, and a three character string is added to the string table. The process continues until the string is exhausted and all of the codes have been output.

 

The sample output for the string is shown in Figure 1 along with the resulting string table. As can be seen, the string table fills up rapidly, since a new string is added to the table each time a code is output. In this highly redundant input, 5 code substitutions were output, along with 7 characters. If we were using 9 bit codes for output, the 19 character input string would be reduced to a 13.5 byte output string. Of course, this example was carefully chosen to demonstrate code substitution. In real world examples, compression usually doesn’t begin until a sizable table has been built, usually after at least one hundred or so bytes have been read in.

 

2.2   LZW Algorithm for Decompression

Algorithm

 

Read OLD_CODE

output OLD_CODE

WHILE there are still input characters DO

Read NEW_CODE

STRING = get translation of NEW_CODE

output STRING

CHARACTER = first character in STRING

add OLD_CODE + CHARACTER to the translation table

OLD_CODE = NEW_CODE

END of WHILE

 

To read a file after compression it must be decompressed. In LZW this has a companion algorithm for decompression. It needs to be able to take the stream of codes output from the compression algorithm, and use them to exactly recreate the input stream. The decompressor builds its own dictionary on its side, that matches exactly the compressor’s, so that only the codes need to be sent.

 

Explanation with example

The algorithm is shown in Figure2. Just like the compression algorithm, it adds a new string to the string table each time it reads in a new code. All it needs to do in addition to that is translate each incoming code into a string and send it to the output.

Figure 2 shows the output of the algorithm given the input created by the compression earlier in the article. The output string is identical to the input string from the compression algorithm. Note that the first 256 codes are already defined to translate to single character strings, just like in the compression code.

The decompression algorithm shown in Figure 2 is just a little too simple. There is a single exception case in the LZW compression algorithm when there is a string consisting of a (STRING,CHARACTER) pair already defined in the table, and the input stream then sees a sequence of STRING, CHARACTER, STRING, CHARACTER, STRING, the compression algorithm will output a code before the decompressor gets a chance to define it.

 

A simple example will illustrate the point. Imagine the the string MNOPQ is defined in the table as code 280. Later on, the sequence MNOPQMNOPQMNOP occurs in the table. The compression output will look like that shown in Figure 3.

When the decompression algorithm sees this input stream, it first decodes the code 280 and outputs the MNOPQ string. After doing the output, it will add the definition for code 299 to the table, whatever that may be. It then reads the next input code, 300, and finds that it is not in the table. Fortunately, this is the only case where the decompression algorithm will encounter an undefined code. Since it is in fact the only case, we can add an exception handler to the algorithm. The modified algorithm just looks for the special case of an undefined code, and handles it. In the example in Figure3, the decompression routine sees a code of 300, which is undefined. Since it is undefined, it translates the value of OLD_CODE, which is code 300. It then adds the CHARACTER value, which is ‘M’, to the string. This results in the correct translation of code 300 to string “MNOPQMN”.

If we want to create a compression and decompression software it is most important to calculate the CRC32 values. It is used to check whether there is any lose of data during compression and also used to check the data lose during transmission. For that first calculate the CRC32 of the original file and store it in somewhere in memory and then compress the file and when reading the decompressed file, find the CRC32 of the decompressed file and check whether the both CRC‘s are equal. If these are same the data decompressed correctly because for all files even the words it have its own CRC32 value. It is unique value.

The Modified Decompression Algorithm:

 

Read OLD_CODE

output OLD_CODE

CHARACTER = OLD_CODE

WHILE there are still input characters DO

Read NEW_CODE

IF NEW_CODE is not in the translation table THEN

STRING = get translation of OLD_CODE

STRING = STRING + CHARACTER

ELSE

STRING = get translation of NEW_CODE

END of IF

output STRING

CHARACTER = first character in STRING

add OLD_CODE + CHARACTER to the translation table

OLD_CODE = NEW_CODE

END of WHILE

References:

(1) Some parts of the text and examples taken from: Mark Nelson’s Pages. The page contains example implementation of LZW in C.

(2) Visit the Wikipedia page for compression: http://en.wikipedia.org/wiki/Data_compression

»

Copyright © 2017 Hoozi Resources