File Compression: How Does It Work? Information compression methods.

GORKOFF February 24, 2015 at 11:41 AM

Data compression methods

  • Algorithms

My scientific advisor and I are preparing a small monograph on image processing. I decided to submit a chapter on image compression algorithms to the judgment of the HabraSociety. Since it's hard to fit a whole chapter within the framework of one post, I decided to split it into three posts:
1. Methods of data compression;
2. Lossless compression of images;
3. Compression of images with loss.
Below you can read the first post in the series.

At the moment, there are a large number of lossless compression algorithms, which can be conditionally divided into two large groups:
1. Stream and dictionary algorithms. This group includes algorithms of the RLE (run-length encoding), LZ * and other families. The peculiarity of all algorithms of this group is that the encoding does not use information about the frequencies of characters in the message, but information about the sequences that were encountered earlier.
2. Algorithms for statistical (entropy) compression. This group of algorithms compresses information using the unevenness of the frequencies with which different characters occur in the message. Algorithms in this group include arithmetic and prefix coding algorithms (using Shannon-Fanno, Huffman, secant trees).
Information transformation algorithms can be distinguished into a separate group. The algorithms of this group do not directly compress information, but their application greatly simplifies further compression using stream, dictionary, and entropy algorithms.

Streaming and dictionary algorithms

Run length coding

Run-Length Encoding (RLE) is one of the simplest and most common data compression algorithms. In this algorithm, a sequence of repeated characters is replaced by a character and the number of times it is repeated.
For example, the string "AAAAA", which requires 5 bytes to store (assuming that a byte is allocated for storing one character), can be replaced with "5A", consisting of two bytes. Obviously, the longer the repetition series, the more efficient this algorithm is.

The main disadvantage of this algorithm is its extremely low efficiency on sequences of non-repeating characters. For example, if we consider the sequence "ABABAB" (6 bytes), then after applying the RLE algorithm it will turn into "1A1B1A1B1A1B" (12 bytes). There are various methods to deal with the problem of non-repeating characters.

The simplest method is the following modification: the byte encoding the number of repeats must store information not only about the number of repeats, but also about their presence. If the first bit is 1, then the next 7 bits indicate the number of repetitions of the corresponding character, and if the first bit is 0, then the next 7 bits indicate the number of characters that must be taken without repetition. If you encode "ABABAB" using this modification, we get "-6ABABAB" (7 bytes). Obviously, the proposed technique can significantly improve the efficiency of the RLE algorithm on non-repeating sequences of characters. The implementation of the proposed approach is shown in Listing 1:

  1. type
  2. function RLEEncode (InMsg: ShortString): TRLEEncodedString;
  3. MatchFl: boolean;
  4. MatchCount: shortint;
  5. EncodedString: TRLEEncodedString;
  6. N, i: byte;
  7. begin
  8. N: = 0;
  9. SetLength (EncodedString, 2 * length (InMsg));
  10. while length (InMsg)> = 1 do
  11. begin
  12. MatchFl: = (length (InMsg)> 1) and (InMsg [1] = InMsg [2]);
  13. MatchCount: = 1;
  14. while (MatchCount<= 126 ) and (MatchCount < length(InMsg) ) and ((InMsg[ MatchCount] = InMsg[ MatchCount + 1 ] ) = MatchFl) do
  15. MatchCount: = MatchCount + 1;
  16. if MatchFl then
  17. begin
  18. N: = N + 2;
  19. EncodedString [N - 2]: = MatchCount + 128;
  20. EncodedString [N - 1]: = ord (InMsg [1]);
  21. else
  22. begin
  23. if MatchCount<>length (InMsg) then
  24. MatchCount: = MatchCount - 1;
  25. N: = N + 1 + MatchCount;
  26. EncodedString [N - 1 - MatchCount]: = - MatchCount + 128;
  27. for i: = 1 to MatchCount do
  28. EncodedString [N - 1 - MatchCount + i]: = ord (InMsg [i]);
  29. end;
  30. delete (InMsg, 1, MatchCount);
  31. end;
  32. SetLength (EncodedString, N);
  33. RLEEncode: = EncodedString;
  34. end;

Decoding a compressed message is very simple and boils down to a single pass over the compressed message, see Listing 2:
  1. type
  2. TRLEEncodedString = array of byte;
  3. function RLEDecode (InMsg: TRLEEncodedString): ShortString;
  4. RepeatCount: shortint;
  5. i, j: word;
  6. OutMsg: ShortString;
  7. begin
  8. OutMsg: = "";
  9. i: = 0;
  10. while i< length(InMsg) do
  11. begin
  12. RepeatCount: = InMsg [i] - 128;
  13. i: = i + 1;
  14. if RepeatCount< 0 then
  15. begin
  16. RepeatCount: = abs (RepeatCount);
  17. for j: = i to i + RepeatCount - 1 do
  18. OutMsg: = OutMsg + chr (InMsg [j]);
  19. i: = i + RepeatCount;
  20. else
  21. begin
  22. for j: = 1 to RepeatCount do
  23. OutMsg: = OutMsg + chr (InMsg [i]);
  24. i: = i + 1;
  25. end;
  26. end;
  27. RLEDecode: = OutMsg;
  28. end;

The second method for increasing the efficiency of the RLE algorithm is the use of information transformation algorithms that do not directly compress the data, but bring it to a form that is more convenient for compression. As an example of such an algorithm, we will consider a BWT permutation named after the inventors of the Burrows-Wheeler transform. This permutation does not change the characters themselves, but only changes their order in the string, while the repeated substrings after applying the permutation are collected into dense groups, which are much better compressed using the RLE algorithm. Direct BWT conversion is reduced to the sequence of the following steps:
1. Adding a special end-of-line character to the original string, which is not found anywhere else;
2. Getting all cyclic permutations of the original string;
3. Sorting the received lines in lexicographic order;
4. Returning the last column of the resulting matrix.
The implementation of this algorithm is shown in Listing 3.
  1. const
  2. EOMsg = "|" ;
  3. function BWTEncode (InMsg: ShortString): ShortString;
  4. OutMsg: ShortString;
  5. LastChar: ANSIChar;
  6. N, i: word;
  7. begin
  8. InMsg: = InMsg + EOMsg;
  9. N: = length (InMsg);
  10. ShiftTable [1]: = InMsg;
  11. for i: = 2 to N do
  12. begin
  13. LastChar: = InMsg [N];
  14. InMsg: = LastChar + copy (InMsg, 1, N - 1);
  15. ShiftTable [i]: = InMsg;
  16. end;
  17. Sort (ShiftTable);
  18. OutMsg: = "";
  19. for i: = 1 to N do
  20. OutMsg: = OutMsg + ShiftTable [i] [N];
  21. BWTEncode: = OutMsg;
  22. end;

The easiest way to explain this transformation is with a specific example. Let's take the string "ANANAS" and agree that the end-of-line character will be the symbol "|". All cyclic permutations of this line and the result of their lexicographic sorting are shown in Table. one.

Those. the result of the direct conversion will be the string "| ННАААС". It is easy to see that this string is much better than the original one; it is compressed by the RLE algorithm, since it contains long subsequences of repeated letters.
A similar effect can be achieved with the help of other transformations, but the advantage of the BWT transform is that it is reversible, however, the inverse transform is more difficult than the direct one. In order to restore the original string, you need to do the following:
Create an empty n * n matrix, where n is the number of characters in the encoded message;
Fill the rightmost empty column with the encoded message;
Sort table rows in lexicographic order;
Repeat steps 2-3 until there are empty columns;
Return the line that ends with the end-of-line character.

At first glance, the reverse transformation is straightforward, and one of the options is shown in Listing 4.

  1. const
  2. EOMsg = "|" ;
  3. function BWTDecode (InMsg: ShortString): ShortString;
  4. OutMsg: ShortString;
  5. ShiftTable: array of ShortString;
  6. N, i, j: word;
  7. begin
  8. OutMsg: = "";
  9. N: = length (InMsg);
  10. SetLength (ShiftTable, N + 1);
  11. for i: = 0 to N do
  12. ShiftTable [i]: = "";
  13. for i: = 1 to N do
  14. begin
  15. for j: = 1 to N do
  16. ShiftTable [j]: = InMsg [j] + ShiftTable [j];
  17. Sort (ShiftTable);
  18. end;
  19. for i: = 1 to N do
  20. if ShiftTable [i] [N] = EOMsg then
  21. OutMsg: = ShiftTable [i];
  22. delete (OutMsg, N, 1);
  23. BWTDecode: = OutMsg;
  24. end;

But in practice, the efficiency depends on the selected sorting algorithm. Trivial algorithms with quadratic complexity will obviously have an extremely negative impact on performance, so it is recommended to use efficient algorithms.

After sorting the table obtained in the seventh step, it is necessary to select from the table a row ending with the "|" character. It is easy to see that this is the only line. That. we examined the BWT transformation using a specific example.

Summing up, we can say that the main advantage of the RLE group of algorithms is the simplicity and speed of operation (including the speed of decoding), and the main disadvantage is inefficiency on non-repeating character sets. The use of special permutations increases the efficiency of the algorithm, but also greatly increases the running time (especially decoding).

Dictionary compression (LZ algorithms)

The group of dictionary algorithms, unlike the algorithms of the RLE group, encodes not the number of repetitions of characters, but the previously encountered sequences of characters. During the operation of the algorithms under consideration, a table is dynamically created with a list of previously encountered sequences and their corresponding codes. This table is often called a dictionary, and the corresponding group of algorithms is called dictionary.

The simplest version of the dictionary algorithm is described below:
Initialize the dictionary with all characters that appear in the input string;
Find in the dictionary the longest sequence (S) that matches the beginning of the encoded message;
Issue the code of the found sequence and remove it from the beginning of the encoded message;
If the end of the message is not reached, read the next character and add Sc to the dictionary, go to step 2. Otherwise, exit.

For example, the newly initialized vocabulary for the phrase "CUCKOOKUCHONKUPILACAPUCHON" is shown in Table. 3:

During the compression process, the dictionary will be supplemented with the sequences found in the message. The process of replenishing the dictionary is shown in Table. 4.

When describing the algorithm, the description of the situation when the dictionary is filled completely was deliberately omitted. Depending on the variant of the algorithm, different behavior is possible: full or partial emptying of the dictionary, stopping the filling of the dictionary, or expanding the dictionary with a corresponding increase in the code width. Each of these approaches has certain disadvantages. For example, stopping the completion of the dictionary can lead to a situation when the dictionary stores sequences that occur at the beginning of the compressed string, but do not occur later. At the same time, clearing the dictionary can remove frequent sequences. Most of the used implementations, when filling out the dictionary, begin to track the compression ratio, and when it drops below a certain level, the dictionary is rebuilt. Next, we will consider the simplest implementation that stops replenishing the dictionary when it is full.

First, let's define a dictionary as a record that stores not only the found substrings, but also the number of substrings stored in the dictionary:

Previously encountered subsequences are stored in the Words array, and their codes are the numbers of subsequences in this array.
We also define the functions for searching in the dictionary and adding to the dictionary:

  1. const
  2. MAX_DICT_LENGTH = 256;
  3. function FindInDict (D: TDictionary; str: ShortString): integer;
  4. r: integer;
  5. i: integer;
  6. fl: boolean;
  7. begin
  8. r: = - 1;
  9. if D. WordCount> 0 then
  10. begin
  11. i: = D. WordCount;
  12. fl: = false;
  13. while (not fl) and (i> = 0) do
  14. begin
  15. i: = i - 1;
  16. fl: = D. Words [i] = str;
  17. end;
  18. end;
  19. if fl then
  20. r: = i;
  21. FindInDict: = r;
  22. end;
  23. procedure AddToDict (var D: TDictionary; str: ShortString);
  24. begin
  25. if D. WordCount< MAX_DICT_LENGTH then
  26. begin
  27. D. WordCount: = D. WordCount + 1;
  28. SetLength (D. Words, D. WordCount);
  29. D. Words [D. WordCount - 1]: = str;
  30. end;
  31. end;

Using these functions, the coding process according to the described algorithm can be implemented as follows:
  1. function LZWEncode (InMsg: ShortString): TEncodedString;
  2. OutMsg: TEncodedString;
  3. tmpstr: ShortString;
  4. D: TDictionary;
  5. i, N: byte;
  6. begin
  7. SetLength (OutMsg, length (InMsg));
  8. N: = 0;
  9. InitDict (D);
  10. while length (InMsg)> 0 do
  11. begin
  12. tmpstr: = InMsg [1];
  13. while (FindInDict (D, tmpstr)> = 0) and (length (InMsg)> length (tmpstr)) do
  14. tmpstr: = tmpstr + InMsg [length (tmpstr) + 1];
  15. if FindInDict (D, tmpstr)< 0 then
  16. delete (tmpstr, length (tmpstr), 1);
  17. OutMsg [N]: = FindInDict (D, tmpstr);
  18. N: = N + 1;
  19. delete (InMsg, 1, length (tmpstr));
  20. if length (InMsg)> 0 then
  21. AddToDict (D, tmpstr + InMsg [1]);
  22. end;
  23. SetLength (OutMsg, N);
  24. LZWEncode: = OutMsg;
  25. end;

The result of the encoding will be the word numbers in the dictionary.
The decoding process is reduced to direct decryption of codes, while there is no need to transfer the created dictionary, it is enough that during decoding the dictionary is initialized in the same way as during encoding. Then the dictionary will be completely restored directly during the decoding process by concatenating the previous subsequence and the current symbol.

The only problem is possible in the following situation: when it is necessary to decode a subsequence that is not yet in the dictionary. It is easy to see that this is only possible when it is necessary to extract a substring to be added at the current step. This means that the substring matches the cSc pattern, i.e. starts and ends with the same character. Moreover, cS is the substring added in the previous step. The considered situation is the only one when it is necessary to decode a line that has not yet been added. Considering the above, we can suggest the following option for decoding a compressed string:

  1. function LZWDecode (InMsg: TEncodedString): ShortString;
  2. D: TDictionary;
  3. OutMsg, tmpstr: ShortString;
  4. i: byte;
  5. begin
  6. OutMsg: = "";
  7. tmpstr: = "";
  8. InitDict (D);
  9. for i: = 0 to length (InMsg) - 1 do
  10. begin
  11. if InMsg [i]> = D. WordCount then
  12. tmpstr: = D. Words [InMsg [i - 1]] + D. Words [InMsg [i - 1]] [1]
  13. else
  14. tmpstr: = D. Words [InMsg [i]];
  15. OutMsg: = OutMsg + tmpstr;
  16. if i> 0 then
  17. AddToDict (D, D. Words [InMsg [i - 1]] + tmpstr [1]);
  18. end;
  19. LZWDecode: = OutMsg;
  20. end;

The advantages of dictionary algorithms include their higher compression efficiency compared to RLE. Nevertheless, it should be understood that the real use of these algorithms is associated with some implementation difficulties.

Entropy coding

Encoding with Shannon-Fano trees

The Shannon-Fano algorithm is one of the first compression algorithms developed. The algorithm is based on the idea of ​​representing more frequent characters using shorter codes. Moreover, the codes obtained using the Shannon-Fano algorithm have the prefix property: i.e. no code is the beginning of any other code. The prefix property ensures that the encoding is one-to-one. The algorithm for constructing Shannon-Fano codes is presented below:
1. Divide the alphabet into two parts, the total probabilities of the symbols in which are as close to each other as possible.
2. Add 0 to the prefix code of the first part of characters, add 1 to the prefix code of the second part of characters.
3. For each part (in which there are at least two characters), recursively perform steps 1-3.
Despite its comparative simplicity, the Shannon-Fano algorithm is not without drawbacks, the most significant of which is the non-optimal encoding. Although the partitioning at each step is optimal, the algorithm does not guarantee an optimal result as a whole. Consider, for example, the following line: "AAAABVGDEZH". The corresponding Shannon-Fano tree and the codes derived from it are shown in Fig. one:

Without encoding, the message will occupy 40 bits (assuming that each character is encoded with 4 bits), and using the Shannon-Fano algorithm 4 * 2 + 2 + 4 + 4 + 3 + 3 + 3 = 27 bits. The message volume has decreased by 32.5%, but it will be shown below that this result can be significantly improved.

Encoding with Huffman Trees

The Huffman coding algorithm, developed a few years after the Shannon-Fano algorithm, also possesses the prefix property, and, in addition, the proven minimal redundancy, this is precisely why it is extremely widespread. To obtain Huffman codes, the following algorithm is used:
1. All symbols of the alphabet are represented as free nodes, while the weight of the node is proportional to the frequency of the symbol in the message;
2. From the set of free nodes, two nodes with the minimum weight are selected and a new (parent) node with a weight equal to the sum of the weights of the selected nodes is created;
3. The selected nodes are removed from the free list, and the parent node created on their basis is added to this list;
4. Steps 2-3 are repeated until there is more than one node in the free list;
5. Based on the constructed tree, a prefix code is assigned to each character of the alphabet;
6. The message is encoded with the received codes.

Consider the same example as in the case of the Shannon-Fano algorithm. The Huffman tree and the codes obtained for the "AAAABVGDEZH" message are shown in Fig. 2:

It is easy to calculate that the size of the encoded message will be 26 bits, which is less than in the Shannon-Fano algorithm. Separately, it should be noted that due to the popularity of the Huffman algorithm, at the moment there are many options for Huffman coding, including adaptive coding, which does not require the transmission of symbol frequencies.
Among the disadvantages of the Huffman algorithm, a significant part are problems associated with the complexity of the implementation. The use of real variables for storing the symbol frequencies is associated with a loss of precision; therefore, in practice, integer variables are often used, but since the weight of parent nodes is constantly growing, sooner or later an overflow occurs. Thus, despite the simplicity of the algorithm, its correct implementation can still cause some difficulties, especially for large alphabets.

Encoding with secant function trees

Encoding using secant functions is an algorithm developed by the authors that allows obtaining prefix codes. The algorithm is based on the idea of ​​building a tree, each node of which contains a secant function. To describe the algorithm in more detail, it is necessary to introduce several definitions.
A word is an ordered sequence of m bits (the number m is called the word length).
Secant literal is a pair of the type of discharge-discharge value. For example, literal (4,1) means that the 4 bits of the word must be 1. If the literal condition is met, then the literal is considered true, otherwise it is false.
A k-bit secant is a set of k literals. If all literals are true, then the secant function itself is true, otherwise it is false.

The tree is built so that each node divides the alphabet into the closest possible parts. In Fig. 3 shows an example of a secant tree:

The tree of secant functions in the general case does not guarantee optimal coding, but it provides an extremely high speed of work due to the simplicity of the operation in the nodes.

Arithmetic coding

Arithmetic coding is one of the most efficient ways to compress information. Unlike the Huffman algorithm, arithmetic coding allows you to encode messages with an entropy of less than 1 bit per symbol. Because most of the arithmetic coding algorithms are protected by patents, only the basic ideas will be described below.
Suppose that in the alphabet used there are N symbols a_1,…, a_N, with frequencies p_1,…, p_N, respectively. Then the arithmetic coding algorithm will look like this:
Take as a working half-interval)
Share this