File compression options. When data compression is needed

GORKOFF February 24, 2015 at 11:41 am

Data compression methods

  • Algorithms

My supervisor and I are preparing a short monograph on image processing. I decided to present to the court of the habra community a chapter on image compression algorithms. Since it’s hard to fit a whole chapter within one post, I decided to break it into three posts:
1. Data compression methods;
2. Lossless image compression;
3. Lossy image compression.
You can read the first post in the series below.

At the moment, there are a large number of lossless compression algorithms, which can be conditionally divided into two large groups:
1. Stream and dictionary algorithms. This group includes algorithms of the families RLE (run-length encoding), LZ*, etc. A feature of all the algorithms of this group is that the encoding uses not information about the frequency of symbols in the message, but information about the sequences encountered earlier.
2. Algorithms for statistical (entropy) compression. This group of algorithms compresses information using the uneven frequencies with which different characters occur in a message. Algorithms of this group include arithmetic and prefix coding algorithms (using Shannon-Fanno, Huffman, secant trees).
Information transformation algorithms can be singled out as a separate group. Algorithms of this group do not directly compress information, but their application greatly simplifies further compression using streaming, dictionary and entropy algorithms.

Stream and Dictionary Algorithms

Run length coding

Run-Length Encoding (RLE) is one of the simplest and most common data compression algorithms. In this algorithm, a sequence of repeated characters is replaced by a character and the number of times it repeats.
For example, the string "AAAAA", which requires 5 bytes to store (assuming a byte is allocated for storing one character), can be replaced by "5A", consisting of two bytes. Obviously, this algorithm is more efficient, the longer the series of repetitions.

The main disadvantage of this algorithm is its extremely low efficiency on sequences of non-repeating characters. For example, if we consider the sequence "ABABAB" (6 bytes), then after applying the RLE algorithm, it will turn into "1A1B1A1B1A1B" (12 bytes). There are various methods to solve the problem of non-repeating characters.

The simplest method is the following modification: the byte encoding the number of repetitions must store information not only about the number of repetitions, but also about their presence. If the first bit is 1, then the next 7 bits indicate the number of repetitions of the corresponding character, and if the first bit is 0, then the next 7 bits indicate the number of characters to be taken without repetition. If we encode "ABABAB" using this modification, we get "-6ABABAB" (7 bytes). Obviously, the proposed technique can significantly improve the efficiency of the RLE algorithm on non-repeating character sequences. The implementation of the proposed approach is shown in Listing 1:

  1. type
  2. function RLEEncode(InMsg: ShortString) : TRLEEncodedString;
  3. MatchFl: boolean ;
  4. MatchCount: shortint ;
  5. EncodedString: TRLEEncodedString;
  6. N, i: bytes;
  7. begin
  8. N:=0;
  9. SetLength(EncodedString, 2 * length(InMsg) ) ;
  10. while length(InMsg) >= 1 do
  11. begin
  12. MatchFl : = (length(InMsg) > 1 ) and (InMsg[ 1 ] = InMsg[ 2 ] ) ;
  13. MatchCount := 1 ;
  14. while (MatchCount<= 126 ) and (MatchCount < length(InMsg) ) and ((InMsg[ MatchCount] = InMsg[ MatchCount + 1 ] ) = MatchFl) do
  15. MatchCount : = MatchCount + 1 ;
  16. if MatchFl then
  17. begin
  18. N: = N + 2;
  19. EncodedString[ N - 2 ] : = MatchCount + 128 ;
  20. EncodedString[ N - 1 ] : = ord (InMsg[ 1 ] ) ;
  21. else
  22. begin
  23. if MatchCount<>length(InMsg) then
  24. MatchCount : = MatchCount - 1 ;
  25. N := N + 1 + MatchCount;
  26. EncodedString[ N - 1 - MatchCount] : = - MatchCount + 128 ;
  27. for i := 1 to MatchCount do
  28. EncodedString[ N - 1 - MatchCount + i] : = ord (InMsg[ i] ) ;
  29. end ;
  30. delete(InMsg, 1 , MatchCount) ;
  31. end ;
  32. SetLength(EncodedString, N) ;
  33. RLEncode := EncodedString;
  34. end ;

Decoding a compressed message is very simple and comes down to a single pass through the compressed message, see Listing 2:
  1. type
  2. TRLEEncodedString = array of byte ;
  3. function RLEDecode(InMsg: TRLEEncodedString) : ShortString;
  4. RepeatCount: shortint ;
  5. i, j: word;
  6. OutMsg: ShortString;
  7. begin
  8. OutMsg := "" ;
  9. i:=0;
  10. while i< length(InMsg) do
  11. begin
  12. RepeatCount : = InMsg[i] - 128 ;
  13. i : = i + 1 ;
  14. if RepeatCount< 0 then
  15. begin
  16. RepeatCount := abs(RepeatCount) ;
  17. for j : = i to i + RepeatCount - 1 do
  18. OutMsg := OutMsg + chr (InMsg[ j] ) ;
  19. i := i + RepeatCount;
  20. else
  21. begin
  22. for j := 1 to RepeatCount do
  23. OutMsg := OutMsg + chr (InMsg[ i] ) ;
  24. i : = i + 1 ;
  25. end ;
  26. end ;
  27. RLEDecode := OutMsg;
  28. end ;

The second method of increasing the efficiency of the RLE algorithm is the use of information transformation algorithms that do not directly compress the data, but convert it to a form that is more convenient for compression. As an example of such an algorithm, we will consider a BWT permutation named after the inventors of the Burrows-Wheeler transform. This permutation does not change the characters themselves, but only changes their order in the string, while repeating substrings, after applying the permutation, are collected into dense groups that are much better compressed using the RLE algorithm. Direct BWT transformation comes down to a sequence of the following steps:
1. Adding to the source string a special end-of-line character that does not occur anywhere else;
2. Getting all cyclic permutations of the original string;
3. Sorting the received strings in lexicographic order;
4. Returning the last column of the resulting matrix.
The implementation of this algorithm is shown in Listing 3.
  1. const
  2. EOMsg="|" ;
  3. function BWTEncode(InMsg: ShortString) : ShortString;
  4. OutMsg: ShortString;
  5. LastChar: ANSIChar;
  6. N, i: word;
  7. begin
  8. InMsg := InMsg + EOMsg;
  9. N := length(InMsg) ;
  10. ShiftTable[ 1 ] := InMsg;
  11. for i := 2 to N do
  12. begin
  13. LastChar := InMsg[N] ;
  14. InMsg : = LastChar + copy(InMsg, 1 , N - 1 ) ;
  15. ShiftTable[i] := InMsg;
  16. end ;
  17. Sort(ShiftTable) ;
  18. OutMsg := "" ;
  19. for i := 1 to N do
  20. OutMsg : = OutMsg + ShiftTable[ i] [ N] ;
  21. BWTEncode := OutMsg;
  22. end ;

The easiest way to explain this transformation is with a concrete example. Let's take the string "pineapple" and agree that the symbol "|" will be the end of the string. All cyclic permutations of this string and the result of their lexicographic sorting are given in Table. one.

Those. the result of direct conversion will be the string "|NNAAAC". It is easy to see that this string is much better than the original one, it is compressed by the RLE algorithm, because it contains long subsequences of repeated letters.
A similar effect can be achieved using other transformations, but the advantage of the BWT transformation is that it is reversible, however, the reverse transformation is more complicated than the direct one. In order to restore the original string, you must perform the following steps:
Create an empty matrix of size n*n, where n is the number of characters in the encoded message;
Fill the rightmost empty column with the encoded message;
Sort table rows in lexicographic order;
Repeat steps 2-3 as long as there are empty columns;
Return the string that ends with an end-of-line character.

The implementation of the inverse transformation is not difficult at first glance, and one of the implementations is shown in Listing 4.

  1. const
  2. EOMsg="|" ;
  3. function BWTDecode(InMsg: ShortString) : ShortString;
  4. OutMsg: ShortString;
  5. ShiftTable: array of ShortString;
  6. N, i, j: word;
  7. begin
  8. OutMsg := "" ;
  9. N := length(InMsg) ;
  10. SetLength(ShiftTable, N + 1) ;
  11. for i := 0 to N do
  12. ShiftTable[i] := "" ;
  13. for i := 1 to N do
  14. begin
  15. for j := 1 to N do
  16. ShiftTable[ j] : = InMsg[ j] + ShiftTable[ j] ;
  17. Sort(ShiftTable) ;
  18. end ;
  19. for i := 1 to N do
  20. if ShiftTable[ i] [N] = EOMsg then
  21. OutMsg := ShiftTable[i] ;
  22. delete(OutMsg, N, 1 ) ;
  23. BWTDecode := OutMsg;
  24. end ;

But in practice, the efficiency depends on the selected sorting algorithm. Trivial algorithms with quadratic complexity will obviously have an extremely negative impact on performance, so it is recommended to use efficient algorithms.

After sorting the table obtained at the seventh step, it is necessary to select a row from the table that ends with the symbol "|". It is easy to see that this is the only line. That. we have considered the BWT transformation on a specific example.

Summing up, we can say that the main advantage of the RLE algorithm group is simplicity and speed of operation (including decoding speed), and the main disadvantage is inefficiency on non-repeating character sets. The use of special permutations increases the efficiency of the algorithm, but also greatly increases the running time (especially decoding).

Dictionary compression (LZ algorithms)

The group of dictionary algorithms, in contrast to the algorithms of the RLE group, encodes not the number of repetitions of characters, but sequences of characters encountered earlier. During the operation of the algorithms under consideration, a table is dynamically created with a list of already encountered sequences and their corresponding codes. This table is often called a dictionary, and the corresponding group of algorithms is called dictionary.

The simplest version of the dictionary algorithm is described below:
Initialize the dictionary with all characters found in the input string;
Find in the dictionary the longest sequence (S) that matches the beginning of the encoded message;
Issue the code of the found sequence and remove it from the beginning of the encoded message;
If the end of the message is not reached, read the next character and add Sc to the dictionary, go to step 2. Otherwise, exit.

For example, a newly initialized dictionary for the phrase "CUCKOOCOOKOOHOOD" is shown in Table. 3:

During the compression process, the dictionary will be supplemented by the sequences encountered in the message. The process of updating the dictionary is given in Table. 4.

When describing the algorithm, the description of the situation when the dictionary is completely filled was deliberately omitted. Depending on the variant of the algorithm, different behavior is possible: complete or partial clearing of the dictionary, stop filling the dictionary, or expansion of the dictionary with a corresponding increase in the code capacity. Each of these approaches has certain disadvantages. For example, stopping dictionary replenishment can lead to a situation where the dictionary stores sequences that occur at the beginning of the compressed string, but do not occur later. At the same time, clearing the dictionary may result in the removal of frequent sequences. Most of the implementations used when filling the dictionary begin to track the degree of compression, and when it decreases below a certain level, the dictionary is rebuilt. Next, we will consider the simplest implementation that stops replenishing the dictionary when it is full.

First, we define a dictionary as a record that stores not only the substrings encountered, but also the number of substrings stored in the dictionary:

The previously encountered subsequences are stored in the array Words, and their code is the numbers of the subsequences in this array.
We also define the dictionary search and add to dictionary functions:

  1. const
  2. MAX_DICT_LENGTH = 256 ;
  3. function FindInDict(D: TDictionary; str: ShortString) : integer ;
  4. r: integer
  5. i: integer ;
  6. fl: boolean ;
  7. begin
  8. r := - 1 ;
  9. if D. WordCount > 0 then
  10. begin
  11. i := D.WordCount ;
  12. fl := false ;
  13. while (not fl) and (i >= 0 ) do
  14. begin
  15. i := i - 1 ;
  16. fl:=D.Words[i]=str;
  17. end ;
  18. end ;
  19. if fl then
  20. r :=i;
  21. FindInDict := r;
  22. end ;
  23. procedure AddToDict(var D: TDictionary; str: ShortString) ;
  24. begin
  25. if D.WordCount< MAX_DICT_LENGTH then
  26. begin
  27. D.WordCount := D.WordCount + 1 ;
  28. SetLength(D. Words , D. WordCount ) ;
  29. D. Words [ D. WordCount - 1 ] : = str;
  30. end ;
  31. end ;

Using these functions, the encoding process according to the described algorithm can be implemented as follows:
  1. function LZWEncode(InMsg: ShortString) : TEncodedString;
  2. OutMsg: TEncodedString;
  3. tmpstr: ShortString;
  4. D: TDictionary;
  5. i, N: bytes;
  6. begin
  7. SetLength(OutMsg, length(InMsg) ) ;
  8. N:=0;
  9. InitDict(D) ;
  10. while length(InMsg) > 0 do
  11. begin
  12. tmpstr := InMsg[ 1 ] ;
  13. while (FindInDict(D, tmpstr) >= 0 ) and (length(InMsg) > length(tmpstr) ) do
  14. tmpstr : = tmpstr + InMsg[ length(tmpstr) + 1 ] ;
  15. if FindInDict(D, tmpstr)< 0 then
  16. delete(tmpstr, length(tmpstr) , 1 ) ;
  17. OutMsg[N] := FindInDict(D, tmpstr) ;
  18. N: = N + 1;
  19. delete(InMsg, 1 , length(tmpstr) ) ;
  20. if length(InMsg) > 0 then
  21. AddToDict(D, tmpstr + InMsg[ 1 ] ) ;
  22. end ;
  23. SetLength(OutMsg, N) ;
  24. LZWEncode := OutMsg;
  25. end ;

The result of encoding will be the numbers of words in the dictionary.
The decoding process is reduced to direct decoding of codes, and there is no need to transfer the created dictionary, it is enough that the dictionary be initialized during decoding in the same way as during encoding. Then the dictionary will be completely restored directly in the decoding process by concatenating the previous subsequence and the current character.

The only problem is possible in the following situation: when it is necessary to decode a subsequence that is not yet in the dictionary. It is easy to see that this is possible only in the case when it is necessary to extract a substring that must be added at the current step. And this means that the substring satisfies the pattern cSc, i.e. starts and ends with the same character. In this case, cS is the substring added in the previous step. The considered situation is the only one when it is necessary to decode a string that has not yet been added. Given the above, we can offer the following option for decoding a compressed string:

  1. function LZWDecode(InMsg: TEncodedString) : ShortString;
  2. D: TDictionary;
  3. OutMsg, tmpstr: ShortString;
  4. i: byte;
  5. begin
  6. OutMsg := "" ;
  7. tmpstr := "" ;
  8. InitDict(D) ;
  9. for i := 0 to length(InMsg) - 1 do
  10. begin
  11. if InMsg[i] >= D.WordCount then
  12. tmpstr : = D. Words [ InMsg[ i - 1 ] ] + D. Words [ InMsg[ i - 1 ] ] [ 1 ]
  13. else
  14. tmpstr := D. Words [ InMsg[ i] ] ;
  15. OutMsg := OutMsg + tmpstr;
  16. if i > 0 then
  17. AddToDict(D, D. Words [ InMsg[ i - 1 ] ] + tmpstr[ 1 ] ) ;
  18. end ;
  19. LZWDecode := OutMsg;
  20. end ;

The advantages of dictionary algorithms include their greater compression efficiency compared to RLE. Nevertheless, it should be understood that the actual use of these algorithms is associated with some implementation difficulties.

Entropy coding

Encoding with Shannon-Fano Trees

The Shannon-Fano algorithm is one of the first developed compression algorithms. The algorithm is based on the idea of ​​representing more frequent characters with shorter codes. In this case, the codes obtained using the Shannon-Fano algorithm have the prefix property: i.e. no code is the beginning of any other code. The prefix property ensures that the encoding is one-to-one. The algorithm for constructing Shannon-Fano codes is presented below:
1. Divide the alphabet into two parts, the total probabilities of symbols in which are as close as possible to each other.
2. Add 0 to the prefix code of the first part of symbols, add 1 to the prefix code of the second part of symbols.
3. For each part (which has at least two characters), recursively perform steps 1-3.
Despite its relative simplicity, the Shannon-Fano algorithm is not without drawbacks, the most significant of which is the non-optimal coding. Although the partitioning at each step is optimal, the algorithm does not guarantee the optimal result as a whole. Consider, for example, the following string: "AAAABVGDEZH". The corresponding Shannon-Fano tree and the codes derived from it are shown in Fig. one:

Without encoding, the message will take 40 bits (provided that each character is encoded with 4 bits), and using the Shannon-Fano algorithm 4*2+2+4+4+3+3+3=27 bits. The message volume decreased by 32.5%, but it will be shown below that this result can be significantly improved.

Encoding with Huffman Trees

The Huffman coding algorithm, developed a few years after the Shannon-Fano algorithm, also has the prefix property, and, in addition, the proven minimal redundancy, this is the reason for its extremely wide distribution. To obtain Huffman codes, the following algorithm is used:
1. All symbols of the alphabet are represented as free nodes, with the weight of the node proportional to the frequency of the symbol in the message;
2. From the set of free nodes, two nodes with a minimum weight are selected and a new (parent) node is created with a weight equal to the sum of the weights of the selected nodes;
3. The selected nodes are removed from the free list, and the parent node created on their basis is added to this list;
4. Steps 2-3 are repeated until there is more than one node in the free list;
5. Based on the constructed tree, each character of the alphabet is assigned a prefix code;
6. The message is encoded with the received codes.

Consider the same example as in the case of the Shannon-Fano algorithm. The Huffman tree and codes received for the message "AAAABVGDEZH" are shown in Fig. 2:

It is easy to calculate that the size of the encoded message will be 26 bits, which is less than in the Shannon-Fano algorithm. Separately, it is worth noting that due to the popularity of the Huffman algorithm, at the moment there are many options for Huffman coding, including adaptive coding, which does not require the transmission of symbol frequencies.
Among the disadvantages of the Huffman algorithm, a significant part is the problems associated with the complexity of implementation. The use of real variable symbols for storing frequencies is associated with a loss of accuracy, therefore, integer variables are often used in practice, but, since the weight of parent nodes is constantly growing, sooner or later an overflow occurs. Thus, despite the simplicity of the algorithm, its correct implementation can still cause some difficulties, especially for large alphabets.

Encoding with secant trees

Coding using secant functions is an algorithm developed by the authors that allows obtaining prefix codes. The algorithm is based on the idea of ​​building a tree, each node of which contains a secant function. To describe the algorithm in more detail, it is necessary to introduce several definitions.
A word is an ordered sequence of m bits (the number m is called the word length).
The secant literal is a bit-value pair. For example, the literal (4,1) means that 4 bits of the word must be equal to 1. If the condition of the literal is true, then the literal is considered true, otherwise it is false.
A k-bit secant is a set of k literals. If all literals are true, then the secant function itself is true, otherwise it is false.

The tree is built so that each node divides the alphabet into the closest possible parts. On Fig. 3 shows an example of a secant tree:

The tree of secant functions in the general case does not guarantee optimal coding, but it provides an extremely high speed of operation due to the simplicity of the operation at the nodes.

Arithmetic coding

Arithmetic coding is one of the most efficient ways to compress information. Unlike the Huffman algorithm, arithmetic encoding allows messages to be encoded with an entropy of less than 1 bit per symbol. Because most arithmetic coding algorithms are protected by patents, only the main ideas will be described below.
Let's assume that in the used alphabet there are N characters a_1,…,a_N, with frequencies p_1,…,p_N, respectively. Then the arithmetic coding algorithm will look like this:
As a working half-interval, take )
Share