Lecture number 4. Information compression

Information compression principles

The purpose of data compression is to provide a compact representation of the data generated by the source for their more economical storage and transmission over communication channels.

Let's say we have a 1 (one) megabyte file. We need to get a smaller file out of it. Nothing complicated - we launch an archiver, for example, WinZip, and as a result we get, say, a file of 600 kilobytes in size. Where did the remaining 424 kilobytes go?

Compression of information is one of the ways to encode it. In general, codes are divided into three large groups - compression codes (effective codes), noise-immune codes and cryptographic codes. Codes designed to compress information are divided, in turn, into lossless codes and lossy codes. Lossless encoding implies absolutely accurate data recovery after decoding and can be used to compress any information. Lossy encoding usually has a much higher compression ratio than lossless encoding, but allows some deviations of the decoded data from the original.

Types of compression

All information compression methods can be conditionally divided into two large non-overlapping classes: compression with loss information and compression without loss information.

Compression without loss of information.

We are primarily interested in these compression methods, since they are used when transferring text documents and programs, when issuing completed work to a customer, or when creating backup copies of information stored on a computer.

The compression methods of this class cannot allow the loss of information, therefore they are based only on the elimination of its redundancy, and information has redundancy almost always (true, if someone has not already compacted it before). If there were no redundancy, there would be nothing to compress.

Here is a simple example. The Russian language has 33 letters, ten numbers, and about a dozen more punctuation marks and other special characters. For the text that is written only in capital Russian letters(as in telegrams and radiograms) sixty different meanings would suffice. However, each character is usually encoded as a byte, which contains 8 bits and can represent 256 different codes. This is the first reason for redundancy. For our "telegraphic" text, six bits per character would be enough.

Here is another example. In international character encoding ASCII to encode any character, the same number of bits (8) is allocated, while everyone has long and well known that it makes sense to encode the most frequently occurring characters with fewer characters. So, for example, in "Morse code" the letters "E" and "T", which are common, are encoded by one character (respectively, this is a dot and a dash). And such rare letters as "Yu" (- -) and "C" (- -) are encoded with four characters. Inefficient encoding is the second reason for redundancy. Programs that compress information can enter their own encoding (different for different files) and assign a certain table (dictionary) to the compressed file, from which the decompressing program learns how certain characters or their groups are encoded in this file. Algorithms based on information recoding are called Huffman algorithms.

The presence of repeated fragments is the third reason for redundancy. This is rare in texts, but in tables and graphics, repetition of codes is common. So, for example, if the number 0 is repeated twenty times in a row, then it makes no sense to put twenty zero bytes. Instead, they put one zero and a coefficient of 20. Such algorithms based on the detection of repetitions are called methodsRLE (Run Length Encoding).

Graphic illustrations are especially distinguished by large repeating sequences of identical bytes, but not photographic ones (there is a lot of noise and neighboring points differ significantly in parameters), but those that artists draw with a “smooth” color, as in animated films.

Lossy compression.

Lossy compression means that after decompressing the compressed archive, we will get a document that is somewhat different from the one that was at the very beginning. It is clear that the greater the compression ratio, the greater the loss and vice versa.

Of course, such algorithms are inapplicable for text documents, database tables, and especially for programs. Minor distortions in plain unformatted text can still be somehow survived, but the distortion of even one bit in the program will make it completely inoperable.

At the same time, there are materials in which it is worth sacrificing a few percent of information in order to get tenfold compression. These include photographic illustrations, video footage, and musical compositions. The loss of information during compression and subsequent decompression in such materials is perceived as the appearance of some additional "noise". But since a certain “noise” is still present when creating these materials, its small increase does not always look critical, and the gain in file sizes is huge (10-15 times for music, 20-30 times for photo and video materials).

Lossy compression algorithms include such well-known algorithms as JPEG and MPEG. The JPEG algorithm is used when compressing still images. Graphic files compressed with this method have a JPG extension. MPEG algorithms are used in video and music compression. These files may have different extensions, depending on the specific program, but the most well-known are .mpg for video and .mp3 for music.

Lossy compression algorithms are used only for consumer applications. This means, for example, that if a photo is transmitted for viewing, and music for playback, then such algorithms can be applied. If they are transferred for further processing, for example, for editing, then no loss of information in the source material is unacceptable.

The amount of compression loss tolerable can usually be controlled. This allows you to experiment and achieve the optimal size / quality ratio. In photographic illustrations intended for display on a screen, a loss of 5% of information is usually not critical, and in some cases 20-25% can be tolerated.

Lossless compression algorithms

Shannon-Fano code

For further reasoning, it will be convenient to represent our source file with text as a source of characters that appear one at a time in its output. We do not know in advance which character will be next, but we know that with probability p1 the letter "a" will appear, with probability p2 the letter "b", and so on.

In the simplest case, we will consider all characters of the text to be independent of each other, i.e. the probability of the next symbol appearing does not depend on the value of the previous symbol. Of course, this is not the case for a meaningful text, but now we are considering a very simplified situation. In this case, the statement "the symbol carries the more information, the less the probability of its occurrence" is true.

Let's imagine a text whose alphabet consists of only 16 letters: A, B, C, D, D, E, F, Z, I, K, L, M, N, O, P, R. Each of these characters can be encode with just 4 bits: from 0000 to 1111. Now imagine that the probabilities of these characters appearing are distributed as follows:

The sum of these probabilities is, of course, unity. Let us divide these symbols into two groups in such a way that the total probability of the symbols of each group is ~0.5 (Fig.). In our example, these will be groups characters A-B and G-R. The circles in the figure, denoting groups of symbols, are called vertices or nodes (nodes), and the construction itself of these nodes is a binary tree (B-tree). Let's assign a code to each node, designating one node with the number 0, and the other with the number 1.

Again, let's divide the first group (A-B) into two subgroups in such a way that their total probabilities are as close as possible to each other. Let's add the number 0 to the code of the first subgroup, and the number 1 to the code of the second.

We will repeat this operation until there is only one symbol left at each vertex of our "tree". The complete tree for our alphabet will have 31 nodes.

Character codes (rightmost nodes of the tree) have codes of unequal length. So, the letter A, which has a probability p=0.2 for our imaginary text, is encoded with only two bits, and the letter P (not shown in the figure), which has a probability p=0.013, is encoded with a six-bit combination.

So, the principle is obvious - frequently occurring characters are encoded with a smaller number of bits, rarely occurring characters are encoded with more. As a result, the average number of bits per character will be equal to

where ni is the number of bits encoding the i-th character, pi is the probability of occurrence of the i-th character.

Huffman code.

The Huffman algorithm elegantly implements the general idea of entropy coding using prefix sets and works as follows:

1. We write out in a row all the characters of the alphabet in ascending or descending order of the probability of their occurrence in the text.

2. Consistently combine two symbols with the lowest probabilities of occurrence into a new composite symbol, the probability of occurrence of which is assumed to be equal to the sum of the probabilities of its constituent symbols. Finally, we construct a tree, each node of which has the sum of the probability of all nodes below it.

3. We trace the path to each leaf of the tree, marking the direction to each node (for example, right - 1, left - 0) . The resulting sequence gives a code word corresponding to each character (Fig.).

Let's build a code tree for a message with the following alphabet:

Disadvantages of Methods

The biggest difficulty with codes, as the previous discussion suggests, is the need to have probability tables for each type of data to be compressed. This is not a problem if English or Russian text is known to be compressed; we simply provide the encoder and decoder with a code tree suitable for the English or Russian text. In the general case, when the probability of characters for the input data is unknown, static Huffman codes work inefficiently.

The solution to this problem is the statistical analysis of the encoded data, performed during the first pass through the data, and the compilation of a code tree based on it. In this case, the actual encoding is performed by the second pass.

Another disadvantage of codes is that the minimum length of a code word for them cannot be less than one, while the message entropy may well be 0.1 or 0.01 bits/letter. In this case, the code becomes essentially redundant. The problem is solved by applying the algorithm to blocks of characters, but then the encoding / decoding procedure becomes more complicated and the code tree expands significantly, which must eventually be saved along with the code.

These codes do not take into account the relationships between characters that are present in almost any text. For example, if the text on English language we encounter the letter q, then we can say with confidence that the letter u will come after it.

Run Length Encoding (RLE) is one of the oldest and simplest archiving algorithms. Compression in RLE occurs by replacing chains of identical bytes with "counter, value" pairs. ("red, red, ..., red" is written as "N reds").

One of the implementations of the algorithm is as follows: they look for the least frequently occurring byte, call it a prefix, and replace chains of identical characters with triples "prefix, counter, value". If this byte occurs in the source file one or two times in a row, then it is replaced by a pair of "prefix, 1" or "prefix, 2". There is one unused "prefix, 0" pair that can be used as a terminator for the packed data.

When encoding .exe files, you can search for and pack sequences like AxAyAzAwAt..., which are often found in resources (Unicode strings)

The positive aspects of the algorithm include the fact that it does not require additional memory during operation, and it is quickly executed. The algorithm is applied in PCX, TIFF, BMP formats. An interesting feature of batch encoding in PCX is that the degree of archiving for some images can be significantly increased by simply changing the order of colors in the image palette.

The LZW code (Lempel-Ziv & Welch) is one of the most common lossless compression codes today. It is with the help of the LZW code that compression is carried out in such graphic formats as TIFF and GIF, with the help of LZW modifications, many universal archivers perform their functions. The operation of the algorithm is based on searching in the input file for repeated sequences of characters, which are encoded by combinations of 8 to 12 bits in length. Thus, this algorithm is most effective on text files and on graphic files that have large single-color areas or repeating pixel sequences.

The lack of information loss during LZW encoding led to the widespread use of the TIFF format based on it. This format does not impose any restrictions on the size and color depth of the image and is widely used, for example, in printing. Another LZW-based format - GIF - is more primitive - it allows you to store images with a color depth of no more than 8 bits / pixel. At the beginning of the GIF - file is a palette - a table that establishes a correspondence between the color index - a number in the range from 0 to 255 and a true, 24-bit color value.

Lossy compression algorithms

The JPEG algorithm was developed by a group of companies called the Joint Photographic Experts Group. The goal of the project was to create a highly efficient compression standard for both black and white and color images, and this goal was achieved by the developers. Currently, JPEG is widely used where a high degree of compression is required - for example, on the Internet.

Unlike the LZW algorithm, JPEG encoding is a lossy encoding. The encoding algorithm itself is based on very complex mathematics, but in general it can be described as follows: the image is divided into squares of 8 * 8 pixels, and then each square is converted into a sequential chain of 64 pixels. Further, each such chain is subjected to the so-called DCT-transform, which is one of the varieties of the discrete Fourier transform. It lies in the fact that the input sequence of pixels can be represented as a sum of sinusoidal and cosine components with multiple frequencies (the so-called harmonics). In this case, we only need to know the amplitudes of these components in order to reconstruct the input sequence with a sufficient degree of accuracy. The more harmonic components we know, the smaller the discrepancy between the original and the compressed image. Most JPEG encoders allow you to adjust the amount of compression. This is achieved in a very simple way: the higher the compression ratio is set, the fewer harmonics will be represented by each 64-pixel block.

Of course, the strength of this type of encoding is a large compression ratio while maintaining the original color depth. It is this property that has led to its widespread use on the Internet, where file size reduction is of paramount importance, in multimedia encyclopedias, where it is required to store as many graphics as possible in a limited amount.

The negative property of this format is the inherent degradation of image quality that cannot be eliminated by any means. It is this sad fact that does not allow it to be used in printing, where quality is at the forefront.

However, the JPEG format is not the limit of perfection in an effort to reduce the size of the final file. Recently, intensive research has been carried out in the field of the so-called wavelet transform (or burst transform). Based on the most complex mathematical principles, wavelet encoders allow you to get more compression than JPEG, with less information loss. Despite the complexity of the mathematics of the wavelet transform, in software implementation it is simpler than JPEG. Although wavelet compression algorithms are still in initial stage development, they have a great future ahead of them.

fractal compression

Fractal image compression is a lossy image compression algorithm based on the application of systems of iterated functions (IFS, usually affine transformations) to images. This algorithm is known for the fact that in some cases it allows to obtain very high compression ratios (the best examples are up to 1000 times with acceptable visual quality) for real photographs of natural objects, which is not available for other image compression algorithms in principle. Due to the difficult situation with patenting, the algorithm was not widely used.

Fractal archiving is based on the fact that with the help of the coefficients of the system of iterated functions, the image is represented in a more compact form. Before looking at the archiving process, let's look at how IFS builds an image.

Strictly speaking, IFS is a set of three-dimensional affine transformations that transform one image into another. Points in three-dimensional space (x-coordinate, y-coordinate, brightness) are subjected to transformation.

The basis of the fractal encoding method is the detection of self-similar regions in an image. The possibility of applying the theory of iterated function systems (IFS) to the problem of image compression was first explored by Michael Barnsley and Alan Sloan. They patented their idea in 1990 and 1991. Jacquin introduced a fractal coding method that uses a system of domain and range subimage blocks, square-shaped blocks covering the entire image. This approach became the basis for most of the fractal encoding methods used today. It has been improved by Yuval Fisher and a number of other researchers.

In accordance with this method, the image is divided into a set of non-overlapping rank subimages (range subimages) and a set of overlapping domain subimages (domain subimages) is determined. For each rank block, the encoding algorithm finds the most appropriate domain block and an affine transformation that translates this domain block into the given rank block. The structure of the image is mapped into a system of rank blocks, domain blocks and transformations.

The idea is this: suppose that the original image is a fixed point of some contraction mapping. Then, instead of the image itself, this mapping can be remembered in some way, and to restore it, it is enough to repeatedly apply this mapping to any starting image.

By Banach's theorem, such iterations always lead to a fixed point, that is, to the original image. In practice, the whole difficulty lies in finding the most suitable compression mapping from the image and in its compact storage. Typically, mapping search algorithms (i.e., compression algorithms) are heavily brute-force and computationally expensive. At the same time, recovery algorithms are quite efficient and fast.

Briefly, the method proposed by Barnsley can be described as follows. The image is encoded by several simple transformations (in our case, affine), that is, it is determined by the coefficients of these transformations (in our case, A, B, C, D, E, F).

For example, the image of the Koch curve can be encoded with four affine transformations, we will uniquely define it using only 24 coefficients.

As a result, the point will necessarily go somewhere inside the black area on the original image. Having done this operation many times, we will fill all the black space, thereby restoring the picture.

The two best-known IFS images are the Sierpinski triangle and the Barnsley fern. The first is given by three and the second by five affine transformations (or, in our terminology, lenses). Each transformation is set literally in bytes, while the image built with their help can take several megabytes.

It becomes clear how the archiver works and why it takes so much time. In fact, fractal compression is a search for self-similar areas in an image and determining the parameters of affine transformations for them.

In the worst case, if an optimizing algorithm is not applied, it will be necessary to enumerate and compare all possible image fragments of different sizes. Even for small images, taking into account discreteness, we will get an astronomical number of options to be sorted out. Even a sharp narrowing of the transformation classes, for example, by scaling only a certain number of times, will not allow us to achieve an acceptable time. In addition, the quality of the image is lost. The vast majority of research in the field of fractal compression is now aimed at reducing the archiving time required to obtain a high-quality image.

For the fractal compression algorithm, as well as for other lossy compression algorithms, the mechanisms by which it will be possible to control the degree of compression and the degree of loss are very important. To date, a sufficiently large set of such methods has been developed. First, it is possible to limit the number of transformations by ensuring that the compression ratio is no lower than a fixed value. Secondly, it can be required that in a situation where the difference between the fragment being processed and its best approximation is above a certain threshold value, this fragment must be crushed (several lenses are required for it). Thirdly, it is possible to prohibit fragmentation of fragments smaller than, say, four points. By changing the threshold values and the priority of these conditions, you can very flexibly control the image compression ratio: from bit matching to any degree of compression.

Comparison with JPEG

Today, the most common graphics archiving algorithm is JPEG. Compare it with fractal compression.

First, note that both algorithms operate on 8-bit (grayscale) and 24-bit full-color images. Both are lossy compression algorithms and provide close archiving ratios. Both the fractal algorithm and JPEG have the ability to increase the compression ratio by increasing the loss. In addition, both algorithms parallelize very well.

The differences start when we consider the time it takes the algorithms to archive/unarchive. So, the fractal algorithm compresses hundreds and even thousands of times longer than JPEG. Unpacking the image, on the contrary, will happen 5-10 times faster. Therefore, if the image is compressed only once, but transmitted over the network and decompressed many times, then it is more profitable to use the fractal algorithm.

JPEG uses cosine decomposition of the image, so the loss in it (even with a given minimum loss) appears in waves and halos at the border of sharp color transitions. It is for this effect that they do not like to use it when compressing images that are prepared for high-quality printing: there this effect can become very noticeable.

The fractal algorithm is free from this shortcoming. Moreover, each time you print an image, you have to perform a scaling operation, since the raster (or lineature) of the printing device does not match the raster of the image. During the conversion, several unpleasant effects can also occur, which can be dealt with either by scaling the image programmatically (for cheap printing devices such as conventional laser and inkjet printers), or by supplying the printing device with its own processor, hard drive and a set of image processing programs (for expensive phototypesetting machines). As you might guess, when using the fractal algorithm, such problems practically do not arise.

The displacement of JPEG by the fractal algorithm in widespread use will not happen soon (at least due to the low archiving speed of the latter), however, in the field of multimedia applications, in computer games its use is justified.

ARCHIVERS

Information compression is the process of transforming the information stored in a file by reducing data redundancy. The purpose of this process is to reduce the volume occupied by data.

archive file is a specially created file containing one or more compressed files.

Compression ratio: K c \u003d V c / V o * 100%

K c is the compression ratio, Vc– the size of the compressed file, V o is the original size of the file.

The compression ratio depends on:

1) the program used - the archiver,

2) compression method,

3) type of source file: text, graphic, video, audio, etc.

Programs that package and unpack files are called archivers. The most common are: ARJ, ZIP, RAR. The extension of the archive files matches the name of the archiver used to create them.

Archivers allow you to create self-extracting archive files, i.e. to unpack them, you do not need to run the archiver program, because. they themselves contain the unpacking program. These archives are called SFX archives.
(SelF-eXtracting). The extension of such files is *.EXE.

Information compression principles

Any text contains repetitive characters. It is possible to specify one character and the number of repetitions. The efficiency of this algorithm is even higher in relation to graphic files. If you look at the monitor, you can see a lot of repeating dots of the same color. The PCX graphic file format is based on this information compression principle. Modern archivers highlight not only repeated characters, but also chains of characters, individual words.

If not all characters of the PC alphabet are used in the text, then for their encoding you can use one byte, 8 bits, less than the number. This principle is used in the telegraph machine, where only Russian capital letters are used, 5 bits are enough to represent them, which allows writing three characters in two bytes.

3. The following principle uses the regularity that letters occur with different frequencies in the text. For example, in this text, the space is the most common character, the characters "a", "and" are very common. These frequently occurring characters can be represented by a short combination of bits, the rest of the characters can be encoded by a longer sequence. For instance:

4. Physically, the PC allocates space for placing files on the disk in clusters - blocks of 4 kB. It is impossible to single out less. For example, if a file has a size of 8193 bytes (8 kB and 1 byte), it will physically occupy 16 kB or 16384 bytes. Combining a group of files into one allows you to save on these leftovers. When packing small files, this results in big savings.

In total, with a separate placement of files, 6 kB are not used, which is 100% of the contents of the files. In the second case, 2 kB, 33%, remains unused.

Zip archiver

Packing pkzip files [options]<имя архива>[file paths]

Keys: -rp archiving with subdirectories while preserving the structure

S PWD archive password protection (PWD)

A add files to the archive

M move files to archive

V view the contents of the archive

If you are archiving all the files in the directory, then you must specify the mask *.*

Unpack pkunzip files [options]<имя архива>[file names]

Switches: -d unpacking with subdirectories preserving the structure

SPWD password archive (PWD)

arj archiver

arj<команда>[keys]<имя архива>[file names]

For the arj archiver, one file performs both unpacking and packing operations.

Teams: a archiving

e unpacking without preserving the directory structure

x structure-preserving unpacking

l viewing the contents of an archive

m move files to archive

d delete files from archive

Switches: -r packing with subdirectories while preserving the structure

V splitting the archive into volumes with a volume of vol (if specified)

the size for standard floppy disks (360, 720, 1200, 1440) is indicated in kilobytes, the size of non-standard floppy disks is indicated in bytes

V is specified when unpacking a multi-volume archive

G PWD archive password ( PWD)

Packing files

Unpacking files

GORKOFF February 24, 2015 at 11:41 am

Data compression methods

Algorithms

My supervisor and I are preparing a short monograph on image processing. I decided to present to the court of the habra community a chapter on image compression algorithms. Since it’s hard to fit a whole chapter within one post, I decided to break it into three posts:
1. Data compression methods;
2. Lossless image compression;
3. Lossy image compression.
You can read the first post in the series below.

At the moment, there are a large number of lossless compression algorithms, which can be conditionally divided into two large groups:
1. Stream and dictionary algorithms. This group includes algorithms of the families RLE (run-length encoding), LZ*, etc. A feature of all the algorithms of this group is that the encoding uses not information about the frequency of symbols in the message, but information about the sequences encountered earlier.
2. Algorithms for statistical (entropy) compression. This group of algorithms compresses information using the uneven frequencies with which different characters occur in a message. Algorithms of this group include arithmetic and prefix coding algorithms (using Shannon-Fanno, Huffman, secant trees).
Information transformation algorithms can be singled out as a separate group. Algorithms of this group do not directly compress information, but their application greatly simplifies further compression using streaming, dictionary and entropy algorithms.

Stream and Dictionary Algorithms

Run length coding

Run-Length Encoding (RLE) is one of the simplest and most common data compression algorithms. In this algorithm, a sequence of repeated characters is replaced by a character and the number of times it repeats.
For example, the string "AAAAA", which requires 5 bytes to store (assuming a byte is allocated for storing one character), can be replaced by "5A", consisting of two bytes. Obviously, this algorithm is more efficient, the longer the series of repetitions.

The main disadvantage of this algorithm is its extremely low efficiency on sequences of non-repeating characters. For example, if we consider the sequence "ABABAB" (6 bytes), then after applying the RLE algorithm, it will turn into "1A1B1A1B1A1B" (12 bytes). There are various methods to solve the problem of non-repeating characters.

The simplest method is the following modification: the byte encoding the number of repetitions must store information not only about the number of repetitions, but also about their presence. If the first bit is 1, then the next 7 bits indicate the number of repetitions of the corresponding character, and if the first bit is 0, then the next 7 bits indicate the number of characters to be taken without repetition. If we encode "ABABAB" using this modification, we get "-6ABABAB" (7 bytes). Obviously, the proposed technique can significantly improve the efficiency of the RLE algorithm on non-repeating character sequences. The implementation of the proposed approach is shown in Listing 1:

type

function RLEEncode(InMsg: ShortString) : TRLEEncodedString;

MatchFl: boolean ;

MatchCount: shortint ;

EncodedString: TRLEEncodedString;

N, i: bytes;

begin

N:=0;

SetLength(EncodedString, 2 * length(InMsg) ) ;

while length(InMsg) >= 1 do

begin

MatchFl : = (length(InMsg) > 1 ) and (InMsg[ 1 ] = InMsg[ 2 ] ) ;

MatchCount := 1 ;

while (MatchCount<= 126 ) and (MatchCount < length(InMsg) ) and ((InMsg[ MatchCount] = InMsg[ MatchCount + 1 ] ) = MatchFl) do

MatchCount : = MatchCount + 1 ;

if MatchFl then

begin

N: = N + 2;

EncodedString[ N - 2 ] : = MatchCount + 128 ;

EncodedString[ N - 1 ] : = ord (InMsg[ 1 ] ) ;

else

begin

if MatchCount<>length(InMsg) then

MatchCount : = MatchCount - 1 ;

N := N + 1 + MatchCount;

EncodedString[ N - 1 - MatchCount] : = - MatchCount + 128 ;

for i := 1 to MatchCount do

EncodedString[ N - 1 - MatchCount + i] : = ord (InMsg[ i] ) ;

end ;

delete(InMsg, 1 , MatchCount) ;

end ;

SetLength(EncodedString, N) ;

RLEncode := EncodedString;

end ;

Decoding a compressed message is very simple and comes down to a single pass through the compressed message, see Listing 2:

type

TRLEEncodedString = array of byte ;

function RLEDecode(InMsg: TRLEEncodedString) : ShortString;

RepeatCount: shortint ;

i, j: word;

OutMsg: ShortString;

begin

OutMsg := "" ;

i:=0;

while i< length(InMsg) do

begin

RepeatCount : = InMsg[i] - 128 ;

i : = i + 1 ;

if RepeatCount< 0 then

begin

RepeatCount := abs(RepeatCount) ;

for j : = i to i + RepeatCount - 1 do

OutMsg := OutMsg + chr (InMsg[ j] ) ;

i := i + RepeatCount;

else

begin

for j := 1 to RepeatCount do

OutMsg := OutMsg + chr (InMsg[ i] ) ;

i : = i + 1 ;

end ;

end ;

RLEDecode := OutMsg;

end ;

The second method of improving the efficiency of the RLE algorithm is the use of information transformation algorithms that do not directly compress the data, but convert it to a form that is more convenient for compression. As an example of such an algorithm, we will consider a BWT permutation named after the inventors of the Burrows-Wheeler transform. This permutation does not change the characters themselves, but only changes their order in the string, while repeating substrings, after applying the permutation, are collected into dense groups that are much better compressed using the RLE algorithm. Direct BWT transformation comes down to a sequence of the following steps:
1. Adding to the source string a special end-of-line character that does not occur anywhere else;
2. Getting all cyclic permutations of the original string;
3. Sorting the received strings in lexicographic order;
4. Returning the last column of the resulting matrix.
The implementation of this algorithm is shown in Listing 3.

const

EOMsg="|" ;

function BWTEncode(InMsg: ShortString) : ShortString;

OutMsg: ShortString;

LastChar: ANSIChar;

N, i: word;

begin

InMsg := InMsg + EOMsg;

N := length(InMsg) ;

ShiftTable[ 1 ] := InMsg;

for i := 2 to N do

begin

LastChar := InMsg[N] ;

InMsg : = LastChar + copy(InMsg, 1 , N - 1 ) ;

ShiftTable[i] := InMsg;

end ;

Sort(ShiftTable) ;

OutMsg := "" ;

for i := 1 to N do

OutMsg : = OutMsg + ShiftTable[ i] [ N] ;

BWTEncode := OutMsg;

end ;

The easiest way to explain this transformation is with a specific example. Let's take the string "pineapple" and agree that the symbol "|" will be the end of the string. All cyclic permutations of this string and the result of their lexicographic sorting are given in Table. one.

Those. the result of direct conversion will be the string "|NNAAAC". It is easy to see that this string is much better than the original one, it is compressed by the RLE algorithm, because it contains long subsequences of repeated letters.
A similar effect can be achieved using other transformations, but the advantage of the BWT transformation is that it is reversible, however, the reverse transformation is more complicated than the direct one. In order to restore the original string, you must perform the following steps:
Create an empty matrix of size n*n, where n is the number of characters in the encoded message;
Fill the rightmost empty column with the encoded message;
Sort table rows in lexicographic order;
Repeat steps 2-3 as long as there are empty columns;
Return the string that ends with an end-of-line character.

The implementation of the inverse transformation is not difficult at first glance, and one of the implementations is shown in Listing 4.

const

EOMsg="|" ;

function BWTDecode(InMsg: ShortString) : ShortString;

OutMsg: ShortString;

ShiftTable: array of ShortString;

N, i, j: word;

begin

OutMsg := "" ;

N := length(InMsg) ;

SetLength(ShiftTable, N + 1) ;

for i := 0 to N do

ShiftTable[i] := "" ;

for i := 1 to N do

begin

for j := 1 to N do

ShiftTable[ j] : = InMsg[ j] + ShiftTable[ j] ;

Sort(ShiftTable) ;

end ;

for i := 1 to N do

if ShiftTable[ i] [N] = EOMsg then

OutMsg := ShiftTable[i] ;

delete(OutMsg, N, 1 ) ;

BWTDecode := OutMsg;

end ;

But in practice, the efficiency depends on the selected sorting algorithm. Trivial algorithms with quadratic complexity will obviously have an extremely negative impact on performance, so it is recommended to use efficient algorithms.

After sorting the table obtained at the seventh step, it is necessary to select a row from the table that ends with the symbol "|". It is easy to see that this is the only line. That. we have considered the BWT transformation on a specific example.

Summing up, we can say that the main advantage of the RLE algorithm group is simplicity and speed of operation (including decoding speed), and the main disadvantage is inefficiency on non-repeating character sets. The use of special permutations increases the efficiency of the algorithm, but also greatly increases the running time (especially decoding).

Dictionary compression (LZ algorithms)

The group of dictionary algorithms, in contrast to the algorithms of the RLE group, encodes not the number of repetitions of characters, but sequences of characters encountered earlier. During the operation of the algorithms under consideration, a table is dynamically created with a list of already encountered sequences and their corresponding codes. This table is often called a dictionary, and the corresponding group of algorithms is called dictionary.

The simplest version of the dictionary algorithm is described below:
Initialize the dictionary with all characters found in the input string;
Find in the dictionary the longest sequence (S) that matches the beginning of the encoded message;
Issue the code of the found sequence and remove it from the beginning of the encoded message;
If the end of the message is not reached, read the next character and add Sc to the dictionary, go to step 2. Otherwise, exit.

For example, a newly initialized dictionary for the phrase "CUCKOOCOOKOOHOOD" is shown in Table. 3:

During the compression process, the dictionary will be supplemented by the sequences encountered in the message. The process of updating the dictionary is given in Table. 4.

When describing the algorithm, the description of the situation when the dictionary is completely filled was deliberately omitted. Depending on the variant of the algorithm, different behavior is possible: complete or partial clearing of the dictionary, termination of filling the dictionary, or expansion of the dictionary with a corresponding increase in code capacity. Each of these approaches has certain drawbacks. For example, stopping dictionary replenishment can lead to a situation where the dictionary stores sequences that occur at the beginning of the compressed string, but do not occur later. At the same time, clearing the dictionary may result in the removal of frequent sequences. Most of the implementations used when filling the dictionary begin to track the degree of compression, and when it decreases below a certain level, the dictionary is rebuilt. Next, we will consider the simplest implementation that stops replenishing the dictionary when it is full.

First, we define a dictionary as a record that stores not only the substrings encountered, but also the number of substrings stored in the dictionary:

The previously encountered subsequences are stored in the array Words, and their code is the numbers of the subsequences in this array.
We also define the dictionary search and add to dictionary functions:

const

MAX_DICT_LENGTH = 256 ;

function FindInDict(D: TDictionary; str: ShortString) : integer ;

r: integer

i: integer ;

fl: boolean ;

begin

r := - 1 ;

if D. WordCount > 0 then

begin

i := D.WordCount ;

fl := false ;

while (not fl) and (i >= 0 ) do

begin

i := i - 1 ;

fl:=D.Words[i]=str;

end ;

end ;

if fl then

r :=i;

FindInDict := r;

end ;

procedure AddToDict(var D: TDictionary; str: ShortString) ;

begin

if D.WordCount< MAX_DICT_LENGTH then

begin

D.WordCount := D.WordCount + 1 ;

SetLength(D. Words , D. WordCount ) ;

D. Words [ D. WordCount - 1 ] : = str;

end ;

end ;

Using these functions, the encoding process according to the described algorithm can be implemented as follows:

function LZWEncode(InMsg: ShortString) : TEncodedString;

OutMsg: TEncodedString;

tmpstr: ShortString;

D: TDictionary;

i, N: bytes;

begin

SetLength(OutMsg, length(InMsg) ) ;

N:=0;

InitDict(D) ;

while length(InMsg) > 0 do

begin

tmpstr := InMsg[ 1 ] ;

while (FindInDict(D, tmpstr) >= 0 ) and (length(InMsg) > length(tmpstr) ) do

tmpstr : = tmpstr + InMsg[ length(tmpstr) + 1 ] ;

if FindInDict(D, tmpstr)< 0 then

delete(tmpstr, length(tmpstr) , 1 ) ;

OutMsg[N] := FindInDict(D, tmpstr) ;

N: = N + 1;

delete(InMsg, 1 , length(tmpstr) ) ;

if length(InMsg) > 0 then

AddToDict(D, tmpstr + InMsg[ 1 ] ) ;

end ;

SetLength(OutMsg, N) ;

LZWEncode := OutMsg;

end ;

The result of encoding will be the numbers of words in the dictionary.
The decoding process is reduced to direct decoding of codes, and there is no need to transfer the created dictionary, it is enough that the dictionary be initialized during decoding in the same way as during encoding. Then the dictionary will be completely restored directly in the decoding process by concatenating the previous subsequence and the current character.

The only problem is possible in the following situation: when it is necessary to decode a subsequence that is not yet in the dictionary. It is easy to see that this is possible only in the case when it is necessary to extract a substring that must be added at the current step. And this means that the substring satisfies the pattern cSc, i.e. starts and ends with the same character. In this case, cS is the substring added in the previous step. The considered situation is the only one when it is necessary to decode a string that has not yet been added. Given the above, we can offer the following option for decoding a compressed string:

function LZWDecode(InMsg: TEncodedString) : ShortString;

D: TDictionary;

OutMsg, tmpstr: ShortString;

i: byte;

begin

OutMsg := "" ;

tmpstr := "" ;

InitDict(D) ;

for i := 0 to length(InMsg) - 1 do

begin

if InMsg[i] >= D.WordCount then

tmpstr : = D. Words [ InMsg[ i - 1 ] ] + D. Words [ InMsg[ i - 1 ] ] [ 1 ]

else

tmpstr := D. Words [ InMsg[ i] ] ;

OutMsg := OutMsg + tmpstr;

if i > 0 then

AddToDict(D, D. Words [ InMsg[ i - 1 ] ] + tmpstr[ 1 ] ) ;

end ;

LZWDecode := OutMsg;

end ;

The advantages of dictionary algorithms include their greater compression efficiency compared to RLE. Nevertheless, it should be understood that the actual use of these algorithms is associated with some implementation difficulties.

Entropy coding

Encoding with Shannon-Fano Trees

The Shannon-Fano algorithm is one of the first developed compression algorithms. The algorithm is based on the idea of representing more frequent characters with shorter codes. In this case, the codes obtained using the Shannon-Fano algorithm have the prefix property: i.e. no code is the beginning of any other code. The prefix property ensures that the encoding is one-to-one. The algorithm for constructing Shannon-Fano codes is presented below:
1. Divide the alphabet into two parts, the total probabilities of symbols in which are as close as possible to each other.
2. Add 0 to the prefix code of the first part of symbols, add 1 to the prefix code of the second part of symbols.
3. For each part (which has at least two characters), recursively perform steps 1-3.
Despite its relative simplicity, the Shannon-Fano algorithm is not without drawbacks, the most significant of which is the non-optimal coding. Although the partitioning at each step is optimal, the algorithm does not guarantee the optimal result as a whole. Consider, for example, the following string: "AAAABVGDEZH". The corresponding Shannon-Fano tree and the codes derived from it are shown in Fig. one:

Without encoding, the message will take 40 bits (provided that each character is encoded with 4 bits), and using the Shannon-Fano algorithm 4*2+2+4+4+3+3+3=27 bits. The message volume decreased by 32.5%, but it will be shown below that this result can be significantly improved.

Encoding with Huffman Trees

The Huffman coding algorithm, developed a few years after the Shannon-Fano algorithm, also has the prefix property, and, in addition, the proven minimal redundancy, this is the reason for its extremely wide distribution. To obtain Huffman codes, the following algorithm is used:
1. All symbols of the alphabet are represented as free nodes, with the weight of the node proportional to the frequency of the symbol in the message;
2. From the set of free nodes, two nodes with a minimum weight are selected and a new (parent) node is created with a weight equal to the sum of the weights of the selected nodes;
3. The selected nodes are removed from the free list, and the parent node created on their basis is added to this list;
4. Steps 2-3 are repeated until there is more than one node in the free list;
5. Based on the constructed tree, each character of the alphabet is assigned a prefix code;
6. The message is encoded with the received codes.

Consider the same example as in the case of the Shannon-Fano algorithm. The Huffman tree and codes received for the message "AAAABVGDEZH" are shown in Fig. 2:

It is easy to calculate that the size of the encoded message will be 26 bits, which is less than in the Shannon-Fano algorithm. Separately, it should be noted that due to the popularity of the Huffman algorithm on this moment There are many variants of Huffman coding, including adaptive coding, which does not require transmission of symbol frequencies.
Among the disadvantages of the Huffman algorithm, a significant part is the problems associated with the complexity of implementation. The use of real variables to store the frequencies of symbols is associated with a loss of accuracy, therefore, in practice, integer variables are often used, but, since the weight of parent nodes is constantly growing, sooner or later an overflow occurs. Thus, despite the simplicity of the algorithm, its correct implementation can still cause some difficulties, especially for large alphabets.

Encoding with secant trees

Coding using secant functions is an algorithm developed by the authors that allows obtaining prefix codes. The algorithm is based on the idea of building a tree, each node of which contains a secant function. To describe the algorithm in more detail, it is necessary to introduce several definitions.
A word is an ordered sequence of m bits (the number m is called the word length).
The secant literal is a bit-value pair. For example, the literal (4,1) means that 4 bits of the word must be equal to 1. If the condition of the literal is true, then the literal is considered true, otherwise it is false.
A k-bit secant is a set of k literals. If all literals are true, then the secant function itself is true, otherwise it is false.

The tree is built so that each node divides the alphabet into the closest possible parts. On Fig. 3 shows an example of a secant tree:

The tree of secant functions in the general case does not guarantee optimal coding, but it provides an extremely high speed of operation due to the simplicity of the operation at the nodes.

Arithmetic coding

Arithmetic coding is one of the most effective ways information compression. Unlike the Huffman algorithm, arithmetic encoding allows messages to be encoded with an entropy of less than 1 bit per symbol. Because most arithmetic coding algorithms are protected by patents, only the main ideas will be described below.
Let's assume that in the used alphabet there are N characters a_1,…,a_N, with frequencies p_1,…,p_N, respectively. Then the arithmetic coding algorithm will look like this:
As a working half-interval, take )