Finite Context Modelling

Dario Phong/Hugi


This is an introduction to finite context modelling, which discusses most of the topics needed to understand ppm variants. However, this article talks about some topics which are used in other modelling techniques, and which may be useful for any programmer interested in compression. I personally recommend to fully read this article, as it includes some basic knowledge, like measures of compression, a brief introduction to entropy, why we can't have magic compressors, and some definitions that are widely used.

The need for compression

Nowadays there are bigger hard disks than ever before. However, I doubt that someone would reject to have twice the space on them. And that's where compression does its job. Having more disk space is not the only usage of compression, though. Communications via Internet are slow, and therefore if we can reduce the amount of data to send, we are going to need less time to send it. Therefore we are going to reduce the cost of transmission. So in fact data compression can help us to get much more space from our hard disk or greatly reduce our phone bills, but compression is not magic. It's a science that has its roots in a branch of mathematics, information theory, which discusses data storage and communication. Information Theory was founded around 1940 by Claude Shannon. It has theoretical limits which can't be surpassed and theories which can't be ignored. So don't expect to find the magic compression algorithm which codes all the files to a given length. Also don't believe anyone who tells you that he is capable of doing so.

Among all the compression algorithms, finite context modelling is widely used, and both the state-of-art and fastest compressors rely on it. Ppm variants are the most known application of finite context modelling. Lzp is based on context modelling for selecting one offset. Bwt is also based on it. Not to mention simple models for widely used coders, like Huffman or arithmetic coding. Such an important kind of model should be learned from the roots. This should let you understand current algorithms or papers about them, and in case you want to, you may be able to do improvements on it by researching.

Measuring compression

There are different ways to know the compression achieved by a given algorithm between a file and a compressed version of it. You can do it with %, (compressed_length / original_length) * 100, for example if we had a file with a size of 400 bytes and compress it down to 100 bytes, the ratio will be (100/400)*100 = 25%, so the output compressed file is only 25% of the original. However, there's another method: (1-(compressed_length / original_length)) * 100. In this case the ratio is 75%, meaning that we have subtracted 75% of the original file. In both cases the compression is the same, but the ratios are different, thus this is not a satisfactory way of measuring compression, as we'll need to say the formula used.

In advertisements usually a measure is used which doesn't have this problem, it's the following one: 2:1, meaning that the file has been compressed to the half, or 4:1, meaning that we have compressed it to one quarter. However, while we can have no doubt understanding this way of measuring, it still has a problem, it has low precision. How do we write that a file has been compressed to 40%? Thus this is neither satisfactory.

Fortunately there's one which does not present any of those problems: Bits Per Byte (bpb). The formula is quite easy: (compressed_length / original_length) * 8. In the previous example (100/400)*8 = 2bpb, meaning that we need only 2 bits to represent one byte. It's also accurate enough: (47/134)*8 = 2.805970149254 bpb, usually we only get three digits, like 2.806 bpb. Note that when we know the expected compression of a given algorithm we can know the supposed length of the output: (input_length / 8) * bpb = output length. Also this way of measuring has another good point, it's widely used, you'll find in most of the papers.

Also you can find bpc, Bit Per Char, which in some cases is the same as bpb, and when specified, it refers to a given alphabet, which usually only contains characters that appear in text data. (But then this is not suitable for compressing binary data.) In the compression function we code a file, and if the coded file is smaller than the original file we have achieved compression, otherwise we have expanded it. For image files one can use the term of bpp, which means Bits Per Pixel.

There are also some measures for speed. Like kbyps, Kilo BYtes Per Second, you can compute it with: file_length / seconds. Obviously the file length must be in kilobytes (divide its size in bytes by 1024) and seconds should have more precision than a second, using hundredths of seconds is enough. Unfortunately this has a big problem, a program runs faster on a faster computer. Obviously we can't use the same computer for testing all the programs (at least not for technical papers). The only solution which seems like a good idea is comparing the speed of our compression program with a standard program. In The Context-Tree Weighting Project they test the speed of different algorithms against the speed of the utility compress.

To be able to compare the bpb of an algorithm with another, a standard set of files is used. The Calgary Corpus is still widely used. Though I would recommend to use also the large Canterbury Corpus. Both are available on the site mentioned above.

Messages and symbols

A message is a sequence of symbols. A symbol can represent any kind of data, it may be a byte, a character, or a pixel, it depends on your application. An alphabet is the set of symbols which may appear in our message. The alphabetsize is the number of symbols in the alphabet, usually it is represented by the symbol "Z". The message has a finite length, it's usually represented by "N". The frequency of a symbol is the number of times it has appeared in the message.

For example, let's say we have a message like "987654321012". The alphabet contains all the numbers in decimal ("0", "1", "2", "3", "4", "5", "6", "7", "8", "9") so the alphabetsize is 10. In this case, the length (N) of the message is 12. In our example "9" is the first symbol in the message, "8" the second and so on, till "2", which is the last. In our example the frequency of the symbol "1" is two and the frequency of "5" is one. Optionally our alphabet could have another symbol, let's say '$', which would mark the end of the message. However, we could avoid this symbol by transmitting the length of the message.

We have to compress a given file (the message) with a given length (known beforehand), whose symbols are bytes. Some data like sound or image can be lossy compressed, that is we can choose to only code the information which is more relevant and throw away the rest of it, thus introducing error on it. On the other hand, the loss of data is something which we can't afford with text or binary data. Just think about the mess of a text with almost random characters, or corrupted binary data which our programs may need to run. Due to the need of a decompressed file being identical to the original one, we have to use lossless compression for this kind of data. In most of the cases we name this kind of compression (the compression of files without any loss) text compression, commonly English text compression, as English has an structure which has been taken as representative for any language. Also this structure is not very different of that of binary files, and thus if a given compressor performs well with text, it's going to have a good performance with binary data.

Information content and entropy

Let's have a look at the following message: "theretherethere". We can easily see that we could easily substitute "there" with only a reference to the previous one. This is usually what Lz variants do (lz77, lz78, lzw, lzp,...). And let's see another one, "0123456789", here we could rely on the fact that there's a little difference between the current symbol and the previous, to try to recode it as the difference "0111111111". This is called differential coding (coded_symbol = current_symbol - last_symbol). Both of them use a model. A model is a set of rules and parameters which, based on the statistics of the message, tries to predict its structure and therefore tries to predict the exact probability of the next symbol.

As we see we are using different models, which have differently results, but none of them seems optimal for all the messages (in fact there's no model optimal for all the messages). Both of them exploit the redundancy (the similarities, correlations) of the symbols in a message. If we take a look at the compression of a given message, we'll see two stages, the first gets the probabilities of the symbols in the message, this is the model, and the second, based on the probabilities of the model, generates codes, and this is the coder (or encoder, both names mean the same). Together they make a compressed file. Needless to say we also have a decompressor, with a decoder and the same model, so it can reconstruct the original file based on the compressed file.

Let's see two different messages: "abcabcabc", "odwtmcbey". We can notice that both of them have the same length (and thus they take the same space). But we can clearly see that they have different structures, the first repeating a pattern ("abc") and the second being rather random. To analyse it we have the most powerful model, our brain (unfortunately we have not developed models better than our brain for text compression yet) so we can see that the length of a given message isn't a very reliable measure of how much can we compress it. A message has an internal structure with a probability for every symbol, however, both are unknown to us, and the only information that a model has about the message is the message itself. Our model then assigns a probability to every symbol. Then the job of the coder is, assign a code to every symbol based on a probability distribution. (In theory, we would say that it mapped the source message in codewords.) But what's the optimal code length for a given probability distribution?

Between 1940 and 1950 Claude Shannon created the information theory, a branch of mathematics, among the results of his job we can find the answer to the previous question. The entropy, which measures the content of information in a given message, also gives the optimal code length in average for a probability distribution. The entropy of a given symbol is defined as -Log2(P) where P is the probability of a given symbol (the one whose code we want to generate). Note that we do Log2 if we want the result to be expressed in bits, otherwise we only have to change the base of the logarithm, if we want it to be expressed in bytes we would use: -Log256(P). Just as a curiosity, the term entropy is also used in thermodynamics, a branch of physics, they are called the same because both are analogous.

Finite context modelling

Finite context models assign a probability to a symbol based on the frequencies of the symbols that happened before the current symbol to code. Before because when we code a symbol based on the previous symbols, we can be sure that the decoder will have this information too (as the previous symbols have been already transmitted). Finite context modelling is based on the fact that a symbol that has recently appeared will appear in the near future, so the more it appears, the higher is the probability that it appears again, so every time it's seen we increment its probability.

In this article I'll not deal with coders, but you can choose among Static Huffman, Arithmetic Coding, a variant of arithmetic coding called Range Coder and many others (read my articles on them in previous Hugi issues or on my homepage). Arithmetic coding approaches the entropy, so from this point on, we'll only use the entropy as a reference (assuming that we are using arithmetic coding).

Let's say we have a message like "aaabbc" and a finite context model which assigns a probability of symbol_frequency/total_frequency to every symbol, then we have:

Symbol Frequency Probability Optimal code length
a 3 3/6 -Log2(3/6) = 1 bit
b 2 1/6 -Log2(2/6) = 1.59 bits
c 1 2/6 -Log2(1/6) = 2.59 bits

In this example the original length of the symbol "c" was 8 bits (we assume it was a byte), however, our model assigned a code length of 2.59 bits, from this we can see that the redundancy was of 8-2.59 = 5.41 bits, and we can generalize the following: for any model the redundancy of a given symbol is original_code_length-code_length. Note that the probability 1/6 for 'c' is the expected probably of the next symbol being a 'c'.

Just notice from the previous example that every time a symbol appears we increment its probability, and 6 is the total number of symbols seen (total frequency), from there we can get the probability of a given symbol. When we implement this simple finite context model we have a table with Z entries (alphabetsize, in case of bytes: 256), which is initialised to 0, and then when a symbol appears we increment its value, something like: ++table[symbol].

Now I'm sure that you are wondering if we can do better than the entropy, the answer is no. The compressed length of the message is (1*3)+(1.59*2)+(2.59*1)= 8.77 bits. Let's try to give a higher probability to the most probable symbol:

Symbol Probability Fake probability Entropy
a 3/6 = 6/12 9/12 -Log2(9/12) = 0.42 bits
b 2/6 = 4/12 2/12 -Log2(2/12) = 2.59 bits
c 1/6 = 2/12 1/12 -Log2(1/12) = 3.59 bits

So the final length is: (0.42*3) + (2.59*2) + (3.59*1) = 10.03 bits. We have coded one symbol (the most probable one) with fewer bits, but the resulting output file is bigger, so on average we can't do better than the entropy. This can be mathematically proved by the Kraft Inequalities, for further reading see Charles Bloom's homepage. Of course we can do better models, which give higher probabilities to the symbols being coded, and thus we achieve higher compression.

Finite context models are also called finite Markov models (Markov, 1856-1922, was a Russian mathematician), because they assume the input file to be a Markovian sequence: every symbol depends only in the finite preceding sequence of symbols. Note that also exist models for a message of an infinite length, but in most of the cases this is not practical, see The Context-Tree Weighting Project Also the term Markov model is used to refer to finite state models, however this falls out of the scope of this article.

Context and order-n

As we have seen we can't do better than the entropy, so now the only problem is to make a finite context model which gives higher probability to the most probable symbols. The best solution is using the information of the contexts (which has proven to be very high).

The context of a given symbol is the finite length sequence of symbols before or after it. In finite context modelling we use the symbols before it. Because we can be sure that the decompressor's model will have this information. That ensures that it can have the same probabilities as the compressor's model and there would be no problem decoding the symbols. On the other hand if the compressor and decompressor have different probability distributions, the same code may mean different things, and thus we will get corrupted data. Due to this fact, a rule which you must follow is that both compressor and decompressor must always have the same model (if you want to do lossless compression).

We use context because the probability of a letter appearing in a text heavily depends on the symbols that have appeared before, wouldn't it be more probable to see a "h" after "whic" rather than a "q" or any other letter?

In the previous example model, the probability of a symbol was taken from the number of time that it appeared in the file, when using the information of context we also have to take care of what is the context of the current symbol, and thus we take only the probability of the symbols which have appeared under this context. Both text and binary data are context sensible, that is, under a given context the same symbols tend to appear, so we can get a more reliable probability. (Binary data tends to be context sensible, however, it shows a more random structure than text.)

I'll only prove this for text data: "the text" (the context is in red colour and the symbols used to make the probabilities are in green). We are currently at the end of the message, we take care of the last symbol, in this case a "t", the probability distribution under the context "t" is 1/2 for "e" and 1/2 for "h". (We can also reserve probability space to indicate that the symbol is another one, which would be an escape code.) If we don't take context information in account the probability of "h" would be 1/8 and thus it would be coded to 3 bits, instead of 1 bit using the last symbol as context. Implementing this is about like the previous example, but instead of one table we would use Z tables, and depending on the context we would choose one.

As you can see, using the information from the context we have symbols with higher probabilities than if we didn't take care of the context. This is called a skewed distribution, in the other case we have a flat distribution, which means that all the symbols have a probability of 1/Z where Z is the alphabetsize, or any other probability distribution which assigns the same probability to every symbol. Just as an example, a text file which is not compressed has a flat distribution, with every symbol having a probability of 1/256, thus every symbol is coded with 8 bits. The goal of a model is to get a skewed probability distribution, which hopefully gives high probability to the symbols which actually occur, and thus we can code them with fewer bits.

When we use context in a finite context model, we don't say that we have a model using one byte as a context, we say that we have an order-1 model. When we use no context at all (one table for all the symbol) we are using order-0. Of course if we use 3 symbols it's order-3, if we use 5 it's order-5 and so on. There's also an special order, order-(-1), which is a table with a flat distribution for all the symbols, used for ppm.

You have to note that using high orders gives high probabilities only if the data reassembles a Markov model, if it's context sensitive, like text is (under a given context the same symbols appear again).


Once we have different orders the question of how to use them arises. One of the solutions is blending, which is to combine the probabilities of all the orders to a single probability for every symbol. This is done by adding together the probabilities of the same symbol under different contexts. However, as we have seen, probabilities of higher orders tend to be more reliable and thus achieve higher compression, the way we reflect this in blending is by weighting higher orders, that is multiplying them to give them more importance when they are added together. Every context has a different weight, which we can choose beforehand, or change while compressing.

Now let's say we have a message like "baabcab" and an order-2-1-0 model, using the different weights values: 8, 2, 1 for order-2, order-1 and order-0 respectively. In this case once all the data has been processed the probability distribution will look like the following:

. . Probability Weighted .
Order Context a b c a b c Weight
2 ab 0 0 1 0 0 8 8
1 b 1 0 1 2 0 2 2
0 . 3 3 2 3 3 2 1
. Total: 4 3 4 5 3 12 .

Probability distribution for next symbol:

Symbol Probability Code length
a 5/20 -Log2(5/20) = 2 bits
b 3/20 -Log2(3/20) = 2.737 bits
c 12/20 -Log2(12/20) = 0.737 bits

That means that we would have 3 codes with the code lengths showed above.

However, using blending is slow, though it gives good compression. There's another method, which is faster. This is the use of escape codes, that's the method used by ppm. In the document "Text compression" by T.C. Bell, J.G. Cleary and I.H. Witten from 1990 you can find more about blending, they also recommend making the weights sum up to 1 (in order not to surpass the limit of a value that the arithmetic coder being used has.)

Zero frequency problem and escape codes

The zero information problem is when we have no information about the input, we haven't seen any symbols yet, and the problem is what probability distribution to choose. In this case the rational solution seems to be a flat probability, however, with a flat probability we achieve no compression.

The zero frequency problem especially arises when using high orders, look at this text: "we have a model using one byte". Let's say we are under a order-4 model, and we want to know the probability of the next symbol (the context is "byte"), as we have never seen before 'byte' the probability of the next symbol appearing is 0, then in a context where no symbol has appeared the most optimal solution seems to be a flat distribution, 1/Z (1/256), which gives the same probability to every new symbol. But should we assign an equal probability to all the symbols when more than one symbol have appeared? Look at the following example: If you have the following text "aa" and you have seen the symbol 'a' once and you start with a flat distribution, the probability of the next 'a' is 2/257, if you see it once again it will be 3/258, but you'll expect the probability to be much more higher than 3/258, maybe something like 2/3 (because its frequency is 2 and nothing else has appeared), but you still have to leave some probability for the case of a new (unseen in this context) symbol appearing. This is called an escape code, which means that the symbol which we are currently coding is not in the current context, and thus we have to try it in another context. We start in a table (order-0) with all the symbols (Z) and the escape code, all the symbols have a probability of 0 and the escape code has a probability of 1. Let's say we have to code a given symbol now, as it doesn't appear in the order-0 table (when only the escape codes have a probability) you have to fallback to order-(-1), where all the symbols have a flat probability and you can safely code the symbol. An escape code is used to make the decoder know that it has to use the next table, we always go from high orders to lower order, till we hit order-(-1), where we can code the symbol in case it hasn't appeared in higher orders.

For example, the case of "aa" will be the following: the first 'a' is not in the order-0 table, so we emit an escape code (which is in the order-0 table) and then code the 'a' symbol with the order-(-1) table, with a probability of 1/Z, once the symbol is coded we have to update the order-0 table to reflect the "a" that has appeared once, however, we don't update the order-(-1) table, as its only purpose is to code symbols which have never appeared, and symbols which have already appeared will be found in higher orders (probably resulting in more compression). Then the next 'a' is found in the order-0 table (which we have updated and now has two symbols on it, "a" and the escape code) and now it has a probability of 1/2, and thus we can code it to only one bit. In that case note that the real probability of the second "a" was 1, however, our model assigned it a probability of 1/2, and then the coder coded the "surprise".

As you see this scheme of making the escape code is a little bit inefficient, as it always has the same value, no matter how many symbols we have seen till the date. Fortunately there are some other ones. The first which I've described is method A (and thus a ppm implementation which uses it would be called ppma). Method B is based on "Don't believe it till you have seen it twice" by subtracting one from the count of every symbol, but this is kind of old stuff, so let's go to see the next one. In method C, the probability of the escape code is equal to the number of different symbols seen (also remember to add the escape code probability to the total of probabilities). Using escape codes is an alternative of blending, let's see the previous example ("baabcab") but using escape codes with method C instead of blending: (E is the escape code, whose probability is equal to all the different symbols in the current context, obviously it is not used in order-(-1) as it already has all the symbols.)

Order Context a b c d E Total
2 ab 0 0 1 0 1 2
1 b 1 0 1 0 2 4
0 . 3 3 2 0 3 11
-1 . 1 1 1 1 0 4

Probability distribution for next symbol:

Symbol Total code length (sum of code lengths)
a (-Log2(1/2))+(-Log2(1/4)) = 3 bits
b (-Log2(1/2))+(-Log2(2/4))+(-Log2(3/11)) = 3.874 bits
c (-Log2(1/2)) = 1 bit
d (-Log2(1/2))+(-Log2(2/4))+(-Log2(3/11))+(-Log2(1/4)) = 5.874 bits

After "baabcab" was coded, the tables would look like those, now let's see how we would code a "b": We start at the highest order, as in order-2 the probability of the symbol "b" is 0, we have to emit an escape code whose probability in the order-2 table is 1/2, then we analyse the order-1 table, here its probability is also 0, and so we have again another code, of course using the escape code probability of the order-1 table, which results in being 2/4 (because there are two different symbols in this context, "a" and "c"), once we are in order-0 we predict the "b" with a probability of 3/11, and thus finally we code it. Note that we have not coded "b" in one code but in three. In case a "d" appeared, we would emit escape codes in all the tables except in the order-(-1) table, which has a flat probability distribution for all the symbols (and thus it can predict all the symbols which may appear in the message but whose frequencies till the date are 0). What we have seen in this example is a simple ppmc implementation.

Higher orders produce better results, however, it has been proven that orders above 5 are not worth the processing time and space requirements in basic ppm implementations like ppma or ppmc, however, ppmd, ppm* or ppmz make use of higher orders to achieve more compression. Ppm (Prediction by Partial Matching) is a finite context adaptive model, which uses arithmetic coding as a coder and tries to accurately predict the next symbol, using the current context in a context tree (or any other data structure like a suffix tree or hash tables) for predicting the current symbol, using the highest order possible and going to a lower order (using escape codes) till the symbol is found. (Actually this is not true for all the versions, ppm* chooses the lowest deterministic context, and ppmz chooses the highest deterministic context, or any context via LOE.) Ppmc achieves around 2.4 bpb in the Calgary Corpus Set, and ppmz (by Charles Bloom) is the state of art with 1.9 bpb. Ppmz is the most promising model for lossless compression, in the case of infinite context modeling (which is not used) the state of art is Ctw. Unfortunately both of them (especially ctw) are very slow.

Models, probability updating

The difference between finite context models are not only the order or escape code method, but also how often the probabilities are updated. In this topic we have three different kinds of models:

- Static. An static model uses predefined probabilities, which both the coder and decoder know beforehand (and thus we don't need to transmit them), with those fixed probabilities (we never update them) we code the whole file. This of course has the drawback that if the file doesn't match our probabilities we are going to expand the file instead. For example, if we have probabilities for English text and code Japanese text, the result will be bad. Of course this is the fastest one. But this is not actually used except in a few applications. (For example it's not used in archivers, but it's used for coding the probability of the symbols of a Jpeg file.)

- Semi static. This models does two passes, the first to gather probabilities for all the symbols, then once this is known it outputs the probability distribution and starts to output the codes for every symbol. It's still fast, and in most of the cases it does a good job, however, it has some inefficiencies, like that it has to pass the whole probability distribution (usually this is only applied to order-0, as going to higher orders would need to pass a huge amount of probabilities) or that it does not locally adapt. Just an example if we have the file "aaaaaabbbbbb", it will code both 'a' and 'b' down to one bit. But we can do better. Also this kind of model is called semi adaptative, however, it's widely just called static model (as the purely static model is hardly used).

- Adaptative. It processes the file symbol per symbol, in the following way: code symbol, update model. Thus it has accurate probabilities and doesn't need to tell the decompressor which the probability distribution is before using it. Of course it has a drawback, it's slow, even if there are some schemes for speeding it up (both for Huffman and arithmetic coding), as we have to update the model for every symbol processed.

Of course there are also differences in how a model updates its probabilities (for example in most ppm implementations updating only probabilities at the same and higher orders where the symbol was found, this is called exclusion), but this is something which changes with every model and will be analysed when they are going to be used.


[1] Charles Bloom Kraft Inequality, (Compression : News Postings : Kraft Inequality)
[2] Bell, T.C., Cleary, J.G. and Witten, I.H. (1990) Text compression, Prentice Hall, Englewood Cliffs, NJ.
[3] Charles Bloom Solving the Problems of Context Modeling, (
[4] The Context-Tree Weighting Project,

Closing words

I finish this article hoping that this article gives you all the information you need to fully understand how ppm works. I've done my best with this article so it's as accurate and complete as possible, so I'd be very grateful hearing your suggestions for further improvements of it, or just comments on it, just write me an email.

I want to thank Michael Schindler, Sampo Syreeni and Stefaan for their help proofreading this article and answering my questions. I plan to do another article talking about the entropy formula only. The article on ppmc is already available. If I had more knowledge about information theory (I'm only 17, still in the institute) I'd write about it.

Dario Phong, Barcelona 2-Jan-2000