Information Theory: A 28k Crash Course



This article is about some basic information theory, it requires only minimal knowledge of data compression to understand. I wrote it because there is no internet site out there with all this info. on it (that I know of).

Most people find information theory interesting, so hopefully this article will be interesting to you. You don't need to know any compression algorithms to appreciate this. Also, for this basic (discrete) information theory you only need high school math. After reading this doc you should hopefully be able to see why people had the ideas for developing the various different types of compressors.

What Is Information Theory?

Information theory is a specialised section of math. concerned with describing properties of data.

You will surely know that some data compresses well, while some doesn't seem to compress at all. But why can't you compress all files? How much can you compress files? How can we tell if a code is decodable? How do we measure information?

All the above questions will be addressed in this little doc, after reading it you should be able to tackle lots of other questions too.

Some Definitions

Ok, bare with me, I'll only need a few of these.

(i) A message: A sequence of binary digits (bits).

[ comment: Often a message is a single ASCII symbol or a whole lot of them together ]

(ii) A source: Anything which emits messages with known probabilities.

[ comment: If we had a source like { A = 1/4, B = 1/4, C = 1/2 } Then we can work out the probabilities of various data from that source, e.g. AAA is less likely than CCC, assuming independence between symbols. ]

(iii) A source symbol: A message of fixed length from a source.

[ comment: Usually just an ASCII symbol, in our source above A would be a source symbol ]

(iv) A code word: A source symbol of unrestricted length.

[ comment: e.g. a Huffman code ]

(v) A code: A sequence of code words.

(vi) A compact code: A code which on average is shorter than all other codes.

(vii) q: The number of symbols in the alphabet (for ASCII, q = 256)

When I write X Y or XY I mean X * Y.
When I write log X I mean log2(X).

Quantifying Information: Entropy

This measure of information was invented by Claude Shannon in 1948, his original paper (part I) is possibly the best introduction to entropy, I know I really liked it. Also, it was (relatively) recently republished in postscript.

Part (i) - Intuitive Reasoning

What we're going to address here is: how much information does a message hold? Does a long sequence of the symbol 'A' convey a lot of information?

It's important to have an intuitive feel for information, so first I'll do some informal examples.

Suppose we had four messages, { s1, s2, s3, s4 }

Now suppose we wanted to make a choice from that set q, we could resolve all uncertainty about which one we are talking about by specifing 2 bits as an index.

s1 = 00, s2 = 01, s3 = 10, s4 = 11

As long as the sender and receiver agree to that system all uncertainty is completely resolved after two bits are sent.

This forms a very rough basis for a measure of information, call it A. Before any bits are sent we are completely uncertain as to which message will be specified, if A is a function of a number of bits then

A(0) = completely uncertain


A(0) = {4}

When I write A(x) = {y}, I really mean after receiving x bits, we must make a choice from y things. (The sender and receiver must also agree that there will be four possible messages.)

What about after receiving one bit, we now know it is one of two messages, (the ones beginning with 0 or the ones beginning with 1)

A(1) = {2}
A(2) = {1}

Our measure makes pretty good sense,
A(1) means we must choose between two further messages,
A(2) means we must choose from one message, i.e. we know which one it is.

Now the question arises, how much information do we need to resolve a choice? We already have the answer in this special case, two bits. What about in general? It's quite simple, if we have n events, then we need log2(n) bits to resolve any uncertainty between them all. We call these bits, which resolve all uncertainty, information, the number of bits needed to resolve uncertainty reflects the information content of the choice.

A few words about the logarithmic measure. It may be the next integer above or below log2(n) but that doesn't really matter, that is a practical issue.

Another example, take the ASCII table, to resolve all uncertainty between which symbol you are talking about you need 8 bits, which is equal to log2(256).

Consider now our messages again: { s1, s2, s3, s4 }

Suppose s1 occured much more than the others, imagine it occured 9/10 of half the time. We'd be less uncertain about what message is going to be transmitted, in fact, we can be correct in saying:

"It'll probably be message s1"

Since there is little uncertainty, we can resolve it with little information, on average. We could invent a scheme like this, where we ask the sender questions, in this case, the first one we'd ask is:

           Receiver: "Is it message s1?"
           Sender: yes/no (1 bit)
           yes: stop.
           no: continue to find out which message it is.

Most of the time we'll just be sent 1 bit saying yes. If we are told it isn't symbol s1, then we'll be more uncertain, since it could be any of three messages.

Moral: s1 is assigned little information, since it introduces little uncertainty. On average, little information is needed to reduce the uncertainty since messages s1 is so likely.

Part (ii) - Rigorous Reasoning

With the ideas of part one in mind, Shannon invented a function which given probabilities of occurences of messages, gives their information. The way he did this was, to state some intuitive and desirable properties for the measure and then derived it under them as constraints. I couldn't decide whether to include the derivation or not, I've included a brief derivation (if you are interested in full one either email me or check out Shannon's paper).

The properties which seem very handy for a measure of information to have are:

(i) It is a continuous function of the probabilities of messages, i.e. we should be able to consider a message with probablility 1/2 just as well as a message with probability 59/341.

(ii) If all events are equally likely then it should be a strictly monotonically increasing function of the number of events. Again, this makes a lot of sense, if our ASCII table was 512 symbols in size we'd need more information.

(iii) If the decision is made in a multi-stage process, then the entropy of the overall choice should be the weighted some of the entropies of the other choices.

Number (iii) is perhaps a little less obvious than the others: Shannon's example was (call our entropy H)

H(1/2, 1/3, 1/6) = H(1/2, 1/2) + 1/2H(1/3, 2/3)

Or as a tree:

                 1/2 1/3 1/6
                  /   |    \
                 s1   s2   s3

Is equivalent to:

                 1/2  1/2
                  /    \
                 s1   /  \
                     1/3 2/3

This is ok since the events retain the same relative probabilities.

So what should H be? If we have H(1/2n, 1/2n, ..., 1/2n) by property (iii) we have:

H(1/2n, 1/2n, ..., 1/2n) = H(1/2, 1/2) + H(1/n, 1/n, ..., 1/n)

I.e. we choose between either the first half or the second half.

If we continue in that way, the entropy will be the some of the number of times we can make a binary decision, or divide the number of events by two, we know that to be the log, so now,

H(1/n, 1/n, ..., 1/n) = H(1/2, 1/2)log(n)

We know the information content of H(1/2, 1/2) to be a bit so it is simply log(n) bits. We can apply arguments of continuity for a general case of n (when it isn't a power of two).

In Shannon's original paper he arrives at that result differently, but the above is, imho, simpler. Notice that we have arrived at an expression for H when all events are equally likely.

But what about when all events aren't equally likely? We can apply (iii) again. All we have to do is make sure the relative probabilities of events stay the same and their entropies stay the same, which brings us to the measure of information:

H(S) = S P(i) log 컴컴

That is, for n messages from a source, with probabilities P(i), the entropy of the source is H(S). The base of the logs is usually 2 for the information in bits.

I must stress that this is an average measure, H(S) denotes the information of the source : ON AVERAGE. It could be much more or much less. Entropy is just the expected number of bits sent from a source,

H(S) = S P(i) L(i)

That's just the expected number of bits sent, above L(i) = -log2[ P(i) ]

With the measure established, what properties does it have? What is the maximum entropy of a source? What is the minimum entropy of a source? When you read "entropy" you should think "average information".

Minimum Entropy

Mininum entropy is easy,

H(S) = 0

If and only if, P(i) = 1 for any one source symbol.

Intuitively, if we know a symbol will occur, there is no need to send it, since no information is conveyed.

Maximum Entropy

Now let's find out maximum entropy:

Consider two sources S and U,

S = { s1, s2, s3, ..., sn }
U = { u1, u2, u3, ..., un }

I denote the probability of symbol i from source S as S(i) and similarly with U.

We define the following equation:

X = S S(i) log 컴컴

Notice that log(x) <= x - 1

Now we know that

X <= S S(i) ( 컴컴 - 1)

Multiply in by S(i):

X <= S U(i) - S(i)

Seperate the sums:

X <= S U(i) - S S(i)

Both of those expressions sum to one, since they are probabilities, hence

X <= 0,

Now we have the following inequality by filling in:

S S(i) log 컴컴 <= 0

But what about maximum entropy?

Let all U(i) = 1/q, now:

S S(i) log 컴컴  <= 0

Seperate the sums:

           1                1
S S(i) log   - S S(i) log 컴컴 <= 0
           q               S(i)

Now simplify:

    1               1
log - - S S(i) log 컴컴 <= 0
    q              S(i)


S S(i) log 컴컴 <= log(q)

The LHS is just H(S) so,

H(S) <= log(q)

Which means: The average information (entropy) of source S is always less than the logarithm of the number of source symbols in source S.

Hence the maximum information is when all S(i) = 1/q

This is when all source symbols are equi-probable, this is also when we are most uncertain, because it is equally likely to be any source symbol. In Shannon's paper he points out that any transformation that makes two or more symbols have more balanced probabilities also increases the entropy of the source, e.g.

S = { 3/8, 1/4, 1/4, 1/8, }

If we transformed S to this:

S = { 2/8, 1/4, 1/4, 2/8 } -> { 1/4, 1/4, 1/4, 1/4 }

by averaging 3/8 and 2/8 then we increase the information, in this case, we see that we actually maximised the information by doing so, that need not always be the case.


We've considered two properties of entropy, the maximum and minimum. Now we'll prove by far the most far reaching result of entropy: it is the most compact measure of information. Of course, it seemed that this should be so when we derived it, but we need a proof.

Noiseless Source Coding Thereom: part (i)

Ok, this is a sort of warm up...

Call the average length of our representation of a source, S, L.

We'll show that:

 H(S) <= L(S)

Pretty cool eh? The expected length of our code will always be greater than or equal to the entropy.

Consider the difference in the two lengths:

H(S) - L(S) = S P(i) log 컴컴  - S P(i) l(i)

l(i) denotes the length of codeword i in code L.

Rewrite the rightmost sum:

H(S) - L(S) = S P(i) log 컴컴 - S P(i) log [ 2^l(i) ]

Difference of the logs is the log of the quotient so,

H(S) - L(S) = S P(i) log 컴컴컴컴컴
                          2^l(i) P(i)

By our inequality log(x) <= x - 1

H(S) - L(S) <= S P(i) ( 컴컴컴컴컴  - 1 )
                        2^l(i) P(i)

Simplify it a little:

H(S) - L(S) <= S 컴컴컴 - P(i)

By extracting the P(i) and summing it since it they are probabilities they sum to one so:

H(S) - L(S) <= S 컴컴컴 - 1

Now, fill in for H and L:

            1                         1
S P(i) log 컴컴  <= S P(i) l(i) + S 컴컴 - 1
           P(i)                     2^l(i)

Ok, so the entropy is always less than the expected length of our code plus the sum of 2^-l(i) minus 1. We could notice, that if P(i) = 2^-l(i) then we have equality, and hence, we have optimality! We notice that if all P(i) are anything but 2^-l(i) it is sub-optimal.

Later we'll be able to look at it like this: The rightmost sum appears in something called the Kraft-McMillan inequality, and it is always <= 1.

So that means that

H(S) - L(S) <= 0

H(S) <= L(S)

and H(S) = L(S) only when all P(i) = 2^l(i).


The expected length of our code will always be less than or equal to the entropy of the source. So we can say entropy is the minimum length of a source.

We can now comment on the efficiency of our code,

Code Efficiency

A = 컴컴컴

We define A to be the efficiency of the code L(S). (A isn't the normal symbol for efficiency, but the correct symbol isn't in the ASCII table.)

It is important to realise this is a constant, but it is, again, an average, it is the quotient of two expected values.

We can now define the redundancy of our code.

R = 1 - A

  = 1 - 컴컴

The definition of redundancy: A code is redundant if it has the ability to represent more information than the source it seeks to represent.

Noiseless Source Coding Thereom: part (ii)

Now this is Shannon's thereom.

This thereom shows the following:

(i) We show the upper bound on compact codes for a source.

(ii) Codings exist which come arbitrarily close to the entropy of the source.

This thereom proves everything very cleanly.

Suppose we define a restriction on the lengths of our code words as follows:

     1                   1
log 컴컴 <= L(i) <= log 컴컴 + 1
    P(i)                P(i)

That seems like a fair enough coding scheme, we assign codes as close to their entropy as we can, while still writing integer length codes. Call the code formed by these code words A. We can verify decodablity through the K-M inequality which we'll see later.

The upper bound gives us the following:

         L(i) <= log 컴컴  + 1

Multiply across by P(i)

    P(i)L(i) <= P(i) log 컴컴 + P(i)

Take some sums...

  S P(i)L(i) <= S P(i) log 컴컴 + S P(i)

Probabilities sum to one on the RHS, and fill in for our stuff.

H(S) <= A(S) <= H(S) + 1

So our code will always be within one bit (or Hartley or nat or whatever depending on the base of the entropy and average length) of the entropy.

Now consider the average length of the compact code for a source, we'll call it C(S). By definition we have:

H(S) <= C(S) <= A(S) <= H(S) + 1

(H(S) is always less than or equal to C(S) also)

So for all compact codes, we know that:

H(S) <= C(S) <= H(S) + 1

Now by considering what are called extensions to a source we can prove there are codings which come arbitrarily close to the entropy of the source. (Extensions use a simple property of entropy that H(S^n) = nH(S), think of it in terms of decision trees to understand it, then we just use the same inequality for our compact code, divide across by n and take the limit.)


The average length of a compact code is linked to entropy as follows:

H(S) <= C(S) <= H(S) + 1

Hence H(S) + 1 is an upper bound an a compact code.

Lim   C(S, n)
n->  컴컴컴 = H(S)

As n tends to infinity the compact code tends to the entropy. C(S, n) denotes the length of the compact code corresponding to the nth extension of source S.

Broad Results

Remember the questions we asked above, we have answered lots of them now:

Q: How to quantify information?
A: Entropy.

Q: Minimum average information?
A: Minimum entropy.

Q: Maximum average information?
A: Maximum entropy.

Q: Maximum compression on average of a source with known symbol probabilities?
A: Entropy.

Q: Can we compress all files?
A: No.

The reason we have answered this last question is a simple result of entropy.

The fact that entropy is "optimal" gives rise to the two fundamental problems of data compression:

(i) Coding symbols in their entropy. (source coding)
(ii) Estimating symbols' probabilities. (modelling)

Both of those are more or less solved, coding symbols in their entropy is achieved through arithmetic coders. Estimating the probabilities of symbols is achieved through a number of methods which are (in order of success) Prediction by Partial Match (PPM) Context Tree Weighting (CTW) and State Based Modelling (mainly just Dynamic Markov Coding). Also, you know when people often refer to "LZ77/78/SS" they really, really shouldn't do that, since LZ77 is a the first Ziv-Lempel compressor, and LZ78 is a Ziv-Lempel compressor which is asmyptotically optimal, that means as the length of the input tends to infinity LZ78 approaches the entropy (cool eh?) However the convergence of the algorithm is very slow, making it only a little better than LZ77. LZSS is Storer & Sysmanski (how d'ya spell that?) LZ and is quite different again.

The best known compressor is self tuning PPMZ (Charles Bloom) but like most other high performance compression algorithms is completely impractical for the layperson. However, imho, text compression has gotten a little silly of late, as people push for tiny improvements, really they should be doing other things, like trying to find faster algorithms that get the same performance.

Anyway -

We still have to answer:

Q: Can you compress any file?
Q: How can we tell if a set of code word lengths could be used to make a decodable code?

Yes we can compress ANY file at all. A good example (from an old comp.compression FAQ) is:

if(input is my data) output(bit equal to one)
else output(bit equal to zero), output(input)

Now, here's a neat little way to show the incompressibility of data:

Let's define "data" : data is a sequence of binary digits of length greater than 0.

That's perfectly straight forward. Now we have T and R, a compressor and decompressor, they both are defined on data, i.e. both their input and output must be data. (I denote the length of data, D, in bits as |D|.)

We say a compressor is lossless if and only if,

R(T(D)) = D for ALL D.

Now if we could compress all files, it's plausible we could write a compressor to shorten a file by just one bit. so consider such a transformation,

|T(D)| = |D| - 1 for ALL D.

If we run T on D |D| times we get some data, say, P and |P| = 0 WAIT! if |P| = 0 then it isn't data, and our decompressor is only defined on data! Hence we cannot compress all files.

Also, notice that T is not only a not a lossless compressor but not a compressor! (since they're defined as a transformation that maps from data to data)

Kraft Inequality

This is an inequality defined on instantaneous code words, but we'll see how McMillan's thereom generalises it.

Firstly though -
An instantaneous code word of length L is a code word which can be decoded through looking only at the code word's own L bits.

So if we had simple codes like these:


We would read a number of zeros delimited by a '1' and know which code it was. What about in general? Consider these codes:


They are also instantaneous. In general instantaneously decodable code words obey the following rule:

No code word is a prefix of any other code word.

Consider 0, 01 and 10, if we had to decode


To decode the first '0' we'd have to look all the way to the first '1'. In general this can be very hairy, and the prefix rule is very handy.

How do we construct such code words?

It's easy, we just do a binary count, suppose we wanted to construct code words of lengths 2, 2, 2 3, 3.

First we sort them smallest to biggest. Now we binary count for each digit of each code word,


What if someone said, construct code words of lengths 2, 2, 2, 2, 3, We'd start going:

00, 01, 10, 11, <- end of binary count, it can't be done!

So there should be some general way to tell whether code word lengths obey this, there is and we'll derive it now:

Assume the lengths of the codes are sorted smallest to biggest i.e.

L(1) <= L(2) <= ... <= L(n)

(i.e. L(i) is a montonically increasing function of the code word number)

Let's consider the number of possible code words for each length.

Number of possible code words for code word of length L(1) :


Number of possible code words for code word 2: 2^L(2)
Number of instantaneous code words for code word 2:

2^L(2) - 2^(L(2) - L(1))

[ Notice: Number of prefix free code words is the number of possible minus number with code word 1 as a prefix (= 2^(L(2) - L(1)) ]

Number of possbile code words for code word 3: 2^L(3)
Number of instantaneous code words for code word 3:

2^L(3) - 2^(L(3) - L(1)) - 2^(L(3) - L(2))

Number of possible code words for code word i: 2^L(i)
Number of instantaneous code words for code word i:

2^L(i) - 2^(L(i) - L(i - 1)) - 2^(L(i) - L(i - 2)) - ... - 2^(L(i) - L(1))

Now, to actually construct the code we need to have the number of possible code words for the ith code word greater than or equal to one, if we have it less than 1 then obviously it can't work, because we have to have at least one possible code to construct.

So, we require:

2^L(1) >= 1                               [1]

2^L(2) - 2^(L(2) - L(1)) >= 1             [2]

...                                      [...]

2^L(i) - 2^(L(i) - L(i - 1)) - 2^(L(i) - L(i - 2)) - ...
- 2^(L(i) - L(1)) >= 1

Now, divide the ith inequality by 2^L(i)
For inequality one, it is

1 >= 2^-L(i)

for inequality two it is,

1 - 2^-L(1) >= 2^-L(2)

By rearranging:

2^-L(1) + 2^-L(2) <= 1

We see that if the ith inequality is satisfied so must all the ones before it be satisfied.

So in general, a set of code words is instantaneously decodable if and only if:

S 2^-L(i) <= 1

This is the Kraft inequality.

Notice that non-integers also satisfy this inequality.

Things to note about Kraft's inequality:

All sets of uniquely decodable codes satisfy Kraft's inequality that is, uniquely decodable implies a satisfied inequality BUT a satisfied inequality does NOT imply uniquely decodable. This is because it only tells us whether the lengths of the code words we are using could make a uniquely decodable code - we could still construct insane little code words whose lengths satisfy the inequality.

A good question is as to whether restricting ourselves to instantaneous code words loses us any code space, i.e. is the code redundance increased over other uniquely decodable codes...check out McMillan's thereom:

Just before we finish, we can see that our code lengths have no redundance when we have equality, by definition of redundance.

McMillan's Thereom

McMillan's Thereom: Every uniquely decodable code satisfies Kraft's inequality.

[ this means that we don't increase the length of our codes by making them instantaneous ]

Hence, we can make a stronger statement than:

All instantaneous sets of code words satisfy:

S 2^-L(i) <= 1

We can say, all uniquely decodable codes satisfy

S 2^-L(i) <= 1

This leads us to the...

Kraft-McMillan Inequality

Every set of uniquely decodable code words satisfy:

S 2^-L(i) <= 1

This is the Kraft-McMillan inequality (looks pretty similar to Kraft's =)).

Relationship between K-M inequality and entropy

There is a very strong relationship between the K-M inequality and entropy. In fact, we can derive the K-M inequality from entropy (but if we do so, we musn't use it in our proof of entropy's optimality)

L(i) = -log2(P(i))
=> 2^-L(i) = P(i)

Now sum over both sides

S 2^-L(i) = S P(i)

But the RHS are probabilities which sum to one, so

S 2^-L(i) = 1

This means, when all code word lengths are their entropy (optimal) we get the sum of 2^- them as one. But, consider making the L(i) longer than that then this is a smaller quantity so,

S 2^-L(i) <= 1

Which means, as does the K-M inequality, that: All uniquely decodable codes satisfy the inequality, all codes which satisfy it aren't decodable.

We can now draw much more powerful results from the inequality quite easily, by thinking about our knowledge of entropy.

When the K-M inequality evaluates to one, it means:

A source exists which could be coded optimally with these code word lengths.

It does NOT mean they are optimal for your source. This is intuitive, since, the K-M is only influenced by lengths, not probabilities. The second result is perhaps more important, when the K-M evaluates to less than one it means:

No source exists for which this is the optimal coding for.

This is important, because a coding scheme can be analysed, and forgetting about modelling you can say, it sucks or it doesn't suck depending on the inequality. We were essentially told this above too, without playing with entropy.

Again the results:

(i) If the K-M evaluates to one:
A source exists which could be coded optimally with these code word lengths.

(ii) If the K-M evaluates to less than one:
No source exists which could be coded optimally with these code word lengths.

(iii) If the K-M evaluates to greater than one:
These code word lengths, could never be used to construct uniquely decodable codes.


Phew... this article was pretty damn long. I hope you found it interesting. If you want to contact me at all (questions, mistakes), you can email me.

-- Nicholas Nash.
-- paradox: