The concepts of information and entropy can be thought of as the mathematical equivalents to our everyday intuitions about order and chaos. It's a surprising fact that you can reason about the flow and storage of information in the same way that you can reason about the flow of Porridge - despite it being a rather vague idea.
So What is Entropy?
Intuitively, the entropy of something is a measure of its intrinsic uncertainty, chaos, or disorder. In physical systems the primary source of entropy is heat - as things heat up, their atoms start to vibrate and move around, and the hotter they get, the more there is of this random unpredictable vibration.
This is the sort of entropy that is referred to in such things as the laws of thermodynamics, but it can be handy to talk about entropy in a more limited sense, where the 'universe' does not include absolutely everything. For example, the entropy of a newly formatted hard drive is, if you only look at the data you can read off it, zero. Whereas the entropy of your 100 page essay that was in it just before it was formatted was very high.
This doesn't mean that the entropy in the system has decreased - erasing the data has generated heat in the disk, and in all likelihood hair loss in the owner, so the physical entropy will have increased significantly. However, the entropy of the data will have decreased.
'Information' is one of those rare words in science which means almost exactly the same thing as it means to a non-scientist, albeit defined more specifically. (Scientists are generally fond of stealing phrases from normal English and giving them completely new meanings - such as 'stable marriages' which have nothing to do with the nuclear family.) The information we possess about something is a measure of how well we understand it, and how ordered it is.
However, the process of gaining information is exactly the same as the process of losing uncertainty, so they are essentially the same thing, differing only in sign.
The point of zero entropy or information is set at the case where we know precisely what state the item is in, and there is absolutely no doubt about it. The point of maximum entropy and minimum information occurs when we have absolutely no idea about the system. None at all - neither what state it might be in, nor how many possible states there are, nor what it had for breakfast. This point is set at infinity, though in practice it never occurs.
The units to measure information or entropy are the bit, nit, and Joules per Kelvin. The bit measures the amount of information contained in a single yes/no question which was equally likely to be true or false ('are you male?'), and the nit, or 'natural/naperian bit', is set such that one bit is approximately 0.693 nits. Joules per Kelvin is a measure for very large quantities of entropy, such as those associated with heat in physical systems, and each corresponds to around 1023 bits.
Sending Informative Postcards
Every message we receive has a certain amount of information associated with it. The more improbable the message, the more information it has, while a message which tells us about a certainty has no information content at all.
This means that the following postcard is very uninformative, since each of the statements is highly probable, and we might well have already guessed them:
Dear Michael. I am having a lovely time. The weather is lovely, as is the food. I forgot my toothbrush. Wish you were here. Much love, Becks.
Whereas the following postcard has a huge information content, perhaps too high (hence the phrase 'too much information'):
Dear Michael. I've decided to join the Movementarian sect and donate all our worldly goods to them. They believe in truth above all else, so I should tell you that it's not normal, and it doesn't happen to every guy. Also, I was born a man, and it was me who killed your parents, not the inflatable clown. Much love, Becks.
Murphy's Law is in this sense built right into mathematics, since informative messages are precisely the messages that we are likely to disbelieve, whereas messages that seem trustworthy are the ones that are boring and useless.
Entropy of the Unreceived
As well as knowing the information we can gain from a single postcard, it might be nice to know how much information we will receive from a postcard we haven't received yet, or how much uncertainty is in our lives as a result of not knowing what the postcard will say. Alternatively, it might be interesting to see how much information is transferred for each letter in an h2g2 entry.
Because mathematicians like to use long pointless words, they call this the Entropy of Ensembles. To find this out, you need to know the probability of every possible message, and hence find the average information, weighted according to the likelihood of it occurring.
In English and most similar languages, the information each letter provides is about 4 bits. It just sometimes doesn't seem like it. Since asking a yes/no question provides 1 bit of information, this means that a game of 20 questions provides you with only 20 bits of information - at the minimum enough to guess a five-letter word if you choose your questions right (but see Memory and State below).
A Multitude of Chaos
So far, we've been dealing with single variables - literally, a thing which varies. A variable might stand for an unreceived postcard, or an unread letter, or any other question to which the answer is currently unknown, but often we have two or more pieces of information that are linked in some way ie, Joint Ensembles. For example, we might be interested in knowing whether somebody is 3ft, 4ft, 5ft, 6ft, or 7ft, and also what gender they are, and these two pieces of information are related. There are lots of ways of measuring how these two combine as a source of information.
The first one of these is the Joint Entropy. If we take X to be the height variable, and Y to be the gender variable, then the joint entropy of height and gender is simply the average entropy of each possible combination (weighted by the likelihood of it occurring). This is the simplest measure, and is very similar to the entropy of a single variable.
The second is the Conditional Entropy. Conditional probability is about finding answers to questions like 'given that Sam is seven foot, what is the probability that (s)he is male?'. Conditional entropy, on the other hand, tells us the general uncertainty remaining in someone's gender, if we know about their height, answering questions like 'given that Alice is female, how much uncertainty is there about her height?'.
Finally, the Mutual Information of X and Y is the amount of information you gain about Y from knowing X, or about X from knowing Y. Alternatively, you can think of it as the amount of reduction in entropy in Y, which occurred as a result of knowing X - as mentioned earlier, entropy and information are opposite sides of the same coin.
Sadly, just because the entropy involved in some quantity has decreased, doesn't mean that we are guaranteed to get the correct result. However, it does mean that the probability of error has decreased - the amount of entropy left in a system can be used to predict the probability of incorrect prediction when the prediction is made by a perfectly intelligent entity, although cynics will claim that no such entity exists.
The Information Diagram
In the following ascii text, imagine, if you will, that the top left circle is blue, and the bottom right one is red.
| A |
| | B | |
| C |
In the diagram, the area of the blue circle represents the entropy of one variable (which we can call X), and the area of the red circle represents the entropy of a related variable (which we can call Y). Because X and Y are related, some of their entropy overlaps - the size of this overlap, marked B, is the mutual information of X and Y.
The other measures can also be seen - the size of the area A is the conditional entropy of X given Y, and the size of area C is the conditional entropy of Y given X. Meanwhile, the entire thing is the joint entropy of X and Y combined.
Memory and State
Sometimes, in the sending of a sequence of events, the events that you receive are dependant on each other - for example, the next word in a sentence is (hopefully) dependant on the previous one, which is what distinguishes conversation from nonsense.
This actually reduces the information rate you get from the source - because it is more predictable, and hence the information for each message is reduced (remember, high probability means low information).
But it's not all bad news - such redundant messages help to avoid errors - consider this classic:
Send reinforcements, we're going to advance!
... and consider what happens if you introduce a few random errors:
Semd reingorcememtz, were goink 2 advans!
...but if you have enough errors, as a result of Chinese whispers perhaps, the redundancy won't help you:
Send three and fourpence, we're going to a dance!
One interesting question is to ask exactly how closely two variables are related. For example, the question of skirt-wearing is closely related to gender, whereas the question of colour of eyes is completely unrelated. Although ideas like Mutual Information can help in a general sense, one way of quantifying this idea more precisely is through the ideas of distances.
Distance is defined as the difference between the joint entropy and the mutual information of X and Y. It is highest for pairs of ensembles which have a high individual entropy, and which are completely unrelated. Usefully, this distance shares all the common properties of geometrical distances - for example, it is symmetrical, and D(X,X) is zero for any ensemble X.
If this distance seems too simple and straight-forward, try Relative Entropy1. Simply put, it's a measure of how inefficient it is to assume that some variable is X when it's actually Y. This is very important for finding the best way to transmit information around, but sadly is horrendously complicated to calculate.
If you're asking that question, then answers concerning the innate value of knowledge, or the beauty of pure maths, won't interest you. So instead here are a few examples of applications of these sorts of ideas.
The first application is in the realm of physics - entropy is the subject of the famous laws of thermodynamics, and a better understanding of entropy from a theoretical viewpoint means that we can better understand what properties the universe must have in order for these laws to hold - often very basic properties of the sort that physicists try and investigate. Alternatively, if people find holes in the laws, they might look at these kinds of ideas to try and work out what might have gone wrong.
The second application is in the realm of compressing data. It's a mathematical fact that the smallest size a collection of data can possibly be compressed to is its entropy - and this puts an upper bound on how good programs can possibly be, as well as giving us clues on how we might make them better.
The third application is in storing data safely - so that cosmic rays or random errors causing a few bits to change can be detected, and hopefully, corrected. We can even calculate the maximum rate that information can be passed over some noisy wire, and, again, gain some clues on how to do it well.
The fourth application is, surprisingly, relativity theory. Currently, scientists think that the fastest way to transfer information is at the speed of light, and when they say that, they're talking about information in the sense it's meant in this entry.
Here's another application that is a little more speculative. Suppose you have a great big volume of air, like that floating over the world, and you want to transfer information through it using radio waves. Given that radio frequencies interfere with each other, and antennae placed in proximity to one another similarly get in the way, what's the limit on the total number of mobile phone calls you can carry simultaneously?
Amazingly, there are hard limits on this amount, and those limits can be derived, eventually, from the quantities given in this entry. Better news yet is that from those limits we can work out how to increase capacity by using, for example, signals reflected of nearby buildings, clouds, and the ionosphere.
On a more prosaic note, the discovery that highly compressed data 'looks like noise' was the inspiration behind the transmission method CDMA, which is used by all digital mobile phones. This has the nice property that the number of conversations you can carry is unlimited, but the more there are, the worse the reception gets.
This view of information and entropy has a rather curious by-product; the amount of entropy in some systems will be different for different people - depending on how much they have been observing it. But entropy is widely believed to be a basic part of the universe, strictly governed by the laws of thermodynamics. So, at a fundamental level, Ultimate Truth changes depending on who's looking at it.
Don't be alarmed too much, though - it doesn't change very much...