This blog post is adapted from ex-OpenAI researcher, Anthropicco-founder <ahref="https://colah.github.io/posts/2015-09-Visual-Information/">ChristopherOlah’s wonderful work</a>. I removed parts that are generallycommonsense to a CS kid and added some of my own notes &explanations.

<h2 id="visualizing-probability-distribution">Visualizing ProbabilityDistribution</h2><p>The author did a really cool job visualizing probability distributionhere, but doesn’t provide any more knowledge than basic probabilitytheory, so removed this part.</p><h2 id="code">Code</h2><p>I want to communicate with my friend Bob, so we establish a code,mapping each possible word we may say to sequences of bits.</p>

<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/code-2bit.png”alt=”fixed length code” /><figcaption aria-hidden="true">fixed length code</figcaption>
<p>However, Bob is a dog lover, with very high probability, he talksabout dogs.</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/DogWordFreq.png”alt=”Bob’s word frequency” /><figcaption aria-hidden="true">Bob’s word frequency</figcaption>
<p>Incorporating the code we defined above into this graph, we get thefollowing diagram, with the vertical axis to visualize the probabilityof each word, p(x),and the horizontal axis to visualize the length of the correspondingcodeword, L(x).Notice the total area is the expected length of a codeword we send.</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/OldCode.png”alt=”Expected length of fixed-length code” /><figcaption aria-hidden="true">Expected length of fixed-lengthcode</figcaption>
<h2 id="variable-length-code">Variable-Length Code</h2><p>Perhaps we could be very clever and make a variable-length code wherecodewords for common words are made especially short. The challenge isthat there’s competition between codewords – making some shorter forcesus to make others longer. To minimize the message length, we especiallywant the commonly used ones to be. So the resulting code has shortercodewords for common words (like “dog”) and longer codewords for lesscommon words (like “bird”).</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/code.png”alt=”Variable length code” /><figcaption aria-hidden="true">Variable length code</figcaption>
<p>Looking at this code format with word frequency, on average, thelength of a codeword is now 1.75 bits!</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/NewCode.png”alt=”Expected length of variable-length code” /><figcaption aria-hidden="true">Expected length of variable-lengthcode</figcaption>
<p>It turns out that this code is the best possible code. There is nocode which, for this word frequency distribution, will give us anaverage codeword length of less than 1.75 bits.</p><p>There is simply a fundamental limit. Communicating what word wassaid, what event from this distribution occurred, requires us tocommunicate at least 1.75 bits on average. No matter how clever ourcode, it’s impossible to get the average message length to be less. Wecall this fundamental limit the entropy of thedistribution – we’ll discuss it in much more detail shortly.</p><h2 id="the-space-of-codewords">The Space of Codewords</h2><p>To make our codewords uniquely decodable, we want them to follow theprefix property: no codeword should be the prefix of anothercodeword. i.e. If we see a particular codeword, there shouldn’t be somelonger version that is also a codeword. Codes that obey this propertyare also called prefix codes.</p><p>One useful way to think about this is that every codeword requires asacrifice from the space of possible codewords. If we take the codeword01, we lose the ability to use any codewords it’s a prefix of. We can’tuse 010 or 011010110 anymore because of ambiguity – they’re lost to us.The following graph shows we in effect lost <spanclass=”math inline”>$\frac {1} {2^{L(01)}} = \frac {1} {2^2} = \frac {1}{4}$</span> of our codeword space.</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/CodeSpaceUsed.png”alt=”Code space sacrificed” /><figcaption aria-hidden="true">Code space sacrificed</figcaption>
<p>Since a quarter of all codewords start with 01, we’ve sacrificed aquarter of all possible codewords. That’s the price we pay in exchangefor having one codeword that’s only 2 bits long! In turn this sacrificemeans that all the other codewords need to be a bit longer. There’salways this sort of trade off between the lengths of the differentcodewords. A short codeword requires you to sacrifice more of the spaceof possible codewords, preventing other codewords from being short. Whatwe need to figure out is what the right trade off to make is!</p><h2 id="cost-of-codeword">Cost of Codeword</h2><p>You can think of this like having a limited budget to spend ongetting short codewords. In fact, we have a budget = 1 to spend, where 1is the area of the whole codeword space. We pay for one codeword bysacrificing a fraction of possible codewords.</p><p>We define the cost c ofhaving a code word x withlength L as <spanclass=”math inline”>$\frac 1 {2^{L(x)}}$</span>.</p><ul><li>The cost of buying a codeword of length 0 is 1, all possiblecodewords – if you want to have a codeword of length 0, you can’t haveany other codeword. $x = \emptyset, c = \frac1 {2^{L(\emptyset)}} = \frac 1 {2^0}$</li><li>The cost of a codeword of length 1, like “0”, is 1/2 because half ofpossible codewords start with “0”. $x = 0, c =\frac 1 {2^{L(“0”)}} = \frac 1 {2^1}$</li><li>The cost of a codeword of length 2, like “01”, is 1/4 because aquarter of all possible codewords start with “01” <spanclass=”math inline”>$x = 0, c = \frac 1 {2^{L(“01”)}} = \frac 1{2^2}$</span></li><li>…</li></ul><p>(ignore the gray area in the following graph, the function we plot is$\frac 1 {2^L}$ and cost is theheight)</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/code-costonly.png”alt=”Code cost” /><figcaption aria-hidden="true">Code cost</figcaption>
<p>To make the calculation simpler, instead of base 2, we imagine wehave the natural number e. It now is no longer a binary code, but a“natural” code with the cost becomes $\frac 1{e^L}$. It is both the height and the gray area: <spanclass=”math inline”>$\frac 1 {e^L} = \int^\infty_L \frac 1 {e^L} \;dL$</span></p><h2 id="optimal-encoding">Optimal Encoding</h2><p>Recall the expected length of the codeword we pictured above, if welook at each codeword’s contribution to this expected length, eachcodeword makes the average message length longer by its probabilitytimes the length of the codeword. For example, if we need to send acodeword that is 4 bits long 50% of the time, our average message lengthis 2 bits longer than it would be if we weren’t sending that codeword.We can picture this as a rectangle.</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/code-lengthcontrib.png”alt=”Contribution to expected length” /><figcaption aria-hidden="true">Contribution to expectedlength</figcaption>
<p>Now, we have two values related to the length of a codeword and wepicture them together in the following graph:</p><ol type="1"><li>The amount we pay decides the length of the codeword: gray partdecides the width of the purple rectangle</li><li>The length of the codeword controls how much it adds to the averagemessage length: width of the rec decides the area of the rec (height isfixed for a given word because height represents the wordfrequency)</li></ol>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/code-cost.png”alt=”Contribution and cost” /><figcaption aria-hidden="true">Contribution and cost</figcaption>
<p>Short codewords reduce the average message length but are expensive,while long codewords increase the average message length but arecheap.</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/code-cost-longshort.png”alt=”short vs long” /><figcaption aria-hidden="true">short vs long</figcaption>
<p>What’s the best way to use our limited budget? Just like one wants toinvest more in tools that one uses regularly, we want to spend more onfrequently used codewords. There’s one particularly natural way to dothis: distribute our budget in proportion to how common an eventis - <spanclass=”math inline”>c(x) = p(x)</span>.The following graph shows such encoding system.</p><p>Additionally, that’s exactly what we did in our codes with Bob: “dog”appears $\frac 1 2$ of the time, so wegive it codeword “0” with the cost $c = \frac1 {2^{L(“0”)}} = \frac 1 {2^1}$; “bird” appears <spanclass=”math inline”>$\frac 1 8$</span> of the time, so we give itcodeword “111” with the cost $c = \frac 1{2^{L(“111”)}} = \frac 1 {2^3}$; If an event only happens 1% ofthe time, we only spend 1% of our budget.</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/code-auction-balanced-noderivs.png”alt=”Contribution and cost line up” /><figcaption aria-hidden="true">Contribution and cost lineup</figcaption>
<p>The author then proved it is the optimal thing to do. Honestly Ididn’t understand the proof, but this code system we’re talking abouthere is no more than a general version of Huffman Encoding. So I guessit’s fine as long as you understand how to prove the greedy HuffmanEncoding is optimal.</p><h2 id="calculating-entropy">Calculating Entropy</h2><p>In the Cost of Codeword section, wedefined the cost c of having acode word x with length <spanclass=”math inline”>L</span> as <spanclass=”math inline”>$\frac 1 {2^{L(x)}}$</span>. By inverting thisdefinition, given c, we caninfer the length $L(x) = \log_2 \frac 1c$</p><p>Furthermore, in the Optimal Encodingsection, we proved the optimal cost to spend on each word is <spanclass=”math inline”>c(x) = p(x)</span>,so the length of the optimal encoding system is <spanclass=”math inline”>$L(x) = \log_2 \frac 1 {p(x)}$</span></p><p>Earlier, we said: given a probability distribution of what we want tocommunicate, there is a fundamental limit of expected code length weneed to send no matter how smart our code is. This limit, the expectedcode length using the best possible code, is called theentropy of p:H(p).</p><p>\(H(p) = \mathbb{E}_{p(x)} [L(x)] = \sum_x p(x) \log_2 \frac 1 {p(x)}\)</p><p>People usually write entropy in the following way, which makes itmuch less intuitive.</p><p><spanclass=”math display”>H(p) = −∑xp(x)log2p(x)</span></p><p>The entropy, which by definition is the shortest possible code, hasclear implications for compression. In addition, it describes howuncertain I am and gives a way to quantify information:</p><ul><li>If I knew for sure what was going to happen, I wouldn’t have to senda message at all! <spanclass=”math inline”>p(x) = 1 ⟹ H(x) = 0</span></li><li>If there’s two things that could happen with 50% probability, I onlyneed to send 1 bit. <spanclass=”math inline”>∀xi, p(xi) = 0.5 ⟹ H(x) = 1</span></li><li>If there’s 64 different things that could happen with equalprobability, I’d have to send 6 bits. $\forallx_i, p(x_i) = \frac 1 {64} \implies H(x) = 6$</li></ul><p>The more concentrated the probability, the more I can craft a clevercode with short average messages. The more diffuse the probability, thelonger my messages have to be.</p><h2 id="cross-entropy">Cross Entropy</h2><p>Imagine now a cat-lover Alice talks about animals with the followingword frequency:</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/DogCatWordFreq.png”alt=”Alice loves cats” /><figcaption aria-hidden="true">Alice loves cats</figcaption>
<p>When Alice sends a message to Bob using Bob’s codes, her messageswere longer than they needed to be. Bob’s code was optimized to hisprobability distribution. Alice has a different probabilitydistribution, and the code is suboptimal for it: the expected codewordlength when Bob uses his own code is 1.75 bits, while when Alice useshis code it’s 2.25. It would be worse if the two weren’t so similar!</p><p>The expected length of communicating an event from distribution <spanclass=”math inline”>q(x)</span> with the optimal codefor another distribution <spanclass=”math inline”>p(x)</span> is called thecross-entropy. Formally, we define cross-entropy as:(people usually write <spanclass=”math inline”>H(q, p)</span>instead)</p><p>\(H_p(q) = \mathbb E_{q} \log_2\left(\frac{1}{p(x)}\right)= \sum_xq(x)\log_2\left(\frac{1}{p(x)}\right)\)</p><p>To lower the cost of Alice sending message, I asked both of them tonow use Alice’s coding, but this made it suboptimal for Bob andsurprisingly, it’s worse for Bob to use Alice’s code than for Alice touse his! Cross-entropy isn’t symmetric.</p><ul><li>Bob using his own code (<spanclass=”math inline”>H(p) = 1.75</span> bits)</li><li>Alice using Bob’s code (<spanclass=”math inline”>Hp(q) = 2.25</span> bits)</li><li>Alice using her own code (<spanclass=”math inline”>H(q) = 1.75</span> bits)</li><li>Bob using Alice’s code (<spanclass=”math inline”>Hq(p) = 2.375</span> bits)</li></ul><p>In the following diagram, if the messages are coming from the samedistribution the plots are beside each other, and if they use the samecodes they are on top of each other. This allows you to kind of visuallyslide the distributions and codes together.</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/CrossEntropyCompare.png”alt=”4 cases of cross entropy” /><figcaption aria-hidden="true">4 cases of cross entropy</figcaption>
<p>So why one is bigger than the other? This is because When Bob usesAlice’s code <spanclass=”math inline”>Hq(p)</span>,Bob says “dog” with high probability but “dog” is a word Alice happensto say the least so assigned longest length to. On the other hand for<spanclass=”math inline”>Hp(q)</span>,Bob didn’t dislike “cats” that much, so didn’t assign it with too big aword length.</p><h2 id="kl-divergence">KL Divergence</h2><p>Cross-entropy gives us a way to express how different two probabilitydistributions are. When using <spanclass=”math inline”>q</span>’s codewords for <spanclass=”math inline”>p</span>’s distribution, the more differentthe distributions p and <spanclass=”math inline”>q</span> are, the bigger the difference isbetween cross-entropy of <spanclass=”math inline”>p</span> with respect to <spanclass=”math inline”>q</span> and the entropy of <spanclass=”math inline”>p</span>. In math, <spanclass=”math inline”>Dq(p) = Hq(p) − H(p)</span>is bigger.</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/CrossEntropyQP.png”alt=”H_q(p)” /><figcaption aria-hidden="true">H_q(p)</figcaption>
<p>Similarly when using p’scodewords for q’sdistribution.</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/CrossEntropyPQ.png”alt=”H_p(q)” /><figcaption aria-hidden="true">H_p(q)</figcaption>
<p>The really interesting thing is the difference <spanclass=”math inline”>Dq(p)</span>.That difference is how much longer our messages are because we used acode optimized for a different distribution <spanclass=”math inline”>q</span>. If the distributions are thesame, this difference will be zero. As the difference grows, it will getbigger.</p><p>This difference is the Kullback–Leibler divergence,or just the KL divergence. The really neat thing aboutKL divergence is that it’s like a distance between two distributions. Itmeasures how different they are! (If you take that idea seriously, youend up with information geometry)</p><p>People usually write it as <spanclass=”math inline”>DLK(p||q)</span>,but our more intuitive way writes:</p><p><spanclass=”math display”>Dq(p) = Hq(p) − H(p)</span></p><p>If you expand the definition of KL divergence, you get:</p><p>\(\begin{align}D_q(p) &amp;= \sum_x p(x)\log_2\left(\frac{p(x)}{q(x)} \right) \\&amp;= \sum_x p(x) \left[ \log_2\left(\frac{1}{q(x)} \right) -\log_2\left(\frac{1}{p(x)} \right)\right]\\&amp;= \sum_x p(x) \left[ L_q(x) - L_p(x)\right]\end{align}\)</p><p>And we see the <spanclass=”math inline”>$\log_2\left(\frac{p(x)}{q(x)} \right)$</span> issimply the difference between how many bits a code optimized for <spanclass=”math inline”>q</span> and a code optimized for <spanclass=”math inline”>p</span> would use to represent <spanclass=”math inline”>x</span>. The expression as a whole is theexpected difference in how many bits the two codes would use</p><h2 id="multiple-variables-and-joint-entropy">Multiple Variables andJoint Entropy</h2><p>Let’s return to our weather and clothing example from earlier. Mymother, like many parents, worries that I don’t dress appropriately forthe weather. So, she often wants to know both the weather and whatclothing I’m wearing. How many bits do I have to send her to communicatethis?</p><p>To send both pieces of information, we can flatten the probabilitydistribution (calculate the joint probability)</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/prob-2D-factored1-flat.png”alt=”flattened probability” /><figcaption aria-hidden="true">flattened probability</figcaption>
<p>Now we can figure out the optimal codewords for events of theseprobabilities and compute the average message length:</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/Hxy-flat.png”alt=”joint word length” /><figcaption aria-hidden="true">joint word length</figcaption>
<p>Everything is the exact same as our normal definition, except withtwo variables instead of one. We call this the jointentropy of <spanclass=”math inline”>X</span> and <spanclass=”math inline”>Y</span>, defined as</p><p>\(H(X,Y) = \mathbb E_{p(x,y)} \log_2\left(\frac{1}{p(x,y)}\right) =\sum_{x,y} p(x,y) \log_2\left(\frac{1}{p(x,y)}\right)\)</p><p>It’s in fact more intuitive not to flatten it, but Instead to keepthe 2 dimensional square and add word length as a 3rd dimension height.Now the entropy is the volume.</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/Hxy-3D.png”alt=”3D code length” /><figcaption aria-hidden="true">3D code length</figcaption>
<h2 id="conditional-entropy">Conditional Entropy</h2><p>Suppose my mom already knows the weather. She can check it on thenews. Now how much information do I need to provide? I actually need tosend less, because the weather strongly implies what clothing I’llwear!</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/HxCy-sep.png”alt=”Conditional code length separated” /><figcaption aria-hidden="true">Conditional code lengthseparated</figcaption>
<p>When it’s sunny, I can use a special sunny-optimized code, and whenit’s raining I can use a raining optimized code. In both cases, I sendless information than if I used a generic code for both. For example incase of raining, I send code length of 4 when I wear t-shirt and codelength of 4/3 when I wear coat.</p><p>\(H(X \mid \text{Y = raining})= \frac 1 4 \log_2 \frac 1 4 + \frac 3 4 \log_2 \frac 3 4= 0.81\)</p><p>(Don’t worry for now why the entropy, representing length of anoptimal codeword, can be fractional, we will <ahref=”#fractional-bits”>explain it later</a>)</p><p>With a similar calculation, we show <spanclass=”math inline”>H(X ∣ Y = sunny) = 0.81</span>. Toget the average amount of information I need to send my mother, I justput these two cases together:</p><p>\(\begin{align}H(X \mid Y) &amp;= P(\text{Y = rainy})\,H(X \mid \text{Y = rainy})+P(\text{Y = sunny})\,H(X \mid \text{Y = sunny}) \\&amp;= \frac 1 4 \times 0.81 + \frac 3 4 \times 0.81 = 0.81\end{align}\)</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/HxCy.png”alt=”Conditional code length together” /><figcaption aria-hidden="true">Conditional code lengthtogether</figcaption>
<p>We call this the conditional entropy:</p><p>\(\begin{align}H(X|Y) &amp;= \sum_y p(y) H(X \mid Y=y) \\&amp;= \sum_y p(y) \sum_x p(x|y) \log_2\left(\frac{1}{p(x|y)}\right) \\&amp;= \sum_{x,y} p(x,y) \log_2\left(\frac{1}{p(x|y)}\right)\end{align}\)</p><h2 id="mutual-information">Mutual Information</h2><p>In the previous section, we observed that knowing one variable canmean that communicating another variable requires less information.</p><p>One nice way to think about this is to imagine amounts of informationas bars. These bars overlap if there’s shared information between them:some of the information in <spanclass=”math inline”>X</span> and <spanclass=”math inline”>Y</span> is shared between them, so <spanclass=”math inline”>H(X)</span> and <spanclass=”math inline”>H(Y)</span> are overlapping bars.And since <spanclass=”math inline”>H(X, Y)</span> is theinformation in both, it’s the union of the bars <spanclass=”math inline”>H(X)</span> and <spanclass=”math inline”>H(Y)</span>.</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/Hxy-info-1.png”alt=”Multi-variate bar 1” /><figcaption aria-hidden="true">Multi-variate bar 1</figcaption>
<p>We rank the information needed to communicate these 3 kinds of thingsin descending order:</p><ul><li>Both X and <spanclass=”math inline”>Y</span>: joint entropy <spanclass=”math inline”>H(X, Y)</span></li><li>X alone: marginal entropyH(X)</li><li>X when <spanclass=”math inline”>Y</span> is known: conditional entropyH(X|Y)</li></ul>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/Hxy-overview.png”alt=”Ranking information contained” /><figcaption aria-hidden="true">Ranking informationcontained</figcaption>
<p>In the bar perspective, <spanclass=”math inline”>H(X|Y)</span> is theinformation we need to send to communicate <spanclass=”math inline”>X</span> to someone who already knows <spanclass=”math inline”>Y</span> - it is the information in <spanclass=”math inline”>X</span> which isn’t also in <spanclass=”math inline”>Y</span>. Visually, that means <spanclass=”math inline”>H(X|Y)</span> is the partof H(X) bar whichdoesn’t overlap with <spanclass=”math inline”>H(Y)</span>.</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/Hxy-info-4.png”alt=”Multi-variate bar 2” /><figcaption aria-hidden="true">Multi-variate bar 2</figcaption>
<p>From this, we get another identity: the information in <spanclass=”math inline”>X</span> and <spanclass=”math inline”>Y</span> is the information in <spanclass=”math inline”>Y</span> plus the information in <spanclass=”math inline”>X</span> which is not in <spanclass=”math inline”>Y</span>. (sounds like set, doesn’tit?)</p><p><spanclass=”math display”>H(X, Y) = H(Y) + H(X|Y)</span></p><p>To wrap up, we have</p><ul><li>information in each variable: <spanclass=”math inline”>H(X)</span> and <spanclass=”math inline”>H(Y)</span></li><li>union of the information in both: <spanclass=”math inline”>H(X, Y)</span></li><li>information in one but not the other: <spanclass=”math inline”>H(X|Y)</span> and <spanclass=”math inline”>H(Y|X)</span></li></ul><p>We can further define mutual information:information both in X andY, or in set terms, theintersection of information:</p><p><spanclass=”math display”>I(X, Y) = H(X) + H(Y) − H(X, Y)</span></p><blockquote><p>If you expand the definition of mutual information out, you get:</p><p>\(I(X,Y) = \sum_{x,y} p(x,y) \log_2\left(\frac{p(x,y)}{p(x)p(y)} \right)\)</p><p>That looks suspiciously like KL divergence!</p><p>Well, it is KL divergence. It’s the KL divergence of <spanclass=”math inline”>P(X, Y)</span> and itsnaive approximation <spanclass=”math inline”>P(X)P(Y)</span>.That is, it’s the number of bits you save representing <spanclass=”math inline”>X</span> and <spanclass=”math inline”>Y</span> if you understand the relationshipbetween them instead of assuming they’re independent.</p></blockquote><p>and the variation of information. The variation ofinformation is the information which isn’t shared between the variables.It gives a metric of distance between different variables. The variationof information between two variables is zero if knowing the value of onetells you the value of the other and increases as they become moreindependent.</p><p><spanclass=”math display”>V(X, Y) = H(X, Y) − I(X, Y)</span></p><p>How does this relate to KL divergence, which also gave us a notion ofdistance? Well, KL divergence gives us a distance between twodistributions over the same variable or set ofvariables. In contrast, variation of information gives usdistance between two jointly distributed variables. KLdivergence is between distributions, variation ofinformation within a distribution.</p><p>In summary, we put them all in a single diagram:</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/Hxy-info.png”alt=”Multi-variate bar 3” /><figcaption aria-hidden="true">Multi-variate bar 3</figcaption>
<h2 id="fractional-bits">Fractional Bits</h2><p>A careful reader may have noticed that in <ahref=”#conditional-entropy”>previous calculations</a>, we had fractionallength of message. Isn’t that weird? How should we interpret such amessage? How is it done in real life?</p><p>The answer is: you can think of them as expected length of a message.If half the time one sends a single bit, and half the time one sends twobits, on average one sends one and a half bits. There’s nothing strangeabout averages being fractional.</p><p>That’s a quick but vague answer. Let’s look at an example: consider aprobability distribution where one event <spanclass=”math inline”>a</span> happens 71% of the time andanother event b occurs 29% ofthe time.</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/halfbit-ab.png”alt=”Fractional bit single” /><figcaption aria-hidden="true">Fractional bit single</figcaption>
<p>The optimal code would use 0.5 bits to represent <spanclass=”math inline”>a</span>, and 1.7 bits to represent <spanclass=”math inline”>b</span>. Well, if we want to send a singleone of these codewords, it simply isn’t possible. We’re forced to roundto a whole number of bits, and send on average 1 bit.</p><p>… But if we’re sending multiple messages at once, it turns out thatwe can do better. Let’s consider communicating two events from thisdistribution. If we send them independently, using the code weestablished for a single event, we’d need to send two bits. Can we dobetter?</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/halfbit-ab2.png”alt=”Fractional bit double” /><figcaption aria-hidden="true">Fractional bit double</figcaption>
<p>Half the time, we need to communicate <spanclass=”math inline”>aa</span>, 21% of the time we needto send ab or <spanclass=”math inline”>ba</span>, and 8% of the time weneed to communicate <spanclass=”math inline”>bb</span>. Again, the ideal codeinvolves fractional numbers of bits.</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/halfbit-ab-idealcode.png”alt=”Fractional bit double code length” /><figcaption aria-hidden="true">Fractional bit double codelength</figcaption>
<p>If we round the codeword lengths, we’ll get something like this:</p>
<imgsrc=”https://colah.github.io/posts/2015-09-Visual-Information/img/halfbit-ab-code.png”alt=”Fractional bit double code length rounded” /><figcaption aria-hidden="true">Fractional bit double code lengthrounded</figcaption>
<p>This codes give us an average message length of <spanclass=”math inline”>1.8</span> bits. That’s less than the <spanclass=”math inline”>2</span> bits when we send themindependently. Another way of thinking of this is that we’re sending1.8/2 = 0.9 bits on average for eachevent. If we were to send more events at once, it would become smallerstill. As n tends to infinity,the overhead due to rounding our code would vanish, and the number ofbits per codeword would approach the entropy.</p><p>Further, notice that the ideal codeword length for a was 0.5 bits,and the ideal codeword length for aa was 1 bit. Ideal codeword lengthsadd, even when they’re fractional! So, if we communicate a lot of eventsat once, the lengths will add.</p><h2 id="conclusion">Conclusion</h2><ul><li>Entropy is optimal length ..</li><li>KL divergence ..</li><li>Entropy over multiple variables can be interpreted as sets ofinformation</li></ul>