Introduction

While reading this paper lately, I stumbled on the notion of mutual information from information theory, so as always I took a trip on the internet and summarized my findings in a markdown file. Here they are:

We can see mutual information as a reduction in uncertainty for predicting parts of the outcome of a system of two random variables. Given a system with two random variables $X$ and $Y$, when we observe the outcome of $X$, there is a corresponding reduction in uncertainty for $Y$. Mutual information measures this reduction.

The formula

In plain English:

The mutual information $I(X, Y)$ bewteen two random variables $X$ and $Y$ is equal to the sum over all possible values $x$ of $X$ and $y$ of $Y$ of the joint probability $p(x, y)$ of the two events $x$ and $y$, times the $\log$ of the joint probability divided by the product of the marginal probabilities $p(x)$ and $p(y)$.

Note: This is the formula for discrete processes (for continuous, sums become integrals)

Let’s quickly explain those two notions:

  • The marginal probability $P(x)$ gives the probability of an event occurring. It is not conditioned on another event. It is more formally denoted $P_{X}(x)$.
  • The joint probability $P(x, y)$ gives the probability of event $X = x$ and event $Y = y$ occurring. It is the probability of the intersection of the two elements.

The distribution part is when you consider these probabilities over all possible values of the random variables, that is the distribution over the probabilities. The marginal distribution becomes $P(X)$ and the joint distribution becomes $P(X, Y)$.

When the joint distribution of $X$ and $Y$ is known, the marginal distribution of $X$ is the probability distribution of $X$ averaging over information about $Y$. To calculate $P(X)$, we sum the joint probabilities over Y.

Y P(Y)
y1 y2 y3
X x1 216 116 216 516
x2 116 116 416 616
x3 116 316 116 516
P(X) 416 516 716

What does it mean?

To grasp the meaning of the mutual information, we will consider the two extremes. First the case where our two random variables are completely independent, and then the case where they are completely dependent.

Independent random variables

$X$ and $Y$ have two values, $0$ or $1$, each with probability $\frac{1}{2}$.

Y P(Y)
0 1
X 0 0.25 0.25 0.5
1 0.25 0.25 0.5
P(X) 0.5 0.5

The blue values are the joint probabilities $P(x, y)$. The red values are the marginal probabilities $P(x)$, $P(y)$.

Here we see that $P(x, y) = 0.25$ for all possible values of $x$ and $y$, and that the marginal probability is always $0.5$. This matches the definition of independence where the joint probability is equal to the product of the marginal probabilities:

With this observation, calculating the mututal information is fairly easy, we can change the numerator $P(x, y)$ in the mutual information formula \ref{eq:1} by $P(x) P(y)$:

The values in the log cancel out and we’re left with $\log(1) = 0$, that is, the mutual information of two independent variables is zero. Since the two variables are independent, the mutual information, that is the reduction in uncertainty for the second variable given the first variable is observed, is zero. If the two variables are independent, knowing the value of $X$ doesn’t reduce the uncertainty of $Y$ at all.

Let’s consider the opposite case where we have very strong dependence between the two variables.

Dependent random variables

Here, if we know the outcome of $X$, we also perfectly know the outcome of $Y$.

Y P(Y)
0 1
X 0 0.5 0.0 0.5
1 0.0 0.5 0.5
P(X) 0.5 0.5

We should expect that the reduction in uncertainty in one of the variables to be equal to $1$, because one value perfectly determines the other.

If we go ahead with the formula, we need to sum over all the possible values of $x$ and $y$. We should only consider the diagonal values because in the anti-diagonal the joint probability $P(x, y)$ is zero and so the corresponding term is zero.

For the diagonal terms where $P(x, y) = 0.5$ we have:

Since we’re using bits of information we have a $\log_2$ and $\log_2(2) = 1$. Out of the four terms we should consider in the sum, two of them are zero and two of them are $0.5$, so the sum of all the terms is $1$, as expected.

Relation to the Kullback-Leibler divergence

From the two previous examples, essentially the mutual information is a way of capturing the degree of dependence between two variables. If the two variables are strongly dependent, there is a high degree of mutual information and it means that we know a lot more by knowing the joint distribution than by knowing the marginal distribution.

In each of these examples, the marginal distributions are equal, even if the two systems are very different.

  • For the independent variables case the marginal distributions gives us the joint distribution, since the joint is the product of the marginals in the independent case. So the joint distribution is a bit redundant and gives no extra information.
  • For the dependent variables case the marginal distribution is not really useful, we know a lot more about the dynamics of the system by looking at the joint distribution.

So we kind of have a parallel here:

  • independent variables - no mutual information - we get our information from the marginal distribution
  • dependent variables - mutual information is 1 - we get our information from the joint distribution

This gives the intuition that mutual information has something to do with the divergence between the joint and mutual distributions. About the informational cost of representing our system as the product of marginals opposed to the joint distribution. More precisely, we can express the mutual information using the Kullback-Leibler divergence from the joint distribution $P(x, y)$, to the product of the marginal distributions $P(x) \cdot P(y)$.

Where $D_{KL}$ is the Kullback-Leibler divergence, defined between two distributions $p$ and $q$ by:

Equation \ref{eq:1} expresses the mutual information using the divergence between from the joint distribution to the product of the marginal distributions. It can also be useful to express the mutual information in terms of the divergence computed on one random variable only (instead of the two variables in equation 1). To do this, we use conditional distributions:

Substituting equation \ref{eq:5} in equation \ref{eq:1}, we get:

We simplyfy and split the expressions in their respective sums (expressions that depend on $y$ in the $y$ sum and same for $x$), and we get a KL divergence (see eq. \ref{eq:3}) that is the divergence of the univariate distribution $P(x)$ from the conditional distribution $P(x |y)$:

With this view, the mutual information can be seen as the expectation of the divergence between the conditional distribution $P(x |y)$ and $P(x)$. The more different these two distributions are on average, the greater the information gain. It makes sense because if $P(x |y)$ and $P(x)$ are the same, it means the information about $y$ is useless and the mutual information is indeed zero. On the other end of the problem, if $P(x |y)$ and $P(x)$ are very different, it means information about $y$ is very important to know about $x$, and the mutual information is high.

Alternatively (as in the paper I was reading), we can write the same equation summing on the other variable using $P(x, y) = P(y | x) \; P(x)$:

Wrapping up

The mutual information between random variables $X$ and $Y$ expresses the reduction in uncertainty in one variable when we have information about the other.

This formula involves the joint distribution and the marginal distribution, and we’ve seen that the mutual information quantity can be seen as an axis from 0 to 1:

  • $I(X, Y) = 0$, no mutual information, the random variables are independent and all the information is contained in the marginal distributions
  • $I(X, Y) = 1$, the random variables are totally dependent and the information is contained in the joint distribution

This view leads to the Kullback-Leibler divergence view of the mutual information:

Using conditional probabilities, we can express the mutual information with a divergence using only one of the random variables, which may be a handy formulation: