Law of Total Variance vs Information Theory

Decompositions into “Unexplained” and “Explained” Uncertainty

Consider two random variables $Y$ and $Z$. We assume that we have access to side-information $Z$ to help us “explain” $Y$. $Z$ could be the model parameters for example (and it makes sense to keep that in mind for the rest of this).

Information Theory. The entropy $H[Z]$ captures the uncertainty about $Z$. The mutual information $\operatorname{I}[Y ; Z]$ captures the notion of how much $Z$ tells us about $Y$ (and vice-versa). Similarly, the conditional entropy $\operatorname{H}[Y \mid Z]$ captures the uncertainty that is left in $Y$ after we have predicted $Z$. Overall, we have the well-known decomposition of mutual information via its definition.

$$ \underbrace{\operatorname{H}[Y]}{\text{(total) uncertainty}} = \underbrace{\operatorname{H}[Y \mid Z]}{\text{unexplained uncertainty}} + \underbrace{\operatorname{I}[Y ; Z]}_{\text{explained uncertainty}}. $$

Law of Total Variance. Similarly, the law of total variance says for random variables $Y$ and $Y$ in the same probability space:

$$ \underbrace{\operatorname{Var}(Y)}{\text{(total) uncertainty}} = \underbrace{\mathbb{E}[\operatorname{Var}(Y \mid Z)]}{\text{unexplained uncertainty}} + \underbrace{\operatorname{Var}(\mathbb{E}[Y \mid Z])}_{\text{explained uncertainty}}. $$

Here, the overall variance captures the uncertainty, and the variance after conditioning captures the unexplained (residual) variance, while the the variance of the mean captures how the explained variance.

More concretely, $\mathbb{E}[Y \mid Z]$ would commonly presents the best model prediction (using RMSE), while $\operatorname{Var}(Y \mid Z)$ captures how much larger the unexplained error is (using RSME — we see this when we expand the definition of $\operatorname{Var}$).

Do the decompositions agree?

The big question that arises from having two different decompositions is whether they agree:

Do $\operatorname{Var}(\mathbb{E}[Y \mid Z])$ and $\operatorname{I}[Y ; Z]$ capture the same quantities, and if not, how do they differ?

Now, variance and entropy are generally not linearly related. Instead, for any univariate random variable $X$, we have:

$$ \operatorname{H}[X]\le\frac{1}{2}\log (2\pi e \, \operatorname{Var}(X)), $$

where we use that the Gaussian distribution is the maximum entropy distribution given a fixed variance—with equality when $X$ follows a univariate normal distribution.

Univariate normal $Y, Z$. Let’s assume $Y$ and $Y \mid Z$ follow univariate normal distributions (which is the simplest modeling assumption really). Then, we have:

$$ \begin{aligned} \operatorname{I}[Y;Z] &= \operatorname{H}[Y] - \operatorname{H}[Y \mid Z]\\&=\frac{1}{2}\log (2\pi e \, \operatorname{Var}(Y))-\mathbb{E}_Z [\frac{1}{2}\log (2\pi e \, \operatorname{Var}(Y \mid z))]\\ &=\frac{1}{2}\log \operatorname{Var}(Y) -\mathbb{E}_Z [\frac{1}{2}\log \operatorname{Var}(Y \mid z)] \end{aligned} $$

We can use Jensen’s inequality now in two ways to obtain both an upper and a lower bound ($\log$ is concave and $-\log$ is convex):