<aside> ✍️ Andreas, March 5, 2021, with discussion feedback from Jannik
</aside>
The epistemic uncertainty of an ensemble is not a useful measure for OOD, generally, and in particular model-dependent.
This also questions computing epistemic uncertainty based on a single layer (softmax outputs) and treating the underlying models as black-boxes.
We view the members of an ensemble as being drawn from a distribution $w\sim\hat{p}(\omega)$. Then, we can use the well-known BALD equation:
$$ \underbrace{H[Y|x]}\text{predictive entropy} = \underbrace{I[Y;\Omega|x]}\text{epistemic uncertainty} + \underbrace{\mathbb{E}{\hat{p}(\omega)}H[Y|x,\omega]}\text{expected softmax entropy}, $$
where $\operatorname{H}[Y|x,\omega]$ is the softmax entropy of a single specific model $\omega$.
Epistemic Uncertainty measures the disagreement of the predictions of different models within the ensemble to capture "epistemic uncertainty".
The epistemic uncertainty of a Deep Ensemble has empirically been regarded as superior to MC Dropout.
Predictive Entropy and Epistemic Uncertainty are seen as good metrics to detect OoD samples: a high value is indicative.
<aside> 👉 DDU makes the claim that we can equate epistemic uncertainty with the sample density in feature space, which is also very useful for OoD detection.
</aside>
The SNGP paper shows that:
A classification model ought to output uniform predictions for OoD data.
$$ p(y \mid \mathbf{x})=p\left(y \mid \mathbf{x}, \mathbf{x} \in \mathscr{X}{\mathrm{IND}}\right) * p^{*}\left(\mathbf{x} \in \mathscr{X}{\mathrm{IND}}\right)+p_{\text {uniform }}\left(y \mid \mathbf{x}, \mathbf{x} \notin \mathscr{X}{\mathrm{IND}}\right) * p^{*}\left(\mathbf{x} \notin \mathscr{X}{\mathrm{IND}}\right) $$
Entropy is maximal for a uniform distribution. Hence, this view aligns with high predictive entropy being indicative of OoD data.
Is Epistemic Uncertainty as defined above always a good metric for OoD detection though?
"Quantifying OoD" refers to having a softmax entropy closer uniform for OoD data, as specified as optimum by SNGP above.
<aside> 👉 This can, for example, happen when we train on OoD data, for example, as many current approaches to OoD detection do.
</aside>