## what is a large Kullback-Leibler divergence?

**A** question that came up on X validated is about scaling a Kullback-Leibler divergence. A fairly interesting question in my opinion since this pseudo-distance is neither naturally nor universally scaled. Take for instance the divergence between two Gaussian

which is scaled by the standard deviation of the second Normal. There is no absolute bound in this distance for which it can be seen as large. Bypassing the coding analogy from signal processing, which has never been clear to me, he only calibration I can think of is statistical, namely to figure out a value extreme for two samples from the same distribution. In the sense of the Kullback between the corresponding estimated distributions. The above is an illustration, providing the distribution of the Kullback-Leibler divergences from samples from a Gamma distribution, for sample sizes n=15 and n=150. The sample size obviously matters.

May 2, 2018 at 12:45 am

There are many confusing things about this post —

1. Yes, the KL divergence is not naturally scaled (like the TV or Hellinger). This is a problem also with the chi-squared distance.

2. Your illustration is confusing — in this sense the sample-size would matter even if you attempted to estimate the TV or Hellinger (which are scaled in a desirable way). This is just a statement about how difficult it is to estimate the distance (and not really a statement about the distance itself).

3. In terms of calibrating “how large is a large KL” two other results might be helpful:

(a) Theorem 2.2 in Tsybakov’s book — which roughly says that if the KL divergence is smaller than some universal constant the two distributions are indistinguishable in a statistical sense.

(b) Sanov’s theorem (and variants) relate the Type I and II errors of the LRT to the KL divergence between two distributions, i.e. if the KL is large the two distributions are “distinguishable” in a statistical sense.

May 2, 2018 at 5:35 pm

I do not have access to Sacha’s book so cannot put an exact meaning to (a) but am confused myself at the impossibility of discriminating between two distributions.

May 3, 2018 at 6:34 am

Suppose you observe X drawn from P.

Consider the sum of the Type I and Type II error of any test for H_0: P = P_0 versus H_1: P = P_1.

This sum of Type I and Type II error is at least some function of the KL divergence between P_0 and P_1. More precisely, if the KL divergence between these two is at most θ, then the sum of the Type I and Type II errors is at least,