ML Divergences

In machine learning, divergence is a measurement of the difference between two distributions. It can be asymmetric so it is not required to be a distance. Kullback-Leibler (KL) divergence is the most popular form of divergence. We will explore the properties of KL divergence that make it so useful. For two probability distributions $A$ and $B$, it is defined as follows:

$$ D_{KL}(A||B) = \sum_xA(x)\log\frac{A(x)}{B(x)} $$

We are only considering discrete distributions. For continuous distributions we can just swap out the summation for an integral.

In ML, we typically use $P(x)$ to represent the true distribution from the real world, which remains fixed. $Q(x)$ represents our model's distribution that we aim to fit. KL is asymmetric, so the order in which we put these two functions is important.

Forward KL

$$ D_{KL}(P||Q) = \sum_xP(x)\log\frac{P(x)}{Q(x)} $$

Zero-avoiding: $Q$ avoids taking zero value when $P$ is non-zero. $Q$ covers
Mode-covering / mean seeking: If there are multiple modes in $P$, $Q$ will assign probability to all. $Q$ assigns some probability mass over the whole distribution of $P$.

source (Eric Jang blog)

Reverse KL

$$ D_{KL}(Q||P) = \sum_xQ(x)\log\frac{Q(x)}{P(x)} $$

Zero-seeking: $Q$ avoids taking non-zero values when $P$ is zero.
Mode-seeking: $Q$ assigns most of the probability to a single mode.

source (Eric Jang blog)

Information Theory Interpretation

KL has an information theory interpretation as relative entropy. When comparing distributions $A$ and $B$, $D_{KL}(A||B)$ represents the expected number of extra bits needed to encode samples from distribution $A$ using a code that was optimized for distribution $B$.

The value $\log\frac{A(x)}{B(x)}$ measures the amount of surprise, measured in nats. If the base is 2, we get bits. This is the surprise of observing an event $x$ from $A$ given the $B$. We weight by $A(x)$ to get an expectation of this surprise.

When we use forward KL , $D_{KL}(P||Q)$, we're measuring the expected number of extra bits needed to encode samples from the true distribution $P$ using a code optimized for our model distribution $Q$. We are weighing surprise by what it is likely in the true distribution.

In the reverse case, $D_{KL}(Q||P)$, we're measuring the extra bits needed to encode samples from our model $Q$ using a code optimized for the true distribution $P$. Now that we're weighting the surprise by $Q(x)$, we care most about events that our model thinks are likely.

This information-theoretic perspective illuminates why KL divergence is such a natural choice for training machine learning models. When we train a model, we're essentially trying to minimize the extra information (measured in bits or nats) needed to encode samples from the true data distribution using our model's distribution. The closer our model distribution gets to the true distribution, the fewer extra bits we need for encoding, and thus the lower the KL divergence.