# heavytailed

## August 19, 2012

### A PAC Bayesian lemma

Filed under: Uncategorized — heavytailed @ 7:49 pm

I’ve made the mistake of working on too many posts/ideas at one time. Each has ballooned into more of a time-sink than I had anticipated. In particular, a conceptually simple question about support vector machines has taken me down a rabbit-hole of interesting but somewhat esoteric concepts. In particular, near the turn of the century, the drive for better error bounds (and deeper understanding of how to characterize or describe concept classes) led to the development of PAC Bayesian analysis, and its more pure counterpart Luckiness functions.

While both concepts don’t appear mainstream today (not even referenced on Wikipedia that I can see), there are some interesting bits to both. Most fruitful for review, I found, is the framework of luckiness functions, which rely implicitly on fat-shattering and level fat-shattering dimensions. With the advent of manifold methods (a.k.a. manifold machine learning), it’s likely to see a resurgence in luckiness/unluckiness functions as a way of ordering or partitioning the hypothesis space. In particular, I would expect to see the luckiness framework used to introduce manifold curvature into the error bounds of well-known algorithms (e.g. manifold random forests).

In the spirit of “lemmas are more useful than theorems,” I provide the proof of a seminal PAC-Bayesian lemma, and state a related theorem on loss functions. I’ll move on to Luckiness functions in an upcoming post.

The Quantifier Reversal Lemma

This lemma enables one to adapt a PAC statement about a formula $\Phi(x,y,\delta): \mathcal{X}\times\mathcal{Y}\times[0,1] \rightarrow \{0,1\}$ (e.g. $\forall x, \forall \delta > 0, Pr\Phi > 1 - \delta$) into a more general relation between the input spaces. Specifically, for random variables $x$ and $y$, $\delta \in [0,1]$, let $\Phi(x,y,\delta)$ be a formula such that $\{\delta : \Phi(x,y,\delta) = 1\} = (0,\delta^*]$ for some $\delta^*$. If

$\forall x, \forall \delta > 0, Pr_y(\Phi(x,y,\delta) = 1) > 1 - \delta$

then for any $\delta > 0$ and $0 < \beta < 1$ we have

$\forall \alpha > 0, Pr_x\left(Pr_y \Phi(x,y,(\alpha\beta\delta)^{\frac{1}{1-\beta}}) > 1 - \delta\right) > 1 - \alpha$

Thus the quantifiers are reversed. What this means intuitively depends on $\Phi$, and in that sense the lemma is general. The specific use here is to help quantify, among those competing hypotheses which agree on a test set, what proportion fall into a high-error category. For instance, if the error $\epsilon(h) > \frac{ln\frac{1}{\delta}}{m}$ then $Pr_{|X|=m}(|h(x_i)-y_i| = 0 \forall i) \leq \delta$ (this follows from the union bound). This statement can be written as

$\forall h, \forall \delta > 0, Pr_{X}[ h(x_i)=y_i \forall i \Rightarrow \epsilon(h) \leq \frac{\frac{1}{\delta}}{m}] > 1 - \delta$

to which the above lemma can be applied (letting $x$ of the lemma refer to $X$ here, and $y$ of the lemma refer to $c$ here). The proof of the lemma is straightforward:  let $f(x,y) = 0$ if $\{ \delta: \Phi(x,y,\delta) = 1 \}$ is empty, and $\delta^*$ otherwise. Because $Pr_y( f(x,y) \geq \delta) > 1 - \delta$ and $[f(x,y)]^{\beta-1}$ is monotone decreasing for $\beta \in [0,1]$, we can write

$\mathbb{E}_y[f(x,y)]^{\beta-1} \leq \int_0^1 z^{\beta -1}dz = \frac{1}{\beta}$

This, combined with the Markov inequality, yields:

$\mathbb{E}_y\mathbb{E}_x[[f(x,y)]^{\beta-1}] \leq \frac{1}{\beta}$

$Pr_y\{\mathbb{E}_x[(f(x,y))^{\beta-1}] \leq \frac{1}{\beta \delta}\} > 1 - \delta$

$Pr_x\{ Pr_y\{ f(x,y)^{\beta -1} \leq \frac{1}{\alpha\beta\delta} \} > 1 - \delta \} > 1 - \alpha$

$Pr_x\{ Pr_y\{ f(x,y) \geq (\alpha\beta\delta)^{\frac{1}{1-\beta}} \} > 1 - \delta \} > 1 - \alpha$

The result follows by plugging back in for $\Phi(x,y,(\alpha\beta\delta)^{\frac{1}{1-\beta}})$.

Average loss among indistinguishable hypotheses

The Quantifier Reversal lemma can be used to prove a relation on the average loss for a collection of hypotheses from the same class which each agree on the sample/training data. For instance, the collection of hyperplanes in the margin for an SVM, or the collection of correctly-classifying weights for a neural network of fixed architecture.  Given some distribution $P$ over the hypothesis space (e.g. a prior — this is PAC-Bayesian, after all), and a loss function $\ell: \mathcal{H} \times \mathcal{X} \rightarrow [0,1]$ (so the loss of some trained hypothesis $h \in \mathcal{H}$ evaluated on $x_i$), then with probability at least $1 - \delta$,

$\mathbb{E}_{h,x}[\ell(h,x)]\leq \mathbb{E}_h\ell(h,X) + \sqrt{\frac{\log \frac{1}{P(U)} + \log \frac{1}{\delta} + 2 \log m }{2m}} + \frac{1}{m}$

where $m = |X|$. Thus we can relate the expected loss on a sample (given a prior over concepts) to the total expected loss using that same prior. This relation can be used to establish the average risk of hypothesis that are indistinguishable given the sample $X$.

A preview of luckiness

PAC-Bayesian theorems rely on some probability distribution over a measurable set of hypotheses. Given the strength of such an assumption, they’re a bit easier to handle. Luckiness, however, is a more heuristical (I’m inventing this word) alternative, where you measure the “luckiness” of a hypothesis-sample pair. For instance, the luckiness of any general sample might be the dimension of the subspace it spans. The luckiness of an SVM might be the number of support vectors it has, or the minimum or maximum margin. Basically, luckiness is a some encoding of a heuristic about your sample and hypothesis space that measures how lucky (or unlucky) you were to achieve a large or low value for that heuristic.

In other words, because lower bounds in full generality are established by PAC theorems, better bounds can only come because of a “lucky” matching between the probability distribution underlying the data, and the hypothesis space used for classification. Luckiness functions provide a means of measuring such a match, and quantifying the degree of assurance that the luckiness of the sample will generalize.