# heavytailed

## August 21, 2012

### Luckiness in Machine Learning

Filed under: Uncategorized — heavytailed @ 12:16 am

As mentioned in my previous post, PAC theorems are general results regarding the generalizability of learning algorithms in a particular setting. “In a particular setting” usually involves assumptions on the hypothesis class H (equivalently, concept class C), and nearly always involve the assumption “H contains an error-free classifier – the underlying decision boundary/formula/algorithm”.

In the last post I mentioned a brief lemma used in generating PAC-Bayesian theorems, including one which bounds the average generalization error of a subset of hypotheses. These bounds, by the way, don’t offer much improvement over Vapnik’s uniform convergence theorem: for a hypothesis class with VC-dimension $d$,

$\epsilon(\hat{h}) \leq \epsilon(h^*) + O\left(\sqrt{\frac{d}{m}\log\frac{m}{d} + \frac{1}{m}\log\frac{1}{\delta}}\right)$

However if your VC-dimension is very large, the PAC bounds give an alternate view of why learning is feasible: if a large component (measured by your prior) of your hypothesis space gives low classification error, then you’re likely to generalize well (as opposed to, perhaps a small collection of hypotheses overfit to your data – this is the 1/P(U) term).

Luckiness gives a related but slightly different view on generalizability. Whereas PAC theorems (and in particular Vapnik’s convergence theorem) provide uniform bounds over an entire hypothesis class, and PAC-Bayesian theorems provide an average bound over a prior-influenced subclass of hypothesis, luckiness provides nonuniform bounds over a hierarchy of the hypothesis class; and in particular provides better bounds for hypotheses where the data (distribution) is serendipitously aligned to the class.

Motivating Example: Kernel-SVM with a Gaussian RBF

I recall hearing (or reading) an exchange about Gaussian kernel SVMs, which teased apart two views of the subject. In the first view, because radial-basis SVMs can shatter any data set, they have VC-dimension infinity. Obviously, because of this, the uniform convergence theorem above (Vapnik’s) doesn’t apply. But they really do work well, so why should that be?

Enter the second view. SVMs in general work really well when there is a large margin, regardless of the VC dimension of the underlying hypothesis class. This view is the one taken by Vapnik for his generalization bounds for SVMs (both those involving the number of support vectors, and those contrasting the margin with the span of the data).

The second view is striking. The whole hypothesis class of RBFK-SVMs does not have an error bound, because of its VC-dimension. However, in the event that, serendipitously, the hypothesis class contains a subclass which, on the training data, has a small number of support vectors, or a large margin, we can say quite a bit about that subclass.

This approach, exploiting a serendipitous congruence between the data and the hypothesis class, falls into the framework of Luckiness functions. From this point of view, model and variable selection can be thought of as methods which attempt to maximize the degree of luckiness of the learning process.

What is Luck?

For many, luck is defined by unpredictability more than anything else. In Norse culture, the case is quite the opposite in that luck had nothing to do with what we would refer to as coincidence or chance. On the contrary, luck was a quality inherent in the man and his lineage, a part of his personality similar to his strength, intelligence, or skill with weapons, at once both the cause and the expression of the success, wealth, and power of a family. Luck expressed itself partially in skills, beauty, and other desirable characteristics, but also in events shaping themselves according to the wishes of the lucky man. One might have luck in specified areas but not in others, such as fishingluck or weatherluck for example. But the so-called “man of luck” was the man who possessed luck generally, not just in one specific area. People possessed luck in different measure and one was helpless against an opponent who had greater luck. Kings especially were great men of luck to the degree that they were able to send forth their luck to assist others. Luck was not a thing to be sought or found by coincidence; one had the luck that one was accorded by fate. Yet, in certain cases luck could be diminished or lost..

– Sommer, The Norse Concept of Luck

This particular view of luck (which brings back fond memories of reading Egil’s Saga) is actually pretty close to what is meant in terms of machine learning. Just change “person” to “hypothesis class”, and “Kings” to “analysts” (or data scientists or whatever you want to call yourself. It’s good to be the King). Some classes are “lucky” for certain kinds of distributions on the input space, while others can be “unlucky” on the same kinds of distributions.

Specifically, a luckiness function $L$ is a mapping:

$L : \mathcal{X}^M \times \mathcal{H} \rightarrow \mathbb{R}^+$

that’s all it is. Similarly you might define an unluckiness function $U$ in exactly the same way. Given a labeled sample $X: |X| = m$, the hypothesis class $\mathcal{H}$ can be ordered according to luckiness or unluckiness: a level $\ell$ is associated with each hypothesis (and parametrized by $x$):

$\ell(x,h) = |\{b \in \{0,1\}^m : \exists g \in \mathcal{H}, g(x) = b, L(x,g) \geq L(x,h)\}|$

that is, it’s the number of alternate classifications $b$ for which there’s some hypothesis which, given $b$, has greater luck than $h$ does on the input classification. For classes equipped with a Luckiness function, if the level is low, the degree of serendipity is high, and one might expect better-than-average performance. Put another way: Let $\mathcal{H}$ be the concept class of binary decision trees, and let $L(x,h) = |h|$, that is, the number of leaves in $h$. Let’s say you’ve trained $h$ and $|h| = t$. What then is the probability that, given some alternate classification on $x$, I’d have generated such a small tree? It’s exactly $2^m\ell(x,h)$. Thus the level of a hypothesis (with respect to a luckiness function $L$) can be thought of as a kind of probability over a uniform distribution of classifications of $x$. (Same with unluckiness). Those with a statistics background might recognize this as the probability resulting from a permutation test.

PAC-Bayes as Luckiness

Notice that the function $L(h,x)$ is not constrained to use information about $x$ (or even $h$ for that matter). In that sense, the probability prior $P(h \in \mathcal{H})$ as a mapping from $\mathcal{H} \rightarrow [0,1]$ can be viewed as a luckiness function. This interpretation reveals something a little bit amiss with the level of the hypothesis class: a bad prior means nothing is lucky, which could implicate bad performance, when in fact the performance can be good in spite of the prior. Basically the complaint is that with such freedom in the choice of $L$, there needs to be some kind of constraint in the analysis of $L$ to really define what luckiness entails.

Enter probable smoothness. I’m not sure the reason behind the nomenclature, I’d prefer something like “lucky with respect to,” but we’ll stick with the canonical term “Probably Smooth”. If a Luckiness function is probably smooth, it means that we can say something about how lucky we expect to be after doubling the number of data points. In particular, we can with high confidence bound the growth rate of the hypothesis level, and by extension, the generalization error.

Probable Smoothness is defined with respect to two functions, $\eta: \mathbb{N} \times \mathbb{R}^+ \times [0,1] \rightarrow [0,1]$ and $\phi : \mathbb{N} \times \mathbb{R}^+ \times [0,1] \rightarrow \mathbb{R}$. The latter, $\phi$, is the bound on the growth of the hypothesis level; while the former, $\eta$ is a kind of fudge factor (i.e. we can’t necessarily bound the hypothesis level of the whole double-set, but we can bound the hypothesis level of all subsets of a certain [large] size). Slightly more mathematically, given a hypothesis $h$ with luckiness $L$ on the first half of a sample (i.e. the first $m$ points), with high confidence ($1 - \delta$), for most (all but $\eta(m,L,\delta)$) of the points in the whole $2m$ sample, the level of the hypothesis can’t be more than $\phi(m,L,\delta)$. Completely symbolic:

$P\{(x,y) : \exists h \in \mathcal{H}, Er_x(h) = 0, \forall x'y' \subseteq_{\eta(m,L(x,y),\delta)} xy, \ell(x'y',h) > \phi(m,L(x,h),\delta)\} \leq \delta$

where the notation $w \subseteq_\alpha z$ means that $w$ is a subvector of $z$ missing at most a fraction $\alpha$ of $z$‘s coordinates.

So the solution to the view of PAC-Bayes priors as a kind of luckiness function is to indicate that not only must you find a mapping from the instance space and hypothesis space to $\mathbb{R}^+$, but it has to be probably smooth on the instance and hypothesis space. It’s probable smoothness, rather than Luckiness, that encodes serendipitous congruence between data and hypothesis; and so by observing that a Bayesian prior on the hypothesis space is a kind of luckiness function, asking that it be probably smooth is effectively asking that it encode (by magical intuition of the prior’s author) something about situations in which the hypothesis classes perform well.

The standard example is one of a decomposable class hierarchy

$\mathcal{H} = \mathcal{H}_1 \subseteq \mathcal{H}_2 \subseteq \dots$

and the VC-dim of $\mathcal{H}_k$ is $k$. One could, for instance, construct a prior $p_k$ so that $\sum_{k=1}^\infty p_k = 1$. In such a case

$\mathrm{er}(h) \leq \epsilon(m,d,\delta) = \frac{4}{m}\left(d\log(\frac{2em}{d})+\log\frac{1}{p_d}) + \log\frac{4}{\delta}\right)$

Similarly, we can define an unluckiness function

$U(x,h) = \min\{d : h \in \mathcal{H}_d\}$

This unluckiness function is probably smooth with respect to $\eta = 0$ and $\phi = (\frac{2em}{U})^U = (\frac{2em}{d})^d$. This gives a similar bound:

$\mathrm{er}(h) \leq \epsilon(m,d,\delta) = \frac{2}{m}\left(1+d\log\frac{2em}{d} + \log\frac{8m}{\delta}\right)$

Thus the structure of the hypothesis class is captured by the unluckiness function U, and the function $\phi$ with respect to which it is smooth.

Some idle speculation

As mentioned in my previous post, an increasingly hot topic in machine learning is the use of data geometry to improve classifier and regression performance. The way in which manifold methods are used is effectively a transformation of the instance space (typically selecting the least significant eigenvectors of the graph laplacian). Thus this kind of structure can’t be directly incorporated into a luckiness framework (but can be viewed, along with model/variable selection as a method for trying to maximize luckiness). Nevertheless, if the transformation could be incorporated into the framework, one could put bounds on the generalization error given, say, the curvature of the underlying manifold and the number of labeled and unlabeled points available.

Edit: also here’s a lecture video I found