# heavytailed

## September 17, 2012

### Algorithmic Luckiness and LOO Stability

Filed under: Uncategorized — heavytailed @ 1:14 pm

I finally had a chance to read the Herbrich/Williamson paper on Algorithmic Luckiness, what with work picking up, a newfound penchant for not being a lazy/sedentary person (read: actually getting exercise), and studying for the GRE (my vocabulary is surprisingly deficient, given the amount that I read…at least I think so). The paper is a decade old now, which means that probably the key insights have already been extracted from it.

The reason it’s interesting, at least for me, has to do with the general concept of exploiting problem structure, and in particular correlations between an instance (e.g. a triplet $(\mathcal{X}\times\mathcal{Y},P,z)$ of a classification space, probability distribution, and sample) and the learning algorithm (or class of algorithms) $\mathcal{A}$. Techniques of this type, that characterize instance spaces and algorithms \textit{jointly}, will be those likely to resolve questions like $\mathrm{P}=\mathrm{NP}$, but that’s really a different topic.

The key inequality of Algorithmic Luckiness relates the set of possible *loss-labelings* of an instance space to the number of “unlucky” permutations of a sample drawn from that space. A *loss-labeling* (which is my own term for a definition in the paper) represents, for an input-label space $\mathcal{X} \times \mathcal{Y}$, the number of possible assignments of $(x,y,\ell(h(x),y))$ induced by hypothesis space $\mathcal{H}$ and loss function $\ell : \mathcal{Y}\rightarrow\mathbb{R}^+$. This is similar to the VC-dimension in that it reduces to the VC-dimension for 0-1 loss (i.e. the number of dichotomous assignments under the hypothesis class).

The fundamental idea of algorithmic luckiness is that doubling your sample $z$ to $z'$ shouldn’t give access to hypotheses that significantly differ on average for points in $z'$. Given a luckiness function $L$, the main inequality relates the following classes/sets:

$\mathcal{L}_\ell(\mathcal{H}_\mathcal{A}(L,z)|_z)$ – the set of loss labelings on the sample $z$ over $\mathcal{A}$-learnable lucky hypotheses.

$\mathcal{L}_\ell(\mathcal{H}_\mathcal{A}(L,z))$ – the set of loss labelings on $\mathcal{X}\times\mathcal{Y}$ over $\mathcal{A}$-learnable hypotheses.

$\mathcal{H}_\mathcal{A}(L,z)$ – the set of $\mathcal{A}$-learnable lucky hypotheses.

$\mathcal{I}_\mathcal{A}(L,z)$ – the set of permutations $\pi(z)$ of $z$ such that $\mathcal{A}[\pi(z)[1:\frac{m}{2}]]$ is luckier than $\mathcal{A}[z[1:\frac{m}{2}]]$.

Note that the double restriction $\mathcal{H}_\mathcal{A}(L,z)$ to both *lucky* hypotheses and hypothesis learnable by $\mathcal{A}$ is important. $\mathcal{H}_\mathcal{A}(z)$ differs from $\mathcal{H}_\mathcal{A}(L,z) \subset \mathcal{H}_\mathcal{A}(z)$ in that it restricts the latter class to only those hypotheses resulting from permutations of $z$ yielding

$L(\mathcal{A}[\pi(z)[1:\frac{m}{2}]],\pi(z)[1:\frac{m}{2}]) \geq L(A[z[1:\frac{m}{2}]],z[1:\frac{m}{2}])$

Note that for a sequence or vector $v$, the notation $v[1:n]$ means the first $n$ elements of $v$, notation borrowed from python.

The key inequality is:

$|\mathcal{L}_\ell(\mathcal{H}_\mathcal{A}(L,z)|_z)| \leq |\mathcal{L}_\ell(\mathcal{H}_\mathcal{A}(L,z))| \leq |\mathcal{H}_\mathcal{A}(L,z)| \leq |\mathcal{I}_\mathcal{A}(L,z)| \leq \frac{(2m)!}{m!}$
The final inequality depends on whether you take $|z| = 2m$ or $|z| = m$, in the second case it can be written $\frac{m!}{(\frac{m}{2})!}$. It follows from the fact that for each permutation that swaps elements of $z$ from the second half for the first half, there are $m!$ (or $(m/2)!$) permutations that only result in permuting elements in the second half of the sample, which have no effect on the luckiness (as this, by definition, is calculated only for the first half). The rest follow by definition (refer to the paper for specifics), but basically the mappings from the RHS to the LHS are obviously surjective.

Even with algorithmic luckiness one has to deal with this cumbersome growth function $\omega$ and its being a bound on the covering number of the hypothesis space. Recall that the idea is that doubling my sample from size $m$ to $2m$ shouldn’t introduce wildly different learnable hypotheses into my hypothesis space. Since we’re thinking about the growth of a set, intuitively we have to introduce some kind of a cover. We care about hypotheses not “differing significantly” on the larger sample, so we need some kind of measure. The idea is, under that measure, the hypotheses yielded by the double sample should be “close” to the hypotheses yielded by the half sample; that is I could probably choose a handful of hypotheses from the half-sample to cover all those in the double-sample under my measure.

Let that measure be the $\ell_1$ error

$\frac{1}{2m}\sum_{i=1}^{2m} |L(A[z[1:2m]](z_i),y_i) - L(A[z[1:m]](z_i),y_i)| = p_1(z)$

What we want is $\mathbb{P}[$I need lots of hypotheses from the half-sample to cover the double sample at scale $\tau$ under $p_1] < \delta$. Letting $\mathcal{N}[\tau,\mathcal{L}_\ell(\mathcal{H}_\mathcal{A}(L,z)),p_1]$ be the number of hypotheses in $\mathcal{H}_\mathcal{A}(L,z[1:m])$ required to $\tau$-cover the loss-labelings induced by $\mathcal{H}_\mathcal{A}(L,z[1:2m])$. We want that $\mathbb{P}[\mathcal{N}(\tau,\mathcal{L}_\ell(\mathcal{H}_\mathcal{A}(L,z),p_1)) > \omega] < \delta$ where “$\omega$” is some function representing “lots of hypotheses”: this is the growth function. If this PAC-like inequality holds, then the triplet $(\mathcal{A},\ell,L)$ is called $\omega$-small at scale $\tau$ (in general $\omega$ is a function of $\tau$).

If $\ell$ is bounded and $(\mathcal{A},\ell,L)$ is $\omega$-small at scale $\tau$, then the generalization error of $\mathcal{A}[z]$ can be bounded as:

$\mathrm{GenErr}[\mathcal{A}[z]] \leq \mathrm{TrainErr}(\mathcal{A}[z],z) + \sqrt{\frac{8}{m}\left(\lceil\log_2(\omega)\rceil + \log_2\frac{4}{\delta}\right)} + 4\tau$

The use of a half- or double- sample is more or less a convenience. One could use a quarter/quadruple or whatever kind of multiplier with only minor changes to the lemmas and theorems in the paper. A question one might ask is will this hold not for fractions of the training sample, but absolute numbers (can I leave out, for instance, just one data point)? Some reference to this was made in the paper in the form of Uniform Algorithmic Stability, which is a related topic. The answer, ultimately, is “yes,” and the result comes from the late Partha Niyogi’s 2004 paper, completely orthogonal to the luckiness framework, which discusses the sufficiency (and necessity) of Leave-One-Out stability for generalization.

As a note, this particular paper doesn’t give direct generalization bounds based on LOO-stability. However, Leave-One-Out schemes based on uniform stability have generated generalization error bounds. In fact, both Niyogi and Williamson cite Bosquet’s paper on uniform stability, and Williamson goes as far as to establish stronger bounds by combining Bosquet’s framework with Algorithmic Luckiness. LOO-stability, on the other hand, is proven to be equivalent to the universal Glivenko-Cantelli condition on the hypothesis space. Big words, but it just boils down to the convergence in probability of the training set error to the true error as the size of the training set is taken to the infinite limit. A cursory bit of googling suggests that if generalization bounds have been derived for Mukherjee/Niyogi’s definition of LOO-stability, they’re not widely known. I also doubt that they’re better than Algorithmic Likelihood bounds for reasons I’ll get into later.

Turning to Leave-One-Out stability, we turn from spaces of $\mathcal{A}$-learnable hypotheses $\mathcal{H}_\mathcal{A}$ towards general function spaces $\mathcal{F}$, upon which — in the nomenclature of Mukherjee/Niyogi — the Empirical Risk Minimization (ERM) procedure induces a learning map $L: \bigcup_{n \geq 1} Z^n \rightarrow \mathcal{H}$. Note that the selection $\mathcal{H}$ is part of the ERM procedure, and so presumably $\mathcal{H} \subset \mathcal{F}$.

Such a learning map is LOO-stable if

1. The predictions on a single point don’t change very much if you leave the point out of the learning sample (with high probability)
2. If you leave out each point one at a time and average the resulting losses, it’s not too far from the (true) expected loss of the learned hypothesis (with high probability)

So basically, you’re not allowed to grossly mislabel points if you leave them out of the training set, and the estimated LOO-risk can’t be too far from the true risk. Stated like this the result is kind of obvious, and requirement (2) seems pretty close to proof-by-definition. But it does look impressive in its symbolically-encoded form:

1. $\mathbb{P}_S\{|V(f_S,z_i)-V(f_{S^i},z_i)| \leq \beta_{\mathrm{CV}}\} \geq 1 - \delta_{\mathrm{CV}} \;\; \forall i$
2. $\mathbb{P}_S\left\{\left| I[f_S] - \frac{1}{n}\sum_{i=1}^n V(f_{S^i},z_i)\right| \leq \beta_{\mathrm{EL}}^{(n)}\right\} \geq 1 - \delta_{\mathrm{EL}}^{(n)} \forall i$

Condition (1) is referred to as $\mathrm{CV}_{\mathrm{loo}}$ stability, while condition 2 takes the name $\mathrm{Eloo}_{\mathrm{err}}$. To untangle this mess, $V(f,z)$ is just the loss of function $f$ on $z$, so the term in 1. is the difference in losses on the point $z_i$ for the learned function $f_S$ and $f_{S^i}$, where $S^i$ is the sample with the point $z_i$ left out. Lastly, $I[f] = \mathbb{E}_{P}[V(f,z)]$ is the (true) expected error.

For me, condition (2) is where I start to prefer luckiness to LOO-stability, because the $\mathrm{Eloo}_{\mathrm{err}}$ condition is defined over the true measure, while $\omega$-stability (/smallness) is defined empirically over a double sample. It’s quite likely that LOO-stability can be recast so that the $\mathrm{Eloo}_{\mathrm{err}}$ condition is replaced with an empirical one, perhaps a swap-in condition (as this would bring LOO in alignment with the framework of Algorithmic Luckiness).

So what does all this mean for the real world? Well, ALuckiness does generate quite a deal of evidence that cross-validation should work exceptionally well for investigating the performance of a learning algorithm, and allow one to bound the generalization error. Also, it provides a nice error bound, so that should generalization fail, one may be able to conclude that the training and evaluation data could be drawn from different distributions. LOO-stability nearly guarantees (though not with strictly empirical conditions) that cross-validation can be performed with single LOOs, and ask if condition (1) is holding for a small enough beta.

Unfortunately, neither of these things are quite what I would want. Ideally there would be some “fast” (say logtime) function $f$ that takes in an algorithm $\mathcal{A}$ and training sample $z$ such that:

$f(\mathcal{A},z) = \left\{\begin{array}{ccc} 1 & & \mathcal{A} \; \mathrm{works \; well \; for \;} z \\ \\ 0 & & \mathrm{otherwise}\end{array}\right.$

with high probability. At least for luckiness, one can identify a feature that explains good generalization; in general one must come up with this from sheer force of thought, or by having a sample that has that feature (and then one has to come up with $omega$), so there’s the possibility of saying “I expect this to generalize really well”. But it seems like a lot more work for an answer that can more easily be obtained through cross-validation.

I guess maybe these items are interesting from an intellectual standpoint. Algorithmic Luckiness reminded me (very obliquely) of the proof that $\mathrm{PSPACE} \subset \mathrm{IP}$ in that a single sample is extended to a double sample, and the size of the cover of hypothesis space can be bounded. In the proof, a boolean function is extended (algebrized), and the size of the cover of solutions can be bounded through queries to some oracle. This exploits only a property of the instance, and won’t be powerful enough to show refined results for complexity (like the collapse of the Polynomial Hierarchy), but AL takes a bit of a more remote view, and examines not just the instance (sample), but also the algorithm running on that sample. It might be worth fleshing out the similarities and differences, but that will have to be a different post.

Edit: sources: 1 2 3 4