# heavytailed

## November 3, 2012

### Manifold Hypothesis Part 1: Compression, Learning, and the Strong MH

Filed under: Uncategorized — heavytailed @ 8:25 pm

Manifold methods have received increasing attention in recent years, and although still a small part of machine learning research, I think it’ll be a fairly hot topic as big data drives it to merge with related fields like model selection, regularization, and compressed sensing. The manifold hypothesis is also one of the topics to which my mind regularly wanders while unoccupied. It’s a fun topic, and certainly worth exploring a bit for its insights.

I’ve recently started to think about the Manifold Hypothesis in two parts: a “weak” MH, and a “strong” MH. Both are hypotheses about domains and supports of distributions given a set of data, but they apply to slightly different settings. In particular, I see a connection between the “weak” MH and unsupervised and semi-supervised learning, and “strong” MH with semi-supervised and supervised learning. The strong MH is also deeply related to the Kernel trick.

Weak MH: A set $X \in \mathcal{X}^m$ where $\mathcal{X}$ is a space with intrinsic dimensionality $D$ (for instance $\mathbb{R}^D$) drawn from a probability distribution $\mathcal{P}^m$ is $\mathcal{L},\epsilon$-degenerate (for loss function $\mathcal{L}$ and $\epsilon \in \mathbb{R}$ – since this can be loss) if there exists a manifold $\mathcal{M}$ of dimension $d < D$ embedded in $\mathcal{X}$ such that

$\int \mathcal{L}(\mathcal{M},X)d\mathcal{P}^m(X) \leq \epsilon$

The weak Manifold Hypothesis (which is with regards to a particular dataset or data source) is basically that $X$ is $\mathbf{d}(\cdot,\cdot),\epsilon$-degenerate where $\mathbf{d}(\cdot,\cdot)$ is a distance function, $\epsilon << 1$ and $d << D$. In terms of probability it’s that $\mathcal{P}$ is concentrated on a manifold of dimension $d < D$, and thus the distribution is nearly degenerate.

Intuitively, weak MH states that the intrinsic dimensionality of the data is much smaller than the ambient space in which the data is embedded. In plain english, this means that, if the data was comprised of $N$ components  measurements, features, dimensions), that there’s some combination $n << N$ that closely approximates the data. In mathematical terms, there’s an underlying data manifold $\mathcal{M}$ of dimension $n << N$, upon which a probability distribution $f$ is defined, from which the data is drawn, although $\mathcal{M}$ is embedded in a $N$-dimensional space (say $\mathbb{R}^n$). Typical examples include a spiral or zig-zag:

This figure comes from a Microsoft Research paper on Decision Forests, in the subchapter which describes Manifold Forests. On the left: the ambient data distribution (in R2), and on the right, the unwinding of the spiral or zig-zag onto the number line, represented as a red-to-blue gradient. While the ambient distribution is two-dimensional, the data lie very close to a continuous line (1-dimensional manifold).

There are lots of methods to exploit this kind of structure in the unsupervised or semi-supervised setting, in particular transformations based on Laplacian eigenmaps, exploitation of local covariance matrices, or a nice trick called Kernel Warping. I’ll probably have a post about these methods sometime in the future. However, while machine learning algorithms that exploit manifold structure are a subclass of machine learning algorithms (by definition), the underlying hypothesis is not taken strongly enough. In my mind, there’s another form of the manifold hypothesis, one which is an extension of the general notion of learnability: the Strong Manifold Hypothesis.

Strong MH: I don’t have a good formalism for the strong Manifold Hypothesis yet, but the idea is that there is a manifold $\mathcal{M}$ with dimension $d$ (note that there’s no specified relation between $d$ and $D$) upon which the hypothesis class is defined. So while “swiss roll” might be a weak MH, “rectangles on a swiss roll” would be strong MH. Basically, where the weak MH describes the support of the data distribution, the strong MH describes the domain of the decision function.

Visually I think about MH like this:

Weak MH: My features can be compressed with little loss to a manifold of lower dimension

Strong MH: My hypothesis class is defined on a (in this case lower) manifold domain. This is really an example of both weak MH and strong MH holding: the manifold on which the hypothesis is defined (clearly P[circle] is increasing with distance from the origin along the spiral) is the same manifold as the one upon which the data generating distribution has support. How often we expect this to be the case is a philosophical consideration.

Notice that for the weak MH, while the data is clearly supported on a line (spiral), the decision boundary is a rectangle in the embedded space. Were we to represent the data in a linear fashion, the appropriate hypothesis class would be a union of intervals, which has a high VC dimension compared to rectangles in the plane. Indeed, higher-dimension analogues of this situation (say: parallelpiped boundary swiss roll data) get to be quite tough: the problem of learning a union of rectangles that explain data is NP-hard! On the other hand, weak MH is great for unsupervised tasks like density estimation.

The strong MH is really a claim that you know the space or manifold on which the hypothesis class is defined; indeed it could be (and should be!) a low-VC dimension hypothesis class on a very strange projective space. Because of this (and pretending I have a rigorous formalization of the strong MH), the strong manifold hypothesis is an equivalent characterization of learnability. (Of course, it’s just one step more complicated: I have a low-VC-dim class on a manifold, so if I have a good idea what the manifold is, I can learn by altering the representation of my data and applying every PAC theorem).

Equivalent to what, though? In previous posts I’ve mentioned Algorithmic Stability, Luckiness, and the PAC framework, all of which are descriptions of learnability. The latter two describe on the situation where I’ve found a hypothesis class that works well, and the former situation describes the situation where my classifications are robust with respect to my training set (and it suffices to leave out only one point for this, see previous posts). However, the Manifold Hypothesis concerns only an intrinsic property of the data itself — although the data includes both what you’re trying to predict, and what you’re using to generate the prediction.

An intuitive way to think about this is through the deep connection between learnability and data compression. The sky-high picture is that if you can compress and reconstruct your data with low error, and if one of the dimensions you can compress is what you’re trying to predict, then you can learn. In this kind of framework you can talk about learning multi-valued functions as well: given data in some ambient space with indexed dimensions, one can specify which dimensions are “targets” and ask how well they can be compressed, and with what reconstruction error. If they can be compressed with low reconstruction error, they can be learned (this is basically by definition).

Sample Compression: PAC error

The connection between compression and learning was first (rigorously) explored by Littlestone and Warmuth who noticed that a number of learned hypothesis functions were parameterized by a handfull of points from the input training sample. For instance, SVMs need only the $k << N$ support vectors to represent the learned hypothesis, and planar rectangles need only the four corners. In keeping with this observation, they define sample compression as a notion of learnability (like VC dimension). A hypothesis class is sample compressible if it can be defined by a compression function $\kappa$

$\kappa: \bigcup_{n=k+1}^{\infty}(\mathcal{X}\times\mathcal{Y})^n \rightarrow (\mathcal{X}\times\mathcal{Y})^k$

which describes how to pick a hypothesis parameterization of size $k$ from an input training set of size $n > k$. Along with this $\kappa$ compression function is a $\rho$ reconstruction function:

$\rho: (\mathcal{X}\times\mathcal{Y})^k\times \mathcal{X} \rightarrow \mathcal{Y}$

So basically, $\kappa$ takes the training set and computes a hypothesis which can be represented (or parametrized) by $k$ points. $\rho$ takes those $k$ points, recomputes the hypothesis, and applies it to a new point $x \in \mathcal{X}$, generating a prediction $y \in \mathcal{Y}$. If this can be done perfectly, that is, given a training set $\mathcal{S}$, $\rho(\kappa(\mathcal{S}),x_i)=y_i$ for every element $(x_i,y_i) \in \mathcal{S}$, then the hypothesis class represented by $(\kappa,\rho)$ is learnable. A SVM is a great example of a sample-compressible hypothesis class: the separating hyperplane can be parametrized only by its support vectors. Thus kappa is the algorithm which determines the support vectors, and rho is the algorithm which uses them to apply the halfspace classification.

Additional information not a part of the training set can also be specified. The extension is to allow the addition of a space $\mathcal{Q}$ to the range of $\kappa$ and domain of $\rho$. (This might be, for instance, the slope and intercept for a linear regression, or the support vectors for a maximum-margin SVM.)

$\kappa: \bigcup_{n=k+1}^{\infty}(\mathcal{X}\times\mathcal{Y})^n \rightarrow \mathcal{Q}\times(\mathcal{X}\times\mathcal{Y})^k$

and

$\rho: \mathcal{Q}\times(\mathcal{X}\times\mathcal{Y})^k\times \mathcal{X} \rightarrow \mathcal{Y}$

If this is the case, then the hypothesis represented by $(\kappa,\rho)$ has low error. In particular, if the training set is of size $m$, then the hypothesis error is $\epsilon$ or greater with probability $P^m(\mathrm{err} \geq \epsilon) \leq |Q|{m \choose k}(1-\epsilon)^{m-k}$. This can be extended to a standard $\delta,\epsilon$ bound without much trouble. The proof follows from the fact that you have perfect classification of the test set given the $k$ points, and the test set distribution is identical to the real distribution, so given some error rate $\epsilon$ and a training set set $X, |X|=m$ you need only count the number of $k$-point subsets that yield a correctness of $P(\rho(x_{h_1},\dots,x_{h_k},x_\mathrm{new})=y_\mathrm{new}) < 1-\epsilon$, then integrate over all training sets of size $m$, and bound by simple intersection. See here.

Sample Compression: Generalization Bounds

So the PAC bounds of Littlestone and Warmuth lay dormant for a few years. Those bounds depend on having no noise in the labels, that is having $y=f(x)$ and not $P(y)=f(x)$. But what if we have noise in the labels, or what if we care about an arbitrary risk (that is, loss)-based decision boundary, rather than a strict classification boundary? Sample compression provides a means of extracting generalization bounds as well.

One way of generating bounds (which isn’t very tight, but why not) is to bound the expected difference in risk and apply your choice of expectation/probability inequalities. We’re going to be given a sample set $\mathcal{S} \subset (\mathcal{X}\times\mathcal{Y})^m$ of size $m$, and a compressible class $\mathcal{C}$ where hypotheses can be indexed by $z = ((x_{t_1},y_{t_1}),(x_{t_2},y_{t_2}),\dots,(x_{t_k},y_{t_k}))$ and possibly some side information $Q$. Here, $t$ is a set of indeces of size $k$, so $t_i \in 1,...,m$. The space is equipped with a loss function $L(x,y_{\mathrm{act}},y_{\mathrm{pred}})$, with $0 \leq L(\cdot) \leq M$, which generates an optimal hypothesis $z^* \in \mathcal{C}$ which minimizes $\mathbb{E}_{P(x,y)}[L(x,y,\rho(z^*,x))] = L[[z^*]]$ (recall $\rho$ is the reconstruction function).

It’s actually quite difficult to get fully generalized bounds through this approach, and that difficulty arises from the fact that the loss function may not be bounded. The 0-1 loss is well understood (it varies very little from the previous: bound the proportion or number of errors, define the sets $\mathtt{BADSET}$ and $\mathtt{EGOOD}$ as the collection of training sets that yield bad hypotheses ($L[[z_\mathcal{S}]] - L[[z^*]] > \epsilon$), and the collection of training sets that have good behavior on the training set ($\sum_{(x_i,y_i) \in \mathcal{S}\setminus z_\mathcal{S}} L(x_i,y_i,z_\mathcal{S}(x_i)) < q$) (this is a bound on the number of misclassifications, as the loss is 0-1). Then just apply Hoeffding’s inequality to the intersection of the two collections (e.g. both events happen), and get

$P\left[L[[z_\mathcal{S}]] - L[[z^*]] > \epsilon \right ] \leq \exp\left[-2(m-k)\left(\epsilon-\frac{q}{m-k} \right ) \right ]$

A more indirect, but I think cooler way of obtaining a bound is by considering sample compression to be a form of Algorithmic Luckiness (see previous posts). In particular, the representation size $k$ or $|Q|k$ is exactly the luckiness function, and is $\omega$-small at any scale $\tau$, and

$\omega(L_0,l,m,\delta,\tau) = \sum_{i=0}^{L_0}{2m \choose i} \leq \left(\frac{em}{L_0} \right )^{L_0}$

Which gives an immediate generalization bound by (plugging in – see previous posts):

$L[[z^*]] \leq L[[z_\mathcal{S}]]+\sqrt{\frac{8}{m}\left\lceil k\log_2\frac{em}{k}\right\rceil+\log_2\frac{2m}{\delta}}$

But again, this is for bounded loss $L \in [0,1]$.

By the way, as a quick aside, there are tons of mismatches between theory and practice when it comes to loss functions. Not only are loss functions, in general, not bounded (L1 and L2 loss aren’t, hinge loss isn’t, etc) but most theoretical treatments treat loss as $L: \mathcal{Y}\times\mathcal{Y} \rightarrow [0,1]$ and not in reality $L: \mathcal{X}\times\mathcal{C}\times\mathcal{Y} \rightarrow \mathbb{R}^+$, that is hinge loss for a SVM is really: $L(x,f,y) = \max\{1-yf(x),0\}$. So if your classifier screws up on a point far from the margin, it imposes a good deal of loss. The hinge loss is really simple, yet doesn’t fit in with any of the theory above, as it’s not bounded. I think that a great deal of these theorems can be extended from bounded loss, to loss with bounded kth derivatives (so hinge loss, L2 loss). So there’s still significant work to do in this regard.

Strong MH, Learnability, and the Kernel Trick

So after that rambling digression, you’re probably wondering what this all has to do with the strong manifold hypothesis. Recall that my characterization of it is as the hypothesis that there is a manifold $\mathcal{M}$ upon which a hypothesis with low VC-dimension lives. Rather than delve into VC dimension, I’ll use the fact that for every class of VC dimension $d$, there exists a compressed representation of size $O(d)$; so I can equivalently say that there is a space on which I can learn a hypothesis with compression size $k$.

Note that we’re not talking about learning $\mathcal{M}$, but we know it a priori. In this case, if the hypothesis holds, every PAC compression bound applies directly. This is why the Kernel trick works: you’ve pushed your data into a high-dimensional space upon which the hypothesis class is defined. You compress your hypothesis to just the $k$ points you need, and you’re done. The fact that the data itself lay on a low-dimensional manifold in that space (for instance, you may have used a polynomial kernel, which takes $x \rightarrow (x,x^2,x^3)$, which clearly is a single curve in $\mathbb{R}^3$) doesn’t matter, which is why the separation between the data manifold and the hypothesis manifold is important.

1. I don’t know or understand enough yet about machine learning to comment meaningfully on the most of the post, although I did enjoy it.

This is a really dumb and pedantic question on my part, but in the statement of the weak manifold hypothesis, does it matter whether or not the manifold is just a topological manifold, or is it assumed to be even a differentiable or even smooth manifold?

Comment by krinsman — July 16, 2017 @ 7:06 am

• Wow. I forgot about my own blog here…

These ideas have been quite well refined in the literature on dimensionality reduction. From my perspective I basically view the weak notion as the existence of an unknown manifold M, with an embedding map f: M -> R^n. We need not endow M with anything more than the ability to support distributions (i.e. a topological manifold with a sigma-algebra). So discrete manifolds are acceptable.

One could formulate the stronger hypothesis (say for classification) as associating class probabilities with the open sets in M (i.e. inferring the distribution on its probability space). However in thinking about “neighborhoods” or “boundaries” *on* M, and their image in R^n (under f), ideally M would be Riemannian.

Comment by heavytailed — July 16, 2017 @ 6:39 pm