# heavytailed

## October 6, 2014

### Kernel Random Effects

Filed under: Uncategorized — heavytailed @ 10:57 pm

I ran into a couple of grad students studying for a big AI exam yesterday. I barged in wondering what they were doing with all the pretty inner products and norms on the board, precipitating a moment of intense awkwardness. One of them asked how much I knew about Mercer’s theorem, and if I could explain it. The best I could do was recall the gist of Mercer’s theorem: which is that if you have a Kernel $K: \mathcal{X}\times\mathcal{X} \rightarrow \mathbb{R}$ that obeys certain conditions (positive-definiteness and boundedness), then you can think of it as an inner product in a larger feature space. In other words (letting $\mu$ be a measure on $\mathcal{X}$):

$(1) \;\;\; \sum_{i,j=1}^n a_i a_j K(x_i,x_j) \geq 0 \; \; \forall x_i \in \mathcal{X}, a_i \in \mathbb{R}$

$(2) \;\;\;\; \int_X \int_X K(x,z)^2 d\mu(x)d\mu(z) < \infty$

$(1),(2) \Rightarrow K(x,z) = \sum_{k=1}^\infty \lambda_k\phi_k(x)\phi_k(z)$

Where the convergence is uniform and absolute. This is, basically, obvious, and is the extension of the linear algebraic notion that positive semidefinite matrices can be written as an inner product, and the point is that so long as you can construct a positive-definite K for which condition (2) holds, you can solve certain optimization functions over a Hilbert space without ever directly calculating the image of your data in that space. For support vector machines (the typical application of Kernels) this basically means if you magically pick a high-dimensional representation in which there is a good separating hyperplane, you can find a really good hypothesis.

Okay well what does it mean for there to be a good separating hyperplane in a large (possibly infinite) dimensional space? It means that there’s a $(w,b)$ such that

$w^\dagger x_i + b < 0 \;\; \mathrm{if} \; \; y_i = 0 \;\;\; w^\dagger x_i + b > 0 \; \mathrm{otherwise}$

Note that for Kernel methods, the inner product is replaced by the kernel, that is $K(w,x) + b < 0$ (etc). One way of interpreting this is as showing that there is a high-dimensional inner product space characterized by K on which the classification function is linear; another way is to interpret this as saying that in that high-dimensional space, the data cluster on either side of the dividing line. Perhaps there are many clusters, but on either side of that line: the explanatory data is somehow better aligned with the responses.

(Note that this is different from the Manifold hypotheses articulated in previous posts. The weak MH says that the data are distributed on a low-dimensional manifold; the Strong MH says that the hypothesis class has a simple representation in that very same manifold. The Kernel-MH says the hypothesis class has a simple representation on some manifold of dimensionality up to that of the data, in a potentially infinite ambient space. This is why Kernel methods don’t magically escape the curse of dimensionality.)

Obviously, a better alignment of features with response makes all manner of machine learning methods work more appropriately. In particular, in the regression setting, one would like to identify features better correlated with the response. The only issue is that while the predictors can fly into the infinite-dimensional Kernel space, the response can’t. And even if you could, all you get from $K(x,z)$ is the Kernel matrix $K_{ij} = K(x_i,x_j)$, with no access to $\phi(x_i)$ and $\phi(x_j)$.

But. For Random effects (see for instance here), $K$ is all you need. This is one of those (excellent, but few) non-trivial cases where you’re forced to use random effects. This blog post will provide some background and demonstrate some results. There’s not all that much to develop — merely pointing out that K is a matrix of appropriate dimensions is enough for anyone to take this forward and run with it — except to point out relationships between the current literature on Kernels and assumptions about random effects models (and these mostly have to do with normalization).

From the Annals of Bad Examples: Why Use Kernels

“Doing a good thing for a bad reason”

The reason to use a Kernel is to project the data into a space in which it is linearly separable. What this is really doing is choosing a space in which the image of the separating criterion function is linear, enabling linear classification in the image of the data, which corresponds to nonlinear classification of the unprojected data. This works really well for many types of hypotheses, but one type on which it tends to fail miserably are disjoint intervals. Yet somehow the canonical example is as follows:

Got some red points, some blue points (the y-axis is just 0 or 1 depending on the class, it’s a 1-dimensional classification) and they’re obviously not linearly separable since they’re disjoint intervals. Now there is a transformation $C: \mathbb{R}\rightarrow\mathbb{R}$ that will wind up separating the red points from the blue points, it’s the cheating kernel, where you simply multiply the endpoints together to form a polynomial, i.e. for endpoints $z_i \in Z$:

$C(x) = \prod_{i=1}^{|Z|}(x-z_i) = (x_i-3)(x_i-2)(x_i+4)(x_i+1)(x_i-6)(x_i-8)(x_i+11)(x^i-12)$

and “magically” by constructing a polynomial with zeros on the edges of the intervals the two classes are linearly separable!

Therefore the Kernel trick works! My beef is: this is really a kernel dirty trick: the Kernel function itself encodes the hypothesis, and so if we didn’t know it ahead of time, we’d’ve been screwed. Before we move on to realistic examples using “real” kernels, it’s worth asking what the Kernel Matrix looks like for this, and we can choose either to normalize within the kernel feature space or not (the normalization sets $\tilde{K}(x,y) = \frac{K(x,y)}{\sqrt{K(x,x)K(y,y)}}$)

For a reason to be determined, normalizing the Kernel (right) reveals the huge structure in the data (the fuzziness is due to a small constant I added to deal with 0 valued denominators), and very obviously the eigenspaces of these two kernel matrices will be different. The one on the right should be very easy to classify, the latter maybe less so. Looking at the projections, it’s clear there’s a much larger margin in the latter case

The latter case clearly indicates that a random effects model (or fixed effect using PCs) could easily pull signal out of this. This immediately leads to means of extending the Kernel trick to general regression problems. In part this is already well-known; for the general regression problem of solving:

$\mathrm{min}_w L(X^\dagger w)$

where L is a loss function (such as $L = (y-y_\mathrm{pred})^2$) we can rewrite as

$\mathrm{min}_v L(X^\dagger Xv)$

simply by letting $w = Xv$. For L2-penalized regression the transformation works as well (the penalty term is $+ \lambda v^\dagger X^\dagger Xv$). The story here is a little bit more restrictive, but at the same time a bit cooler: by being a bit more explicit about an underlying probability model (e.g. gaussian, probit, etc), the kernel matrix K is taken as a hypothesis about the covariance of observations. If you’ve found a “good” representation of your data – not just a space where it’s separable but where distances mean something, in terms of “close points are correlated, far points are not,” the Kernel matrix should have the property of explaining a large proportion of your data.

Let’s not over-trivialize: this is a pretty radical re-interpretation of the Kernel matrix and what it’s trying to accomplish. Usually we only care about separability: the fact that the magnitude of the positive points (in the first figure) is far smaller than that of the negative points, and that overall there’s a vast range not associated at all with the response, is in general ignored. It’s separable: let’s move on. By contrast, considering the Kernel matrix as though it were part of a variance components model means caring about the structure of the data, beyond smearing it out enough that the class is more or less separable. It may even be worth making normative statements: good Kernels provide both improved separability and discover/retain covariance in the predictors that translates into covariance in the response. (This is, at its core, a statement about scale).

Digression: Kernel Normalization

Before moving to the empirical section (which can be summarized by: blindly sticking K into generalized linear models sometimes works), it’s worth linking the way scale relates to linear models to the way the Kernel represents a map into a latent high-dimensional feature space.

One thing to notice is that Kernel normalization restricts the Kernel feature space to a unit sphere. This follows from Mercer’s theorem: we can just rewrite

$K(x,y)/\sqrt{K(x,x)K(y,y)} = \frac{\phi(x)}{||\phi(x)||}^\dagger \frac{\phi(y)}{||\phi(y)||} = \cos \theta_{\phi_x\phi_y}$

for suitably defined inner product (it may be an integral). Clearly you lose a dimension from this, and so if K corresponded to an N-dimensional space, then the normalized version is the (N-1)-dimensional sphere. This precisely explains the behavior observed above: restricting the one-dimensional polynomial transform $C(x)$ above to the 0-dimensional sphere (the points {1,-1}) effectively makes $K(x,y)/\sqrt{K(x,x)K(y,y)} = \mathrm{sgn}(C(x))\mathrm{sgn}(C(y))$: ergo the constant strobes in the normalized Kernel matrix.

Conceptually at least, normalization is a statement that the original kernel maps the data into a space where the angles (suitably defined) are predictive, but the position of the data in absolute terms is irrelevant. In some sense, there’s less smearing, and this can be amenable to methods not just looking for a separating hyperplane, but looking for a space where distance between points has some kind of meaning.

One way to examine the impact of normalization is by its effect on the canonical feature map. That is, given a kernel $K(x,y): X \times X \rightarrow \mathbb{R}$ it induces a number of objects. One such object is the functional $L_K: \ell^2(X) \rightarrow \ell^2(X)$ defined as $g(y) = L(f) = \int_X f(x)K(x,y)dx$. This can be thought of as a change of bases much like a Laplace transform; and the canonical feature map is the application of this functional to the delta-representation of the datapoint, i.e. $\phi_K(x) = K_x(y) = L_K(\delta(x)) = \int_X K(z,y)\delta(z-x)dz$.

There are many features consistent with a Kernel, not all of which draw from Mercer’s theorem like the canonical map does. However the exact feature depends on how you define your inner product, for instance if $\langle f(x),g(y)\rangle = \int\int f(x)K(x,y)g(y) dxdy$ then the above representation is good to go. By contrast you might want to factor this as $\int\int f(x)\sqrt{K(x,y)}\sqrt{K(x,y)}g(y)dxdy$, which is perfectly fine given that K is positive definite, in this case you might want the canonical feature to be $\int_X \sqrt{K(z,y)}\delta(z-x)dz$

So what does the canonical feature look like for a polynomial kernel? On the left is a standard quadratic kernel, and on the right, the same kernel but normalized. Remember, these are canonical features, in this case $\phi_z(x,y) = \int_x\int_y \sqrt{K(x,y)}\delta(x-z_x)\delta(y-z_y)$ is parametrized by the 2-vector $z = (z_x,z_y) = (1,-1)$, the black dot in the pictures.

Increasing the polynomial to degree six:

So Kernel normalization here appears to make the canonical features look more like radial functions. What are inner products of radial functions? Effectively similarities: radial functions whose centers are close by (in comparison to the function volume, say) have a large inner product, those with centers far apart have a small inner product. This is going to link us back to random effects, but it’s worth addressing (in our own poor fashion) what’s going on geometrically when one normalizes a Kernel.

Some Geometry of Kernel Normalization

So normalizing a kernel is basically rescaling so that the diagonal is one. Note that the standard RBF $\exp(-||x-y||^2)$ is automatically normalized, which is of course related to the intuitive connection between inner products of radial functions and similarities. But what really does it mean if $K(x,y) = (x^\dagger y + c)^d$, when normalized, results in a better classification that $K(x,y) = \exp(-||x-y||^2)$? (These are .gifs by the way, if they’re not animating click on them to view).

Since these kernels are just inner products, they should be rotation invariant; that is invariant under unitary operations on L2 (the space of the canonical representation). In the case of kernels that are functions of a norm or inner product (i.e. $K(x,y) = K(x^\dagger y)$ or $K(x,y) = K(||x-y||)$), rotations of feature space induce a unitary operator in Kernel space – and therefore the Kernel matrix will be invariant under such rotations.

But one big difference between the exponential and polynomial kernel is that the former is translation invariant, and the latter is not.

However, unlike the polynomial kernel, the exponential kernel is invariant under translations, while the polynomial kernel has a “minimum” (0) where $x^\dagger y = -c$. In some sense, while pure radial kernels encode “locality”, other kernels encode biases about the ambient feature space. The polynomial kernel treats points as similar if they are (1) far away from the minimum c, and (2) collinear (or nearly so) with each other and c. The ANOVA kernel (where rotations of feature space do not induce unitary transformations in Kernel space) encodes for axis-parallel similarities, and a wavelet kernel (which is admissible though not positive definite) encodes for a patterned similarity:

So basically, one kernel performing “better” than another means that the canonical feature of the “superior” kernel function encodes something about the classification geometry in the ambient feature space that is closer to the truth. So for instance, if the normalized polynomial kernel outperforms the standard RBF, it indicates that similarities are not well modeled by gaussian-scaled distances, but instead by distances that are biased by overall distance from (and angle to) the reference point c.

But what does it mean if normalization improves performance? What’s the difference between the normalized kernel and the unnormalized one? Well if a kernel is a generalized covariance, then a normalized kernel is some kind of a generalized correlation. That is, we go from some unknown space $\mathcal{D}$ with an inner product onto a sphere in that space: $\{x \in \mathcal{D}: \langle x, x \rangle = 0_D\}$ where $0_D$ is the 0 element of that space. This is a kind of poor man’s projective space: we’ve made our destination manifold scale-invariant (with respect to rescaling the data). Indeed for SVMs, this kind of normalization is recommended as theoretically it improves the margin.

Would we want to do this for a random effect model? Should we do this for Kernel-PCA? There are really two answers here: for RBF Kernels, there is no actual question, as these are automatically normalized. However for things like the polynomial kernel, the question is one of whether we want to equalize scales across the features in our Hilbert space. From a physical point of view, data has units associated with it; so if x is impressions/day then x^2 is impressions^2/day^2. This automatically places features on different scales, so it seems to me from this perspective normalization is appropriate. On the other hand, if the data is normalized to be unitless, then features in the Kernel space may be “physical” in some sense.

Ultimately, there’s nothing stopping a Kernel from being used in an old-school linear model. However, there are connections between Kernel machines and Linear Mixed models