# heavytailed

## August 24, 2012

### Boosting generative models

Filed under: Uncategorized — heavytailed @ 8:56 pm

A post on reddit’s r/machinelearning introduced me to the concept of gradient boosting. A quick google turned up the seminal paper. A number of ideas popped out of it that I hadn’t considered before, the most concrete of which being: Least-Angle Regression (or stagewise regression) as a boosted ensemble; and the possibility (nebulous still) of comparing high-influence points (in say logistic regression) with support vectors. The similarity is fairly tantalizing. Also, the nomenclature of “Influence Trimming” seems very close to the idea of influence curves in semiparametric regression. Unfortunately, papers where I’ve encountered the concept of influence curves have been near-impenetrable, so I have nothing to go on other than similarity of situation and similarity of name.

At any rate, I immediately thought about how I might apply gradient boosting to the kinds of data I deal with in Next-Gen Sequencing, in particular the problem of variant filtering. The setup for this is really quite simple: a set of algorithms process the raw data into a list of putative genetic variants (e.g. at position 2,864,228 of chromosome 1, the base T has mutated to base C in some number of the samples). Each of these putative mutations comes with an annotation vector, describing a number of features of the data: number of unique molecules (“reads”) at the position, the probability (under a Bayesian framework) that the site is variable, the average confidence in the positioning of the reads at the locus, etc. The goal is to separate the real variation from machine artifacts and statistical false-positives.

As of now, the standard method is to use a Gaussian Mixture Model to model the distributions of true and false positives. See here for a full presentation. The way it’s done is a bit interesting: because there are a large number of categorized polymorphisms (in particular in humans and mice), those that are re-discovered are used to build the positive model. Then, rather than have an explicit “negative model”, the variants are ranked according to their probability under the positive model, and the worst 7% (really this is a parameter) are assumed to be negative, and are used to build the negative model. This method obviously generates a bias, but it’s probably a helpful bias — (Consider the simplified case of two gaussians, one positive, one negative; same standard deviation, different means, and undergo the same procedure. The mean of the negative gaussian will be approximately (assuming a far enough spread in the gaussians):

$\int_{\infty}^{\mu_1-\Phi^{-1}(0.07)} x dF(x)$

where $F(x) = \frac{1}{2}\Phi(x-\mu_1) + \frac{1}{2}\Phi(x-\mu_2)$, and $\Phi$ is the standard error function. Noting that $\Phi^{-1}(0.07) \approx -2.46$ . Thus the mean of the negative model will be about $\mu_1-0.0193$, and larger for larger standard deviations ($\mu_1 - 0.0193\sigma$). I’ve let $\mu_1$ indicate the bad gaussian’s mean, and assumed that $\mu_1 < \mu_2$.) — at least from the perspective of minimizing false-negatives. It may actually be fairly straightforward to derive a correction for this scheme.

At any rate, when I thought about it for 25 seconds, I realized that it’s not at all clear how to boost generative (as opposed to discriminative) models. (It’s also not completely clear how to mix generative and distributive models — but that’s a subject for some other date). From a purely speculative standpoint, it shouldn’t be awfully difficult to extend, the difference between the two spheres seems to boil down only to the strength of the assumption about the underlying distributions. (For instance, a GMM makes a strong assumption about the moments of the data-generating distribution, while a SVM makes the weaker assumption that there is good separation — in return, generative models have a tendency to converge more quickly.) So I wondered if there was research in the direction of generative model boosting. Turns out, the “nothing new under the sun” theory continues to hold, and here’s how:

Loss Functionals for Generative Models

You’re not wrong, Walter, you’re just…

So you want to boost your generative model. How cute. You need a loss function to represent some penalty for getting the wrong label…except that you’re trying to estimate a density, so there are no labels. In this universe you can’t be wrong…but you can be an as.. erm, unlikely. So you need a loss function that’s related to the likelihood of the data you’re modeling. I’ve tried to find less-than-obvious loss functions, but sources are a scarce, so there are basically two contenders: $L^k$ loss:

$L(x,\hat{f}) = \int ||\hat{f}(x)-f(x)||^kdx$

and the likelihood loss:

$L(x,f) = -\sum \log(f(x))$

For an example of $L^2$ loss:

$\int (\hat{f}(x)-f(x))^2dx = \int \hat{f}(x)^2 - 2\int\hat{f}(x)f(x) + C$

where $C$ can’t be minimized by choice of $\hat{f}$. Keeping the regularization term and approximating the second integral by the data itself  gives:

$L_2(x,\hat{f}) = \int \hat{f}(x)^2dx - \frac{2}{n} \sum \hat{f}(x_i)$

And we find the $\hat {f}$ to minimize the above risk functional. Other loss functionals can be defined, but, as I said, I couldn’t find any others other than ML, MAP, and L2 actually used. (You might imagine, for instance, a weighted moment-matching loss function). Anyway, it seems to me that the only roadblock for boosting a generative model is the suitable definition of a loss function. Thereafter, it’s plug-and-play.

Generative Boosting

At this point gradient boosting will just work: it follows from basically the same argument:

$L(x,F_{k}(x)+\epsilon h(x)) = L(x,F_{k}(x)) + \epsilon h(x) \frac{\partial L(x,F_{k}(x))}{\partial F_{k}(x)} + O(\epsilon^2)$

with sums over everything as usual (that is $x$ is really $\sum x_i$). Thus, boosting amounts to finding the $h(x)$ that minimizes $h(x)\frac{\partial L(x,F_{k}(x))}{\partial F_{k}(x)}$. Once this $h^*$ is found, a line search is performed:

$\rho^* = \arg\min_\rho L(x,F_{k}(x) + \rho h(x))$

Note that $F_{k}(x) + \rho h(x)$ is no longer a probability distribution (it will integrate to $1 + \rho$). The easy fix is to instead use

$\rho^* = \arg\min_\rho L(x, (1-\rho)F_{k}(x) + \rho h(x))$

as you would for any mixture distribution. Then set $F_{k+1}(x) = F_{k}(x) + \rho^*h(x)$.

Some observations. First: while discriminative boosting comes with the really nice intuitive concept of “pseudo-residuals”, I can’t see an analogous “pseudo-density” hiding in the generative boosting equations. The closest I can come is to think of a re-weighted density: using the MLE loss function:

$\frac{\partial L(x,F_{k}(x))}{\partial F_{k}(x)} = -\frac{1}{F_k(x)}$

and thus to minimize the loss (and maximize this term), $h(x)$ should be weighted on those points least probable under $F_k(x)$. This is akin to generating the weighted density

$\tilde{f}_\omega(x_i) = \frac{\delta(x-x_i)}{F_k(x_i)}\left(\sum_i F_k(x_i)\right)^{-1}$

the delta function is Dirac’s. Is it something obvious enough to be called a pseudodensity? pseudosample? I’m not sure.

Second: Generative boosting is forgetful. At step $k$ the coefficient on the first hypothesis is $\Pi (1-\rho_i)$, so it’s possible to winnow initial bad hypotheses out of existence. It also gives a really nice justification for trivial methods of on-line estimation: if your greedy first estimates are bad, they’ll be forgotten, so you can continue being greedy. For some reason, KDE is really widely used for on-line generative modeling, so this justification is sort of useless (you have to store the datapoints anyway to construct the function: KDE offers zero compressibility). However, generative gradient boosting should be able to construct optimal local bandwidths for KDE, if one is willing to pass over the data a few times, and it definitely simplifies some of the work done in the area of boosting kernel density estimates. It’d probably be fun to implement.

Sources