# heavytailed

## May 2, 2013

### On Pleiotropy (or multiple response regression)

Filed under: Uncategorized — heavytailed @ 4:05 pm

Imagine the case where instead of having a set of predictors and a response, there are instead multiple responses, and those responses themselves may be related in some way; and the interest is to develop an accurate predictor of the response vector, and/or to identify those predictors that have the largest effect. Since the predictor can now impact multiple response variables, the “effect” of a predictor is a vector: $\nabla_x y|x$, and so the effect size and direction could be reasonably thought of as the magnitude and direction of that vector.

In biology, this is referred to as pleiotropy, and the concept has existed for over a century. A bit of the history of the field can be found here. The idea is you have a gene that effects two traits. In terms of evolutionary biology the question is: how does the degree of pleiotropy alter the evolutionary properties of the gene and the related traits. For instance, in certain circumstances, the degree of pleiotropy can increase the effect of selection on a gene, see here for an entry point into the literature. In terms of statistical genetics, pleiotropy is a phenomenon one might leverage to gain statistical power.

Note that little attention is paid to the mechanism of pleiotropy. That is, whether $x \rightarrow y_1$ and $x \rightarrow y_2$ independently, or whether $x \rightarrow y_1 \rightarrow y_2$ as a kind of network chain. This kind of inference is vastly important in other fields like economics: the dynamics of how a shock translates into an economic outcome (e.g. through which side-channels) is of particular importance to entities like the Federal Reserve or the IMF.

A Brief History

The first detailed treatment of the prediction of multivariate responses from multivariate prediction was given by Breiman and Friedman in 1997. Their interest was in reducing prediction errors caused by correlated responses, and they proposed a number of methods to get around covarying response variables: a Shrinkage method, a rank method, and ridge regression. A few years before, Jiang and Zeng used a similar approach to genetically map quantitative trait loci across multiple traits.

15 years earlier, Friedman and Rafsky approached the problem of testing for correlation between sets of vectors, and proposed a class of statistics based on graphical measures. Prior to that, the clever statistician would have proposed using $X^\dagger X$ as the parameter for a Wishart distribution, transformed the $\mathrm{Cov}(Y)$ matrix into that eigenspace, and run a $\chi^2$ test on the sum of the eigenvalues.

Recent years have seen a resurgence of attention in this field, as $\ell_1$-penalized estimation (“sparse learning”) methods have become more widespread. Lee and Liu use this framework to estimate the inverse covariance matrix, while Kim and Xing extend it by adding $\ell_1$ penalties (priors) to select for tree-like relationships between gene expression and variants.

Multiple-response models can be built from first principles. Given a mixing matrix $\Delta$, which describes the correlative structure of the responses unconditional on the predictors – so both the predictor variables and the idiosyncratic errors are mapped to responses – we can write

$\vec y = N( \Delta^{-1} \mathbf{B} \vec x, \Delta)$

The appearance of $\Delta$ both on the fixed an the random effects level is imperative: both the known fixed effects and the unknown random effects need to mapped through the response correlation function, or to put it another way

$\vec y_i \sim \Delta^{-1}\mathbf{B}\vec x_i + \Delta \epsilon_i$

This is the fundamental modelization, and there’s a good deal of complexity lurking under the surface here that we will get to. The immediate question is why the fixed and random levels get inverses of one another. One way to think about it is $\Delta= \Sigma$, so the random effects level applies the covariance, while the fixed-effect level accounts for the covariance among responses (so, for instance, if $\mathbf{B}\vec x_i$ happens to be the principle eigenvector of $\Sigma$, it only needs a small magnitude to have a disproportionate response, so it is downweighted). A second way to view it is via a correspondence with conditional distributions. From the derivation of the blup, for jointly normal vectors $(w_1,w_2)$

$w_1|w_2 = N(\Sigma_{21}\Sigma_{11}^{-1}w_1,\Sigma_{22}-\Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12})$

The covariance is easily seen to be $[(\Sigma^{-1})_{22}]^{-1}$. Then $\Sigma_{21} {\Sigma_{11}}^{-1}$ can similarly be written entirely in block terms of $\Sigma^{-1}: (\Sigma^{-1})_{22}[(\Sigma^{-1})_{12}]^\dagger$. Thus one might imagine $\mathbf{B}^\dagger = (\Sigma^{-1})_{21}$ and $\Delta = [(\Sigma^{-1})_{22}]^{-1}$.

A third way is purely practical: if they showed up with the same order then $\Delta^{-1} y = Bx + \epsilon$ and we’re separable again. The fourth way is that a realization of $\epsilon$ that “looks like” a genetic effect, that is you manage to randomly draw $\epsilon_i = \Delta^{-1}\mathbf{B}\vec x_i$ – it better have the effect $\mathbf{B}\vec x$, if the fixed and random levels had the same exponent, this would compound rather than cancel.

While learning $\Delta$ is in general difficult, under the i.i.d. assumption, however, the likelihood function is simply a sum of the likelihoods of $n$ multivariante normals, and can be maximized. This is also why having $\Delta$ appear both on the fixed and random level is important: if $\Delta$ does not appear somewhere on the random effects level, this model degenerates into a standard general linear model, where one learns the (now inseparable) components $(\Delta B)_{ij}$, and as an exercise you can show that any such form for a multiple response regression can be separated into $k$ single-response regressions.

Conditional Random Fields

Though this model can be arrived at from first-principles (i.e. direct modelization), the result is a class of models known as conditional random fields, which are a type of graphical model. That is, you can view the “Mixing Matrix” $\Delta$ as a set of weighted edges between the response variables, and the effect matrix $\mathbf{B}$ as weighted edges between predictors and the response. All else being equal, (i.e. under controlled conditions), the only variation in $\vec y$ is due to changes in $\vec x$ and the matrices of effect and mixture $(\mathbf{B},\Delta)$, so

$P[y_i | \vec x, y_j] = P[y_i | \vec x, y_k] = P[y_i | \vec x]$

The literature on CRFs is not too extensive, but the literature on the related Markov random fields is extensive, and a CRF is simply a subtype of a MRF, which is a general form of an undirected graphical model.

There are a couple of red flags here, which are probably obvious to some, but worth pointing out. First, it’s always a red flag when you see “under the i.i.d. assumption”: then you have to ask “how do i let my responses not be i.i.d”? So you ask, what if I have a “typical” random effect, that is, one that models correlations between observations and not response (that is to say, $y_{ij}$ is indexed $i$ for observation, $j$ for response). If $\mathbf{Y}$ is $k \times n$, then $\Delta$ is $k \times k$…but in the model above there’s no room for a random effect $\mathbf{R}$ which is $n \times n$. The full model will be

$y_{ij} = (\Delta^{-1} \mathbf{B} x_i)_j + \epsilon_{ij}$

Now we can specify both response effects: $\mathbb{E}[\vec \epsilon_{k,\cdot} \vec \epsilon_{k,\cdot}^\dagger] = \Delta \;\; \forall k$ and observation effects: $\mathbb{E}[\vec \epsilon_{\cdot,k}\vec \epsilon_{\cdot,k}^\dagger] = \mathbf{R} \;\; \forall k$. This gives exactly what you would expect: $\mathbb{E}[\epsilon_{ij}\epsilon_{kl}] = \Delta_{ik}\mathbf{R}_{jl}$.

Matrix Gaussian Distribution

So all we’ve done under the hood here is move from modeling single-trait, multiple-person response as a multivariate gaussian, to modeling multiple-trait, multiple-person response as a matrix gaussian. This distribution takes three matrix parameters: $\mathbf{M}$, the mean matrix, $\mathbf{\Sigma}$, a covariance matrix for within-row, among columns, and $\mathbf{\Omega}$, a covariance matrix for within-column, among row. Then:

$d\mathbb{P}[Y] = \exp\left[-\frac{1}{2}\mathrm{tr}(\mathbf{\Sigma}^{-1}(Y-\mathbf{M})^\dagger\mathbf{\Omega}^{-1}(Y-\mathbf{M}))\right]\left[(2\pi)^{kn/2}|\mathbf{\Sigma}|^{k/2}|\mathbf{\Omega}|^{n/2}\right]^{-1}$

And we replace $Y - \mathbf{M}$ with the fixed effect model $\Delta^{-1} \mathbf{B} X$, so we’d have

$d\mathbb{P}[Y] = \exp\left[-\frac{1}{2}\mathrm{tr}(\mathbf{\Delta}^{-1}X^\dagger \mathbf{B}^\dagger \Delta^{-\dagger} \mathbf{R}^{-1}\Delta^{-1} \mathbf{B} X)\right]\left[(2\pi)^{kn/2}|\Delta|^{k/2}|\mathbf{R}|^{n/2}\right]^{-1}$

This further illustrates why you need $\Delta$ to show up on the fixed part of the equation. While failing to model the random part meant just “renaming” the effect size variables by $\mathbf{B} \rightarrow (\Delta^{-1}\mathbf{B})$, failing to model the fixed part eliminates the mixing, and pleiotropic effects are just ignored (e.g. $\mathbf{B}X$ is corrected by $\Omega^{-1}$ rather than $(\Delta^\dagger\Omega\Delta)^{-1}$). Note that none of the literature mentioned in the A Brief History aside take this important step in the model.

This isn’t the full model of a Matrix Gaussian. As above, this gives (for zero-mean)

$\mathbf{E}[Y_{ij}Y_{kl}] = \mathbf{\Sigma}_{jl}\mathbf{\Omega}_{ik}$

In actuality these random variables need not be related by their marginal covariance. In full generality, if $Y$ is $n \times k$, then the covariance structure is given by a fourth-dimensional tensor $\mathbf{\Xi}_{n\times n \times k \times k}$ which for the model above is $\Sigma \otimes \Omega$. The full distribution function must be different as well, as clearly $\Sigma \otimes \Omega$ can not span the space of such tensors (in the same way that an outer product of vectors cannot span the space of matrices), and so the distribution function above is the a distribution in a particular degenerate space.

This also suggests how to extend the model. Perhaps there is exogenous correlation not linked to the predictor variables; in such a case one could add an additional term $\Theta$ and have

$Y_{\cdot,i} = \Delta^{-1} B X_i + (\Delta + \Theta)\epsilon_{\cdot,i}$

In addition you can correlate the underlying errors. You might append this as another separate error term that doesn’t get multiplied by $\Delta$ at all.

$Y_{i,\cdot} = \mathbf{G}\epsilon_{i,\cdot}$

Estimation and Elegance

This model suffers greatly from two drawbacks. First, the estimation difficulty is high, particularly on genomic-scale datasets. In order to account for genomic perturbations when constructing the most likely mixing matrix, all of the predictors must be considered simultaneously. So no more getting away with single-SNP “scans” of the genome, they all have to go in (but there’s plenty of room for cleverness to avoid doing so). Secondly, while the mathematics are beautiful, the approach is fundamentally inelegant. Not all elegant solutions are simple, this approach shouldn’t be condemned for complexity alone, more a Mahler hammer where a tap hammer would do.

Estimating the model parameters is difficult for two reasons. Firstly: there are a lot of them. Secondly: the likelihood, especially when penalized in certain ways, is very tricky to optimize in a short enough time to publish a paper before your grant runs out. Because you now have to fit a large number of explanatory variables (the smallest genotype chip is ~200,000 variants), and some number of responses. For the most part, this means doing inference in an underdetermined system, or at the very least in a system with only marginally more datapoints than parameters.

Enter sparse learning, which is a massive and quickly-growing field, deserving of its own post. For an introduction, see Candes and Tao [PDF warning]. The Bayesian framework for sparse learning is to put a very strong prior on the parameters which forces them towards zero. A typical example is a Laplace prior with a fairly small deviation. For instance one might take $f(\mathbf{B}_{ij}) = \alpha_1\exp\left[-2\alpha_1 |\mathbf{B}_{ij}|\right]$, same for the elements of $\Delta$ (making sure to center the diagonal at 1). To do likelihood maximization requires marginalizing over hundreds of thousands of parameters hundreds of times. Brutal.

In addition, the asymptotics of penalized regression estimates are yet to be fully understood, and thus there are no good estimates of the standard error. The current suggestions are to sample from the posterior (using e.g. Gibbs sampling), performing a bootstrap, or simply permuting. So now marginalize over hundreds of thousands of paramaters hundreds of times thousands of times. Double brutal.

There are pure optimization analogues for this which minimize a linear mixture of a risk/optimization function and $\ell_k$ penalties on parameters. Software packages for solving such optimization problems have been developed, but suffer from somewhat painful scaling issues. Enter a recent paper of Chen and Lin who propose a smoothing method to $\epsilon$-approximate general problems of this form. This enables scaling up to thousands of responses and thousands of predictors, but there’s still quite some distance to close before such methods can churn through Big Data. Nevertheless, as the Chen and Lin paper demonstrates, larger-scale analyses are far more feasible now than before.

Ultimately this kind of Network approach (“Systems Biology”) provides a valuable formulation, approach, method, and part of the puzzle, but it’s rather blunt for a number of fine tasks. For instance, sparsifying priors will fall flat in a community which is still wrestling with the Quantitative (“nearly every explanatory variable is weakly explanatory”) Genetics question. In addition, while I could explain this approach to a group of computer scientists, statisticians, and engineers, it’s all meaningless to pure biologists and clinicians. The lack of a clear visual diagnostic (like a quantile-quantile plot, or even just a boxplot) is a major drawback.

For now, at least, clever conditional analyses which look at enrichment, or consistency of estimated effects, are winning out. (For instance, these kinds of networks can be revealed through the dynamics of a system. This is done all the time in biology: introduce a perturbation, and identify the responses at various time points. This gives the initial molecular targets, followed by buffering, secondary effects, and so forth.) And if the cost of a little bit more clarity is a little bit less power, maybe this is a fair tradeoff. But I’m not sure.