# heavytailed

## May 2, 2016

### Genes and the geometry of equivalence classes

Filed under: Uncategorized — heavytailed @ 7:02 am

A recent trend in *omic research has been the quest to identify so-called “regulatory modules.” The search is for the molecular elements (Transcription factors, DNA binding sites, modification to DNA or histones, mRNA regulators such as miRNA species, post-transcriptional modifiers) which control the protein and/or mRNA concentration of one (or several) protein species. Which particular proteins are of interest depends on the biological application; but these typically have to do with cellular phenotypes (such as stem cell differentiation).

Much is already known about gene expression signatures between biological states: genes which are up-regulated in one state, and down-regulated in another. This is true both for cell-type indicators (for instance GFAP or S100B as hallmarks of Astrocytes), as well as disease-state indicators (case/control differential expression).

This is an example of what I’d term a marginal or (single-view or uni-omic) differential analysis. For certain biological phenotypes, many differential signatures have been identified, including gene expression, DNA methylation, DNAse hypersensitivity, and histone modifications.

Besides differential analysis, one can consider the measured feature (e.g. expression) as any other biological measurement, and seek to characterize its distribution in the population. The identification of biomeasure co-modules (such as coexpression modules) belongs to this category. For instance, gene expression exhibits natural variation, is under strong genetic control (as evinced by eQTL studies), and within a given cell state there can be strong correlations between genes. The fact that there are phenotypes which are strongly genetic as well as strongly correlated suggests that these are good candidates for pleiotropy. Module identification from a single feature is then a kind of marginal clustering.

These are not mutually exclusive categories, as it’s entirely possible (and quite common) to cluster biofeatures based on their differential profiles (i.e. fold-change).

The search for regulatory modules is a shift in perspective from uni-omic clustering to multi-omic clustering. This entails the aggregation of information from multiple biomeasures; but it also entails relating objects of different types. That is, while the marginal clustering of DNA methylation focuses on similarity between two CpG islands, and the marginal clustering of mRNA expression focuses on similarity between two mRNA species, the integrated clustering of mRNA expression with DNA methylation requires dealing with a similarity between an mRNA specie and a CpG island.

One well-motivated approach is the gene-centric approach, where the base unit for all biofeatures is a single gene. See Goode DK, et al: Dynamic Gene Regulatory Networks Drive Hematopoietic Specification and Differentiation for a quite wonderful (if oblique) example. While this approach relies heavily on genome annotation to associate e.g. a DNAse hypersensitive region with a gene promoter, it has the benefit of being both interpretable, and directly integrable. One potential approach comes from treating distances or similarities as weighted graphs, and either merging the marginal graphs to form a consensus graph, or clustering on the multigraph formed by the disjoint union of edge sets. This is only possible because all the marginal graphs share the same vertices (i.e. genes).

In the linked paper, the genes of interest (nodes) are a set of key stem-cell transcription factors. The “samples” across which data were measured are 2-ples: (gene, cell type). So a single observational unit is (for instance) ChIP-peak intensity for SOX2 [“gene”] in the promoter of ACE1 in HB cells [“sample”]. Peak intensity for SOX2 in the promoter of NUP32 in HB cells is another “sample” for SOX2, as is peak intensity for SOX2 in the promoter of ACE1 in ES cells. This enables a “target promoter peak similarity” measure to be computed between transcription factors, as the correlation of their peak intensity profiles across genes and cell types. Yet another TF-TF similarity is the peak-intensity/DHS-pattern overlap profile. First, genes are annotated by the DHS pattern of their promoter (e.g. open in all cell types, open only in ES cells, etc). There are ~30 of these. Then, for a given TF, an enrichment score can be calculated by asking how enriched genes of a given DHS-category are for TF peaks. These enrichment scores can be correlated, resulting in a third TF-TF similarity. Another similarity is the status of each TF’s own promoter (active, repressed, poised, or unmarked), and whether the promoter status of two TF’s correlate across cell types. Figure 4 of the paper is a (coarse) visualization of the multigraph that results from a combination of these similarities.

The challenge in the multigraph approach for multi-omic module identification is defining similarities or distances when a single gene has multiple features. Its promoter may have multiple CpG sites; or several ChIP or histone modification peaks. Its transcription may produce multiple isoforms. So how do we extend similarities (or distances) to the cases where two genes have different numbers of features? That is: How do you compare genes based on the expression of each of their transcripts? How do you do it when the genes have different numbers of isoforms? How do you make this comparison sensitive both to aggregate expression and isoform expression?

This post concerns itself with these questions.

## Aggregate Expression Comparisons

Let’s take another look at gene comparisons when each gene has the same number of features. This is true for gene expression (one gene TPM measurement per sample). It’s true for the linked paper’s treatment of promoter ChIP peaks (one TF ChIP-binding measurement per (gene, cell-type) pair; treated as a vector). Thus equipped, we then want to categorize two genes $g_1$ and $g_2$ as being “close together” or “far apart”. The obvious thing to do is treat the expression profiles $g_1$ and $g_2$ as vectors and compute the standard Euclidean distance:

$d(g_1, g_2) = ||g_1 - g_2|| = \sqrt{\sum_{i=1}^n (g_{1_i} - g_{2_i})^2}$

Obviously if $g_1 = g_2$ then the distance is zero, so the genes are close; otherwise they get further and further apart based on how different their profiles are. But there is a drawback: what happens if $g_1 = 2 + g_2$: exactly the same expression profile, but simply varying around a higher baseline expression level. In this case $d(g_1, g_2) = 2\sqrt{n}$. So even though the profiles are directly related — $g_1$ expression is higher exactly when $g_2$ expression is higher, and lower when $g_2$ expression is lower — Euclidean distance would suggest that these genes are “far apart”. In fact, further apart than genes whose expression profiles are random, but simply have the same mean and variance.

To be fair (to Euclid, I suppose): in most other cases this is what we want: it should be the case that, for two clusters with means $\mu_1$ and $\mu_2$ far apart, the within-cluster  Euclidean distances ought to be smaller than between-cluster distances. But in the case of expression profiles, this is not the behavior we want; we want to be invariant to changes in baseline expression.

Most bioinformaticians would tell you the solution is to take $\tilde g_1 = (g_1 - \mu_{g_1})/\sigma_{g_1}$, i.e. decenter and rescale the data by turning everything into a z-score. They would be right to do so. But I am not most bioinformaticians. There are some subtle points that merit belaboring. What we have here is really mismatch between our intuition about what we wanted, and how we formulated the problem. It turns out we didn’t want just any old distance, and so solving the problem of “find a simple distance measure” gave us the right answer to the wrong question. So instead of addressing behaviors we don’t want when they crop up, a more principled approach might be to be more precise about the behaviors we do want. In particular, the kinds of biofeature profiles we want to consider equivalent.

First up on the list, we have

Translation invariance

The first thing to do is temporarily to give up on the notion of a distance function. It’s tempting to look at a family of symmetric functions $d: \mathbb{R}^n \times \mathbb{R}^n \rightarrow [0, \infty)$ that satisfy $d(c + g_1, g_2) = d(g_1, g_2)$. But the required axioms for a distance function cannot be satisfied since one of these axioms is $d(x,y) = 0 \Leftrightarrow x = y$.

Instead, it is possible to refine the very space on which we model the problem. Instead of treating genes as data drawn from the space $\mathbb{R}^n$, they can be treated as elements of an equivalence class $\mathcal{S}$, defined by some relation. The relation, in this case, is translation:

$\mathcal{S} = \{x \in \mathbb{R}^n | x = a + c\}; \; a \in \mathbb{R}^n \; c \in \mathbb{R}$

Equivalence classes defined on $\mathbb{R}^n$ can be viewed quotient manifolds, and as such have defined on them a geometric notion of a distance. The approach therefore is to identify the appropriate equivalence relation that encapsulates some intuition; and then to solve for a suitable metric on the equivalence class. In the case of translation invariance: given an arbitrary vector $x \in \mathcal{S}$, any other element in its equivalence class can be obtained by $x' = x + \mathbf{1}\gamma$ for some constant $\gamma$. These are all parallel lines. The normal plane to these lines is given by $P_{\bot}(z) = x + (I - \mathbf{11}^T)z$. Thus $d(x,y)$ under this equivalence class is given by the solution to

$d(x,y) = ||z|| \; \mathrm{s.t.} \; x + (I - \mathbf{11}^T)z = y + \mathbf{1}\gamma$

Then

$d(x,y) = \mathrm{min}_{\gamma} ||(I - \mathbf{11}^T)^{-1}(y - x + \mathbf{1}\gamma)||$

It can be seen that $\gamma = \mu_x - \mu_y$ (note that $\mathbf{11}^T$ is the summing matrix, so $\mathbf{11}^Tx = n\mathbf{1}\mu_x$).

An equivalent algorithm to compute this distance is to simply subtract the means from x and y, and then calculate the standard Euclidean distance. But the point is that now we know this is a solution to our problem: we specified an equivalence class that represented our intuition, and defined a proper distance function on that equivalence class. In this case, the above distance is also the geodesic distance. This is because the normal plane happened to be constant.

Scale invariance

Let’s consider (for simplicity) our data as already centered at 0. We also care that genes which differ by scale as part of the same equivalence class; that is:

$\mathcal{S} = \{x \in \mathbb{R}^n | x = ac\} \;\; a \in \mathbb{R}^n, \; c \in \mathbb{R}$

As above, the equivalence classes are lines, but not parallel. Each line passes through the origin (and indeed, the origin is not a part of the space!) Given a point $x$, another member of its class can be generated by $x' = x + \gamma x = (I + \gamma)x$. Following the same derivation of above we can define the tangent distance (that is, distance in the tangent space) as

$d_{\bot}(x,y) = \mathrm{min}_\gamma ||(I - xx^T)^{-1}(y - x + y\gamma)||$

The ansatz here is to rescale $y$ to have the same length as $x$ (note the $xx^T$); ergo $\gamma = \frac{||x|| - ||y||}{||y||}$. An equivalent algorithm is to rescale x and y by $x \leftarrow x/||x|| = x/\sigma_x$ (as x is mean-0), and take standard Euclidean distance. This is the reason that data normalization (re-centering, setting standard deviation to 1) works: it is a metric on the translation and scale invariant equivalence class.

But: the tangent space is not constant. Unlike in the translation-invariant case, the tangent space is point-dependent. Thinking about distance geometrically as the length of a particular curve, once you move slightly away from the point $x$ towards the point $y$, the tangent space is different; that is, if you move half the distance towards $y$ in the tangent space of $x$, obtaining a new point $x'$, the distance $d(x', y)$ is not half the distance $d(x,y)$. So while this is a valid distance on the equivalence class $\mathcal{S}$, it is not necessarily ideal.

The property that, if $d(x,y) = D$, and you move $a$ units from $x$ towards $y$ to the point $x'$, and the remaining distance $d(x',y) = D-a$ is known as non-acceleration. Curves called geodesics have this property, and for a given manifold $\mathcal{M}$ and points $(x,y)$, the geodesic between them is unique.

Our equivalence class $\mathcal{S}$ has a name: $\mathbb{RP}^{n}$. As a quotient manifold of $\mathbb{R}^n$ it is a differentiable manifold with the geodesic

$\gamma(t; \vec x, \vec v) = \cos(t||v||^2)x + \sin(t||v||^2)\frac{v}{||v||^2}$

This is parameterized by an initial point x, and an initial direction v (note that v is not the destination point, but instead a direction vector at x). The corresponding distance between two points x and y along the geodesic connecting x to y is:

$d(x,y) = \cos^{-1}\left(\frac{|x^Ty|}{||x||\cdot ||y||}\right) = \cos^{-1} | \mathrm{cor}(x, y) |$

This is an axial distance; positive and negative correlations are equivalent. We can convert to a directional distance by removing the absolute value, which corresponds to a geodesic distance on a union of equivalence classes (depending on the sign of c); i.e. rays from the origin.

To sum up: by identifying an “intuitive problem” with standard Euclidean distance, and simply centering and normalizing the data, you do fix the problems with standard Euclidean distance by making the resulting measure scale and translation invariant (because you made your data all have the same mean and scale). At the same time, by approaching from the other direction, and defining an equivalence class corresponding to our intuition, we find that the rescaled-Euclidean distance is really a tangent space distance on the equivalence class. This distance is characterized by straight lines on the tangent space, and therefore exhibits acceleration, as opposed to the geodesic distance on the quotient manifold. In our case, the geodesic is known in closed form, and can be used in as a plug-in replacement for the data-rescaled distance.

So, to compare two vectors within the scale and translation invariant equivalence class, take $d(g_1, g_2) = \cos^{-1}(|\mathrm{cor}(g_1, g_2)|)$. If you want to be robust, you can use Spearman correlation in place of Pearson (which replaces the sphere by the permutohedron).

We will soon see that this approach has direct analogues for matrix-valued features; and the geometry of equivalence classes provides a solution even if the matrices differ in their number of rows.

Exercise:

For clarity, it was assumed that the data was magically zero-mean when we cared about a scale-invariant equivalence class. Show that $d_{\bot}(x,y)$, as implemented by $d_{\bot}(x,y) = ||\tilde x - \tilde y||$ where $\tilde z = (z - \mu_z)/\sigma_z$, is the tangent-space distance for the equivalence class

$\mathcal{S} = \{x \in \mathbb{R}^n | x = ac + b\}; \;\; a \in \mathbb{R}^n, c, b \in \mathbb{R}$

The equivalence classes are parallel projective spaces. I have no idea how to visualize such a thing.

## Transcript expression comparisons

Imagine comparing measurements from two genes $G_1$ and $G_2$, whose features take the form of matrices. A canonical example is transcript-level expression: $G_1$ has $k_1$ transcripts, and $G_2$ has $k_2$ transcripts, both of which have been quantified on the same $n$ samples. If $k_1 = k_2$ then the distance

$d(G_1, G_2) = ||G_1 - G_2||_F = \sqrt{\sum_{i,j} (G_{1_{ij}}-G_{2_{ij}})^2} = ||\mathrm{vec}(G_1) - \mathrm{vec}(G_2)||$

is well-defined. Already we run into trouble: this distance is dependent in how the rows are ordered. That is, if $G_2$ has identical quantifications to $G_1$, but the transcripts are simply shuffled, $G_2$ is suddenly very far from $G_1$. The row means haven’t changed, nor has the scale of each row, or the overall scale of the matrix.

Imagine trying to “fix” this behavior (as we did by centering and rescaling our vectors). How would we transform $G_1$ and $G_2$? And what would we do if $k_1 > k_2$? This is where equivalence classes really shine in modeling the situation.

The argument proceeds along the following lines:

What do we mean when we say two sets of isoforms are related? If there are only two isoforms for each gene, it means when one of the isoforms of one gene goes up, an isoform of the other gene also goes up. So each isoform of $G_1$ predicts some of part of the isoforms of $G_2$. Even if $k_1 \neq k_2$, we still mean that $G_1$ captures some information about $G_2$. If knowing $G_1$ allows me to predict  the isoform levels of $G_2$ exactly then I’d say those two genes are equivalent.

So let $\mathcal{F}$ be some class of functions $f: \mathbb{R}^{k_2 \times n} \rightarrow \mathbb{R}^{k_1 \times n}$. We can define equivalence classes as above by using

$\mathcal{S} = \{X| X=f(A)\} \;\; A \in \mathbb{R}^{k_2 \times n} \; f \in \mathcal{F}$

Translation-invariance, scale-invariance, rotation-invariance, or whatever other properties that are desired can be incorporated into the class $\mathcal{F}$. Without specifying $\mathcal{F}$ there’s not much we can do except ask for residual distances: let $\hat f = \mathrm{argmin}_f ||G_2 - f(G_1)||_F$ and we can take as a measure of distance

$d(G_1, G_2) = ||G_2 - \hat f(G_1)||_F = \min_{f \in \mathcal{F}} ||G_2 - f(G_1)||_F$

What kind of distance is this? We will see, soon enough, that if $\mathcal{F}$ is the class of multilinear functions (i.e. standard regression), then the above distance is the tangent-space distance. This should extend to any class $\mathcal{F}$, but I don’t have a proof.

The simplest thing to do is to let $\mathcal{F}$ be the space of multilinear functions; in which case

$\mathcal{S} = \{X | \Gamma X + \mathrm{diag}(\gamma) \mathbf{1}_{k_2 \times n} = A\}$

where $\Gamma$ is a $k_2 \times k_1$ coefficient matrix, and $\gamma$ is a $k_2$-vector of row means. Let’s assume that $k_2 > k_1$, in which case the equivalence classes are parallel hyperplanes of dimension $k_1 \times n$ in a space of dimension $k_2 \times n$.

Novel members of a class have to be generated in a rather interesting way. Given $X$ as $k_2 \times n$ with $C$ as $k_2 \times k_1$; then if $X = CZ$ we should have $(C^TC)^{-1}C^TX = Z$; and therefore a new member can be generated by $X' \leftarrow C_2(C_1^TC_1)^{-1}C_1^TX + \mathrm{diag}(\gamma)\mathbf{1}$. And of course it is possible to let $C_2 = C_1$. One can immediately recognize this as the least-squares solution:

$d(G_1, G_2) = \min_C || G_2 - CG_1||_F$

(we can augment $G_1$ with an extra row of 1s to take care of $\gamma$). This minimization is, of course, multiple regression; and

$\hat C = G_2G_1^T(G_1G_1^T)^{-1}$

and therefore

$d(G_1,G_2) = ||G_2(I - G_1^T(G_1G_1^T)^{-1}G_1)||_F = ||G_2 - \hat f(G_1)||_F$

which is precisely the squared residuals from multiple regression. The form here precisely corresponds to the tangent space at $G_2$, and therefore this is the matrix analogue of the tangent space distance for vectors. Therefore, to “do the same thing” as centering and scaling the data in the matrix case, regress one matrix on the other, and take the Euclidean norm of the residuals. But it is impossible to transform every matrix in exactly the same way and still describe a metric on the equivalence class $\mathcal{S}$.

But, exactly as before, the tangent space is point-dependent; and so the above distance exhibits some form of acceleration. It is not a geodesic distance. Can we find one? It turns out our equivalence class $\mathcal{S}$ has a name: the Grassmannian $\mathrm{Gr}(k_2, n)$. This is the space of all $k_2$-dimensional planes in $\mathbb{R}^n$, and is equivalent to the quotient space:

$\mathrm{Gr}(k_2, n) \equiv \{X \in \mathbb{R}_*^{k_2 \times n} | \mathrm{span}(X) = \mathrm{span}(A) \} \;\; A \in \mathbb{R}_*^{k_2 \times n}$

where $\mathbb{R}_*^{a \times b}$ means the space of maximal-rank $a \times b$ matrices. This is also known as the noncompact Stiefel manifold.

If $k_1 = k_2$, then we rather happily can consider the distance between equivalence classes $[G_1], [G_2]$ in $\mathrm{Gr}(k, n)$. As in the case of single-expression, the distance here is in the form of principal angles. Let $\tilde G_1, \tilde G_2$ be semi-orthogonal matrices belonging to the corresponding equivalence classes ($\tilde G_i \in [G_i]$). That is, $\tilde G_1\tilde G_1^T = I$. Clearly we can take $\tilde G_1 = (G_1G_1^T)^{-1/2}G_1$. Letting

$U\Sigma V^T = \tilde G_1\tilde G_2^T$

be the singular-value decomposition of the product of the two matrices, then

$d_{\mathrm{Gr}}(G_1, G_2) = \sqrt{\sum \cos^{-1}(\sigma_i)^2}$

This recovers the previous formula if $k=1$. This should be no surprise, as $\mathrm{Gr}(1, n) \equiv \mathbb{RP}^n$! This also underlies such typical metrics as $\sqrt{1 - |\det(XY^T)|}$, which is the old $\sqrt{1 - \mathrm{cor}(x,y)^2}$ in disguise (take the product of the cosines of the principal angles)!

When $k_1 \neq k_2$, the same approach can be done as before, by letting $\hat G_2 = \hat f(G_1) = G_2G_1^T(G_1G_1^T)^{-1}G_1$ and taking

$d(G_1, G_2) = d_{\mathrm{Gr}}(\hat G_2, G_2)$

## Combining aggregate and transcript expression

I do believe there is an equivalence class that may capture both transcript alignment and aggregate expression: the idea being that if the aggregate transcript expression between the two genes differs by significantly more than a constant, they should be considered ‘further apart’. Here, we constrain $\Gamma$ somewhat, requiring that $\Gamma^T\Gamma = \gamma I$ for some constant $\gamma$. So we have

$\mathcal{S} = \{X | \gamma \Gamma X + \mathrm{diag}(\zeta) \mathbf{1}_{k_2 \times n} = A\} \; \Gamma \in O(k_2, k_1)$

This entails more or less the same process as before, except that $\hat C$ should be replaced with $\hat \gamma\hat C (\hat C^T\hat C)^{-1/2}$. The rest follows.

In sum: clarifying what it means for two genes to be “equivalent” in their transcript expression lead to the formulation of the appropriate equivalence class, and identification of the corresponding geodesic distance.

One can rightly wonder whether the choice of a metric really matters. Euclidean, Mahalanobis, Jensen-Shannon, some appropriate Geodesic, and so on.

In certain cases, the metric is defined on subtly different spaces, as is the case with correlation-based metrics, mutual-information based metrics, and Euclidean distance. There are differences in equivalence classes; and part of the problem is that these differences are hidden. Rather, some property of the metric itself is considered desirable and it is chosen for that reason. Ideally, a suitable equivalence class should be defined an a natural metric used thereon.

On the other hand, some metrics are defined over the same equivalence class. $\sqrt{\frac{1}{2}(1-|\mathrm{cor}(x,y)|)}$, $\cos^{-1}|\mathrm{cor}(x,y)|$, $\sqrt{1 - \mathrm{cor}(x,y)^2}$ are all metrics on $\mathbb{RP}^n$. Therefore if $(x_1, y_1)$ are further apart than $(x_2, y_2)$ under one metric, they are further apart under the others. They’re all just monotonic transformations of one another, and rank inter-point distances in precisely the same order. Who’s to say one has more desirable properties than another?

I don’t have a good answer.