A recent trend in *omic research has been the quest to identify so-called “regulatory modules.” The search is for the molecular elements (Transcription factors, DNA binding sites, modification to DNA or histones, mRNA regulators such as miRNA species, post-transcriptional modifiers) which control the protein and/or mRNA concentration of one (or several) protein species. Which particular proteins are of interest depends on the biological application; but these typically have to do with cellular phenotypes (such as stem cell differentiation).

Much is already known about ** gene expression signatures** between biological states: genes which are up-regulated in one state, and down-regulated in another. This is true both for cell-type indicators (for instance GFAP or S100B as hallmarks of Astrocytes), as well as disease-state indicators (case/control differential expression).

This is an example of what I’d term a * marginal* or (

*or*

**single-view***)*

**uni-omic***analysis. For certain biological phenotypes, many*

**differential****have been identified, including gene expression, DNA methylation, DNAse hypersensitivity, and histone modifications.**

*differential signatures*Besides differential analysis, one can consider the measured * feature* (e.g. expression) as any other biological measurement, and seek to characterize its distribution in the population. The identification of

*(such as*

**biomeasure co-modules***) belongs to this category. For instance, gene expression exhibits natural variation, is under strong genetic control (as evinced by eQTL studies), and within a given cell state there can be strong correlations between genes. The fact that there are phenotypes which are strongly genetic as well as strongly correlated suggests that these are good candidates for pleiotropy. Module identification from a single feature is then a kind of*

**coexpression modules***.*

**marginal clustering**These are not mutually exclusive categories, as it’s entirely possible (and quite common) to *cluster* biofeatures based on their *differential *profiles (i.e. fold-change).

The search for regulatory modules is a shift in perspective from uni-omic clustering to * multi-omic clustering*. This entails the aggregation of information from multiple biomeasures; but it also entails relating objects of different types. That is, while the marginal clustering of DNA methylation focuses on similarity between two CpG islands, and the marginal clustering of mRNA expression focuses on similarity between two mRNA species, the

*of mRNA expression with DNA methylation requires dealing with a similarity between an mRNA specie and a CpG island.*

**integrated clustering**One well-motivated approach is the ** gene-centric** approach, where the base unit for all biofeatures is a single gene. See Goode DK, et al: Dynamic Gene Regulatory Networks Drive Hematopoietic Specification and Differentiation for a quite wonderful (if oblique) example. While this approach relies heavily on genome annotation to associate e.g. a DNAse hypersensitive region with a gene promoter, it has the benefit of being both interpretable, and directly integrable. One potential approach comes from treating distances or similarities as weighted graphs, and either merging the marginal graphs to form a

*, or clustering on the*

**consensus graph***formed by the disjoint union of edge sets. This is only possible because all the marginal graphs share the same vertices (i.e. genes).*

**multigraph**In the linked paper, the genes of interest (nodes) are a set of key stem-cell transcription factors. The “samples” across which data were measured are 2-ples: (gene, cell type). So a single observational unit is (for instance) ChIP-peak intensity for SOX2 [“gene”] in the promoter of ACE1 in HB cells [“sample”]. Peak intensity for SOX2 in the promoter of NUP32 in HB cells is another “sample” for SOX2, as is peak intensity for SOX2 in the promoter of ACE1 in ES cells. This enables a “target promoter peak similarity” measure to be computed between transcription factors, as the correlation of their peak intensity profiles across genes and cell types. Yet another TF-TF similarity is the peak-intensity/DHS-pattern overlap profile. First, genes are annotated by the DHS pattern of their promoter (e.g. open in all cell types, open only in ES cells, etc). There are ~30 of these. Then, for a given TF, an enrichment score can be calculated by asking how enriched genes of a given DHS-category are for TF peaks. These enrichment scores can be correlated, resulting in a third TF-TF similarity. Another similarity is the status of each TF’s own promoter (*active, repressed, poised, *or *unmarked*), and whether the promoter status of two TF’s correlate across cell types. Figure 4 of the paper is a (coarse) visualization of the multigraph that results from a combination of these similarities.

The challenge in the multigraph approach for multi-omic module identification is defining similarities or distances when a single gene has multiple features. Its promoter may have multiple CpG sites; or several ChIP or histone modification peaks. Its transcription may produce multiple isoforms. So how do we extend similarities (or distances) to the cases where two genes have different numbers of features? That is: How do you compare genes based on the expression of each of their transcripts? How do you do it when the genes have different numbers of isoforms? How do you make this comparison sensitive both to aggregate expression and isoform expression?

This post concerns itself with these questions.

**Aggregate Expression Comparisons**

Let’s take another look at gene comparisons when each gene has the same number of features. This is true for gene expression (one gene TPM measurement per sample). It’s true for the linked paper’s treatment of promoter ChIP peaks (one TF ChIP-binding measurement per (gene, cell-type) pair; treated as a vector). Thus equipped, we then want to categorize two genes and as being “close together” or “far apart”. The obvious thing to do is treat the expression profiles and as vectors and compute the standard Euclidean distance:

Obviously if then the distance is zero, so the genes are close; otherwise they get further and further apart based on how different their profiles are. But there is a drawback: what happens if : exactly the same expression profile, but simply varying around a higher baseline expression level. In this case . So even though the profiles are directly related — expression is higher exactly when expression is higher, and lower when expression is lower — Euclidean distance would suggest that these genes are “far apart”. In fact, *further* apart than genes whose expression profiles are random, but simply have the same mean and variance.

To be fair (to Euclid, I suppose): in most other cases this is what we want: it *should* be the case that, for two clusters with means and far apart, the within-cluster Euclidean distances ought to be smaller than between-cluster distances. But in the case of expression profiles, this is *not* the behavior we want; we want to be invariant to changes in baseline expression.

*Most* bioinformaticians would tell you the solution is to take , i.e. decenter and rescale the data by turning everything into a *z-*score. They would be right to do so. But I am not most bioinformaticians. There are some subtle points that merit belaboring. What we have here is really mismatch between our intuition about what we wanted, and how we formulated the problem. It turns out we didn’t want just any old distance, and so solving the problem of “find a simple distance measure” gave us the right answer to the wrong question. So instead of addressing behaviors we don’t want when they crop up, a more principled approach might be to be more precise about the behaviors we *do* want. In particular, the kinds of biofeature profiles we want to consider * equivalent*.

First up on the list, we have

*Translation invariance*

The first thing to do is temporarily to give up on the notion of a distance function. It’s tempting to look at a family of symmetric functions that satisfy . But the required axioms for a distance function cannot be satisfied since one of these axioms is .

Instead, it is possible to refine the very *space* on which we model the problem. Instead of treating genes as data drawn from the space , they can be treated as elements of an ** equivalence class **, defined by some relation. The relation, in this case, is translation:

Equivalence classes defined on can be viewed * quotient manifolds*, and as such have defined on them a geometric notion of a distance. The approach therefore is to identify the appropriate

*that encapsulates some intuition; and then to solve for a suitable metric on the equivalence class. In the case of translation invariance: given an arbitrary vector , any other element in its equivalence class can be obtained by for some constant . These are all parallel lines. The normal plane to these lines is given by . Thus under this equivalence class is given by the solution to*

**equivalence relation**Then

It can be seen that (note that is the summing matrix, so ).

An equivalent *algorithm* to compute this distance is to simply subtract the means from *x* and *y*, and then calculate the standard Euclidean distance. But the point is that now we *know* this is a solution to our problem: we specified an equivalence class that represented our intuition, and defined a proper distance function on that equivalence class. In this case, the above distance is also the geodesic distance. This is because the normal plane happened to be constant.

*Scale invariance*

Let’s consider (for simplicity) our data as already centered at 0. We also care that genes which differ by *scale* as part of the same equivalence class; that is:

As above, the equivalence classes are lines, but not parallel. Each line passes through the origin (and indeed, the origin is *not* a part of the space!) Given a point , another member of its class can be generated by . Following the same derivation of above we can define the * tangent distance* (that is, distance in the tangent space) as

The ansatz here is to rescale to have the same length as (note the ); ergo . An equivalent *algorithm* is to rescale *x* and *y* by (as *x* is mean-0), and take standard Euclidean distance. *This* is the reason that data normalization (re-centering, setting standard deviation to 1) works: it is a metric on the translation and scale invariant equivalence class.

But: the tangent space is not constant. Unlike in the translation-invariant case, the tangent space is point-dependent. Thinking about distance geometrically as the length of a particular curve, once you move slightly away from the point towards the point , the tangent space is different; that is, if you move half the distance towards in the tangent space of , obtaining a new point , the distance is * not* half the distance . So while this is a valid distance on the equivalence class , it is not necessarily ideal.

The property that, if , and you move units from towards to the point , and the remaining distance is known as *non-acceleration*. Curves called * geodesics* have this property, and for a given manifold and points , the geodesic between them is unique.

Our equivalence class has a name: . As a quotient manifold of it is a differentiable manifold with the geodesic

This is parameterized by an initial point *x*, and an initial direction *v* (note that *v* is not the destination point, but instead a direction vector *at x*). The corresponding distance between two points *x* and *y* along the geodesic connecting *x* to *y* is:

This is an *axial* distance; positive and negative correlations are equivalent. We can convert to a *directional* distance by removing the absolute value, which corresponds to a geodesic distance on a union of equivalence classes (depending on the sign of *c*); i.e. *rays* from the origin.

To sum up: by identifying an “intuitive problem” with standard Euclidean distance, and simply centering and normalizing the data, you do fix the problems with standard Euclidean distance by making the resulting measure scale and translation invariant (because you made your data all have the same mean and scale). At the same time, by approaching from the other direction, and defining an equivalence class corresponding to our intuition, we find that the rescaled-Euclidean distance is really a *tangent space distance* on the equivalence class. This distance is characterized by straight lines on the tangent space, and therefore exhibits acceleration, as opposed to the geodesic distance on the quotient manifold. In our case, the geodesic is known in closed form, and can be used in as a plug-in replacement for the data-rescaled distance.

So, to compare two vectors within the scale and translation invariant equivalence class, take . If you want to be robust, you can use Spearman correlation in place of Pearson (which replaces the sphere by the permutohedron).

We will soon see that this approach has direct analogues for matrix-valued features; and the geometry of equivalence classes provides a solution even if the matrices differ in their number of rows.

Exercise:

For clarity, it was assumed that the data was magically zero-mean when we cared about a scale-invariant equivalence class. Show that , as implemented by where , is the tangent-space distance for the equivalence class

The equivalence classes are parallel projective spaces. I have no idea how to visualize such a thing.

**Transcript expression comparisons**

Imagine comparing measurements from two genes and , whose features take the form of *matrices*. A canonical example is transcript-level expression: has transcripts, and has transcripts, both of which have been quantified on the same samples. If then the distance

is well-defined. Already we run into trouble: this distance is dependent in how the rows are ordered. That is, if has identical quantifications to , but the transcripts are simply shuffled, is suddenly very far from . The row means haven’t changed, nor has the scale of each row, or the overall scale of the matrix.

Imagine trying to “fix” this behavior (as we did by centering and rescaling our vectors). How would we transform and ? And what would we do if ? This is where equivalence classes really shine in modeling the situation.

The argument proceeds along the following lines:

What do we mean when we say two sets of isoforms are *related*? If there are only two isoforms for each gene, it means when one of the isoforms of one gene goes up, an isoform of the other gene also goes up. So each isoform of ** predicts** some of part of the isoforms of . Even if , we still mean that captures some information about . If knowing allows me to predict the isoform levels of

*exactly*then I’d say those two genes are equivalent.

So let be some class of functions . We can define equivalence classes as above by using

Translation-invariance, scale-invariance, rotation-invariance, or whatever other properties that are desired can be incorporated into the class . Without specifying there’s not much we can do except ask for residual distances: let and we can take as a measure of distance

What kind of distance is this? We will see, soon enough, that if is the class of multilinear functions (i.e. standard regression), then the above distance is the tangent-space distance. This should extend to any class , but I don’t have a proof.

The simplest thing to do is to let be the space of multilinear functions; in which case

where is a coefficient matrix, and is a -vector of row means. Let’s assume that , in which case the equivalence classes are parallel hyperplanes of dimension in a space of dimension .

Novel members of a class have to be generated in a rather interesting way. Given as with as ; then if we should have ; and therefore a new member can be generated by . And of course it is possible to let . One can immediately recognize this as the least-squares solution:

(we can augment with an extra row of *1*s to take care of ). This minimization is, of course, multiple regression; and

and therefore

which is precisely the squared residuals from multiple regression. The form here precisely corresponds to the tangent space at , and therefore this is the matrix analogue of the tangent space distance for vectors. Therefore, to “do the same thing” as centering and scaling the data in the *matrix* case, regress one matrix on the other, and take the Euclidean norm of the residuals. But it is impossible to transform every matrix in exactly the same way and still describe a metric on the equivalence class .

But, exactly as before, the tangent space is point-dependent; and so the above distance exhibits some form of acceleration. It is not a geodesic distance. Can we find one? It turns out our equivalence class has a name: the ** Grassmannian** . This is the space of all -dimensional planes in , and is equivalent to the quotient space:

where means the space of maximal-rank matrices. This is also known as the ** noncompact Stiefel manifold**.

If , then we rather happily can consider the distance between equivalence classes in . As in the case of single-expression, the distance here is in the form of ** principal angles**. Let be

**matrices belonging to the corresponding equivalence classes (). That is, . Clearly we can take . Letting**

*semi-orthogonal*be the singular-value decomposition of the product of the two matrices, then

This recovers the previous formula if . This should be no surprise, as ! This also underlies such typical metrics as , which is the old in disguise (take the product of the cosines of the principal angles)!

When , the same approach can be done as before, by letting and taking

**Combining aggregate and transcript expression**

I do believe there is an equivalence class that may capture both transcript alignment and aggregate expression: the idea being that if the aggregate transcript expression between the two genes differs by significantly more than a constant, they should be considered ‘further apart’. Here, we constrain somewhat, requiring that for some constant . So we have

This entails more or less the same process as before, except that should be replaced with . The rest follows.

In sum: clarifying what it means for two genes to be “equivalent” in their transcript expression lead to the formulation of the appropriate equivalence class, and identification of the corresponding geodesic distance.

One can rightly wonder whether the choice of a metric really matters. Euclidean, Mahalanobis, Jensen-Shannon, some appropriate Geodesic, and so on.

In certain cases, the metric is defined on subtly different spaces, as is the case with correlation-based metrics, mutual-information based metrics, and Euclidean distance. There are differences in equivalence classes; and part of the problem is that these differences are hidden. Rather, some property of the metric itself is considered desirable and it is chosen for that reason. Ideally, a suitable equivalence class should be defined an a natural metric used thereon.

On the other hand, some metrics are defined over the same equivalence class. , , are all metrics on . Therefore if are further apart than under one metric, they are further apart under the others. They’re all just monotonic transformations of one another, and rank inter-point distances in precisely the same order. Who’s to say one has more desirable properties than another?

I don’t have a good answer.

## Leave a Reply