heavytailed

May 2, 2016

Genes and the geometry of equivalence classes

Filed under: Uncategorized — heavytailed @ 7:02 am

A recent trend in *omic research has been the quest to identify so-called “regulatory modules.” The search is for the molecular elements (Transcription factors, DNA binding sites, modification to DNA or histones, mRNA regulators such as miRNA species, post-transcriptional modifiers) which control the protein and/or mRNA concentration of one (or several) protein species. Which particular proteins are of interest depends on the biological application; but these typically have to do with cellular phenotypes (such as stem cell differentiation).

Much is already known about gene expression signatures between biological states: genes which are up-regulated in one state, and down-regulated in another. This is true both for cell-type indicators (for instance GFAP or S100B as hallmarks of Astrocytes), as well as disease-state indicators (case/control differential expression).

This is an example of what I’d term a marginal or (single-view or uni-omic) differential analysis. For certain biological phenotypes, many differential signatures have been identified, including gene expression, DNA methylation, DNAse hypersensitivity, and histone modifications.

Besides differential analysis, one can consider the measured feature (e.g. expression) as any other biological measurement, and seek to characterize its distribution in the population. The identification of biomeasure co-modules (such as coexpression modules) belongs to this category. For instance, gene expression exhibits natural variation, is under strong genetic control (as evinced by eQTL studies), and within a given cell state there can be strong correlations between genes. The fact that there are phenotypes which are strongly genetic as well as strongly correlated suggests that these are good candidates for pleiotropy. Module identification from a single feature is then a kind of marginal clustering.

These are not mutually exclusive categories, as it’s entirely possible (and quite common) to cluster biofeatures based on their differential profiles (i.e. fold-change).

The search for regulatory modules is a shift in perspective from uni-omic clustering to multi-omic clustering. This entails the aggregation of information from multiple biomeasures; but it also entails relating objects of different types. That is, while the marginal clustering of DNA methylation focuses on similarity between two CpG islands, and the marginal clustering of mRNA expression focuses on similarity between two mRNA species, the integrated clustering of mRNA expression with DNA methylation requires dealing with a similarity between an mRNA specie and a CpG island.

One well-motivated approach is the gene-centric approach, where the base unit for all biofeatures is a single gene. See Goode DK, et al: Dynamic Gene Regulatory Networks Drive Hematopoietic Specification and Differentiation for a quite wonderful (if oblique) example. While this approach relies heavily on genome annotation to associate e.g. a DNAse hypersensitive region with a gene promoter, it has the benefit of being both interpretable, and directly integrable. One potential approach comes from treating distances or similarities as weighted graphs, and either merging the marginal graphs to form a consensus graph, or clustering on the multigraph formed by the disjoint union of edge sets. This is only possible because all the marginal graphs share the same vertices (i.e. genes).

In the linked paper, the genes of interest (nodes) are a set of key stem-cell transcription factors. The “samples” across which data were measured are 2-ples: (gene, cell type). So a single observational unit is (for instance) ChIP-peak intensity for SOX2 [“gene”] in the promoter of ACE1 in HB cells [“sample”]. Peak intensity for SOX2 in the promoter of NUP32 in HB cells is another “sample” for SOX2, as is peak intensity for SOX2 in the promoter of ACE1 in ES cells. This enables a “target promoter peak similarity” measure to be computed between transcription factors, as the correlation of their peak intensity profiles across genes and cell types. Yet another TF-TF similarity is the peak-intensity/DHS-pattern overlap profile. First, genes are annotated by the DHS pattern of their promoter (e.g. open in all cell types, open only in ES cells, etc). There are ~30 of these. Then, for a given TF, an enrichment score can be calculated by asking how enriched genes of a given DHS-category are for TF peaks. These enrichment scores can be correlated, resulting in a third TF-TF similarity. Another similarity is the status of each TF’s own promoter (active, repressed, poised, or unmarked), and whether the promoter status of two TF’s correlate across cell types. Figure 4 of the paper is a (coarse) visualization of the multigraph that results from a combination of these similarities.

The challenge in the multigraph approach for multi-omic module identification is defining similarities or distances when a single gene has multiple features. Its promoter may have multiple CpG sites; or several ChIP or histone modification peaks. Its transcription may produce multiple isoforms. So how do we extend similarities (or distances) to the cases where two genes have different numbers of features? That is: How do you compare genes based on the expression of each of their transcripts? How do you do it when the genes have different numbers of isoforms? How do you make this comparison sensitive both to aggregate expression and isoform expression?

This post concerns itself with these questions.

Aggregate Expression Comparisons

Let’s take another look at gene comparisons when each gene has the same number of features. This is true for gene expression (one gene TPM measurement per sample). It’s true for the linked paper’s treatment of promoter ChIP peaks (one TF ChIP-binding measurement per (gene, cell-type) pair; treated as a vector). Thus equipped, we then want to categorize two genes g_1 and g_2 as being “close together” or “far apart”. The obvious thing to do is treat the expression profiles g_1 and g_2 as vectors and compute the standard Euclidean distance:

d(g_1, g_2) = ||g_1 - g_2|| = \sqrt{\sum_{i=1}^n (g_{1_i} - g_{2_i})^2}

Obviously if g_1 = g_2 then the distance is zero, so the genes are close; otherwise they get further and further apart based on how different their profiles are. But there is a drawback: what happens if g_1 = 2 + g_2: exactly the same expression profile, but simply varying around a higher baseline expression level. In this case d(g_1, g_2) = 2\sqrt{n}. So even though the profiles are directly related — g_1 expression is higher exactly when g_2 expression is higher, and lower when g_2 expression is lower — Euclidean distance would suggest that these genes are “far apart”. In fact, further apart than genes whose expression profiles are random, but simply have the same mean and variance.

To be fair (to Euclid, I suppose): in most other cases this is what we want: it should be the case that, for two clusters with means \mu_1 and \mu_2 far apart, the within-cluster  Euclidean distances ought to be smaller than between-cluster distances. But in the case of expression profiles, this is not the behavior we want; we want to be invariant to changes in baseline expression.

Most bioinformaticians would tell you the solution is to take \tilde g_1 = (g_1 - \mu_{g_1})/\sigma_{g_1}, i.e. decenter and rescale the data by turning everything into a z-score. They would be right to do so. But I am not most bioinformaticians. There are some subtle points that merit belaboring. What we have here is really mismatch between our intuition about what we wanted, and how we formulated the problem. It turns out we didn’t want just any old distance, and so solving the problem of “find a simple distance measure” gave us the right answer to the wrong question. So instead of addressing behaviors we don’t want when they crop up, a more principled approach might be to be more precise about the behaviors we do want. In particular, the kinds of biofeature profiles we want to consider equivalent.

First up on the list, we have

Translation invariance

The first thing to do is temporarily to give up on the notion of a distance function. It’s tempting to look at a family of symmetric functions d: \mathbb{R}^n \times \mathbb{R}^n \rightarrow [0, \infty) that satisfy d(c + g_1, g_2) = d(g_1, g_2). But the required axioms for a distance function cannot be satisfied since one of these axioms is d(x,y) = 0 \Leftrightarrow x = y.

Instead, it is possible to refine the very space on which we model the problem. Instead of treating genes as data drawn from the space \mathbb{R}^n, they can be treated as elements of an equivalence class \mathcal{S}, defined by some relation. The relation, in this case, is translation:

\mathcal{S} = \{x \in \mathbb{R}^n | x = a + c\}; \; a \in \mathbb{R}^n \; c \in \mathbb{R}

Equivalence classes defined on \mathbb{R}^n can be viewed quotient manifolds, and as such have defined on them a geometric notion of a distance. The approach therefore is to identify the appropriate equivalence relation that encapsulates some intuition; and then to solve for a suitable metric on the equivalence class. In the case of translation invariance: given an arbitrary vector x \in \mathcal{S}, any other element in its equivalence class can be obtained by x' = x + \mathbf{1}\gamma for some constant \gamma. These are all parallel lines. The normal plane to these lines is given by P_{\bot}(z) = x + (I - \mathbf{11}^T)z. Thus d(x,y) under this equivalence class is given by the solution to

d(x,y) = ||z|| \; \mathrm{s.t.} \; x + (I - \mathbf{11}^T)z = y + \mathbf{1}\gamma

Then

d(x,y) = \mathrm{min}_{\gamma} ||(I - \mathbf{11}^T)^{-1}(y - x + \mathbf{1}\gamma)||

It can be seen that \gamma = \mu_x - \mu_y (note that \mathbf{11}^T is the summing matrix, so \mathbf{11}^Tx = n\mathbf{1}\mu_x).

An equivalent algorithm to compute this distance is to simply subtract the means from x and y, and then calculate the standard Euclidean distance. But the point is that now we know this is a solution to our problem: we specified an equivalence class that represented our intuition, and defined a proper distance function on that equivalence class. In this case, the above distance is also the geodesic distance. This is because the normal plane happened to be constant.

Scale invariance

Let’s consider (for simplicity) our data as already centered at 0. We also care that genes which differ by scale as part of the same equivalence class; that is:

\mathcal{S} = \{x \in \mathbb{R}^n | x = ac\} \;\; a \in \mathbb{R}^n, \; c \in \mathbb{R}

As above, the equivalence classes are lines, but not parallel. Each line passes through the origin (and indeed, the origin is not a part of the space!) Given a point x, another member of its class can be generated by x' = x + \gamma x = (I + \gamma)x. Following the same derivation of above we can define the tangent distance (that is, distance in the tangent space) as

d_{\bot}(x,y) = \mathrm{min}_\gamma ||(I - xx^T)^{-1}(y - x + y\gamma)||

The ansatz here is to rescale y to have the same length as x (note the xx^T); ergo \gamma = \frac{||x|| - ||y||}{||y||}. An equivalent algorithm is to rescale x and y by x \leftarrow x/||x|| = x/\sigma_x (as x is mean-0), and take standard Euclidean distance. This is the reason that data normalization (re-centering, setting standard deviation to 1) works: it is a metric on the translation and scale invariant equivalence class.

But: the tangent space is not constant. Unlike in the translation-invariant case, the tangent space is point-dependent. Thinking about distance geometrically as the length of a particular curve, once you move slightly away from the point x towards the point y, the tangent space is different; that is, if you move half the distance towards y in the tangent space of x, obtaining a new point x', the distance d(x', y) is not half the distance d(x,y). So while this is a valid distance on the equivalence class \mathcal{S}, it is not necessarily ideal.

The property that, if d(x,y) = D, and you move a units from x towards y to the point x', and the remaining distance d(x',y) = D-a is known as non-acceleration. Curves called geodesics have this property, and for a given manifold \mathcal{M} and points (x,y), the geodesic between them is unique.

Our equivalence class \mathcal{S} has a name: \mathbb{RP}^{n}. As a quotient manifold of \mathbb{R}^n it is a differentiable manifold with the geodesic

\gamma(t; \vec x, \vec v) = \cos(t||v||^2)x + \sin(t||v||^2)\frac{v}{||v||^2}

This is parameterized by an initial point x, and an initial direction v (note that v is not the destination point, but instead a direction vector at x). The corresponding distance between two points x and y along the geodesic connecting x to y is:

d(x,y) = \cos^{-1}\left(\frac{|x^Ty|}{||x||\cdot ||y||}\right) = \cos^{-1} | \mathrm{cor}(x, y) |

This is an axial distance; positive and negative correlations are equivalent. We can convert to a directional distance by removing the absolute value, which corresponds to a geodesic distance on a union of equivalence classes (depending on the sign of c); i.e. rays from the origin.

To sum up: by identifying an “intuitive problem” with standard Euclidean distance, and simply centering and normalizing the data, you do fix the problems with standard Euclidean distance by making the resulting measure scale and translation invariant (because you made your data all have the same mean and scale). At the same time, by approaching from the other direction, and defining an equivalence class corresponding to our intuition, we find that the rescaled-Euclidean distance is really a tangent space distance on the equivalence class. This distance is characterized by straight lines on the tangent space, and therefore exhibits acceleration, as opposed to the geodesic distance on the quotient manifold. In our case, the geodesic is known in closed form, and can be used in as a plug-in replacement for the data-rescaled distance.

So, to compare two vectors within the scale and translation invariant equivalence class, take d(g_1, g_2) = \cos^{-1}(|\mathrm{cor}(g_1, g_2)|). If you want to be robust, you can use Spearman correlation in place of Pearson (which replaces the sphere by the permutohedron).

We will soon see that this approach has direct analogues for matrix-valued features; and the geometry of equivalence classes provides a solution even if the matrices differ in their number of rows.

Exercise:

For clarity, it was assumed that the data was magically zero-mean when we cared about a scale-invariant equivalence class. Show that d_{\bot}(x,y), as implemented by d_{\bot}(x,y) = ||\tilde x - \tilde y|| where \tilde z = (z - \mu_z)/\sigma_z, is the tangent-space distance for the equivalence class

\mathcal{S} = \{x \in \mathbb{R}^n | x = ac + b\}; \;\; a \in \mathbb{R}^n, c, b \in \mathbb{R}

The equivalence classes are parallel projective spaces. I have no idea how to visualize such a thing.

Transcript expression comparisons

Imagine comparing measurements from two genes G_1 and G_2, whose features take the form of matrices. A canonical example is transcript-level expression: G_1 has k_1 transcripts, and G_2 has k_2 transcripts, both of which have been quantified on the same n samples. If k_1 = k_2 then the distance

d(G_1, G_2) = ||G_1 - G_2||_F = \sqrt{\sum_{i,j} (G_{1_{ij}}-G_{2_{ij}})^2} = ||\mathrm{vec}(G_1) - \mathrm{vec}(G_2)||

is well-defined. Already we run into trouble: this distance is dependent in how the rows are ordered. That is, if G_2 has identical quantifications to G_1, but the transcripts are simply shuffled, G_2 is suddenly very far from G_1. The row means haven’t changed, nor has the scale of each row, or the overall scale of the matrix.

Imagine trying to “fix” this behavior (as we did by centering and rescaling our vectors). How would we transform G_1 and G_2? And what would we do if k_1 > k_2? This is where equivalence classes really shine in modeling the situation.

The argument proceeds along the following lines:

What do we mean when we say two sets of isoforms are related? If there are only two isoforms for each gene, it means when one of the isoforms of one gene goes up, an isoform of the other gene also goes up. So each isoform of G_1 predicts some of part of the isoforms of G_2. Even if k_1 \neq k_2, we still mean that G_1 captures some information about G_2. If knowing G_1 allows me to predict  the isoform levels of G_2 exactly then I’d say those two genes are equivalent.

So let \mathcal{F} be some class of functions f: \mathbb{R}^{k_2 \times n} \rightarrow \mathbb{R}^{k_1 \times n}. We can define equivalence classes as above by using

\mathcal{S} = \{X| X=f(A)\} \;\; A \in \mathbb{R}^{k_2 \times n} \; f \in \mathcal{F}

Translation-invariance, scale-invariance, rotation-invariance, or whatever other properties that are desired can be incorporated into the class \mathcal{F}. Without specifying \mathcal{F} there’s not much we can do except ask for residual distances: let \hat f = \mathrm{argmin}_f ||G_2 - f(G_1)||_F and we can take as a measure of distance

d(G_1, G_2) = ||G_2 - \hat f(G_1)||_F = \min_{f \in \mathcal{F}} ||G_2 - f(G_1)||_F

What kind of distance is this? We will see, soon enough, that if \mathcal{F} is the class of multilinear functions (i.e. standard regression), then the above distance is the tangent-space distance. This should extend to any class \mathcal{F}, but I don’t have a proof.

The simplest thing to do is to let \mathcal{F} be the space of multilinear functions; in which case

\mathcal{S} = \{X | \Gamma X + \mathrm{diag}(\gamma) \mathbf{1}_{k_2 \times n} = A\}

where \Gamma is a k_2 \times k_1 coefficient matrix, and \gamma is a k_2-vector of row means. Let’s assume that k_2 > k_1, in which case the equivalence classes are parallel hyperplanes of dimension k_1 \times n in a space of dimension k_2 \times n.

Novel members of a class have to be generated in a rather interesting way. Given X as k_2 \times n with C as k_2 \times k_1; then if X = CZ we should have (C^TC)^{-1}C^TX = Z; and therefore a new member can be generated by X' \leftarrow C_2(C_1^TC_1)^{-1}C_1^TX + \mathrm{diag}(\gamma)\mathbf{1}. And of course it is possible to let C_2 = C_1. One can immediately recognize this as the least-squares solution:

d(G_1, G_2) = \min_C || G_2 - CG_1||_F

(we can augment G_1 with an extra row of 1s to take care of \gamma). This minimization is, of course, multiple regression; and

\hat C = G_2G_1^T(G_1G_1^T)^{-1}

and therefore

d(G_1,G_2) = ||G_2(I - G_1^T(G_1G_1^T)^{-1}G_1)||_F = ||G_2 - \hat f(G_1)||_F

which is precisely the squared residuals from multiple regression. The form here precisely corresponds to the tangent space at G_2, and therefore this is the matrix analogue of the tangent space distance for vectors. Therefore, to “do the same thing” as centering and scaling the data in the matrix case, regress one matrix on the other, and take the Euclidean norm of the residuals. But it is impossible to transform every matrix in exactly the same way and still describe a metric on the equivalence class \mathcal{S}.

But, exactly as before, the tangent space is point-dependent; and so the above distance exhibits some form of acceleration. It is not a geodesic distance. Can we find one? It turns out our equivalence class \mathcal{S} has a name: the Grassmannian \mathrm{Gr}(k_2, n). This is the space of all k_2-dimensional planes in \mathbb{R}^n, and is equivalent to the quotient space:

\mathrm{Gr}(k_2, n) \equiv \{X \in \mathbb{R}_*^{k_2 \times n} | \mathrm{span}(X) = \mathrm{span}(A) \} \;\; A \in \mathbb{R}_*^{k_2 \times n}

where \mathbb{R}_*^{a \times b} means the space of maximal-rank a \times b matrices. This is also known as the noncompact Stiefel manifold.

If k_1 = k_2, then we rather happily can consider the distance between equivalence classes [G_1], [G_2] in \mathrm{Gr}(k, n). As in the case of single-expression, the distance here is in the form of principal angles. Let \tilde G_1, \tilde G_2 be semi-orthogonal matrices belonging to the corresponding equivalence classes (\tilde G_i \in [G_i]). That is, \tilde G_1\tilde G_1^T = I. Clearly we can take \tilde G_1 = (G_1G_1^T)^{-1/2}G_1. Letting

U\Sigma V^T = \tilde G_1\tilde G_2^T

be the singular-value decomposition of the product of the two matrices, then

d_{\mathrm{Gr}}(G_1, G_2) = \sqrt{\sum \cos^{-1}(\sigma_i)^2}

This recovers the previous formula if k=1. This should be no surprise, as \mathrm{Gr}(1, n) \equiv \mathbb{RP}^n! This also underlies such typical metrics as \sqrt{1 - |\det(XY^T)|}, which is the old \sqrt{1 - \mathrm{cor}(x,y)^2} in disguise (take the product of the cosines of the principal angles)!

When k_1 \neq k_2, the same approach can be done as before, by letting \hat G_2 = \hat f(G_1) = G_2G_1^T(G_1G_1^T)^{-1}G_1 and taking

d(G_1, G_2) = d_{\mathrm{Gr}}(\hat G_2, G_2)

Combining aggregate and transcript expression

I do believe there is an equivalence class that may capture both transcript alignment and aggregate expression: the idea being that if the aggregate transcript expression between the two genes differs by significantly more than a constant, they should be considered ‘further apart’. Here, we constrain \Gamma somewhat, requiring that \Gamma^T\Gamma = \gamma I for some constant \gamma. So we have

\mathcal{S} = \{X | \gamma \Gamma X + \mathrm{diag}(\zeta) \mathbf{1}_{k_2 \times n} = A\} \; \Gamma \in O(k_2, k_1)

This entails more or less the same process as before, except that \hat C should be replaced with \hat \gamma\hat C (\hat C^T\hat C)^{-1/2} . The rest follows.

In sum: clarifying what it means for two genes to be “equivalent” in their transcript expression lead to the formulation of the appropriate equivalence class, and identification of the corresponding geodesic distance.

One can rightly wonder whether the choice of a metric really matters. Euclidean, Mahalanobis, Jensen-Shannon, some appropriate Geodesic, and so on.

In certain cases, the metric is defined on subtly different spaces, as is the case with correlation-based metrics, mutual-information based metrics, and Euclidean distance. There are differences in equivalence classes; and part of the problem is that these differences are hidden. Rather, some property of the metric itself is considered desirable and it is chosen for that reason. Ideally, a suitable equivalence class should be defined an a natural metric used thereon.

On the other hand, some metrics are defined over the same equivalence class. \sqrt{\frac{1}{2}(1-|\mathrm{cor}(x,y)|)}, \cos^{-1}|\mathrm{cor}(x,y)|, \sqrt{1 - \mathrm{cor}(x,y)^2} are all metrics on \mathbb{RP}^n. Therefore if (x_1, y_1) are further apart than (x_2, y_2) under one metric, they are further apart under the others. They’re all just monotonic transformations of one another, and rank inter-point distances in precisely the same order. Who’s to say one has more desirable properties than another?

I don’t have a good answer.

 

September 29, 2015

Iterators in R

Filed under: Uncategorized — heavytailed @ 4:25 am

I’ve got an absurd number of post stubs at the moment; which should start coming out over the next month. This is a short code-related one: how to take many files sorted (by, say, p-value) and stream records into R as though they had all been merged into one file. All this needs is an implementation of a peekable iterator. This is one of those things which is boilerplate for most languages, but is a bit of a struggle for R (especially due to needing to use local/global assignment: <<-).

The first little bit is to use the iterators and itertools packages to create a simple file iterator. I know there’s an iread.table, but it’s not exactly what I need.


require('iterators')
require('itertools')

fstream <- function(filename, format, header=T) {
  handle = file(filename)
  open(handle)
  hdr <- NULL
  if ( header ) {
    hdr <- readLines(handle, n=1)
  }
  nxt <- function() {
    x <- scan(handle, format, nlines=1, quiet=T)
    if ( length(x[[1]]) == 0 ) {
      stop('StopIteration')
    }
    x
  }
  obj <- list(nextElem = nxt)
  class(obj) <- c('fstream', 'abstractiter', 'iter')
  obj
}

The file() function takes a path and turns it into a file handle; and open() lets you read from it (and maintain the offset). Note that function chaining (“handle = open(file(filename))”) won’t work here, nor will open(filename). Kind of ugly stuff.

The next part is to write the merging algorithm, that assumes the files are sorted by a (specified) comparison function. I’ve used a very simple one (sorted, ascending, by second column):


simple_compare <- function(rows) {
  row_vals = sapply(rows, function(t){t[[2]]})  # unlist
  which(row_vals == min(row_vals))[1]
}

iterfiles <- function(files, format, cmp) {
  streams <- lapply(files, function(f) { ihasNext(fstream(f, format)) })
  heads <- lapply(streams, function(t) { nextElem(t)})
  nxt <- function() {
    best_idx <- cmp(heads)
    to_ret <- heads[[best_idx]]
    if ( hasNext(streams[[best_idx]]) ) {
      heads[[best_idx]] <<- nextElem(streams[[best_idx]])
    } else {
      if ( length(streams) == 1 ) {
        streams <<- list()
        heads <<- list()
      } else if ( best_idx == 1 ) {
        streams <<- streams[2:length(streams)]
        heads <<- heads[2:length(heads)]
      } else if ( best_idx == length(heads)) {
        streams <<- streams[1:(best_idx-1)]
        heads <<- heads[1:(best_idx - 1)]
      } else {
        streams <<- c(streams[1:(best_idx-1)], streams[(best_idx+1):length(streams)])
        heads <<- c(heads[1:(best_idx-1)], heads[(best_idx+1):length(heads)])
      }
    }
    if (length(heads) == 0 ) {
      to_ret <- NULL
    }
    to_ret
  }
  nxt
}

Note the extensive use of “<<-” to ensure the reassignment of the streams and head variables which are one level above the scope of the “nxt” function. Putting this altogether:


write.table(file='.foo1', x=sort(runif(1000)))
write.table(file='.foo2', x=sort(runif(1000)))
write.table(file='.foo3', x=sort(runif(1000)))

nxtline <- iterfiles(c('.foo1', '.foo2', '.foo3'), format=list(character(), double()), cmp=simple_compare)

line <- nxtline()
merged_pvals <- c(NA, line[[2]])
line <- nxtline()
while ( ! is.null(line) ) {
  merged_pvals <- c(merged_pvals, line[[2]])
  line <- nxtline()
}

Produces the sorted, merged list of p-values, as expected. The idea, of course, would be that there’s a total of ~300,000,000 such values, and it’s only worthwhile plotting certain ones. This can be done with something like:


ranks <- c(NA, 1)

pvals <- c(NA, fields[[5]])

KEEP_ALL_BEFORE <- 100
SAMPLING_POWER <- 1.1 skip = 4 while (  length(fields[[5]]) > 0 ) {
  if ( rank <= KEEP_ALL_BEFORE ) {
    ranks <- c(ranks, rank)
    pvals <- c(pvals, fields[[5]])
  } else {
    if ( rank %% skip == 1) {
      print(sprintf('%d  %e', rank, fields[[5]]))
      ranks <- c(ranks, rank)
      pvals <- c(pvals, fields[[5]])
      skip <- 1 + round(skip * SAMPLING_POWER)
    }
  }
  fields <- nextline()
  rank = rank + 1
}

and of course the expected pvalue will just be “ranks/rank”, since rank will, in the very end, be the total number of data points.

June 17, 2015

Estimating fraternity coefficients – Part 1

Filed under: Uncategorized — heavytailed @ 8:14 am

For my Ph.D. written qualifier (to demonstrate I’m capable of pursuing a Ph.D., I suppose); I have opted to develop algorithms for the estimation of fraternity coefficients (also known as dominance coefficients). While the coancestry (a.k.a. kinship) coefficient has received significant attention since Peter Visscher’s seminal 2006 paper, the fraternity coefficient has received far less attention. There are fairly good reasons for this:

1) Equation (11) in Visscher’s paper gives the correlation between coancestry \pi_a and fraternity \pi_d as

[1/(16L)/\{5/(64L)\}]^{1/2} \approx 0.89

This means that, for your typical random effects model Y \sim N(\mu, \sigma^2_e I + \sigma^2_a \Pi_a + \sigma^2_d \Pi_d), the matrices \Pi_a and \Pi_d, in expectation, differ by a multiplicative constant; and therefore the model is in expectation close to non-identifiable. In siblings.

2) While “unrelated” individuals will share some small fraction of their genome IBD by chance, one needs to effectively square that chance to obtain the fraction of their genome where both alleles are shared IBD. That said, the fact that coancestry estimation in unrelated individuals has proven so effective suggests that fraternity estimation may be valuable.

3) For common, human diseases, very few genetic variants with dominance effects have been identified. I apologize that my reference for this is an unhelpful google scholar search; I am officially presenting the lack of findings as evidence that there’s very little (such is the nature of publication bias)

Furthermore, the basic analysis has already been done and was published by the Visscher lab this past March. Using the homozygous covariance sharing matrix (see here or here for derivations), Zhu and Yang demonstrate that across ~80 human traits (although not disease), dominance contributes little, if any, to trait heritability.

So why bother? Well there are a couple of reasons, none of which are particularly good on their own, but which together make a reasonable case for choosing this as a topic for what is effectively glorified coursework that is, per requirement, entirely unrelated to my actual thesis. First, the methods above make the assumption that individuals are not inbred. At all. At the limit where we care about estimating the very low probability of two individuals being homozygous for the same allele, it seems strange to make the assumption that their parents could only share alleles across pairs, but not within pairs. And there are populations (comprising perhaps 14% of all of humanity) where consanguineous marriage is common, and prevalence of neurogenetic disorders is increased in these populations. Hardy-Weinberg disequilibrium being expected in such pedigrees, it is entirely possible that dominant effects which have little impact in other populations may contribute somewhat more substantially in these clades. Second, the observation that dominance is not a significant factor for trait heritability applied to a number of quantitative physical traits; but the analysis was not extended to diseases, and in particular complex diseases with two-hit or multi-hit hypotheses. Requiring that two copies of a gene (or two copies of multiple genes) to be affected by mutation is the very definition of dominance. Third, just as genome-wide heritability estimates can be improved (or made more accurate) by restricting the calculation to fixed segments (e.g. coding, regulatory), the same is true of fraternity coefficients, and may yield interesting results. Finally, the methods used for fraternity estimation estimate an allelic covariance, but not \pi_d. The latter is more interesting to me at the moment because it reflects a classical genetic idea; but more importantly requires (in conjunction with inbreeding) multi-allelic sites (or full haplotypes) to calculate. I’ve never had a good opportunity to deal directly with statistical phasing (mostly because there are great algorithms out there already); but the fact that the haplotype assignment is, in this case, a nuisance parameter, gives me a good excuse to finally implement an HMM.

Over the next few weeks, I’ll develop this idea into a mathematical estimator; then into a full algorithm; and finally test it on simulated and real data. We’ll see if it works.

November 17, 2014

The Libor Market Model

Filed under: Uncategorized — Tags: , , , — heavytailed @ 9:15 pm

Among statisticians (particularly biostatisticians), hearing ‘LMM’ immediately triggers “Linear Mixed Model” and scary thoughts about random effects. Among quant circles the association is completely different: LMM means the Libor Market Model, the most commonly used model of interest rates among the squalor of the fixed-income desks.

What is LIBOR?

Does it help if I tell you the acronym stands for London Inter-Bank Offer Rate? No? Simply put, a selection of 18 banks tell the Intercontinental Exchange what they think a fair price is if they were to borrow money.

The Model

So after all that time describing LIBOR, it’s time to reveal that The LIBOR Market Model is a complete misnomer. The model is, itself, a simple model of forward contracts. In fact, it should really be called the NIRMM: Nominal Interest Rate Market Model, as any effective interest rate can be modeled so long as forwards contracts can be observed.

Forward Rates

A forward contract is an up-front agreement for a future loan. No bank would do this in real life, but pretend you know you’re going to buy a house in three years. You walk into Local Municipal Bank and say “Three years from now, I’m going to get a €300,000, 5-year loan from you, let’s lock in a rate.” The bank says “OK we’ll give you 2.5% at that time.” Boom. Forward rate is 0.025 [1] In the actual market, there are many different forward rates, corresponding to different loan durations (“tenors”) and ending dates (“maturities”). Each of these contracts, naturally, has its own rate with which it is associated.

Forward rates don’t have a simple floating value based merely on supply-and-demand (like, say, onions): they face a significant no-arbitrage constraint: If I borrow €300,000 for 3 years, and also hash out a €300,000 forward contract which begins in 3 years and lasts for 5 years, then the total rate I pay ought to be the same as if I’d simply borrowed €300,000 for 8 years. Mathematically if P(a,b) is a bond from time a to time b, and F(t, a, b) is a forward contract at time t from time a to time b: P(t, T_1 + T_2) = P(t, T_1)F(t, T_1, T_2).

If this equality fails to hold, and P(t, T_1 + T_2) > P(t, T_1)F(t, T_1, T_2) then I can make money by borrowing money using the right-side structure (borrow from now to T_1, and engage a future contract from T_1 to T_2), and lending money using the left-side structure. Similarly, if P(t, T_1 + T_2) < P(t, T_1)F(t, T_1, T_2) then I can make  money by borrowing money using the left-side structure, and lending using the right-side structure.

If the equality above holds, then the forward contract rate has an easy definition in terms of zero-coupon bonds P: \;\; F(t, T_1, T_2) = \frac{P(t, T_1+T_2)}{P(t, T_1)}. The simply compounded interest rate L(t, T_1, T_2) is defined by

\frac{1}{1 + (T_2 - T_1)L(t, T_1, T_2)} = F(t, T_1, T_2)

This is a bit of a silly definition as written; let’s parse this out:

\frac{1}{1 + \delta L_i} = \frac{P_{i+1}}{P_i} \Rightarrow 1 + \delta L_i = \frac{P_i}{P_{i+1}} \Rightarrow L_i = \frac{P_{i} - P_{i+1}}{\delta P_{i+1}}

This gives the nifty equation

P_{i+1}(1 + \delta L_i) = P_i

So we can calculate the market Libors from prices, spots, or forwards (e.g. using the market prices for each, without needing to convert). The Libor Market Model is a model for L, which models the way in which forward prices responds to stochastic noise. It has the form:

dL = \mathbf{\Xi}Ldt + L \odot \Omega \; dW

where L is a vector, \mathbf{\Xi}, \Omega are matrices; all of these are functions of time. While \Omega(t) is an exogenous model of volatility, the no-arbitrage constraint will enable us to define the matrix \; \mathbf{\Xi}.

Consider a simple case where \; \mathbf{\Xi} = 0 and \Omega = I. Then each component of L evolves as independent geometric brownian motion, and so

L_i(t) = L_i(0)\exp\left(-\frac{1}{2}t + W_t^i\right) and \mathbb{E}[L_i(t)] = L_i(0)e^{-\frac{1}{2}t}\mathbb{E}[\exp(W_t^i)] = L_i(0)e^{-\frac{1}{2}t + \frac{1}{2}t} = L_i(0)

There’s a problem here: no-arbitrage is not enforced. If L_i(0) is not in an arbitrage-free state, or if $dW$ ever knocks the system out of an arbitrage-free position (which happens almost surely), then the relationship between forwards and spots (e.g. the L_i and L_j) evolves with no regard to market forces.

So how do we enforce our sample paths to be approximately zero-arbitrage? It turns out that simply baking this assumption in suffices to do so; that is: the assumption of geometric brownian motion under the risk-neutral measure is more powerful than one might initially think.

It’s important to note that the “risk-neutral measure” is one of those terms of art that has only a practical meaning. It almost always means whatever measure makes my thing a martingale. Consider that the definition of L_i(t) = L(t, T_i, T_{i+1}) gives

L_i(t)P_{i+1}(t) = \frac{1}{T_{i+1}-T_i}(P_i(t) - P_{i+1}(t)) \Rightarrow L_i(t) = \frac{1}{\delta}\left(\frac{P_i(t)}{P_{i+1}(t)} - 1\right)

Now we want to differentiate. Using the derivative from deterministic calculus will get us into trouble; so we need to use a stochastic derivative, and in particular the stochastic division rule.

Ito’s Lemma and the Stochastic Quotient

Stochastic calculus anywhere outside of engineering is overburdened with theory, acronyms, and poor notation. Most approaches to teaching it get far too technical far too quickly. Seriously, go look at Wikipedia and try not to get completely lost. Google around for some notes and see if you fare any better.

In my mind, the entirety of brownian stochastic calculus is summed up as follows:

dW^2 \sim dt

This is intuitive in the following sense: if X(0) = 0 and dX = \sigma dW we have that X(t) \sim N(0, \sigma t). Of course then \frac{X(t)}{\sqrt{\sigma dt}}^2 \sim \chi^2(1). In an infinitesimal step X = 0 + dW = dW and so \sigma dt \chi^2(1) \sim X^2 \approx dW^2. Then we make the approximation dt \chi^2(1) \approx \delta(dt) since \mathbb{P}[dt \chi > \epsilon] \rightarrow 0 for any \epsilon not depending on dt. Ergo dW^2 \sim dt.

This is very handwavy because we bypass all concern about the competing notions of convergence. We have a random variable converging to a distribution, and a function (that distribution) converging to an infinitesimal; and the infinitesimal converging to 0. Nevertheless, the above is exactly correct when used in a Taylor expansion (the only place you’d want to use it anyway), and can be used to derive many things, including the stochastic quotient.

Consider two geometric brownian motions

dA = A\alpha dt + A\sigma dW_a

dB = B\beta dt + B\gamma dW_b

with \mathbb{E}[W_aW_b] = \rho. Let f(a,b) = \frac{a}{b} and then

df = \frac{\partial f}{\partial a} da + \frac{\partial f}{\partial b}db + \frac{\partial^2 f}{\partial a \partial b}dadb + \frac{1}{2}\frac{\partial^2 f}{\partial a^2}da^2 + \frac{1}{2}\frac{\partial^2 f}{\partial b^2}db^2 + \dots

= \frac{1}{b}da - \frac{a}{b^2}db - \frac{1}{b^2}dadb + \frac{a}{b^3}db^2 +\dots

Which means that

d(A/B) = \frac{1}{B}(A \alpha dt + A \sigma dW_a) - \frac{A}{B} \frac{1}{B}(B \beta dt + B \gamma dW_b) - \dots

... - \frac{1}{B}(A\alpha dt + A\sigma dW_a)(\beta dt + \gamma dW_b) + \frac{A}{B^3}(B\beta dt + B\gamma dW_b)^2

We observe a lot of A/B terms in here. Writing C=A/B:

dC = C(\alpha-\beta) dt + C(\sigma - \gamma)dW - 2C\sigma\gamma dW_adW_b + C\gamma^2dW_b^2 + O(dt^2) + o(dtdW)

Dropping terms smaller than O(dt) and setting dW^2 = dt then we have

dC = C(\alpha - \beta + \gamma^2 - \sigma\gamma\rho)dt + C(\sigma - \gamma)dW

The spot rates P_i follow (independent) geometric brownian motion (this is, again, the no-arbitrage assumption)

dP_i = P_i \mu_i(t)dt + P_i\sigma_i(t)dW

Using the quotient rule, and defining R_{i} = \frac{P_i}{P_{i+1}} we have

\frac{dR_{i}}{R_{i}} = (\mu_i - \mu_{i+1} + \sigma^2_{i+1} - \sigma_i\sigma_{i+1})dt + (\sigma_i - \sigma_{i+1})dW

This gives then

dL_i = \frac{1}{\delta}R_i(\mu_i - \mu_{i+1} + \sigma^2_{i+1} - \sigma_i\sigma_{i+1})dt + (\sigma_i - \sigma_{i+1})dW

The crux of the Libor Market Model is that L_i should follow brownian motion. This implies two things, first, that the volatility difference is very special: \sigma_i - \sigma_{i+1} = \frac{\delta}{R_i} L_i \xi_i(t) and also that the drift terms should be linked to the volatilities as \mu_i(t) = \sigma_i(t) r(t). These then give

dL_i = \frac{1}{\delta} R_i\left[((\sigma_i - \sigma_{i+1})r + \sigma_{i+1}(\sigma_i - \sigma_{i+1}))dt + (\sigma_i - \sigma_{i+1})dW\right]

dL_i = L_i\left[(r + \sigma_{i+1}) \xi_i dt + \xi_i dW\right]

And now we have brownian motion. It’s worth parsing out what the spot rates now have to look like based on the central assumptions we made in order to shove L_i into a brownian motion form. Because we have \mu_i(t) = \sigma_i(t)r(t), we therefore must have

dP_i = P_i\sigma_ir dt + P_i \sigma_i dW \Rightarrow \log \frac{P_t}{P_0} = (r\sigma_i - \frac{\sigma_i^2}{2})t + \sigma dW = \sigma_i\left[(r - \frac{\sigma_i}{2})t + dW\right]

with the right-hand equation assuming constant volatility. In addition, the required relationship

\sigma_{i+1} = \sigma_i - \frac{\delta}{R_i} L_i \xi

telescopes. One thing to keep in mind is that \sigma_i is the volatility of a bond that exercises at time T_i, so \sigma_i(t) = 0 for t > T_i. Letting j(t) be the first index for which t < t_i the above telescopes to

\sigma_{i+1} = \omega(t) - \sum_{j(t)}^i \frac{\delta}{R_i}L_i

And since L = \frac{1}{\delta}(R - 1) \Rightarrow R = \delta L + 1 \Rightarrow \frac{\delta}{R_i}L_i = \frac{\delta L_i}{1 + \delta L_i}

which is the term one is most often used to seeing.

If you recall, above I claimed that the derivation of this model only required certain no-arbitrage assumptions. There are, however, seemingly unrelated-to-arbitrage assumptions on volatility differences, and a linkage between volatility and drift. However, if L_i were not martingales (e.g. if we couldn’t take a reference frame where dL_i = L_i \xi_i dW) then because P_{i+1}(1 + \delta L_i) = P_i, it would imply that there is an arbitrage in the P_i. The hand-wavy way to demonstrate this is to note that, by the stochastic quotient above, that if A is a martingale, and B is a martingale on the same space, then A/B must also be a martingale. Taking the contrapositive: if A/B is not a martingale at least one of: A is not a martingale, B is not a martingale, or A and B are not defined on the same space must be true. Since L_i is a ratio, then the failure of L_i to be  a martingale implies that there must be an arbitrage in the bond prices. Therefore the “strange” conditions on the drift and volatility difference terms are actually just no-arbitrage constraints. The mathematics of this is worked out in Heath, Jarrow, & Morton, in the section on ‘Existence of Market Prices for Risk.’ One of the drawbacks of HJM and BGM is that the notation used is that of stochastic integrals rather than stochastic differentials, which is different from the notation presented here.

Measures, Calibration, and Pricing

You may have noticed in the previous discussion my borrowing the term ‘reference frame’ to refer to dropping the drift term from the differential equation. This is typically referred to as a ‘change of measure’ or even more confusingly, ‘change of numeraire.’ There are appeals to Girsanov’s theorem and the Radon-Nikodym derivative, which involves developing the theory of filtrations. It’s all rather complicated for a very simple piece of intuition: you’ve got a particle moving with certain dynamics. Classically, you can write those dynamics in any number of suitable reference frames, including one in which the particle is not moving at all. The stochastic equivalent is to adopt a reference frame where the particle is moving only stochastically, and this amounts to adopting a reference frame that moves deterministically (and instantaneously) with the drift alone; that is if

\frac{dX}{X} = \alpha(t)dt + \sigma(t)dW

then we can adopt the reference frame

\frac{dF}{X} = \alpha(t)dt

and note the presence of X in the denominator. Sure this is now a system of linked equations, which is precisely the point. Establishing that this is a valid transformation probabilistically is a bit involved, but the actual transformation is intuitively simple. Girsanov’s theorem basically says “You can pick a reference frame.”

That rant aside, how many degrees of freedom do we have in this model? The \sigma_i(t) are fixed (possibly up to a single baseline function \omega(t)) by the preceding L_j. That just leaves the \xi_i; so there are basically n+1 degrees of freedom. The n of course can be increased arbitrarily by choosing \xi_i(t) to be of higher order than a constant term. Given that L_j is a martingale with L_j =_{\mathbb{Q}_j} \xi_j(t)dW_j the \xi_j are typically referred to (and can be interpreted as) Libor volatilities. Again, a reminder, that the \xi_i are not volatilities for the actual LIBOR which is reported daily: these are instead forward rates. Of course, application of Ito’s rule brings them into line:

F_i = f(L_i) = (1 + \delta L_i)^{-1}

df = \frac{\partial f}{\partial x}dx + \frac{1}{2}\frac{\partial^2f}{\partial x^2}dx^2 + \dots

dF_i = -(1 + \delta L_i)^{-2}\delta L_i \xi_i dW + (1 + \delta L_i)^{-3}\delta^2\xi_i^2 L_i^2 dW^2

Recalling that \frac{\delta L_i}{1 + \delta L_i} = \frac{\sigma_i - \sigma_{i+1}}{\xi} we get out (well, what did you expect??)

dF_i =_{\mathbb{Q}_i} (\sigma_{i+1}-\sigma_i)^2F_idt + (\sigma_{i+1}-\sigma_i)F_idW

Despite the fact that the \sigma_i can be computed from the \xi_i (given L_i), and visa versa, which means that one could choose to model the forward, spot, or Libor volatilities, the philosophy of the LMM is that the “truth” lies with the Libor volatilities, \xi_i, and that spot and forward volatilities are derived quantities. The Libor volatilities can be modeled in much the same way forward volatilities would be (for instance Ho-Lee or HJM like models). For instance, we could let

\xi(t) = (a + bt)e^{-ct} + d

\xi_i(t) = k_i \xi(t)

which gives us n+4 free parameters. The Libor values L(t, T_i, T_{i+1}) are directly observable, as defined above, and thus their volatilities can be fit directly.

There are alternatives to direct fitting as well. In much the same way that one can use option prices to back out parameters of the underlying asset, one can back out parameters of the forward rates from caps and swaps. This method of calibration requires a slight digression on pricing these instruments.

Derivatives pricing under LMM

The standard interest rate derivatives are caps and swaps. There are of course others, including a plethora of exotics, but caps and swaps are the two which are germane to LMM calibration. We start with a swap agreement.

Interest Rate Swaps

An interest rate swap agreement references two dates: a “reset date” T_i and a “settlement date” T_{i+1}. The difference \delta_i = T_{i+1}-T_i is the time interval for the contractual fixed rate \kappa. The contract is as follows:

Party A is looking to hedge against volatility in forward interest rates, and wants to fix a set interest rate \kappa over the time period (T_i, T_{i+1}). Therefore at time T_{i+1} Party A disburses \delta_i \kappa to Party B.

Party B is looking to speculate on the volatility on the forward interest rates, and specifically wants to capitalize on the difference between \kappa and the short rate r(t=T_i, T_{i+1}). In exchange for receiving \delta_i \kappa from Party A, Party B disburses whatever the short rate was over that period, r(t=T_i, T_{i+1}).

The short rate r(t, T) can be calculated from zero-coupon bond prices as r(t, T) = \left(\frac{\mathrm{Face\;Value}}{P(t, T)}\right)^{1/\mathrm{num \; compounds}} - 1. We will take the face value to be \texttt{f}, and we’ll assume for simplicity that there is only the single application of the rate at T_{i+1}. Therefore the contract stipulates

\mathrm{Receive}_A = \frac{\texttt{f}}{P(t=T_i, T_{i+1})} - 1

\mathrm{Receive}_B = \delta_i \kappa

The present value of these are

\mathrm{PV}_A = \frac{P(t, T_{i+1})}{\texttt{f}}\left(\frac{\texttt{f}}{P(t=T_i, T_{i+1})} - 1\right)

\mathrm{PV}_B = \frac{P(t, T_{i+1}) \delta_i \kappa }{\texttt{f}}

There’s some simplification for \mathrm{PV}_A, since \frac{P(t, T_{i+1})}{\mathtt{f}} = \frac{P(t, T_i)P(t=T_i, T_{i+1})}{\mathtt{f}^2} (in expectation) then we can rewrite

\mathrm{PV}_A = \frac{P(t, T_i)}{\mathtt{f}} - \frac{P(t, T_{i+1})}{\texttt{f}}

The fair value of \kappa can be identified by setting \mathrm{PV}_A = \mathrm{PV}_B so that

\frac{P(t, T_i)}{\mathtt{f}} - \frac{P(t, T_{i+1})}{\texttt{f}} = \frac{P(t, T_{i+1})}{\texttt{f}} \cdot \delta_i \kappa

Yielding

\kappa = \frac{P(t, T_i) - P(t, T_{i+1})}{\delta_i P(t, T_{i+1})} = \frac{1}{\delta_i}\left(\frac{P_i}{P_{i+1}}-1\right) = L_i

This implies that the Libor parameters can be calibrated to the interest rate swap market. The details were worked out by Jamshidian, and extended to a model similar to the LMM, but using forward swap rates. A direct comparison can be found here. For us, we continue to Caps.

Interest Rate Caps

An interest rate cap is effectively a call option on an interest rate. For those already familiar with the lingo, I’m subsuming what’s typically called a “caplet” into this definition.

Party A is looking to hedge against an increase in forward interest rates, and has in mind a maximum rate payment \zeta over the period (T_i, T_{i+1}). Party A seeks a payment from Party B if the interest rates exceed \zeta during this period, using the start-of-period interest rate r(t=T_{i}, T_{i+1}) as a proxy for the interest rate of the whole period. Thus at time T_{i+1}, Party A receives \delta_i \mathrm{max}(r(T_i, T_{i+1}) - \zeta, 0).

Party B is looking to bet that interest rates will not exceed a particular value, also \zeta, over the same period. Thus at time t, Party B receives a payment of V_B from Party A in anticipation of possible future disbursement, should the interest rate at t=T_{i+1} exceed \zeta.

This means that V_B is the present value of the expected payoff of the cap, e.g. [I’m taking face-value to be 1 for bonds]

V_B = \frac{1}{P(t, T_{i+1})}\delta_i \mathbb{E}[\mathrm{max}(r(T_i, T_{i+1}) - \zeta, 0)]

We can rewrite the interest rate as above:

V_B =\frac{\delta_i}{P(t, T_{i+1})}\mathbb{E}\left[\left(\frac{1}{P(t=T_i, T_{i+1})} - 1 - \zeta\right)^+\right]

V_B =\frac{\delta_i}{P(t, T_{i+1})}\mathbb{E}\left[\left(\frac{P(t, T_i)}{P(t, T_{i+1})}-1 - \zeta\right)^+\right]

V_B = \frac{\delta_i}{P(t, T_{i+1})}\mathbb{E}[(\delta_iL_i - \zeta)^+]

Admittedly, this looks very strange. The units just don’t work out; the only way this is consistent is if \delta_i is not measured in time, but is a dimensionless scalar. In fact, \delta_i here is measured as fraction of a compound time, so if T_{i+1}-T_i = 6 \mathrm{mo} for yearly compounds, then \delta_i = 0.5. In fact, if you look back over the derivation, you’ll notice that this has to be the case, otherwise the libor rate has units 1/t, which would give a forward rate units of time, which is nonsensical for a rate.

Usual cap agreements are contracts that reference a large number of the above caps; for instance a 5-year monthly cap agreement would reference 5 \times 12 = 60 caps. Because of this, the values of a cap for period $T_i$ has to be deconvolved from publicly-traded cap contracts. In addition, the optimization problem, while convex, is nonlinear, making it somewhat more difficult to back out from caps than swaps. See here for more details (fair warning: poor notation rears its ugly head).

Originally I was planning on doing some model calibration myself as an example; perhaps in the future. The post, as it stands, is long enough as it is. There’s one obvious question remaining, though:

Why don’t we calibrate LMM to trasury rates?

Recall that the Libor rate was defined in terms of the spot and forward rate

L_i = \frac{P_i - P_{i+1}}{\delta P_{i+1}}

The answer is that there is a divergence between the ZCP rates calculated from (for instance) U.S. Treasuries, and the ZCP implied by calibrating the LMM to interest rate caps. The point is that government rates are different from interbank rates, whereas the risk-free zero-coupon bond is a theoretical, unobservable quantity. While coupon theory can be applied to things like treasury bonds, and the resulting forward and spot rates analyzed, there is no guarantee that government bonds are, really and truly, a risk-free ZCP. (Indeed, we expect things like exchange rate risk to differentially effect interbank rates and government rates). So calibrating the interbank rates to government rates will make you a very sad panda.

October 6, 2014

Kernel Random Effects

Filed under: Uncategorized — heavytailed @ 10:57 pm

I ran into a couple of grad students studying for a big AI exam yesterday. I barged in wondering what they were doing with all the pretty inner products and norms on the board, precipitating a moment of intense awkwardness. One of them asked how much I knew about Mercer’s theorem, and if I could explain it. The best I could do was recall the gist of Mercer’s theorem: which is that if you have a Kernel K: \mathcal{X}\times\mathcal{X} \rightarrow \mathbb{R} that obeys certain conditions (positive-definiteness and boundedness), then you can think of it as an inner product in a larger feature space. In other words (letting \mu be a measure on \mathcal{X}):

(1) \;\;\; \sum_{i,j=1}^n a_i a_j K(x_i,x_j) \geq 0 \; \; \forall x_i \in \mathcal{X}, a_i \in \mathbb{R}

(2) \;\;\;\; \int_X \int_X K(x,z)^2 d\mu(x)d\mu(z) < \infty

(1),(2) \Rightarrow K(x,z) = \sum_{k=1}^\infty \lambda_k\phi_k(x)\phi_k(z)

Where the convergence is uniform and absolute. This is, basically, obvious, and is the extension of the linear algebraic notion that positive semidefinite matrices can be written as an inner product, and the point is that so long as you can construct a positive-definite K for which condition (2) holds, you can solve certain optimization functions over a Hilbert space without ever directly calculating the image of your data in that space. For support vector machines (the typical application of Kernels) this basically means if you magically pick a high-dimensional representation in which there is a good separating hyperplane, you can find a really good hypothesis.

Okay well what does it mean for there to be a good separating hyperplane in a large (possibly infinite) dimensional space? It means that there’s a (w,b) such that

w^\dagger x_i + b < 0 \;\; \mathrm{if} \; \; y_i = 0 \;\;\; w^\dagger x_i + b > 0 \; \mathrm{otherwise}

Note that for Kernel methods, the inner product is replaced by the kernel, that is K(w,x) + b < 0 (etc). One way of interpreting this is as showing that there is a high-dimensional inner product space characterized by K on which the classification function is linear; another way is to interpret this as saying that in that high-dimensional space, the data cluster on either side of the dividing line. Perhaps there are many clusters, but on either side of that line: the explanatory data is somehow better aligned with the responses.

(Note that this is different from the Manifold hypotheses articulated in previous posts. The weak MH says that the data are distributed on a low-dimensional manifold; the Strong MH says that the hypothesis class has a simple representation in that very same manifold. The Kernel-MH says the hypothesis class has a simple representation on some manifold of dimensionality up to that of the data, in a potentially infinite ambient space. This is why Kernel methods don’t magically escape the curse of dimensionality.)

Obviously, a better alignment of features with response makes all manner of machine learning methods work more appropriately. In particular, in the regression setting, one would like to identify features better correlated with the response. The only issue is that while the predictors can fly into the infinite-dimensional Kernel space, the response can’t. And even if you could, all you get from K(x,z) is the Kernel matrix K_{ij} = K(x_i,x_j), with no access to \phi(x_i) and \phi(x_j).

But. For Random effects (see for instance here), K is all you need. This is one of those (excellent, but few) non-trivial cases where you’re forced to use random effects. This blog post will provide some background and demonstrate some results. There’s not all that much to develop — merely pointing out that K is a matrix of appropriate dimensions is enough for anyone to take this forward and run with it — except to point out relationships between the current literature on Kernels and assumptions about random effects models (and these mostly have to do with normalization).

From the Annals of Bad Examples: Why Use Kernels

“Doing a good thing for a bad reason”

The reason to use a Kernel is to project the data into a space in which it is linearly separable. What this is really doing is choosing a space in which the image of the separating criterion function is linear, enabling linear classification in the image of the data, which corresponds to nonlinear classification of the unprojected data. This works really well for many types of hypotheses, but one type on which it tends to fail miserably are disjoint intervals. Yet somehow the canonical example is as follows:

classify_intervals

Got some red points, some blue points (the y-axis is just 0 or 1 depending on the class, it’s a 1-dimensional classification) and they’re obviously not linearly separable since they’re disjoint intervals. Now there is a transformation C: \mathbb{R}\rightarrow\mathbb{R} that will wind up separating the red points from the blue points, it’s the cheating kernel, where you simply multiply the endpoints together to form a polynomial, i.e. for endpoints z_i \in Z:

C(x) = \prod_{i=1}^{|Z|}(x-z_i) = (x_i-3)(x_i-2)(x_i+4)(x_i+1)(x_i-6)(x_i-8)(x_i+11)(x^i-12)

and “magically” by constructing a polynomial with zeros on the edges of the intervals the two classes are linearly separable!

intervals_polynomial

Therefore the Kernel trick works! My beef is: this is really a kernel dirty trick: the Kernel function itself encodes the hypothesis, and so if we didn’t know it ahead of time, we’d’ve been screwed. Before we move on to realistic examples using “real” kernels, it’s worth asking what the Kernel Matrix looks like for this, and we can choose either to normalize within the kernel feature space or not (the normalization sets \tilde{K}(x,y) = \frac{K(x,y)}{\sqrt{K(x,x)K(y,y)}})

intervals_unnormalizedintervals_normalized

For a reason to be determined, normalizing the Kernel (right) reveals the huge structure in the data (the fuzziness is due to a small constant I added to deal with 0 valued denominators), and very obviously the eigenspaces of these two kernel matrices will be different. The one on the right should be very easy to classify, the latter maybe less so. Looking at the projections, it’s clear there’s a much larger margin in the latter case

unnormalized_eigen_histnormalized_eigen_hist

The latter case clearly indicates that a random effects model (or fixed effect using PCs) could easily pull signal out of this. This immediately leads to means of extending the Kernel trick to general regression problems. In part this is already well-known; for the general regression problem of solving:

\mathrm{min}_w L(X^\dagger w)

where L is a loss function (such as L = (y-y_\mathrm{pred})^2) we can rewrite as

\mathrm{min}_v L(X^\dagger Xv)

simply by letting w = Xv. For L2-penalized regression the transformation works as well (the penalty term is + \lambda v^\dagger X^\dagger Xv). The story here is a little bit more restrictive, but at the same time a bit cooler: by being a bit more explicit about an underlying probability model (e.g. gaussian, probit, etc), the kernel matrix K is taken as a hypothesis about the covariance of observations. If you’ve found a “good” representation of your data – not just a space where it’s separable but where distances mean something, in terms of “close points are correlated, far points are not,” the Kernel matrix should have the property of explaining a large proportion of your data.

Let’s not over-trivialize: this is a pretty radical re-interpretation of the Kernel matrix and what it’s trying to accomplish. Usually we only care about separability: the fact that the magnitude of the positive points (in the first figure) is far smaller than that of the negative points, and that overall there’s a vast range not associated at all with the response, is in general ignored. It’s separable: let’s move on. By contrast, considering the Kernel matrix as though it were part of a variance components model means caring about the structure of the data, beyond smearing it out enough that the class is more or less separable. It may even be worth making normative statements: good Kernels provide both improved separability and discover/retain covariance in the predictors that translates into covariance in the response. (This is, at its core, a statement about scale).

Digression: Kernel Normalization

Before moving to the empirical section (which can be summarized by: blindly sticking K into generalized linear models sometimes works), it’s worth linking the way scale relates to linear models to the way the Kernel represents a map into a latent high-dimensional feature space.

One thing to notice is that Kernel normalization restricts the Kernel feature space to a unit sphere. This follows from Mercer’s theorem: we can just rewrite

K(x,y)/\sqrt{K(x,x)K(y,y)} = \frac{\phi(x)}{||\phi(x)||}^\dagger \frac{\phi(y)}{||\phi(y)||} = \cos \theta_{\phi_x\phi_y}

for suitably defined inner product (it may be an integral). Clearly you lose a dimension from this, and so if K corresponded to an N-dimensional space, then the normalized version is the (N-1)-dimensional sphere. This precisely explains the behavior observed above: restricting the one-dimensional polynomial transform C(x) above to the 0-dimensional sphere (the points {1,-1}) effectively makes K(x,y)/\sqrt{K(x,x)K(y,y)} = \mathrm{sgn}(C(x))\mathrm{sgn}(C(y)): ergo the constant strobes in the normalized Kernel matrix.

Conceptually at least, normalization is a statement that the original kernel maps the data into a space where the angles (suitably defined) are predictive, but the position of the data in absolute terms is irrelevant. In some sense, there’s less smearing, and this can be amenable to methods not just looking for a separating hyperplane, but looking for a space where distance between points has some kind of meaning.

One way to examine the impact of normalization is by its effect on the canonical feature map. That is, given a kernel K(x,y): X \times X \rightarrow \mathbb{R} it induces a number of objects. One such object is the functional L_K: \ell^2(X) \rightarrow \ell^2(X) defined as g(y) = L(f) = \int_X f(x)K(x,y)dx. This can be thought of as a change of bases much like a Laplace transform; and the canonical feature map is the application of this functional to the delta-representation of the datapoint, i.e. \phi_K(x) = K_x(y) = L_K(\delta(x)) = \int_X K(z,y)\delta(z-x)dz.

There are many features consistent with a Kernel, not all of which draw from Mercer’s theorem like the canonical map does. However the exact feature depends on how you define your inner product, for instance if \langle f(x),g(y)\rangle = \int\int f(x)K(x,y)g(y) dxdy then the above representation is good to go. By contrast you might want to factor this as \int\int f(x)\sqrt{K(x,y)}\sqrt{K(x,y)}g(y)dxdy, which is perfectly fine given that K is positive definite, in this case you might want the canonical feature to be \int_X \sqrt{K(z,y)}\delta(z-x)dz

So what does the canonical feature look like for a polynomial kernel? On the left is a standard quadratic kernel, and on the right, the same kernel but normalized. Remember, these are canonical features, in this case \phi_z(x,y) = \int_x\int_y \sqrt{K(x,y)}\delta(x-z_x)\delta(y-z_y) is parametrized by the 2-vector z = (z_x,z_y) = (1,-1), the black dot in the pictures.

polynomial_unnormalized  polynomial_normalized

Increasing the polynomial to degree six:

polynomial6_unnormalized polynomial6_normalized

So Kernel normalization here appears to make the canonical features look more like radial functions. What are inner products of radial functions? Effectively similarities: radial functions whose centers are close by (in comparison to the function volume, say) have a large inner product, those with centers far apart have a small inner product. This is going to link us back to random effects, but it’s worth addressing (in our own poor fashion) what’s going on geometrically when one normalizes a Kernel.

Some Geometry of Kernel Normalization

So normalizing a kernel is basically rescaling so that the diagonal is one. Note that the standard RBF \exp(-||x-y||^2) is automatically normalized, which is of course related to the intuitive connection between inner products of radial functions and similarities. But what really does it mean if K(x,y) = (x^\dagger y + c)^d, when normalized, results in a better classification that K(x,y) = \exp(-||x-y||^2)? (These are .gifs by the way, if they’re not animating click on them to view).

exp_rotated poly2_rotated poly3_normalized_rotated

Since these kernels are just inner products, they should be rotation invariant; that is invariant under unitary operations on L2 (the space of the canonical representation). In the case of kernels that are functions of a norm or inner product (i.e. K(x,y) = K(x^\dagger y) or K(x,y) = K(||x-y||)), rotations of feature space induce a unitary operator in Kernel space – and therefore the Kernel matrix will be invariant under such rotations.

But one big difference between the exponential and polynomial kernel is that the former is translation invariant, and the latter is not.

exp_translated  poly3_translatedpoly3_normalized_translated

However, unlike the polynomial kernel, the exponential kernel is invariant under translations, while the polynomial kernel has a “minimum” (0) where x^\dagger y = -c. In some sense, while pure radial kernels encode “locality”, other kernels encode biases about the ambient feature space. The polynomial kernel treats points as similar if they are (1) far away from the minimum c, and (2) collinear (or nearly so) with each other and c. The ANOVA kernel (where rotations of feature space do not induce unitary transformations in Kernel space) encodes for axis-parallel similarities, and a wavelet kernel (which is admissible though not positive definite) encodes for a patterned similarity:

anova_normalized_rotatedwavelet_rotated

So basically, one kernel performing “better” than another means that the canonical feature of the “superior” kernel function encodes something about the classification geometry in the ambient feature space that is closer to the truth. So for instance, if the normalized polynomial kernel outperforms the standard RBF, it indicates that similarities are not well modeled by gaussian-scaled distances, but instead by distances that are biased by overall distance from (and angle to) the reference point c.

But what does it mean if normalization improves performance? What’s the difference between the normalized kernel and the unnormalized one? Well if a kernel is a generalized covariance, then a normalized kernel is some kind of a generalized correlation. That is, we go from some unknown space \mathcal{D} with an inner product onto a sphere in that space: \{x \in \mathcal{D}: \langle x, x \rangle = 0_D\} where 0_D is the 0 element of that space. This is a kind of poor man’s projective space: we’ve made our destination manifold scale-invariant (with respect to rescaling the data). Indeed for SVMs, this kind of normalization is recommended as theoretically it improves the margin.

Would we want to do this for a random effect model? Should we do this for Kernel-PCA? There are really two answers here: for RBF Kernels, there is no actual question, as these are automatically normalized. However for things like the polynomial kernel, the question is one of whether we want to equalize scales across the features in our Hilbert space. From a physical point of view, data has units associated with it; so if x is impressions/day then x^2 is impressions^2/day^2. This automatically places features on different scales, so it seems to me from this perspective normalization is appropriate. On the other hand, if the data is normalized to be unitless, then features in the Kernel space may be “physical” in some sense.

Ultimately, there’s nothing stopping a Kernel from being used in an old-school linear model. However, there are connections between Kernel machines and Linear Mixed models

July 11, 2014

I’m glad this exists

Filed under: Uncategorized — heavytailed @ 12:06 pm

(leekspin, for reference)

July 10, 2014

On testing statistical libraries

Filed under: Uncategorized — Tags: , , , — heavytailed @ 7:54 pm

I have been frustrated for a very long time about the lack of good libraries for computing linear mixed models. The typical R packages (lme, lmer, etc) don’t quite provide enough access to the underlying functionality: I’ve got covariance matrices for my random effects in hand, and can’t specify a single variable “growth ~ fertilizer + rain|field” that will generate the matrix. I’ve been repurposing statistical tools (GCTA) for this purpose, and providing fake GRMs which are just the covariance matrices I want to use for random effects. It works, but it’s not anywhere near an actual library: it’s a command-line tool.

So: I forked statsmodels and started to develop my own. They’ve got a very nice framework for MLE into which LMM/GLMM fits very well (REML/AI-REML will be more of a square peg/round hole situation), and the implementation took about an hour. But the problem is: how do you seriously test a new entry into a statistical package? There are lots of problems here — lme and lmer don’t always agree on effect estimates, and any regularization applied is not documented, so using “truth” for testing in this case is just not possible. We can test degenerate cases (no random effects) – but here, too, decisions about regularization can make testing tricky. At what point are you “close enough”? I’m talking here specifically about testing correctness, as what I want to test is part of a library of methods for statistical inference, and not statistical software per se. That is, concerns about dealing with nasty data or missing values(pdf) fall slightly upstream of this core.

The answer I settled on: test the pieces. If the likelihood is correct, the Jacobian is correct, the Hessian is correct, and the optimizer is doing the right things (and if the system isn’t sharing state in some weird way between these pieces), we can be reasonably sure that the MLE coming out the other end is right. So how to test these pieces?

Testing likelihood

Likelihood is usually pretty easy to test. The function itself typically has a closed form, even if it’s in terms of “special” functions. In this case, being a multivariate normal, the test is even easier. We can  directly evaluate the likelihood by a simple transformation of the data: remove the proposed mean, multiply by the square root of the proposed covariance (one can use Cholesky decomposition to do this), and then the resulting data can be evaluated as independent draws from a normal distribution. It’s reassuring when this very orthogonal approach produces the same likelihood value.

Testing the Jacobian

This is really the point of the whole post. Say you’ve got a closed-form solution for your Jacobian, how do you test that it’s correct? You can re-implement it in the testing file, but then how can you be sure you’ve not been wrong twice? You can verify that it’s zero at the known maximum, but it’s not really hard to get that wrong (what if, for instance, you got the sign wrong?)

The central maxim of testing mathematical functions is test your invariants. For instance, a great way to test an implementation of a binomial coefficient approximation is by verifying that Newton’s identity still holds:

\sum_k\binom{m}{k}(1 + z)^k = (1 + z)^m \;\;\; z \in [0, 1)

In the case of the Jacobian, we can test that the Taylor expansion is accurate (or: test that the implementation of the closed-form solution matches the numerically-derived Jacobian). In particular, we care about a system of Taylor expansions about a test point x, and we let f_k = f(x) while f_{k+r} = f(x + r\epsilon) for a small \epsilon. So for instance we might have

f_{k-2} = f_k - 2\epsilon f'_k+ \frac{(2\epsilon)^2}{2!}f''_k - \frac{(2\epsilon)^3}{3!}f'''_k + \frac{(2\epsilon)^4}{4!}f^{(4)}_k - \dots

f_{k-1} = f_k - \epsilon f'_k + \frac{\epsilon^2}{2!}f''_k - \frac{\epsilon^3}{3!}f'''_k + \frac{\epsilon^4}{4!}f^{(4)}_k - \dots

f_{k+1} = f_k + \epsilon f'_k + \frac{\epsilon^2}{2!}f''_k + \frac{\epsilon^3}{3!}f'''_k + \frac{\epsilon^4}{4!}f^{(4)}_k + \dots

f_{k+2} = f_k + 2\epsilon f'_k + \frac{(2\epsilon)^2}{2!}f''_k - \frac{(2\epsilon)^3}{3!}f'''_k + \frac{(2\epsilon)^4}{4!}f^{(4)}_k + \dots

A clever individual can remove the effect of the second-derivative (in fact, all even terms) by differencing, for instance

D_1 = f_{k+1} - f_{k-1} = 2\epsilon f'_k + 2\frac{\epsilon^3}{3!}f^{(3)}_k + 2\frac{\epsilon^5}{5!}f^{(5)}_k + O(\epsilon^7)

D_2 = f_{k+2} - f_{k-2} = 4\epsilon f'_k + 2\frac{(2\epsilon)^3}{3!}f^{(3)}_k + 2\frac{(2\epsilon)^5}{5!}f^{(5)}_k + O(\epsilon^7)

The pattern being D_j = \sum_{r=1}^\infty 2 \cdot \frac{(j\epsilon)^{(2r-1)}}{(2r-1)!}f^{(2r-1)}_k. This becomes powerful when arranged into a matrix, for instance, a 9-th power approximation would be

\left(\begin{array}{c}D_1 \\ D_2 \\ D_3 \\ D_4\end{array}\right) = \left(\begin{array}{cccc} 2 & 2 & 2 & 2 \\ 4 & 16 & 64 & 256 \\ 6 & 54 & 486 & 4374 \\ 8 & 128 & 2048 & 32768\end{array}\right)\left(\begin{array}{c} \epsilon f'_k \\ \frac{\epsilon^3}{3!} f^{(3)}_k \\ \frac{\epsilon^5}{5!} f^{(5)}_k \\ \frac{\epsilon^7}{7!} f^{(7)}_k\end{array}\right) + O(\epsilon^9) = A_4 g_4(\epsilon, f) + O(\epsilon^9)

This can be extended indefinitely to arbitrary precision. One might argue that a low-order approximation with very small \epsilon should work, but the problem is that if the likelihood function is not changing very rapidly, evaluating it at a very small \epsilon away may produce the same value, so we need wider spacing to ensure that the evaluation really does produce different values; but we still want very accurate approximations to the jacobian. Because we’d like to have exactitude at around 15 decimal places (machine epsilon is typically \sim 2^{-52} \approx 2 \times 10^{-16} , taking the approximation to D_7 which is O(\epsilon^{13}) meets this requirement for \epsilon = 0.01. The error rate depends also on the higher-order derivatives. In the case of the normal likelihood, these derivatives are very much bounded.

A direct formula for f'_k can be obtained by multiplying together the appropriate widgets; first, D can be written as

D_3 = \left(\begin{array}{ccccccc} 0 & 0 & -1 & 0 & +1 & 0 & 0 \\ 0 & -1 & 0 & 0 & 0 & +1 & 0 \\ -1 & 0 & 0 & 0 & 0 & 0 & +1\end{array}\right)\left(\begin{array}{c}f_{k-3} \\ f_{k-2} \\ f_{k-1} \\ f_k \\ f_{k+1} \\ f_{k+2} \\ f_{k+3}\end{array}\right)

(The pattern should be obvious). Given this, a direct formula for f'_k \approx f'(x) is:

O(\epsilon^{2n-1}) + f'_k = \frac{\mathbf{e}_1^\dagger(A_n^{-1} D_n )f_n}{\epsilon}

where D_n is the nth-order differencing matrix, A_n is the nth-order Taylor difference matrix (defined above), \mathbf{e}_1 is a n \times 1 vector with a 1 in the first position only, and zeros everywhere else, and f_n is the vector of function evaluations at f(x - n\epsilon), f(x - (n-1)\epsilon), \dots, f(x + (n-1)\epsilon), f(x + n\epsilon).  The numerator can be precomputed for very fast evaluation. In fact, for the 13th-order approximation, the quadrature vector is

q_{7} = \frac{1}{360360} \left(\begin{array}{c} -15 \\ 245 \\ -1911 \\ 9555 \\ -35035 \\ 105105 \\ -315315 \\ 0 \\ 315315 \\ -105105 \\ 35035 \\ -9555 \\ 1911 \\ -245 \\ 15\end{array}\right)

The approximate partial derivative in the direction of x + \epsilon would then be \frac{q_{7}^\dagger f_7}{||\epsilon||}.

Update: Here’s a plot of the error of this for the function f(x) = \sin(\pi * x^2)


q7 = 1/360360 * c(-15, 245, -1911, 9555, -35035, 105105, -315315, 0, 315315, -105105, 35035, -9555, 1911, -245, 15)

x = 1:10000/2000

f <- function(x) { sin(pi * x^2)}

df <- function(x) { cos(pi * x^2) * 2 * pi * x}

err <- sapply(1:9000, function(i){t(q7) %*% f(x)[i + 1:15]/(x[2] - x[1]) - df(x)[i + 8]})

plot(x[8 + 1:9000], err, type='l', ylab='absolute error', bty='n', xlab='x')

Rplot01_q7err

 

So pretty low error. It blows up towards the end, but then again, so does f^{(13)}(x).

June 19, 2013

Genetic Association and the Sunrise Problem

Filed under: Uncategorized — Tags: , , , , — heavytailed @ 3:42 am

As a preface, let me acknowledge that this blog has been perhaps a bit too preoccupied with statistical genetics in the past month. Looking at the several posts I have partially completed it is likely to remain so through the summer.

Sequencing and GWAS

Nature Genetics has in the past few months returned to its old position as a clearinghouse for GWAS studies, the author lists for which steadily increase as more and more groups contribute data. (Can we start putting the author list in the supplemental material? It’s getting hard to navigate the webpage…). GWAS studies are nice, we sort of know how to do them. I have a bit of a beef with the push-button approach, I think GWAS studies and in particular meta/mega analyses are failing to deliver as much information as they should, and I’ll come back to this at the end of the section. But GWAS always have the complexity that they are ascertained. We didn’t find a variant of effect T. Does one exist? Can’t say, we could only query these X variants.

Sequencing is fundamentally different: whatever effects are driving the phenotype in your sample are present in your sample. Yet the treatment of (particularly exome) sequencing studies is “like a GWAS just with more variants,” with no real use of the fact that variants are discovered rather than ascertained. It’s the same kind of “where is the signal I can find?” question. But really, the trait is heritable, the signal is there, somewhere, and it’s not just about the signals you can find, it’s also about the places you can rule out. This applies to GWAS as well: you genotyped 50,000 samples and have 12 hits. Surely you can say something else about the other 199,988 sites on the chip, right? Or at least about a large fraction of them?

And once you go from ascertainment to discovery, you can rule out whole portions of the variant space. Things like “No variant in SLC28A1 of frequency > 5% has an odds ratio >1.2”. With that said do I still care about SLC28A1 as a candidate gene for my trait? What if the best I can do for ADAM28 is to say no such variant exits with OR>2.6, but may exist for OR<2.6? Now how much do I care about SLC28A1? You can make such statements having performed sequencing, not only can you produce a list of things where signal definitely is, but you can segment the rest into a list of things where the signal definitely is not, and those where signal might be.

In some sense GWAS studies dropped the ball here too, but the analysis is far more subtle. One of the big uses of modern sequencing data is fine mapping: so, at a GWAS locus, trying to identify rare variants that might be generating a “synthetic association”. For instance, if several rare variants of large effect happen to be partially tagged by the GWAS hit. However, given the observed variance explained, there’s then a posterior estimate of frequency, effect, r^2, and number of variants that could be consistent with the observed signal. Carefully working out what could be there, and what’s expected to be there would provide not only intuition about what to expect when performing fine mapping, but a good and precise framework for thinking about synthetic associations. While the idea that most GWAS loci are so explained has been vigorously rebutted, it is done so at the level of bulk statistics, with little attention paid to if there is a synthetic association here, what must such variants look like?

The point is ultimately there’s a lot more information than what things have p<5e-7, it just needs to be extracted.

Bounding Variant Effects

The 0th-order question is “What variants do I have with p<5e-7“. If the answer isn’t the empty set, the 1st-order question is “What genes are near these variants?” and the 2nd-order question is “What biological processes are these genes involved in?”. If the answer is the empty set, the 1st-order question is “What gene burden tests are at the top of the quantile-quantile plot?”, the 2nd-order question is “What biological processes are these genes involved in?”

But, as I mentioned in the previous section, you’ve seen these variants enough times (especially now with tens- to hundreds of thousands of samples) to know whether or not they’re interesting. The first-order question ought to be “What variants can I rule out as contributing substantially to disease heritability [under an additive model]?” Ideally, this would be effectively everything you looked at. In practice, you looked at 100,000-10,000,000 variants, and some of those will look nominally associated by chance, and these will be difficult to rule out.

Alright so what am I talking about? Here’s the model:

\mathbb{P}[Y=1|X,C_1,C_2,\dots] = \alpha + \beta X + \sum_i \tau_i C_i + \epsilon

where X are the genotypes at a site, and C_i is the ith covariate. It is the property of maximum likelihood that, given the data, the coefficient vector (\alpha^*,\beta^*,\tau_1^*,\dots) = V ^*\sim N(V_{\mathrm{true}},-[\nabla^2\mathcal{L}|_{V_{\mathrm{true}}}]^{-1}). Denoting the variance-covariance matrix as \Sigma then \beta \sim (\beta_{\mathrm{true}},\Sigma_{22}). This posterior distribution on the estimate allows more than the basic test of \beta_{\mathrm{true}} = 0, one can also ask for what values (\beta_u,\beta_l) we have \mathbb{P}[\beta_{\mathrm{true}} > \beta_u | \beta^*] < p, \mathbb{P}[\beta_{\mathrm{true}} < \beta_l | \beta^*] < p. This is just a p-ile confidence interval, but if p<1e-6 and \beta_u-\beta_l is small with \beta_l < 0 < \beta_u, then this particular variant is very unlikely to be causal, given these bounds, in that the probability of a large beta is very low.

Now there are two problems with “just using a confidence interval.” First, because the likelihood is heavily dependent on the number of observations of nonzero X_i, the variance of the estimator is an increasing function of the minor allele frequency. That is, rare variants are harder to bound. Secondly, the disease heritability conferred by a variant of given effect increases with allele frequency. That is, a 1.4 odds ratio variant at 0.5% frequency is, in and of itself, contributing very little to the prevalence of the disease. By contrast, a 1.4 odds ratio variant at 30% frequency is contributing a large fraction of disease heritability.

Both problems can be resolved by taking a heritability view, that is, asking instead the question for what proportion \rho(\beta,f) of phenotypic variance we have

\mathbb{P}[\rho(\beta_{\mathrm{true}},f_{\mathrm{true}}) > T | \beta^*,f^*]

In other words, say you observe a variant with \beta = 0.2 \pm 0.25, at a frequency of 0.01. You then ask, what is the probability that this variant explains 1% of the total prevalence (on a liability scale) of my disease, given this observation? Well, that’s simple, given the prevalence and frequency, one can simply ask for which two values of beta, \beta_{\mathrm{R}},\beta_{\mathrm{P}}, the risk-effect and protective-effect values, is the proportion of phenotypic variance 1%. Then, given those values, use the posterior distribution to ask what is \mathbb{P}[\beta_{\mathrm{true}} > \beta_{\mathrm{R}} | \beta^*]?

Note that the frequency of the variant does not enter into the discussion, it’s only hidden in the calculations of the phenotypic variance \rho(\beta,f), and the standard deviation of the maximum likelihood statistic \beta^*, thus making variants of different frequencies commensurate by placing the hypotheses on the same scale – and one growing approximately with the variance of the estimator. That is, while the variance of the estimator increases as the variant becomes rare (this is what makes bounding using a constant odds ratio difficult), so too does the effect size required to confer a given level of phenotypic variance. These phenomena tend to cancel, so on the scale of variance explained, one can bound a rare variant as well as a common variant.

In order to do this, it’s just a question of converting a given variance explained into frequency-effect pairs, and evaluating the likelihood of such pairs given the data. That is:

= \mathbb{P}[\rho(\beta_{\mathrm{true}},f_{\mathrm{true}}) > T | \beta^*,f^*,f_{\mathrm{true}}]\mathbb{P}[f_{\mathrm{true}}|f^*]

= (\mathbb{P}[\beta_{\mathrm{true}} > \rho_f^{-1}(T)_u|\beta^*] + \mathbb{P}[\beta_{\mathrm{true}} < \rho_f^{-1}(T)_l]|\beta^*)\mathbb{P}[f_{\mathrm{true}}|f^*]

These terms need to be integrated over f_{\mathrm{true}}, but for whatever reason the wordpress parser doesn’t like it when I put $\int_{f_{\mathrm{true}}$ into the above equations. And in practice, for even rare variants on the order of 0.5% frequency, the posterior frequency estimate is peaked enough to be well-approximated by setting f=f^* (remembering, of course, that all frequencies are population-based, so oversampling cases requires re-adjusting the observed frequency based on the prevalence of the disease: f = (1-p)f_U + pf_A).

The last thing to work out is the function \rho(\beta,f,\mathrm{prevalence}) giving the phenotypic variance explained by a variant of frequency f, effect size \beta on a dichotomous trait of a given prevalence. It’s very well known how to compute the liability-scale variance, and the inverse can be found trivially through numerical quadrature.

Liability-scale Variance: A Derivation

Recent discussions have left me slightly cynical on the common intuition behind this calculation (an intuition which seems to be “cite So et al 2012”), so it may be worth stepping through the derivation. Let’s start with a disease, and a variant. The diesease is 10% prevalence, and the variant has a 10% frequency, and an odds ratio of 1.35 (so \beta = 0.30). We begin with assuming an underlying latent variable with some distribution, and for the sake of having a pretty function, I’m going to assume a logistic. There’s also a choice between setting the mean of latent variable (over the population) to 0, and letting the threshold T adjust, or setting T=0 and letting the mean adjust. I will do the latter. So basically we have a whole population that looks like

liability1

Where the area under the curve to the right of 0 is 0.1. The population mean is exactly \mu_P = -\log(1/r-1). So the question is: whereabouts in this liability scale do carriers fall? For that we need to know the probability of being affected by genotype state. Given the odds ratio, this is pretty easy:

\mathbb{P}[A|X=2] = \left[1 + \exp(\mu_R + 2\beta)\right]^{-1} = \left[1 + \exp(\mu_R + 0.6)\right]^{-1} = P_V

\mathbb{P}[A|X=1] = \left[1+\exp(\mu_R + \beta)\right]^{-1} = \left[1+\exp(\mu_R+0.3)\right]^{-1} = P_H

\mathbb{P}[A|X=0] = \left[1+\exp(\mu_R)\right]^{-1} = P_R

\mathbb{P}[A] = \mathrm{Prevalence} = \left[(1-f)^2P_R + 2f(1-f)P_H + f^2P_V\right]

There are three equations and three unknowns (plug in \mu_R = -\log(1/P_R-1)), and so this system is soluble. This can be reduced easily to a single equation by plugging in, and solved by a simple root solver such as


diseaseProbs <- function(frequency,odds,prevalence) {
bt <- log(odds)
cPrev <- function(muR) {
PR <- (1-frequency)^2*logistic(muR)
PH <- 2*frequency*(1-frequency)*logistic(muR + bt)
PV <- frequency^2*logistic(muR+2*bt)
(PR + PH + PV - prevalence)
}
if ( odds > 1.05 ) {
# prevalence almong reference individuals will be lower
res <- uniroot(cPrev,lower=invlogistic(0.001),upper=invlogistic(prevalence))
} else if ( odds < 0.95 ){
# prevalence among reference individuals will be higher
res <- uniroot(cPrev,lower=invlogistic(prevalence),upper=invlogistic(0.99))
} else {
upr <- invlogistic(min(1.2*prevalence,1-1e-5))
res <- uniroot(cprev,lower=invlogistic(0.8*prevalence),upper=upr)
}
muR <- res$root
return(c(logistic(muR),logistic(muR+bt),logistic(muR+2*bt)))
}

So what does this give us? Well it gives us the Logistic Mixture comprising the population distribution

liability2
Where I haven’t been entirely faithful to the densities (I’m scaling by the square root of the frequencies of the genotypes). Point is, you can see the shift in the distributions by genotype, and consequently the proportion of the distribution curves falling above the T=0 threshold. At this point, we can just plug into the formula for variance: \mathbb{E}[(X-\mathbb{E}[X])^2] to get

V = (1-f)^2 (\mu_P-\mu_R)^2 + 2f(1-f)(\mu_P-\mu_R-\beta)^2 + f^2(\mu_P-\mu_R-2\beta)^2

Of course throughout we have assumed a standard Logistically distributed error term. Thus the variance of the means have to be given to us as a proportion of the total variance, or in other words \rho(\beta,f) = \frac{\sigma_g^2}{\sigma_g^2+\sigma_e^2}. In this case, the variance of the logistic is \pi^2/3, and thus we set

\rho = \frac{V}{\pi^2/3 + V} = \frac{0.016}{3.28+0.016} \approx 0.0049

So roughly 0.5% of the liability-scale variance. Note that using the standard method of replacing the logistic distribution with a standard normal, and using \rho_{\mathrm{norm}} = \frac{V_{\mathrm{norm}}}{1+V_{\mathrm{norm}}} results in an estimate of 0.44%. These differences increase with the variance explained, so that for a 20% frequency variant with OR=2.35, such a variant explains roughly 6% in the logistic model, but 5% variance in the normal model.

Assuming the posterior distribution of f is sufficiently peaked, doing a GWAS-like bounding scan is fairly simple. For each variant, calculate

(\beta_u,\beta_l): \rho(\beta,f_i) = T

via \rho_{f_i}^{-1}(T). Then, using the posterior distribution of d\mathbb{P}[\beta_\mathrm{true}|X,C_1,\dots] calculate

p_i = \Phi(\beta_l/\sigma_\beta)+(1-\Phi(\beta_u/\sigma_\beta)).

Multiple Test Considerations

Where does the figure 1e-7 come from for GWAS studies? Or 5e-8 for sequencing studies? (And I mean conceptually, not here here or here). The basic logic is that of the look elsewhere effect:

1) Most variants are null

2) There are lots and lots of variants

3) There are many chances to see spurious nominal associations

4) Must set the threshold fairly high to achieve a study-wide Type-I error of 5%.

The fundamental consideration is (1), and there’s a significant asymmetry between a GWAS scan to reject \beta = 0, and a bounding scan to reject \beta \not \in (\beta_l,\beta_u). If most variants are null under a GWAS scan and ought not to be rejected, then almost no variants are null under a bounding scan, so most ought to be rejected. That is: Type-I error is bad in a GWAS scan (you find something that’s not real), and you’re willing to pay a cost of Type-II error (you don’t find something that is real). By contrast, for a bounding scan, Type-II error is bad (you put bounds on a truly causal variant), while Type-I error is less of an issue (you fail to bound a noncausal variant).

Thus, previous conversations about multiple testing are not applicable, and one shouldn’t blindly apply p=5e-7. Instead, bounding requires a good deal more thought (or assumption) about the architecture of the trait. I would articulate the logic as

1) Causal variants are null

2) Few variants are null

3) There are lots and lots of variants

4) There are many chances to falsely accept the null

5) We can be rather aggressive on the rejection p-value

It’s worth articulating why we don’t care all that much about false rejects: rejecting the null falsely simply includes a non-causal variant in a set of variants that should be enriched for causal variants. Variants for which the correct test is to reject \beta = 0. And it’s absolutely possible for a variant to be such that (e.g. have a large enough standard deviation that) H_0: \beta = 0 is rejected, as well as H_0: \beta \not \in (\beta_l,\beta_u) is rejected. On the other hand, failing to reject the null on a causal variant means we can potentially exclude it from follow-up analyses, a situation we want to avoid.

So there’s a different balance between Type-I and Type-II error here, with Type-II being the one we’re concerned about. But of the two, Type-II error is the trickier, because it’s fundamentally a question of power. The idea is, well, we don’t want to accidentally exclude things. But you have to be a little bit more specific: what kind of things do you not want to exclude? You can be sensitive to all causal variants (even those with OR=1+1e-5) by rejecting everything. So being more reasonable, you might say “I want no false acceptances of any variant with >0.5% of phenotypic variance explained”. Given that statement, you can calculate the power w(\rho=0.05,p_{\mathrm{nom}}), and that’s the power at the nominal threshold p_{\mathrm{nom}}. (See previous posts regarding calculation of power). But of course, that’s not quite enough, because you also need to know the number of chances to falsely accept such a variant, which is, of course, begging the question (if you knew how many causal variants there were, why bother with the exercise?). One way to get around this is to simply assume an unreasonable number of causal variants of such variance explained, N_{\mathrm{UB}} and then use the same machinery as Bonferroni correction and set

p_{\mathrm{nom}} : 1-(1-w(0.05,p_\mathrm{nom}))^{N_\mathrm{UB}} = 0.01

Bounding Variance Explained at a Locus

Alright, so that all was about variants you know about. But what if you care about a specific locus, like a gene, and want to know if variants within that gene contribute substantially to heritability. There are several different questions to articulate, but for now let’s stick with the single-variant approach. We know, from the previous section, how to answer “what is the probability that one of the variants in this locus explains a T or greater portion of phenotypic variance?”

But now we’ll take a step into the unknown. The second-order question: what is the probability that no variant exists at this locus that explains a T or greater proportion of phenotypic variance? That is, given I’ve sequenced the locus and haven’t found anything, what’s the probability I could keep on sequencing and still wouldn’t find anything?

Well, what’s the probability the sun will rise tomorrow?

Following Laplace, 9592/9593.

Well Okay. In all seriousness, this particular question about genetic mutation has just enough structure in it to avoid the philosophical issues that arise with the Sunrise Problem (Statistics can’t prove a negative, etc). Specifically: we rely on the fact that we’ve done sequencing. This means that we have not failed to query sites for mutations (as in the case of genotype chips), but we have, and there were none. That is, if a variant exists, we’ve got a pretty good idea about it’s frequency. Let’s turn to Mister Bayes, and ask, given I’ve sequenced N samples and seen \alpha counts of a given allele, what is the posterior density of the allele frequency? Using the Rule:

d\mathbb{P}[p|\alpha;N] = \frac{ [x^\alpha] (px + q)^{2N} d\mathbb{P}[p]}{\int_p [x^\alpha] (px + q)^{2N} d\mathbb{P}[p]} = \frac{{2N \choose \alpha} p^\alpha(1-p)^{2N-\alpha}d\mathbb{P}[p]}{\int_p{2N \choose \alpha} p^{\alpha}(1-p)^{2N-\alpha}d\mathbb{P}[p]}

This doesn’t give a good intuitive sense of the posterior distribution other than the basic shape (gamma-like, no?). The best thing to use as a prior is the allele frequency spectrum modeled on the human bottleneck-and-superexponential-expansion (see Evans and Shvets, for instance), but I find a good intuitive model (that surprisingly works well in practice) is a truncated inverse distribution

d\mathbb{P}_\epsilon[p] = \left[-p\log(\epsilon)\right]^{-1}

which is distributed on the domain (\epsilon,1]. This tends to overestimate allele frequency because (1) humans have grown faster than exponentially, and (2) no mutation is allowed to have a frequency lower than \epsilon. You can make a Gamma distribution peak at \epsilon and then go to 0, but this is a nice toy distribution. Plugging it in yields

{2N \choose \alpha} p^{\alpha-1}(1-p)^{2N-\alpha} \left[\int_\epsilon^1 {2N \choose \alpha} p^{\alpha-1}(1-p)^{2N-\alpha}dp\right]^{-1}

Digression: The Probability that a Singleton is more than a Singleton

This is actually a fun little proof I have given this simplified prior. You go out and sequence N people, and find a singleton you care about. What’s the probability that it’s frequency is >1/2N? I’ll prove that it’s about \frac{1}{e}. What’s the denominator of the above formula when \alpha = 1? Plugging in

\int_\epsilon^1 \frac{2NZp(1-p)^{2N-1}}{p}dp = \int_\epsilon^1 2NZ(1-p)^{2N-1}dp = Z(1-\epsilon)^{2N}

And therefore the conditional posterior of f can be written

d\mathbb{P}[f \geq p | \mathrm{AC=1\; in \;} N] = \frac{2NZp(1-p)^{2N-1}}{Zp(1-\epsilon)^{2N}} = \frac{2N(1-p)^{2N-1}}{(1-\epsilon)^{2N}}

Now we can inquire as to the probability that, given AC=1 in N people, the true frequency will be greater than some value \delta. This is:

\int_\delta^1 \frac{2N(1-p)^{2N-1}}{(1-\epsilon)^{2N}}dp = \left(\frac{1-\delta}{1-\epsilon}\right)^{2N}

Now let \delta = \epsilon + \delta':

= \left(\frac{1-\epsilon-\delta'}{1-\epsilon}\right)^{2N} = \left(1 - \frac{\delta'}{1-\epsilon}\right)^{2N}

Letting \delta' = \frac{1}{2N} - \epsilon yields:

\mathbb{P}[f \geq \frac{1}{2N}-\epsilon | \mathrm{AC = 1 \; in \;} N] = \left(1-\frac{1}{2N(1-\epsilon)}\right)^{2N} \longrightarrow \exp\left[-\frac{1}{1-\epsilon}\right] \approx e^{-1}

Anyway, the point is you have some posterior estimate f_{\mathrm{post}}(p) of the allele frequency spectrum given that you haven’t observed it in your sequencing. Any variant you haven’t seen yet has that posterior spectrum, and thus given a fixed proportion of phenotypic variance explained T, the odds ratios that such sites must have are constrained to be very large. In particular, for a 10% prevalence disease, using the above model, a 0.01% frequency variant would need an odds ratio of 400,000 to explain 1% of the liability-scale variance.

However, the fact that for such a rare variant, 1% phenotypic variance is achievable is really a breakdown of the liability-scale model. In particular, there is a ceiling on the variance explained by a mutation: a 0.01% variant that is fully penetrant only explains 0.1% of the prevalence of the trait. That is, at some point you know that any undiscovered single variant left at a locus, even if fully penetrant, cannot explain a significant fraction of the phenotypic variance.

On the observed (0/1) scale, such a variant of frequency f for a disease of prevalence p explains a f(1-f)/K(1-K) proportion of the phenotypic variance. Transferring this to the liability scale by Dempster’s formula

h_{0/1} = t^2h_x/(pq) = t^2 h_x / [K(1-K)]

Now the genetic variance on the observed scale, \sigma_{g_{0/1}}^2 = \beta f(1-f) \leq f(1-f) as the slope is at most 1. \sigma_{p_{0/1}}^2 = K(1-K) by definition, and therefore h_{0/1} \leq f(1-f)/K(1-K). Obviously during the conversion to the liability scale, the phenotypic variance, which appears in both denominators, will cancel. Thus we have (for prevalence K):

f(1-f) = t^2h_x \Rightarrow h_x = f(1-f)/t^2.

Here, t = \phi(\Phi^{-1}(1-K)) = \phi(1.28) = 0.17 so for f = 0.0001 and 8% prevalence, h_x = 0.0045 or 0.45% of the phenotypic variance. This is a bit of a tawdry estimate, as such a variant equipped with OR=15 explains (by the other model) 0.14% of the phenotypic variance.  Nevertheless, this transformation provides a reasonable approximation to an upper bound for rarer variants (but we should still search for better corrections to it). Thus, denoting by

\rho(f) = f(1-f)/\phi(\Phi^{-1}(1-K))^2

the function mapping a frequency to the maximum liability variance explained (e.g. that when the variant is fully penetrant), the probability that an undiscovered locus explains \geq T fraction of phenotypic variance is

\mathbb{P}[V \geq T] = \int_f \mathbb{P}[\rho(f) > T | f]f_{\mathrm{post}}(f) < \int_{f_\mathrm{min}}^1 f_{\mathrm{post}}(x)dx < (1-f_\mathrm{min})f_{\mathrm{post}}(f_{\mathrm{min}})

where f_{\mathrm{min}} is the minimum f such that \rho(f) > T. Note that this expression is also < f_\mathrm{post}(f_{\mathrm{min}}), that is, bound by the posterior allele frequency distribution function evaluated at f_\mathrm{min}.

Now f_\mathrm{min} is easy to calculate, set V =f(1-f)/\phi(\Phi^{-1}(1-\mathrm{prev}))^2 = zf(1-f) so that

f_\mathrm{min} = \frac{1}{2} \pm \frac{1}{2z} \sqrt{z^2 - 4Vz}

Thus given the prevalence of a disease, there is an effective upper bound on the liability-scale variance explained, given that the observed 0/1 variance cannot exceed f(1-f). Here, we plot f_{\mathrm{min}} as a function of the variance explained

fmin_by_vexp_fixed

 

So let’s go to our example. Lets say you’ve got a 10% prevalence disease, and have sequenced 50,000 individuals at a locus. You want to know the probability that there’s a 1% variance explained mutation hiding somewhere in your population that you didn’t discover. Then V=1% and z = 0.17, plugging in gives f_\mathrm{min} = 0.00034. Using the posterior from above, f_\mathrm{post}(f_\mathrm{min}) < 1.5\times 10^{-16}. Here’s the kicker: Even if you’ve sequenced only 10,000 individuals. So long as you haven’t seen a variant at all, the chance that such variant that could explain 1% of the variance is still < 1e-11. That means that for a 1e-6 “chance” to have missed such a variant, about 100,000 such variants would have to exist in the population you sampled. At the locus you’re looking at. Quite reasonable on a genome-wide scale. Quite unreasonable for a gene.

In other words, if even a moderate amount of sequencing has been done on your gene, there are no more single variants left to find.

At least, if you care about single variants explaining a large portion of heritability. Other caveats: this assumes a homogenous population, it could be that there are populations where such variants do exist even at reasonable frequencies.

But wait, there’s more. Let’s say such a variant does exist. It’s completely penetrant, but it has a frequency of f < 1/20,000. For the sake of example, let’s say you’ve sequenced people from Dublin, and this variant at frequency 1/50,000 didn’t get into your study. Or maybe you did see a copy or two, it actually doesn’t matter. Suppose it’s causal dominant, you get the disease if you’re a carrier.

But here’s the deal. Dublin’s population is 1.3 million. Only 26 copies of this allele exist. You could sequence every person in Dublin and you wouldn’t have the power to associate it at 5e-7. Furthermore, there are constraints beyond population size: you need to actually collect the samples, and you need to pay for the sequencing. Thus, in addition to there not being single variants of high variance explained left to find, many of those things we haven’t found yet cannot be demonstrably associated. They are un-followupable.

The good news is, the things we have found can. But at some point the posterior allele frequency becomes so scrunched towards 0 that (on a single-variant level at least) you just have to forget about ever following those up statistically. At that point you’ve got to break out the crispers and engineer some cells.

Another Comment on Studies

The preceding discussion is basically the entirety of the reason sequencing studies have really dropped the ball regarding the genetic architecture of the traits they examine. Sequencing studies are effectively mini-GWAS studies, characterized by a Q-Q plot and a multiple-testing burden higher even than that of GWAS, with no clear understanding or mention of why sequencing was undertaken in the first place. “We wanted to discover all the variants present in our gene.” Okay fine. Given you’ve done that, do we ever need to sequence that gene again? What’s the probability that we missed something there that’s explaining a large fraction of the variance of the phenotype? You don’t even need phenotyped samples to answer that question, you can just use estimates from the 1000 Genomes Project!

This is what’s frustrating: genetic studies cherry-pick one or two novel top hits, and leave everything else on the table. The process of exclusion has been entirely ignored. If we went back, and actually analyzed the data with a careful eye, we probably would have completely excluded hundreds, maybe thousands of loci from consideration. Certainly thousands of single variants.

The Multi-Variant View

My little (now reaching ~3500 words) invective, up til now, has been focused on a single-variant view of heritability (variance explained). This view, while helpful for talking about what can be true about the properties of single variants, is less helpful about talking about what can be true about sets of variants. Things like “variants in SOX9” (a very cool gene), or “variants in open chromatin regions near actively transcribed genes in renal cells.” This section is basically an extension of the previous discussion to groups of variants; but this alteration presents some novel challenges: the nonlinearity of heritability estimates, the large state space of multi-variant genotypes, and the statistical brutality of linkage disequilibrium.

Expanded Multivariant View

Now, rather than having a single variant \nu, one has multiple variants \nu_1,\dots,\nu_n, with frequencies f_i, and effects \beta_i. It’d be very nice (and indeed, some papers do) to write

\rho(\vec \nu) = \sum_{i=1}^n \rho(\beta_i,f_i)

But unfortunately \rho is linear neither in \beta nor in f. First, \rho = V/(\sigma_2^2 + V) is decidedly nonlinear in V. Second, V is quadratic in both f and \mu, and \mu is quadratic in f and inverse logistic in \beta. So forget that sum. Instead, it is straightforward to extend from a genotype to a multigenotype.

Multiple Variant Variance Explained

This follows from replacing the index (Ref,Het,Var) from the previous calculation with the multi-index ([Ref,Ref,…],…,[Var,Var…]). For convenience, we’ll wrap the multi-index to a univariate index i \in (0,\dots,3^n), and denote by G_j \in i the genotype of the jth variant in the ith multigenotype. Then:

\mathbb{P}[A|i] = [1+\exp(\mu_0 + \sum_{G_j \in i} G_j\beta_j)]^{-1} = P_i(\mu_0)

The frequency of multigenotype i we denote by F_i, which, in the case of full HWE with no LD is

F_i = \prod_{G_j \in i} \binom{2}{G_j} f_j^{G_j}(1-f_j)^{2-G_j}

\mathrm{Prevalence} = \sum_{i=0}^{3^n} P_i(\mu_0)F_i

Then using \mu_T = -\log(1/\mathrm{Prev}-1)

V = \sum_{i=0}^{3^n} (\mu_i-\mu_t)^2F_i

\rho = \frac{V}{\pi^2/3 + V}

The only issue is the sum. As in the case of power (which requires massive sums over genotype space), we can get around this more efficiently via simulation. We can exploit the fact that we can set \mu_0 = 0 and allow T to shift. It’s easier also to use the standard normal rather than the logistic for this (as a sum of normals is still normal) A simulation would then be:


multiVarSim <- function(frequency,odds,prevalence,logit=T,prec=0.01) {
bt <- log(odds)
invlink <- qnorm
link <- pnorm
if ( logit ) {
invlink <- invlogistic
link <- logistic
}
# this part can be done with replicate, but I enjoy vector algebra
nRuns = round(10/prec,1)
drawGeno <- function(f) { rbinom(nRuns,size=2,prob=f) }
S <- as.matrix(sapply(frequency,drawGeno)) %*% as.matrix(bt)
# now u_0 needs to be such that the total prevalence is prevalence
# this means that sum(link(S+u0))/length(S) = prevalence
cPrev <- function(muR) {
sum(link(S+muR))/length(S) - prevalence
}
res <- uniroot(cPrev,lower=invlink(0.001),upper=invlink(0.999))
muR <- res$root
S <- S + muR
# now remember we care about the variance around muP
sum((S - invlink(prevalence))^2)/length(S)
}

The only trouble with using the multi-genotype approach is that the state space is huge: for n variants, there are 3^n genotype assignments, so forget about looking at 15 variants, let alone hundreds. As mentioned, simulation can overcome this setback, but simulations themselves are expensive.

But even beyond that, when one moves from calculating variance explained by several variants to bounding variance explained by several variants, the numerics becomes challenging. In particular, one moves from a one-parameter posterior distribution (two-parameter if taking uncertainty of the frequency into account) to an n-parameter distribution (2n if accounting for frequency uncertainties), with non-axial limits of integration. That is, while for a single variant

\rho_f^{-1}(V)_u produced a single point \beta_u : \rho(\beta_u,f) = V, for multiple variants

\rho_f^{-1}(V)_u produces a curve: \beta_{u_i} : \rho(\beta_u, f) = V

That is, you can reduce the effect size of one variant, while increasing the effect size of another variant to compensate. This is a situation where one might use Monte Carlo simulation to evaluate the integral: but note that for each point of the Monte Carlo, a simulation needs to be run to get the variance explained. This is terribly costly.

Another way to bound the probability is to solve the constrained optimization

\mathrm{max} -\log \mathbb{P}[\vec \beta_{\mathrm{true}} = \vec \beta_u | \beta^*,f]

s.t. \;\; \rho(\beta_u,f) = V

The likelihood is (luckily) convex, so this isn’t excessively difficult to do. Let p^* be that maximal p-value. Then

\mathbb{P}[V_{\mathrm{true}} > V | \beta^*,f^*] \leq 1-\Phi(\phi^{-1}(1-p^*))

Nevertheless, even this is very costly. For every evaluation of \rho by the optimizer, a simulation needs to be done to estimate its value; even this approximation is really expensive. In addition, LD generates additional difficulty: Haplotypes need to be simulated rather than genotypes (in order to estimate \rho), and if the \beta^* were fit independently rather than jointly, both the mean and the the variance-covariance matrix of the posterior distribution need to be adjusted to account for LD.

This, combined with the problem of estimating effect sizes from low-frequency variants, leads to methods of collapsing variants across a locus into scores.

Collapsed Multivariant View

Collapsing is a means of re-writing the multiple-variant joint model. Instead of

\mathrm{lgt}(Y) = \sum_i \beta_i X_i + \sum_i \gamma_i C_i + \alpha

we write X as a matrix and have

\mathrm{lgt}(Y) = \tau X\xi + \sum_i \gamma_iC_i + \alpha

Where \xi_i is constant and for the calculation of variance explained, \xi_i = \beta_i. The point is \xi itself is not fit, but only the coefficient \tau which, when \xi = \vec\beta, \tau=1. Collapsing tests (“burden” tests) don’t have access to \beta_i, but instead try to “guess” using some function \xi = \Xi(\vec \nu) of the variant properties (like frequency, estimated deleteriousness, and so on).

One thing immediately recognizable is that if \xi = \vec \beta and \tau = 1 then the formula for

\mathbb{P}[A|i] = [1+\exp(\mu_0 + \sum_{G_j \in i} \beta_i G_j)]^{-1} = [1 + \exp(\mu_0 + \vec G_i^{\dagger} \xi)]^{-1}

So in other words, the score vector \vec S = \tau X\xi is an estimate of the latent liability variable for all of the samples. This allows the following simplifying assumption:

S_i \sim N(\mu,\sigma^2)

for some \mu and \sigma. This is sort of more-or-less true as X_{\cdot,j} are a sequence of binom(2,f_i) variables, and so a weighted sum should be approximately binomial, ergo approximately normal. Therefore, for a given \tau, \xi:

Variance Explained by a Collapse Test

We have \mathbb{P}[A|S] = 1-\Phi(\mu_0-\tau S). Here we use a normal latent model because it plays nicely with the assumption that S is normal. Then

\mathrm{Prevalence} = \int_{T}^\infty (1-\Phi(\tau S))\phi((\tau S - \mu_{\tau S})/\sigma_{\tau S}) dS

As before, T can be found by numerical quadrature. Then

V = \mathrm{var}(T - \tau S)

and \rho(S,\tau) = V/(1+V).

Note that this can be simplifed by talking about a variable R = \tau S, but it’s nice to clarify what \tau does (which is scale the proposed score distribution so it aligns with the underlying liability explained).

To use this as an approximation to the variance explained by multiple variants, note that S is a linear function, and therefore (assuming no LD and no relatives):

\mathbb{E}[S] = \mathbb{E}[X\xi] = \mathbb{E}[X]\xi \rightarrow \mathbb{E}[S_i] = \sum_j 2f_j\beta_j

\mathrm{Var}[S_i] = \sum_j \beta_j^2 f_j(1-f_j)

Expressions which can be adapted for the presence of LD and relatedness.

This gives \mu_{\tau S} = \tau \sum 2f_j\beta_j and \sigma^2_{\tau S} = \tau^2 \sum \beta_j^2 f_j(1-f_j)

For an evaluation of “true” variance explained, set \tau = 1. To identify the variance explained by a (possibly wrong) guess \xi, replace all instances of \beta_j with \xi_j.

Using a collapsed estimate of the variance explained has the benefit of being univariate. That is, for a given (either guess or true) \xi, then \rho = \rho_\xi(\tau,f), and thus it has a univariate inverse: \rho_{\xi,f}^{-1}(\rho) = (\tau_l,\tau_u).  This enables the bounding of a locus or burden test through the posterior distribution of the parameter \tau, just as we did with a univariate \beta in the first section. That is, given \xi (and assuming we don’t need to integrate over the posterior of f_i which is fairly painful in this case), to query the probability of a locus explaining Q or more of the phenotypic variance:

Calculate (\tau_l,\tau_u) = \rho_{\xi,f}^{-1}(Q)

Calculate p = \Phi( (\tau_l-\tau^*)/\mathrm{SE}_\tau ) + 1-\Phi( (\tau_u - \tau^*)/\mathrm{SE}_{\tau})

However (!), the obvious approach (set \xi_i = \beta^*_i) where \vec \beta was fit jointly by MLE has the drawback that \tau is inflated due to overfitting. In fact, in such cases \tau=1 always. One way around this is to use \beta_i estimated from a different source. Alternatively let \xi be a hypothesized score (e.g. a burden weight vector). It’s important to note in the latter case that bounding variance explained by a burden test does not bound the maximum possible variance explained at a locus, it bounds the variance explained by the hypothesized \xi. The variants at the locus could still contribute substantially to trait heritability, just simply in a direction orthogonal to \xi (of which there are n-1).

Surely the many things I haven’t seen explain lots of heritability in bulk, right? Right?

The last thing to check off with the locus view now that we can bound variance explained by (1) multiple variants and (2) burden scores, is the probability that at said locus, there may exist many variants that for whatever reason were not discovered, but that in aggregate explain a large proportion of heritability. As in the single-variant case, you can put bounds on these things, but more importantly, consider what it means to want the answer to this question (as anything more than an exercise in probability).

It’s been a long post. Let’s take a step back.

Human medical genetics is science with an ulterior motive, where the search for additional causal variants is often explicitly motivated by the statement that new loci provide new targets for pharmaceutical treatments. And, by and large, this is a positive force for the advancement of disease biology and the understanding of human disease. At the same time, I believe that this Quest For Treatment, combined with the increasing focus on impact explains a wide variety of phenomena in the medical and statistical genetics community, from hasty and opportunistic study design to the ENCODE backlash (and the resurgence of Big vs Small science).

Sequencing is a good example. Initial GWAS studies provided a good number of variants and loci with strong disease associations. The bulk of these associations, however, were not immediately interpretable: falling between genes (unclear which protein might be affected), or in a noncoding part of a gene (so not obviously gain- or loss-of-function). In part because of the difficulty of interpreting biological effects, and in part because of the worry of a winner’s-curse effect, we started (rather uncritically) to refer to these variants as “tag-SNPs”, hypothesizing that the true causal variant was unlikely to have been ascertained, as the bulk of human variation is uncommon to rare and not present in the set of imputed loci, and furthermore is likely to have an interpretable biological impact. In retrospect, it seems rather clear that common variants which maintain their association across populations are unlikely to be tagging a “true causal variant” of lower frequency which is far more likely to be absent in other populations. (That said, one  proof of an elegant experiment that the results reveal something that is obvious in retrospect)

I think the shift from array-based GWAS towards sequencing-based studies reflects both the optimism of researchers, and the entrance of impact as a scientific consideration. Should loss-of-function mutations underlie even 10% of the GWAS hits, we’d then have generated a respectable number of druggable targets, opening up new paths towards treatment. Even having seen the (largely negative) results, even as I agree with Kirschner’s characteriztion of impact, I still think these were the right designs to follow up on the results of GWAS.

But what it does mean is that these studies weren’t designed to add to the knowledge of genetic architecture. Why didn’t we only sequence cases that had none (or very few) of the known genetic risk factors? Because at the same time we wanted to see if there were un-discovered variants underlying those GWAS signals. Why do we sequence exomes instead of the regions near GWAS hits? Because not only do many GWAS loci fall near to or within genes, but also protein-coding variants, in particular loss-of-function variants, provide a very clear target for both follow-up lab experiments, and potential therapeutic developments. In addition, exome sequencing is cheaper so one can sequence more samples to find more rare variants and have more power to associate them.

So, really, the question “What’s the likelihood that there are multiple variants that I have yet to ascertain that in aggregate explain a substantial fraction of phenotypic variance?” is a continuation of this rather incautious optimism about 1) the strictly additive model holding, and 2) the infinitesmal (“Quantiative”) model not holding, to say nothing of 3) unbiased estimates of heritability. For a common disease, there are reasons on all three fronts to cautiously articulate hypotheses before going any further in the Quest for Treatment.

As mentioned in the last section on bounding the variance explained by undiscovered variants, we’ve now gotten to territory where undiscovered variants are so rare that in order to explain even 0.1% of the phenotypic variance they have to be effectively 100% penetrant. The fact that common diseases are not known to present as Mendelian in a substantial portion of cases, as the assumption that extremely rare (or private) variants of strong effect would suggest, indicates that such variants are not likely to account for a large fraction of disease prevalence. At the same time if these variants are not of high effect but together account for a large fraction of heritability, the model isn’t really distinguishable from the Quantitative model: every variant is causal, the effect size is negligible.

At the same time, the Latent Variable Model is just an approximation, and while it alleviates some of the issues of treating a binary trait as a real-valued trait (with actual values of 0 and 1), it does not eliminate them. That is, phantom epistatic effects are mitigated, but not not eliminated (because the latent variable need not be normal, though we assume so). This not only impacts estimates of heritability, but also indicates that the purely additive model is mis-aligned, and there might be a considerable gain (in terms of understanding heritability estimates) to carefully considering what happens in individuals with many genetic risk factors. At the same time, both association scans and association study experimental designs tend not to consider maternal, dominance, additive-additive, or additive-environment effects. Which seems like a more fruitful use of research money: sequencing many thousands of samples in the hope of finding a fully penetrant f=1/30,000 variant, or collecting a wide array of phenotypes and environmental data to better understand endogenous and exogenous disease and gene x disease covariates? Or buying a cluster (or access to Amazon’s) to scan for additive-additive effects?

In sum: none of this proves that a locus can’t harbor undiscovered variants that explain some given fraction of phenotypic variants. One can bound that probability just as in the single-variant section. But the point is: think about why such a bound is necessary. Is the working hypothesis to be tested really that there are high-penetrance near-private variants lurking in the genome? Is that really the hypothesis Occam’s Razor would suggest, given all of the observations in this post?

I, for one, don’t think so.

June 10, 2013

On meta-analysis: Rescuing the power of multiethnic studies

Filed under: Uncategorized — heavytailed @ 3:53 pm

Last year, I wrote about the effect that oversampling (in a case-control study) has on the power of a study (using the Score test). I recently updated that post with a bit more information regarding the difference between evaluating the test statistic under various choices of the input space, and calculating power. In this post, we move from the analysis of a single study, to the analysis of multiple studies: meta-analysis. This has particular applications to multiethnic genetic studies, as one typical approach to testing for genetic association is to test for association within each population, and meta-analyze genetic mutations across the populations in which they appear. By virtue of genetic drift, selection, and isolation-by-distance, many variants will be unique to a single population, and many variants will show large frequency differences between them. We will find that this structure significantly impacts the power of standard meta-analysis approaches, identify the cause, show why alternate methods work, and propose a novel method that performs slightly better in this particular context.

Meta-Analysis

The need for meta-analysis typically arises when several studies have studied the same phenomenon, and (more importantly) asked the same question of data sampled in potentially slightly different ways. “The same” here means, statistically, having tested the same null hypothesis against the same alternate hypothesis.[1] While p-values can typically be interpreted as probabilities, they are in fact random variables, and one can get into extreme danger when treating them as such, and it is almost never the case that p_{meta} = 1-(1-p_1)(1-p_2)\dots(1-p_n). Instead, one can treat \vec p, the p-values for the studies, as a draw from a multivariate distribution. Along the same lines, one might use the test statistics \vec T and their variances \vec V as draws from multivariate distributions. One can then ask the probability if these draws under a suitably defined null distribution (in particular, p_i \sim U(0,1) and T_i \sim N(0,V_i)).

Methods for Meta-Analysis in Genetics

Let’s say you have N studies, producing p-values p, z-values z, with variances v. Within each study, the variant of interest has frequency f, which can change between study populations. Then

Fisher’s Method:

X \leftarrow -\sum\log(p_i)

X \sim \chi^2(2N)

Sum of Uniform:

U \leftarrow \sum p_i

U \sim N(N/2,N(N-1)/12)

Inverse Variance Weighting:

W_i \leftarrow 1/V_i

Z \leftarrow \frac{\sum W_iT_i}{\sum W_i}

Z \sim N(0,1/\sum W_i)

Sample Size Weighting:

W_i \leftarrow \sqrt{N_i}

Z_i \leftarrow T_i/\sqrt{V_i}

Z \leftarrow \frac{\sum W_i Z_i}{\sqrt{\sum W_i}}

Z \sim N(0,1)

Inverse Covariance:

X \leftarrow T^\dagger \mathrm{diag}(V)^{-1} T

X \sim \chi^2(N)

Random Effects:

Model the between-study mean and variance as (\mu,\tau) then estimate

\mu_{n+1} = \frac{\sum \frac{T_i}{V_i + \tau_{n}}}{\sum \frac{1}{V_i + \tau_{n}}}

\tau_{n+1} = \frac{\sum \frac{(T_i-\mu_{n+1})^2 - V_i}{(V_i + \tau_{n})^2}}{\sum\frac{1}{(V_i + \tau_{n})^2}}

And formulate a likelihood ratio X as

X \leftarrow \sum \log \frac{V_i}{V_i+\tau} + \sum \frac{T_i^2}{V_i} - \sum \frac{(T_i - \mu)^2}{V_i + \tau}

X \sim 0.5 \chi^2(1) + 0.5 \chi^2(2)

Comment:

One needs to be careful when applying Fisher, Sum Uniform, or Inverse Covariance as these can be two-tailed meta analyses. In particular, for Fisher and Sum Uniform, if all the p-values were generated by two-tailed tests, then the Fisher and SU meta-statistics are also (akin to) two-tailed, i.e. there’s no checking that the statistics from the individual studies are directionally consistent. The Inverse Covariance approach, being an inner product, also has this property: a large negative statistic will not “cancel” a large positive statistic. Nevertheless, even if you expect directional consistency, finding two large statistics of opposite signs does suggest something’s going on, if only artifactual.

Ideally the meta-statistics will be 1) Calibrated or conservative under the null, and 2) Sensitive to the alternate hypothesis. Not much more to it than that.

[1] Likelihood ratio tests are a common way to violate the second part of this condition, it’s easy to include different explanatory variables, which means the models  generating the LR may be different between the two studies.

Evaluating the Power of Meta-analyses

Following the approach in the previous post, we can calculate score statistics within two studies (or two populations) where the variant of interest potentially has different frequencies. Each of these statistics can then be combined into a meta-statistic and meta-p-value, and this process can be repeated to find the power of the meta-analysis at a particular false-positive level (in this case, alpha is 5\times 10^{-6}). Doing this with Inverse Variance Weighting reveals a startling phenomenon:

\begin{array}{c|c|c|c} f_1 & f_2 & \mathrm{OR} & \mathrm{Power} \\ \hline 0.03 & 0 & 2.4 & 92.4\% \\ \hline 0.03 & 0.005 & 2.4 & 42.5\% \\ \hline 0.03 & 0.03 & 2.4 & 100\% \end{array}

That’s right. Seeing the variant more times (in another population) kills your power if the variant is of a lower frequency in the second population. What?!

Okay so what is going on here. Let’s take a careful look at the realized distributions (since we’re generating simulated genotypes in order to estimate power anyway,  this is just sitting in system memory). First we plot the test statistics (under the alternate hypothesis) within each of the populations, low-frequency in red, high-frequency in blue. Adjacent, we plot in green the meta-statistic that results from 1) multiplying the statistic in the first plot by sqrt(variance) [this recovers the un-normalized test statistic] 2) weighing the results by 1/var 3) normalizing by the root of the sum of 1/var. (In other words the IVW meta-analysis statistic)

TwoPop_Assoc_OR2.4TwoPop_Assoc_OR2.4_META

The vertical dotted line is the Z-value associated with our alpha-level (basically \texttt{qnorm(1-5e-6)}). Basically, by averaging the two distributions on the left in this way, we will (obviously) get a distribution “between” the two, one that has more power than the red distribution, but less than the blue. Thus, weighting like this kills your power in multiethnic studies, or for meta-analyses where the variant frequency is different between the studies you’re analyzing.

This motivates a search for potentially “better” meta analyses, ones that might outperform inverse-variance weighting. Ideally, these methods should perform about as well as or better than max(\vec p), while remaining calibrated under the null hypothesis. Given the above, clearly inverse-variance weighting is going to perform worse than max(\vec p). We might first identify well-calibrated statistics by simulating under the null. Let there be N populations (or studies). For each one (i) draw a Vi from a chi-squared distribution, and then draw a Zi from N(0,Vi). We consider the cases of 5 and 10 populations, under a high- and low- variance setting. The distribution of variances under the “high” and “low” heterogeneity settings look like:

highvariance_realizedlowvariance_realized

And the distributions of the meta-statistics under these scenarios look like (for 5 populations):

MetaDist_NullSim_HighVar_05PopMetaDist_NullSim_LowVar_05Pop

And for 10 populations:

MetaDist_NullSim_HighVar_10PopMetaDist_NullSim_LowVar_10Pop

Inverse-variance weighting seems undercalibrated compared to all other statistics. Fisher’s method is overly aggressive when the variances are allowed to change drastically between populations, and the sum-of-uniform has this property as well, but becomes better calibrated as the number of populations increase. Sample-size weighting (assuming 2000 samples, with no relation to the variance) is uniformly over-aggressive. By contrast, the random effects model is only slightly conservative (it was made for heterogeneity, after all), and the Inverse Covariance method is well-calibrated throughout.

However: When running a meta-analysis, you don’t get provided with the actual variances of the distributions from which the test statistics were drawn, you’re instead provided with an estimate. In addition, this estimate will covary with the test statistic itself. In the case of the score test, as the (observed) frequency increases, both the computed value and the variance of the statistic also increases. Performing a simulated meta-analysis with OR=1 (so no association) generates the following null calibrations:

meta_approaches_OR1.0_highVarmeta_approaches_OR1.0_lowVar

Where here, the “high variance” environment (left) has frequencies of (20%,0.5%) in two populations, while the low variance environment (right) has frequencies of (5%,0.5%). All of a sudden, the meta-analysis statistics all (except for the random effects) become reasonably calibrated under the null hypothesis, strikingly even the sample-size weighted meta analysis! The black lines at the top and bottom are “min(P)” and “max(P)” respectively – falling outside the bounds suggests one may be overly conservative or overly aggressive, at least under the null. Of course, power is all about the alternate, so what happens if we raise the odds ratio:

meta_approaches_OR1.3_highVarmeta_approaches_OR2.0_lowVar

On the left we’ve set the OR=1.3, and OR=2 on the right. I’ve plotted on the x-axis “Min(P)” – the minimum p-value in the two populations, as a kind of baseline target for the sensitivity of the meta analysis. So what do these plots tell us? First we find a confirmation that inverse variance weighting is just not as sensitive to the alternate hypothesis as other approaches, and that overconservativeness is compounded when the variance is high. Second, we find that sample-size weighting performs far better, and is comparable to the random-effects meta analysis, though underperforms at variants of large frequency difference. Finally, we find that both the Inverse Covariance and Fisher’s method perform comparably to max(P), while remaining calibrated under the null hypothesis.

Take-home messages: Don’t inverse-variance weight in this circumstance. Even sample-size weighting is preferable, though random effects performs just as well. However, it’s even more advisable to be “old school” and use Fisher’s method or Inverse Covariance. The last observation is that frequency information has a lot of gain. The Red curve (“Fixed (SF)”) is sample-size weighting that takes frequency into account: W_i = \sqrt{2N f_i(1-f_i)}. Simply providing that information (granted, it’s the *actual* frequency, not the *observed* frequency) to the weighting drops the p values by orders of magnitude, drastically increasing power under the alternate hypothesis.

Note: Inverse Inverse Variance Weighting

Most meta-analyses use the Wald statistic to calculate the final meta-statistic. This differs from the above in that we have used strictly the Score statistic. The Score statistic is such that, for a single-parameter logistic regression, the statistic can be calculated in closed form. By contrast, the Wald statistic requires both the intercept and the variable coefficients be computed under the alternate hypothesis, and this needs to be fit manually. They also relate in the following way: we know from the derivation in my prior post

(\tau^*-\tau_t) \sim N(0,-[\nabla^2 \mathcal{L}]^{-1}) (Wald)

(\lambda - \tau_t) \sim N(0,-\nabla^2\mathcal{L}) (Score)

In other words, the variances of the Score and Wald tests are inverses of each other. This suggests that Inverse Inverse Variance Weighting may perform appropriately for the Score test as a meta-analytic statistic.

Inverse Inverse Variance

W_i \leftarrow V_i

Z \leftarrow \frac{\sum W_iZ_i}{\sqrt{\sum W_i}}

Z \sim N(0,1)

Replacing IVW with IIVW does indeed have the desired effect, and it performs comparably to Fisher and Inverse Covariance.

meta_approaches_OR1.0_highVar_IIVWmeta_approaches_OR1.3_highVar_IIVW

I suppose this is a whole lot of hubbub over the fact that the Score and Wald tests have inverse variances. I guess, except that one can find regimes where IVW is terrible, but IIVW works alright, and the reverse. For example, IIVW works well in the above case (frequencies: 20%, 0.5%, 2000 samples, balanced case/control), it works miserably in the case below (frequencies: 5%, 3%, 1%, 0.5%; samples: 1500/500,750/750,600/200,1000/1000):

IIVW_stinker

Here, IVW is the blue, while the orange-red is IIVW. Given the inverse relationship between Wald and Score, this implies that there will be regimes for the Wald statistic where inverse-variance weighting performs poorly as well. One thing to note is that Sample-Frequency weighting as well as Inverse Covariance (and Fisher’s method) perform well throughout.

June 5, 2013

Back to Biology: Confounding, Bias, and Dichotomous Traits

Filed under: Uncategorized — heavytailed @ 2:56 am

(Update 6/2013 – I’ve edited and extended this old post from 10/2012. I had begun writing a new related post, and decided the material was better placed within this one as an extension.)

Two recent observations set me down a dark and lonely road, and they are unsurprisingly related. They both have to do with sampling and the kinds of confounding biases resulting from uneven sampling with a dichotomous trait, or even in a quantitative trait. The first is the observation that genetic studies (typically) have differential power to find protective and risk variants, an observation which may explain the lack of druggable targets, and perhaps even the ghost of “missing heritability” (although I really doubt that this alone could explain the latter). The second is the observation that correcting for stratification wreaks havoc on rare variant tests.

I think neither of these things are surprising on its own, but I feel it’s worth fleshing out the underlying root of both problems. It all boils down to study design, and more specifically sampling: specifically covariation between dichotomous trait and sample group, or and inflation of group representation above the natural proportion (that is, inflating or deflating group proportions).

Digression: Deriving the Score Statistic

So it turns out I’m really bad at remembering formulae. I blame Wikipedia. And while I wouldn’t pull out limits of Riemann sums to derive integrals of the form

\int \frac{1}{\sqrt{a + bx^2}}dx

understanding the Score (and also: Wald and LRT) test is important enough that it’s worth knowing the derivation. This is taught in every introductory graduate statistics course, but is present in the literature in such a kaleidoscopic number of forms that it’s almost necessary to know the derivation not to be confused, for instance, when a paper writes “it can be shown[1] that the variance of this statistic is” (some formula that doesn’t look quite right). The derivation is a rather fun exercise. The key thing to remember is that “Score” is a terrible name for this test. I think the econometricians got it right: it’s a Lagrange Multiplier Test.

So you have some parametric distribution (typically GLMM, but it will work for any parametrized distribution) that has a likelihood function \mathcal{L}(\tau,S), which is a function of your parameter vector \tau and your sample S. “Fitting” your data is the process:

\tau^* = \mathrm{argmax}_\eta L(\tau,S)

which, by definition, and so long as L is smooth in \tau, is guaranteed to have the property

\left.\frac{\partial L(\tau,S)}{\partial \tau}\right|_{\tau^*} = 0

noting that this is the zero matrix. From the standpoint of optimization, what is the hypothesis that \tau_4 < 0 ? That is to say, if I asked for the maximum likelihood (and corresponding \tau^*) for which \tau_4 < 0, what would we do? We’d try to restrict the domain of \tau_4. We’d just do a constrained optimization! That is:

\tau^*_{\mathrm{constrained}} = \mathrm{argmax}_{\tau} L(\tau,S) \;\; s.t. \;\;\tau_4 < 0

Indeed we can expand to arbitrary constraints (such as \eta lies in the unit ball) quite easily, by taking an arbitrary vector-valued constraint function:

\tau^*_{\mathrm{constrained}} = \mathrm{argmax}_{\tau} L(\tau,S) \;\; s.t. \; \; g(\tau) = 0

You might imagine breaking \tau into constrained components and unconstrained components, but this turns out not to be too helpful. We need to make use of the standard result involving maximum likelihood: namely that

(\tau^*-\tau_t) \sim N(0,-[\nabla^2 \mathcal{L} |_{\tau_t}]^{-1})

Which means that the distribution of the fitted (unconstrained) parameters vary jointly, and thus the variance of \lambda will necessarily involve unconstrained components. We constrain the likelihood by a function g(\tau), and thus our Lagrangian is

\mathcal{L}(\tau) + g(\tau)\lambda

which gives first order conditions

\mathcal{L}_\tau|_{\tau_\lambda} + \nabla g|_{\tau_\lambda}\lambda =0 \;\; \;\;\; g(\tau_\lambda) = 0

Where \tau_\lambda is the maximum of the constrained optimization. By contrast, we’ll call \tau^* the unconstrained maximum, and \tau_t the true value of \tau.

A very typical form for g(\tau) is \tau, which constrains this set of parameters to 0. In this case the first order conditions have a particularly nice form: \mathcal{L}_\tau + \sum \lambda_i; and in particular if \tau is a scalar, this gives the equation \lambda = \mathcal{L}_\tau, which gives rise to the standard nomenclature: the partial derivative of the likelihood with respect to a given parameter is the “score” for that parameter.

At any rate, we want to get around to testing that score, which means somehow we need a value for \mathcal{L}_\tau|_{\tau_\lambda}. So we’ll Taylor expand around \tau_t:

\mathcal{L}_\tau|_{\tau_\lambda} = \mathcal{L}_\tau|_{\tau_t} + \nabla^2\mathcal{L}|_{\tau_t} (\tau_\lambda - \tau_t)

Now while we don’t know much about \tau_\lambda - \tau_t, we do know the distribution of \tau^*-\tau_t, so we examine another expansion:

\mathcal{L}_\tau|_{\tau^*} = \mathcal{L}_\tau|_{\tau^t} + \nabla^2\mathcal{L}|_{\tau_t}(\tau^* - \tau_t) \rightarrow_{H0} N(0,-\nabla^2 \mathcal{L}|_{\tau_t})

So if g(\tau) is such that \nabla g_i \in \{0,1\}, then we’re effectively done, because under the null hypothesis the constraints do not bind, and so the binding \lambda_i will be distributed as \tau^*, and the non-binding \tau will be distributed as \tau^*. The one last step is to derive the conditional distribution of the constrained parameters given the unconstrained ones, so let’s write \nabla^2 \mathcal{L} as \left(\begin{array}{cc} \mathcal{L}_{\tau\tau} & \mathcal{L}_{\tau\alpha} \\ \mathcal{L}_{\alpha\tau} & \mathcal{L}_{\alpha\alpha}\end{array}\right) where now \tau are constrained to 0, and \alpha are not. Then, having fit \alpha:

\lambda|\alpha \sim N(0, -[\mathcal{L}_{\tau\tau}-\mathcal{L}_{\tau\alpha}\mathcal{L}_{\alpha\alpha}^{-1}\mathcal{L}_{\alpha\tau}])

Evaluated at \alpha^*. However if g(\tau) is not of such a nice form, we cannot directly appeal to the null distribution of \mathcal{L}_\tau|_{\tau_\lambda} as we did here, because we relied on the fact that \nabla g(\tau) \in \{0,1\}. In the general case we need to know these values, so we play the Taylor expansion trick again:

g(\tau^*) = g(\tau_t) + \nabla g|_{\tau_t}(\tau^*-\tau_t)

g(\tau_\lambda) = g(\tau_t) + \nabla g|_{\tau_t} (\tau_\lambda - \tau_t)

Now g(\tau_\lambda) = 0, as that is our constraint. This yields a relation for the Jacobian:

g(\tau^*) = \nabla g|_{\tau_t}(\tau^*-\tau_\lambda)

And from above we have a relation for the likelihood derivatives (because by definition \mathcal{L}_\tau|_{\tau^*}=0:

\mathcal{L}_\tau|_{\tau_\lambda} = \nabla^2\mathcal{L}|_{\tau_t}(\tau_\lambda - \tau^*)

We can combine these two equations, yielding

g(\tau^*) = -\nabla g|_{\tau_t}[\nabla^2 \mathcal{L}|_{\tau_t}]^{-1}\mathcal{L}_{\tau}|_{\tau_\lambda}

But we care about \lambda. Remember that we have -\nabla g |_{\tau_\lambda}\lambda = \mathcal{L}_{\tau}|_{\tau_\lambda}. Then:

g(\tau^*) = \nabla g_{\tau t}[\nabla^2\mathcal{L}_{\tau_t}]^{-1}\nabla g|_{\tau_\lambda}\lambda

Which, because we have a symmetric inner product which will be invertible, gives

\lambda = \left[\nabla g_{\tau t}[\nabla^2\mathcal{L}_{\tau_t}]^{-1}\nabla g|_{\tau_\lambda}\right]^{-1}g(\tau^*)

And we can use the Taylor expansion way above for g(\tau^*), combined with the fact that we know how \tau^* is distributed, to calculate that (under H0 where \tau^*=\tau_\lambda=\tau_t):

\lambda = \left[\nabla g_{\tau t}[\nabla^2\mathcal{L}_{\tau_t}]^{-1}\nabla g|_{\tau_\lambda}\right]^{-1} N(0,\left[\nabla g_{\tau t}[\nabla^2\mathcal{L}_{\tau_t}]^{-1}\nabla g|_{\tau_\lambda}\right])

and therefore

\lambda \sim N(0,\left[\nabla g_{\tau t}[\nabla^2\mathcal{L}_{\tau_t}]^{-1}\nabla g|_{\tau_\lambda}\right]^{-1})

And that’s the Lagrange Multiplier (alt. Rao’s Score) test.

[1] Obscure, archaic textbook from 1972

Part 1: Oversampling

As a bit of motivation, and because the mathematics will be largely the same, we review the impact of oversampling on a linear model. For binary outcomes, this amounts to inflating or deflating the rate of outcome in the experiment as compared to the natural rate; for example, sampling 50% cases and 50% controls for a disease with only a 15% prevalence in the underlying population. The basic idea behind oversampling is to note that the tables

\begin{array}{c|c|c|} & A & B \\ \hline \mathrm{Case} & 100 & 8 \\ \hline \mathrm{Control} & 100 & 2 \end{array} \;\;\;\;\;\; \begin{array}{c|c|c|} & A & B \\ \hline \mathrm{Case} & 500 & 12 \\ \hline \mathrm{Control} & 100 & 2 \end{array}

are substantively different (in terms of say, Fisher’s Exact Test), despite the fact that B shows a large enrichment for cases in both situations. A philosophical question that’d be worthwhile for the biological world to answer (but that we won’t get into here): if table 2 is the result of a replication experiment, is the signal in table 1 replicated, or not? While it may seem strange that a frequency of 8% in cases should in an orthogonal experiment come back at a frequency of 2.4% in cases, it only takes a tiny bit of structure.

Anyway, it’s somewhat useful to quantify the degree to which oversampling can reduce power in this way. Following the standard latent-variable setup, we let y_i = \tau X_i + \alpha + \epsilon_i > \theta where \epsilon_i has some cumulative distribution \phi(x). Let r be the rate of sampling, so the number of cases is given by Nr. Also set the frequency of X in cases and controls to f_A and f_U respectively. A score test for the effect of X_i will be

\mathcal{L}_\tau = \sum_A \phi^{-1}\phi'X_i - \sum_U(1-\phi)^{-1}\phi'X_i

(Recall that the likelihood is just the sum of y_i\log(\phi)-(1-y_i)\log(1-\phi)), where sums are over the case and control sets.

the variance of this statistic, as derived above, is

(\mathcal{L}_{\tau\tau}-\mathcal{L}_{\tau\alpha}^{-1}\mathcal{L}_{\alpha\alpha}\mathcal{L}_{\alpha\tau})

and relies on

\log(1-\phi)'' = -(1-\phi)^{-2}\phi'^2- (1-\phi)^{-1}\phi''

All of these are evaluated under the null hypothesis (\tau=0,\alpha=\hat{\alpha}). For every distribution of \phi, we have that \phi(\alpha) = r, by definition. The values of \phi' and \phi'' vary with the particulars. For logistic regression,

\phi'|_{\hat\alpha} = \frac{e^{-\alpha}}{(1+e^{-\alpha})^2} = r(1-r)

(note: 1+e^{-\alpha} = 1/r so e^{-\alpha} = 1/r-1 = (1-r)/r). Finally \phi''|_{\hat\alpha} = -r(1-r)(2r-1). Therefore:

\mathcal{L}_\tau = \sum_A (1-r) X_i - \sum_U rX_i \mathcal{L}_{\alpha\alpha} = -\sum r(1-r) = -Nr(1-r) \mathcal{L}_{\tau\tau}-\mathcal{L}_{\tau\alpha}\mathcal{L}_{\alpha\alpha}\mathcal{L}_{\alpha\tau} = -\sum X_i^2 r(1-r) \mathcal{L}_{\tau\alpha} = -\sum X_ir(1-r)

So then the score test will be \mathcal{L}_\tau/\sqrt{\mathrm{var}}

T = \frac{U}{\sqrt{\mathrm{Var}[U]}} = \frac{(1-r)\mathrm{AC}_{\mathrm{case}} - r\mathrm{AC}_{\mathrm{control}}}{\sqrt{-r(1-r)\left[\frac{1}{N}(\sum X_i)^2 - \sum X_i^2\right]}}

Sums can be simplified using f_U and f_A:

 = \frac{(1-r)Nrf_A - rN(1-r)f_U}{\sqrt{-r(1-r)[\frac{1}{N}(2N\tilde{f})^2 - 2N\tilde{f}(1-\tilde{f}) - 4N\tilde{f}^2)}} = \sqrt{Nr(1-r)} \frac{f_A-f_U}{\sqrt{2\tilde{f}(1-\tilde{f})}}

and \tilde{f} = rf_{A}+(1-r)f_{U}. Note, however, that this is just to get a sense of T at the expected allele counts in cases and controls. This does not speak to the power of the test, nor even to the mean value of T (as we’ve evaluated T(\mathbb{E}[\mathrm{AC}_{\mathrm{case}},\mathrm{AC}_{\mathrm{control}}], which is different from \mathbb{E}[T(\mathrm{AC}_{\mathrm{case}},\mathrm{AC}_{\mathrm{control}})], due to the nonlinearity of T.

This can be extended to weighted group tests (e.g. burden tests). Let the weighting vector for the variants in a gene be \xi, and let X_i now be a vector of genotypes for sample i. Define S_i = \xi^\dagger X_i and the score test becomes

T_{\xi} = \frac{\sum_A (1-r)S_i - \sum_U r S_i}{\sqrt{-r(1-r)[\frac{1}{n}(\sum S_i)^2 - \sum S_i^2]}}

In both cases, you find that the test statistic evaluated at the expected genotypes (given f_A and f_U) is maximized for evenly matched case-control sampling, or r=0.5. But, maximizing T(\mathbb{E}[AC]) is not necessarily the same as maximizing power. One easy way to demonstrate that is to compare the behavior of T(\mathbb{E}[AC]) to that of \mathrm{Pow}(T|\tilde{f},\mathrm{OR}).

Calculating Power

Examining the formula for the test statistic T, we notice that only four quantities are needed: the allele count in cases, allele count in controls,  total number of alleles, and total number of homozygotes (for X_i^2). This leads to the following obvious means of simulating a draw of genotypes from a population.

Given f_A and f_U, the frequency of the variant in cases and controls (which can be calculated from the odds ratio and the overall frequency in the population), draw

\mathrm{Het}_A \sim \mathrm{binom}(n=|A|,p=2f_A(1-f_A))

\mathrm{Het}_U \sim \mathrm{binom}(n=|U|,p=2f_U(1-f_U))

\mathrm{Hom}_A \sim \mathrm{binom}(n=|A|,p=f_A^2)

\mathrm{Hom}_U \sim \mathrm{binom}(n=|U|,p=f_U^2)

and if by chance \mathrm{Het}_S + \mathrm{Hom}_S > |S|, set \mathrm{Het}_S = |S|-\mathrm{Hom}_S. Then T can be calculated. A draw of 10,000 realizations for a site of MAF=1% with OR=1 and 5000 samples is very well calibrated:

TDistribution_null_sim_Nqq_overallTDistribution_null_sim_qqTDistribution_null_sim_Nqq_quantization

So the quantiles of the normal and uniform distributions line up very nicely. There is (as you would expect) some quantization near 0, due to the fact that allele counts are discrete. The power can be estimated from these simulations by, for a given nominal p-value \rho, counting the number of simulations for which |T| > \Phi^{-1}(1-\rho) (where \Phi is the inverse normal cdf).

One notices that while T(\mathbb{E}[AC]) is nicely symmetric, the power seems to be asymmetrically distributed.

statistic_freq_or_at_expectedstatistic_powerstatistic_power_log

The figures here take a 10,000-sample balanced (5000 case, 5000 control) study, where the prevalence of the disease is 8%. Notice that the distribution of T(\mathbb{E}[AC]) (left plot) is perfectly symmetric, whereas the power is biased towards risk-increasing alleles (middle plot, right plot logscale). So what’s happening to break the symmetry? Consider what case/control frequencies correspond to an OR=1.5 and f=0.05 allele. Using Bayes’ rule:

f_A = (P[A|\mathrm{Het}]P[\mathrm{Het}] + 2P[A|\mathrm{HV}]P[\mathrm{HV}])/(2P[A])

= [2f(1-f)\phi(\alpha + \beta) + 2f^2\phi(\alpha+2\beta)]/[2\rho]

f_U = (P[U|\mathrm{Het}]P[\mathrm{Het}] + 2P[U|\mathrm{HV}]P[\mathrm{HV}])/(2P[U])

= [2f(1-f)(1-\phi(\alpha + \beta) + 2f^2(1-\phi(\alpha + 2\beta))]/[2(1-\rho)]

where \rho is the prevalence, and \beta is the log-OR, and \phi is just the logit. What this means is that for a 1% frequency variant

freq_or_oversam_curves_maf01

The case allele frequency (red) increases drastically when the allele confers risk (shaded circles), but decreases drastically when the allele is protective (triangles). At the same time, the control allele frequency (blue) does not significantly change in either case. One potential explanation for the differences in power is that the variance of observed allele counts is higher when the allele is risky: 2p(1-p) is increasing for 0 < p < 0.5, and it could be that this variance of observation dominates any gain from oversampling cases. There’s actually quite a lot in the figure. For diseases even of moderate prevalence (8%, 10%), the control frequency does not change all that much with the odds ratio, either for protective or risk alleles, because the disproportionate number of controls in the population serves as a buffer. So effectively, only the case frequency is changing (and this is particularly true of rare diseases with a prevalence of 1% or less!). Risk alleles increase the case frequency, and the case allele count can increase with it. By contrast, protective alleles lower the case frequency, but the case count can’t decrease with the frequency in the same way, because of (solidifying yet another parallel between genetics and economics) the zero lower bound. You can’t get negative case counts, so in 2500 case samples, the difference between a 0.1% allele and a 0.05% allele is masked by the fact that the realized allele count distribution becomes zero-inflated. What this means is that while a risk allele can increase a 10:5 to 30:5 to 90:5 without any limit, a protective allele can decrease a 5:10 to 0:10 and no lower.

Nevertheless, oversampling does increase power, but it does so differentially on risk versus protective alleles — but always the risk alleles seems to be the one most empowered.

power_sampling_curves_prev08power_sampling_curves_prev30

Here are the sampling ratio/power profiles for 1% variants with a risk/protective odds ratio of 2: on the left, the disease prevalence is 8%, while it is set to 30% on the right. As the disease becomes more prevalent, the optimal ratio seems to shift away from balanced samples, and in the latter case it is possible to sample cases to the pont where you are more powered to detect protective alleles.

Coming up: Part 2 – Frequency Confounding, Part 3 – Population Stratification (PCA-confounds)

This too is an older post. See here, and a related post on fixed and random effect models here.

Older Posts »

Blog at WordPress.com.