# heavytailed

## August 18, 2015

### Estimating fraternity coefficients – Part All The Rest

Filed under: omics — Tags: , , , , — heavytailed @ 10:58 pm

It’s been an exciting summer! Building infrastructure at a startup here in SoCal whilst keeping up with my Ph.D. qualifier and research has taken most of my time away from blogging. In my last post, I promised an update on fraternity coefficient estimation “in the coming weeks.” Weeks having come and gone, this will be a brief summary of the qualifier. It’s precisely as you might expect.

The Problem

If there’s any inbreeding in the population (note: the idea of “having populations” implies inbreeding in a broad sense — if mating were purely random there could be no population differentiation) then fraternity coefficients cannot be estimated from bi-allelic markers, as the equations are underdetermined.

This goes back to the condensed coefficients of identity (the Jacquard coefficients [pdf]). The issue is that, if there is inbreeding, fraternity is both D1 and D7 – and all of a sudden you need to estimate D1 in an unbiased way — which means estimating all the Jacquard coefficients. With bi-allelic sites, the genotype observation probabilities, conditioned on each Jacquard state, are given by:

The shape of the matrix gives the entire problem: seven genotype states for nine coefficients. So to estimate the 9 coefficients we need at least three alleles, which has a conditional observation of:

What about four alleles? The only way you could see four alleles in two individuals is (wx, yz) which has probability pqrs. Combining this with the above states, ordering the alleles and reordering/combining rows, gives a final form for this matrix as

And of course there’s an obvious way to make multi-allelic markers from bi-allelic markers: phase them.

There are two estimators associated with this, the first, a maximum likelihood estimator, looks like

$\mathcal{L}(D) = \prod_{k=1}^m \sum_D \mathbb{P}[G^{(k)}|D]\mathbb{P}[D]$

$= \prod_{k=1}^m I[G_k]^TC^{(k)}\Delta$

where C is the matrix above, with p, q, r, s given by the observed alleles at the locus. G is the genotype state for the pair, so $I[G_k]$ is just an indicator variable that selects out the appropriate row, i.e. (aa, bc) would pull out row 3, so $I[(aa, bc)]$ is just $\mathbf{e}_4$.

For a single pair of samples, denote the concatenation of all such rows (over each multi-allelic locus) as the $N \times 9$ matrix Q. The update rule for ML (using logistic boundary functions) is

$\frac{d\mathcal{L}}{d\theta_i} = [1/(Q\Delta) \cdot Q\mathbf{e}_i - 1/(Q\Delta) \cdot Q\mathbf{e}_9]\Delta_i(1-\Delta_i)$

where $\Delta_i = 1/(1 + e^{-\theta_i})$. The term $1/(Q\Delta)\cdot Q$ should be interpreted as follows: $Q\Delta$ is a $N \times 1$ vector, while $Q$ is a $N \times 9$ matrix, therefore $1/(Q\Delta) \cdot Q$ is akin to $\mathrm{diag}(1/Q\Delta) Q$.

Method of Moments estimation can also be used here by taking expectations. In particular, given two individuals at a locus k, letting Ck be the above matrix adapted to k by plugging in the appropriate allele frequencies, and letting G be the indicator variable for the observed genotype state of two individuals, we have that

$\mathbb{E}[G_k^{(ij)}] = \mathbb{E}[C_k\Delta] = \mathbb{E}[C_k]\Delta = C_k\Delta$

as $C_k$ is a matrix of (conditional) Bernoulli random variables. By concatenating the $C_k$ over all loci k, the coefficients could be solved for by least squares:

$\hat \Delta^{(ij)} = (\mathbf{C}^T\mathbf{C})^{-1}\mathbf{C}^T\mathbf{G}^{(ij)}$

The major problem here is that $\mathbf{C}$ gets very large (1.6GB for 25,000 loci with 8 haplotypes). My solution here was to adapt the model via

$\mathbf{\Xi C \Delta}^{(ij)} = \mathbf{\Xi G}^{(ij)}$

and choosing $\mathbf{\Xi}$ in such a way that $\mathbf{\Xi C}$ could be calculated on the fly without ever materializing $\mathbf{\Xi}$ or $\mathbf{C}$.

Both the likelihood and moment estimators can be constrained through the use of Lagrange multipliers (in the latter case the closed-form least squares would have to be replaced by a constrained gradient descent). In particular, given a good estimate of coancestry, one can constrain with

$\Delta_1 + \frac{1}{2}(\Delta_3 + \Delta_5 + \Delta_7) + \frac{1}{4}\Delta_8 = \Phi_{ij}$

and the no-inbreeding assumption would be given by

$\Delta_1 + \Delta_2 + \frac{1}{2}\Delta_3 + \Delta_4 = 0$

$\Delta_1 + \Delta_2 + \frac{1}{2}\Delta_5 + \Delta_6 = 0$

these are just the definition of the coefficients of coancestry and inbreeding. One is free to drop the fractions on the inbreeding constraints, since, combined with $\Delta_i \geq 0$ these two constraints set $(\Delta_1, \dots, \Delta_6) = 0$.

These approaches work fairly well on simulated pedigrees. Here it is on simulated sib and half-sib pairings:

One of the initial claims was that in populations with high degrees of consanguinity, estimating faternity coefficients from bi-allelic sites (and assuming 0 inbreeding coefficients) will result in a biased estimate. Here are populations simulated with 10%, 20%, and 30% rates of consanguinity; the constrained estimator shows decreasing estimates of fraternity (which is incoherent), while the unconstrained estimator correctly increases with consanguinity rate.

Coancestry coefficients also show the same incoherent trend between constrained and unconstrained estimation.

How much do the inclusions of the constraints impact estimates on real data? This next plot shows the coefficient of fraternity as estimated on the 1000G PJL and STU populations, constrained against unconstrained.

It’s heartening to see a good identity trend; in addition to the large number of points which would be set to 0 by constrained estimation of Jacquard coefficients, but have a positive fraternity coefficient when the estimation is not constrained by a 0-inbreeding assumption.

Nevertheless the real-world effect of this is somewhat slight. Looking at the estimates of fraternity across 1000G populations, we see that there’s not an obvious shift in the populations where first-cousin marriages are comparatively common (PJL/STU) [note that the blue outliers here are due to early terminations of constrained Boyden-Fletcher-Goldfarb-Shanno and are eliminated from statistical tests. The points here are per-pair per-chromosome.

But despite these appearances, the PJL/STU populations do have a significantly different distribution from the others. Conditional on fraternity > 0; the PJL and STU population distributions dominate (e.g. occur before/are higher than) the others (Mann-Whitney U: p < 1e-16 vs GBR, FIN — MWU GBR vs Fin: p = 0.22)

So what does all this mean?

The line-item here is: if you’re estimating the coefficient of fraternity, you probably should allow inbreeding, unless you’re sure that the population is mating randomly. If you know there’s a good degree of consanguinity, you should probably allow for nonzero inbreeding coefficients if you’re estimating kinship.

However, if you’re calculating shared alleles for the purpose of a longitudinal model (as in Zhu et al. 2015), you don’t have to worry, really, about anything. In principle, IBD sharing is not even what you should use in this case. If you’ve directly observed genotypes at all (or the vast, vast majority) causal loci (e.g. sequencing, high-density genotype array), you actually care about realized IBS and not IBD at all.

Deconvolving (ab,ac) into its potential sharing states is really just smoothing. Imagine you have typed m sites in the genome for two individuals, but this set does not tag a particular allele (x) which may have additive or dominant effects. Given the genotypes at the m other sites, and the genotype of one sample at x, is it possible to predict the genotype of the other sample? The answer is: take the m sites and calculate the Jacquard coefficients, and, given the genotype for one of the samples, look up the appropriate rows in the C matrix above, and you have a conditional predictor of x on the other sample. In this way, Jacquard estimation is a kind of smoothing; the average IBD sharing.

For the purpose of phenotype prediction, average IBD sharing is only useful in that it predicts average IBS sharing. That is, IBD is expected IBS; however given IBS, calculating IBD does not provide any new information about any loci that went into the IBD calculation. For dominant effects, you care that all four, or two pairs of alleles are shared IBS, as the effect is allele-specific (as opposed to from-whom-did-you-inherit-these-alleles-specific). To put it another way: IBS in the genotype-phenotype map is not a proxy for some underlying quantity, it is what you directly care about.

This particular realization makes be very happy this was just a qualifier and not some actual paper I was planning to submit. For the study of phenotypes, coancestry ideas are effectively rendered obsolete by the ability to get IBS over the whole genome.

Various notes

Path Counting

In some sense it’s very sad that pedigree-based genetics is dying off, if only because of a cool connection between path-counting and graphical models. Sewall Wright, when he defined coancestry as the correlation between two gametes, was in effect constructing a graphical model.

This is just a DAG (except for the edge labeled m which is just a pre-correlation between founders). If you look very closely at the path-counting recurrences

$\Phi_{ab} = \frac{1}{2}(\Phi_{fb} + \Phi_{mb})$

$\Phi_{aa} = \frac{1}{2}(1 + \Phi_{fm})$

you see that these are identical to the recurrences for the covariance structure given by a DAG:

$\Sigma_{ij} = \sum_{k \in \mathrm{pa}(i)} w_{ki}\Sigma_{kj}$

which, given the graph $W$ of weights, corresponds to

$\Sigma = I + W + W^2 + \dots = (I - W)^{-1}$

(see Lior Pachter’s wonderful number deconvolution for an explanation). There are other recurrences for generalized kinship coefficients (Cheng & Ozsoyoglu, 2014) but none that I could find correspond to any sequence of elementary matrix operations; even over fields other than $\mathbb{R}$. It would be nice to spend some more time exploring this connection between gametic correlation, path counting, and related fields (such as routing theory), but seeing as how the correlation between gametes is not so relevant when you can sequence their progeny, these explorations may have little application in the genotype-phenotype world.

Fast reference-free phasing by compressive sensing

As mentioned, the way to go from bi-allelic loci to multi-allelic loci is to phase the SNVs. For this reason I wrote my own code on top of the Beagle source code. However, phasing generally relies on a reference, and is by far the slowest part; so it would make-sense to pre-phase and subsequently calculate the Jacquard coefficients. Reference-free phasing is even harder, as the underlying haplotypes need to be inferred; this is generally done by using a greedy algorithm and restricting segment size (in terms of number of markers). Modern strategies basically derive from $\texttt{HAPLOTYPER}$ (Niu 2002) which takes this approach.

However it’s possible to phase a window of bi-allelic loci quickly and without a reference using compressive sensing; in particular nonnegative matrix factorization. The idea is, given the genotype matrix G we can write it as

$\Sigma A = G$

where $\Sigma$ is the $m \times 2^m$ matrix of haplotypes, and $A$ the $2^m \times 2n$ haplotype assignments. There are some cool things even here: if you consider the least squares estimator:

$A = (\Sigma^T\Sigma)^{-1}\Sigma^TG$

the matrix $\Sigma^T\Sigma$ is particularly cool because the ith column of $\Sigma$ is the binary representation of i, and so neither matrix need be materialized. But $\Sigma^TG$ is an impossibility. The compressive sensing approach is to define some cutoff $K$ with $K \ll 2^m$ and then solve:

which can be solved with SGD. Alternatively $\min(a, 1-a)$ can be replaced with simply $a(1-a)$ which also induces “bidirectional” shrinkage. This would then be a smooth objective, and so BLGFS would work.

I didn’t pursue this direction much: I implemented a version of this using LAML, but it was a standard NMF with one penalty on all parameters, no domain constraints, and no simplex constraint. However the (very, extremely preliminary) results were promising even with all of these drawbacks. Had Jacquard estimation proved more interesting than it did, I would certainly keep driving down this direction; but it seems the applications for fast haplotype inference and biallelic phasing are limited. Perhaps I’ll dust it off if I wind up taking an official course in compressive sensing.

## June 19, 2013

### Genetic Association and the Sunrise Problem

Filed under: Uncategorized — Tags: , , , , — heavytailed @ 3:42 am

As a preface, let me acknowledge that this blog has been perhaps a bit too preoccupied with statistical genetics in the past month. Looking at the several posts I have partially completed it is likely to remain so through the summer.

Sequencing and GWAS

Nature Genetics has in the past few months returned to its old position as a clearinghouse for GWAS studies, the author lists for which steadily increase as more and more groups contribute data. (Can we start putting the author list in the supplemental material? It’s getting hard to navigate the webpage…). GWAS studies are nice, we sort of know how to do them. I have a bit of a beef with the push-button approach, I think GWAS studies and in particular meta/mega analyses are failing to deliver as much information as they should, and I’ll come back to this at the end of the section. But GWAS always have the complexity that they are ascertained. We didn’t find a variant of effect T. Does one exist? Can’t say, we could only query these X variants.

Sequencing is fundamentally different: whatever effects are driving the phenotype in your sample are present in your sample. Yet the treatment of (particularly exome) sequencing studies is “like a GWAS just with more variants,” with no real use of the fact that variants are discovered rather than ascertained. It’s the same kind of “where is the signal I can find?” question. But really, the trait is heritable, the signal is there, somewhere, and it’s not just about the signals you can find, it’s also about the places you can rule out. This applies to GWAS as well: you genotyped 50,000 samples and have 12 hits. Surely you can say something else about the other 199,988 sites on the chip, right? Or at least about a large fraction of them?

And once you go from ascertainment to discovery, you can rule out whole portions of the variant space. Things like “No variant in SLC28A1 of frequency > 5% has an odds ratio >1.2”. With that said do I still care about SLC28A1 as a candidate gene for my trait? What if the best I can do for ADAM28 is to say no such variant exits with OR>2.6, but may exist for OR<2.6? Now how much do I care about SLC28A1? You can make such statements having performed sequencing, not only can you produce a list of things where signal definitely is, but you can segment the rest into a list of things where the signal definitely is not, and those where signal might be.

In some sense GWAS studies dropped the ball here too, but the analysis is far more subtle. One of the big uses of modern sequencing data is fine mapping: so, at a GWAS locus, trying to identify rare variants that might be generating a “synthetic association”. For instance, if several rare variants of large effect happen to be partially tagged by the GWAS hit. However, given the observed variance explained, there’s then a posterior estimate of frequency, effect, r^2, and number of variants that could be consistent with the observed signal. Carefully working out what could be there, and what’s expected to be there would provide not only intuition about what to expect when performing fine mapping, but a good and precise framework for thinking about synthetic associations. While the idea that most GWAS loci are so explained has been vigorously rebutted, it is done so at the level of bulk statistics, with little attention paid to if there is a synthetic association here, what must such variants look like?

The point is ultimately there’s a lot more information than what things have p<5e-7, it just needs to be extracted.

Bounding Variant Effects

The 0th-order question is “What variants do I have with p<5e-7“. If the answer isn’t the empty set, the 1st-order question is “What genes are near these variants?” and the 2nd-order question is “What biological processes are these genes involved in?”. If the answer is the empty set, the 1st-order question is “What gene burden tests are at the top of the quantile-quantile plot?”, the 2nd-order question is “What biological processes are these genes involved in?”

But, as I mentioned in the previous section, you’ve seen these variants enough times (especially now with tens- to hundreds of thousands of samples) to know whether or not they’re interesting. The first-order question ought to be “What variants can I rule out as contributing substantially to disease heritability [under an additive model]?” Ideally, this would be effectively everything you looked at. In practice, you looked at 100,000-10,000,000 variants, and some of those will look nominally associated by chance, and these will be difficult to rule out.

Alright so what am I talking about? Here’s the model:

$\mathbb{P}[Y=1|X,C_1,C_2,\dots] = \alpha + \beta X + \sum_i \tau_i C_i + \epsilon$

where X are the genotypes at a site, and $C_i$ is the ith covariate. It is the property of maximum likelihood that, given the data, the coefficient vector $(\alpha^*,\beta^*,\tau_1^*,\dots) = V ^*\sim N(V_{\mathrm{true}},-[\nabla^2\mathcal{L}|_{V_{\mathrm{true}}}]^{-1}$). Denoting the variance-covariance matrix as $\Sigma$ then $\beta \sim (\beta_{\mathrm{true}},\Sigma_{22})$. This posterior distribution on the estimate allows more than the basic test of $\beta_{\mathrm{true}} = 0$, one can also ask for what values $(\beta_u,\beta_l)$ we have $\mathbb{P}[\beta_{\mathrm{true}} > \beta_u | \beta^*] < p$, $\mathbb{P}[\beta_{\mathrm{true}} < \beta_l | \beta^*] < p$. This is just a p-ile confidence interval, but if p<1e-6 and $\beta_u-\beta_l$ is small with $\beta_l < 0 < \beta_u$, then this particular variant is very unlikely to be causal, given these bounds, in that the probability of a large beta is very low.

Now there are two problems with “just using a confidence interval.” First, because the likelihood is heavily dependent on the number of observations of nonzero $X_i$, the variance of the estimator is an increasing function of the minor allele frequency. That is, rare variants are harder to bound. Secondly, the disease heritability conferred by a variant of given effect increases with allele frequency. That is, a 1.4 odds ratio variant at 0.5% frequency is, in and of itself, contributing very little to the prevalence of the disease. By contrast, a 1.4 odds ratio variant at 30% frequency is contributing a large fraction of disease heritability.

Both problems can be resolved by taking a heritability view, that is, asking instead the question for what proportion $\rho(\beta,f)$ of phenotypic variance we have

$\mathbb{P}[\rho(\beta_{\mathrm{true}},f_{\mathrm{true}}) > T | \beta^*,f^*]$

In other words, say you observe a variant with $\beta = 0.2 \pm 0.25$, at a frequency of $0.01$. You then ask, what is the probability that this variant explains 1% of the total prevalence (on a liability scale) of my disease, given this observation? Well, that’s simple, given the prevalence and frequency, one can simply ask for which two values of beta, $\beta_{\mathrm{R}},\beta_{\mathrm{P}}$, the risk-effect and protective-effect values, is the proportion of phenotypic variance 1%. Then, given those values, use the posterior distribution to ask what is $\mathbb{P}[\beta_{\mathrm{true}} > \beta_{\mathrm{R}} | \beta^*]$?

Note that the frequency of the variant does not enter into the discussion, it’s only hidden in the calculations of the phenotypic variance $\rho(\beta,f)$, and the standard deviation of the maximum likelihood statistic $\beta^*$, thus making variants of different frequencies commensurate by placing the hypotheses on the same scale – and one growing approximately with the variance of the estimator. That is, while the variance of the estimator increases as the variant becomes rare (this is what makes bounding using a constant odds ratio difficult), so too does the effect size required to confer a given level of phenotypic variance. These phenomena tend to cancel, so on the scale of variance explained, one can bound a rare variant as well as a common variant.

In order to do this, it’s just a question of converting a given variance explained into frequency-effect pairs, and evaluating the likelihood of such pairs given the data. That is:

$= \mathbb{P}[\rho(\beta_{\mathrm{true}},f_{\mathrm{true}}) > T | \beta^*,f^*,f_{\mathrm{true}}]\mathbb{P}[f_{\mathrm{true}}|f^*]$

$= (\mathbb{P}[\beta_{\mathrm{true}} > \rho_f^{-1}(T)_u|\beta^*] + \mathbb{P}[\beta_{\mathrm{true}} < \rho_f^{-1}(T)_l]|\beta^*)\mathbb{P}[f_{\mathrm{true}}|f^*]$

These terms need to be integrated over $f_{\mathrm{true}}$, but for whatever reason the wordpress parser doesn’t like it when I put $\int_{f_{\mathrm{true}}$ into the above equations. And in practice, for even rare variants on the order of 0.5% frequency, the posterior frequency estimate is peaked enough to be well-approximated by setting $f=f^*$ (remembering, of course, that all frequencies are population-based, so oversampling cases requires re-adjusting the observed frequency based on the prevalence of the disease: $f = (1-p)f_U + pf_A$).

The last thing to work out is the function $\rho(\beta,f,\mathrm{prevalence})$ giving the phenotypic variance explained by a variant of frequency $f$, effect size $\beta$ on a dichotomous trait of a given prevalence. It’s very well known how to compute the liability-scale variance, and the inverse can be found trivially through numerical quadrature.

Liability-scale Variance: A Derivation

Recent discussions have left me slightly cynical on the common intuition behind this calculation (an intuition which seems to be “cite So et al 2012”), so it may be worth stepping through the derivation. Let’s start with a disease, and a variant. The diesease is 10% prevalence, and the variant has a 10% frequency, and an odds ratio of 1.35 (so $\beta = 0.30$). We begin with assuming an underlying latent variable with some distribution, and for the sake of having a pretty function, I’m going to assume a logistic. There’s also a choice between setting the mean of latent variable (over the population) to 0, and letting the threshold T adjust, or setting T=0 and letting the mean adjust. I will do the latter. So basically we have a whole population that looks like

Where the area under the curve to the right of 0 is 0.1. The population mean is exactly $\mu_P = -\log(1/r-1)$. So the question is: whereabouts in this liability scale do carriers fall? For that we need to know the probability of being affected by genotype state. Given the odds ratio, this is pretty easy:

$\mathbb{P}[A|X=2] = \left[1 + \exp(\mu_R + 2\beta)\right]^{-1} = \left[1 + \exp(\mu_R + 0.6)\right]^{-1} = P_V$

$\mathbb{P}[A|X=1] = \left[1+\exp(\mu_R + \beta)\right]^{-1} = \left[1+\exp(\mu_R+0.3)\right]^{-1} = P_H$

$\mathbb{P}[A|X=0] = \left[1+\exp(\mu_R)\right]^{-1} = P_R$

$\mathbb{P}[A] = \mathrm{Prevalence} = \left[(1-f)^2P_R + 2f(1-f)P_H + f^2P_V\right]$

There are three equations and three unknowns (plug in $\mu_R = -\log(1/P_R-1)$), and so this system is soluble. This can be reduced easily to a single equation by plugging in, and solved by a simple root solver such as


diseaseProbs <- function(frequency,odds,prevalence) {
bt <- log(odds)
cPrev <- function(muR) {
PR <- (1-frequency)^2*logistic(muR)
PH <- 2*frequency*(1-frequency)*logistic(muR + bt)
PV <- frequency^2*logistic(muR+2*bt)
(PR + PH + PV - prevalence)
}
if ( odds > 1.05 ) {
# prevalence almong reference individuals will be lower
res <- uniroot(cPrev,lower=invlogistic(0.001),upper=invlogistic(prevalence))
} else if ( odds < 0.95 ){
# prevalence among reference individuals will be higher
res <- uniroot(cPrev,lower=invlogistic(prevalence),upper=invlogistic(0.99))
} else {
upr <- invlogistic(min(1.2*prevalence,1-1e-5))
res <- uniroot(cprev,lower=invlogistic(0.8*prevalence),upper=upr)
}
muR <- res$root return(c(logistic(muR),logistic(muR+bt),logistic(muR+2*bt))) }  So what does this give us? Well it gives us the Logistic Mixture comprising the population distribution Where I haven’t been entirely faithful to the densities (I’m scaling by the square root of the frequencies of the genotypes). Point is, you can see the shift in the distributions by genotype, and consequently the proportion of the distribution curves falling above the T=0 threshold. At this point, we can just plug into the formula for variance: $\mathbb{E}[(X-\mathbb{E}[X])^2]$ to get $V = (1-f)^2 (\mu_P-\mu_R)^2 + 2f(1-f)(\mu_P-\mu_R-\beta)^2 + f^2(\mu_P-\mu_R-2\beta)^2$ Of course throughout we have assumed a standard Logistically distributed error term. Thus the variance of the means have to be given to us as a proportion of the total variance, or in other words $\rho(\beta,f) = \frac{\sigma_g^2}{\sigma_g^2+\sigma_e^2}$. In this case, the variance of the logistic is $\pi^2/3$, and thus we set $\rho = \frac{V}{\pi^2/3 + V} = \frac{0.016}{3.28+0.016} \approx 0.0049$ So roughly 0.5% of the liability-scale variance. Note that using the standard method of replacing the logistic distribution with a standard normal, and using $\rho_{\mathrm{norm}} = \frac{V_{\mathrm{norm}}}{1+V_{\mathrm{norm}}}$ results in an estimate of 0.44%. These differences increase with the variance explained, so that for a 20% frequency variant with OR=2.35, such a variant explains roughly 6% in the logistic model, but 5% variance in the normal model. Assuming the posterior distribution of f is sufficiently peaked, doing a GWAS-like bounding scan is fairly simple. For each variant, calculate $(\beta_u,\beta_l): \rho(\beta,f_i) = T$ via $\rho_{f_i}^{-1}(T)$. Then, using the posterior distribution of $d\mathbb{P}[\beta_\mathrm{true}|X,C_1,\dots]$ calculate $p_i = \Phi(\beta_l/\sigma_\beta)+(1-\Phi(\beta_u/\sigma_\beta))$. Multiple Test Considerations Where does the figure 1e-7 come from for GWAS studies? Or 5e-8 for sequencing studies? (And I mean conceptually, not here here or here). The basic logic is that of the look elsewhere effect: 1) Most variants are null 2) There are lots and lots of variants 3) There are many chances to see spurious nominal associations 4) Must set the threshold fairly high to achieve a study-wide Type-I error of 5%. The fundamental consideration is (1), and there’s a significant asymmetry between a GWAS scan to reject $\beta = 0$, and a bounding scan to reject $\beta \not \in (\beta_l,\beta_u)$. If most variants are null under a GWAS scan and ought not to be rejected, then almost no variants are null under a bounding scan, so most ought to be rejected. That is: Type-I error is bad in a GWAS scan (you find something that’s not real), and you’re willing to pay a cost of Type-II error (you don’t find something that is real). By contrast, for a bounding scan, Type-II error is bad (you put bounds on a truly causal variant), while Type-I error is less of an issue (you fail to bound a noncausal variant). Thus, previous conversations about multiple testing are not applicable, and one shouldn’t blindly apply p=5e-7. Instead, bounding requires a good deal more thought (or assumption) about the architecture of the trait. I would articulate the logic as 1) Causal variants are null 2) Few variants are null 3) There are lots and lots of variants 4) There are many chances to falsely accept the null 5) We can be rather aggressive on the rejection p-value It’s worth articulating why we don’t care all that much about false rejects: rejecting the null falsely simply includes a non-causal variant in a set of variants that should be enriched for causal variants. Variants for which the correct test is to reject $\beta = 0$. And it’s absolutely possible for a variant to be such that (e.g. have a large enough standard deviation that) $H_0: \beta = 0$ is rejected, as well as $H_0: \beta \not \in (\beta_l,\beta_u)$ is rejected. On the other hand, failing to reject the null on a causal variant means we can potentially exclude it from follow-up analyses, a situation we want to avoid. So there’s a different balance between Type-I and Type-II error here, with Type-II being the one we’re concerned about. But of the two, Type-II error is the trickier, because it’s fundamentally a question of power. The idea is, well, we don’t want to accidentally exclude things. But you have to be a little bit more specific: what kind of things do you not want to exclude? You can be sensitive to all causal variants (even those with OR=1+1e-5) by rejecting everything. So being more reasonable, you might say “I want no false acceptances of any variant with >0.5% of phenotypic variance explained”. Given that statement, you can calculate the power $w(\rho=0.05,p_{\mathrm{nom}})$, and that’s the power at the nominal threshold $p_{\mathrm{nom}}$. (See previous posts regarding calculation of power). But of course, that’s not quite enough, because you also need to know the number of chances to falsely accept such a variant, which is, of course, begging the question (if you knew how many causal variants there were, why bother with the exercise?). One way to get around this is to simply assume an unreasonable number of causal variants of such variance explained, $N_{\mathrm{UB}}$ and then use the same machinery as Bonferroni correction and set $p_{\mathrm{nom}} : 1-(1-w(0.05,p_\mathrm{nom}))^{N_\mathrm{UB}} = 0.01$ Bounding Variance Explained at a Locus Alright, so that all was about variants you know about. But what if you care about a specific locus, like a gene, and want to know if variants within that gene contribute substantially to heritability. There are several different questions to articulate, but for now let’s stick with the single-variant approach. We know, from the previous section, how to answer “what is the probability that one of the variants in this locus explains a T or greater portion of phenotypic variance?” But now we’ll take a step into the unknown. The second-order question: what is the probability that no variant exists at this locus that explains a T or greater proportion of phenotypic variance? That is, given I’ve sequenced the locus and haven’t found anything, what’s the probability I could keep on sequencing and still wouldn’t find anything? Well, what’s the probability the sun will rise tomorrow? Following Laplace, 9592/9593. Well Okay. In all seriousness, this particular question about genetic mutation has just enough structure in it to avoid the philosophical issues that arise with the Sunrise Problem (Statistics can’t prove a negative, etc). Specifically: we rely on the fact that we’ve done sequencing. This means that we have not failed to query sites for mutations (as in the case of genotype chips), but we have, and there were none. That is, if a variant exists, we’ve got a pretty good idea about it’s frequency. Let’s turn to Mister Bayes, and ask, given I’ve sequenced $N$ samples and seen $\alpha$ counts of a given allele, what is the posterior density of the allele frequency? Using the Rule: $d\mathbb{P}[p|\alpha;N] = \frac{ [x^\alpha] (px + q)^{2N} d\mathbb{P}[p]}{\int_p [x^\alpha] (px + q)^{2N} d\mathbb{P}[p]} = \frac{{2N \choose \alpha} p^\alpha(1-p)^{2N-\alpha}d\mathbb{P}[p]}{\int_p{2N \choose \alpha} p^{\alpha}(1-p)^{2N-\alpha}d\mathbb{P}[p]}$ This doesn’t give a good intuitive sense of the posterior distribution other than the basic shape (gamma-like, no?). The best thing to use as a prior is the allele frequency spectrum modeled on the human bottleneck-and-superexponential-expansion (see Evans and Shvets, for instance), but I find a good intuitive model (that surprisingly works well in practice) is a truncated inverse distribution $d\mathbb{P}_\epsilon[p] = \left[-p\log(\epsilon)\right]^{-1}$ which is distributed on the domain $(\epsilon,1]$. This tends to overestimate allele frequency because (1) humans have grown faster than exponentially, and (2) no mutation is allowed to have a frequency lower than $\epsilon$. You can make a Gamma distribution peak at $\epsilon$ and then go to 0, but this is a nice toy distribution. Plugging it in yields ${2N \choose \alpha} p^{\alpha-1}(1-p)^{2N-\alpha} \left[\int_\epsilon^1 {2N \choose \alpha} p^{\alpha-1}(1-p)^{2N-\alpha}dp\right]^{-1}$ Digression: The Probability that a Singleton is more than a Singleton This is actually a fun little proof I have given this simplified prior. You go out and sequence N people, and find a singleton you care about. What’s the probability that it’s frequency is >1/2N? I’ll prove that it’s about $\frac{1}{e}$. What’s the denominator of the above formula when $\alpha = 1$? Plugging in $\int_\epsilon^1 \frac{2NZp(1-p)^{2N-1}}{p}dp = \int_\epsilon^1 2NZ(1-p)^{2N-1}dp = Z(1-\epsilon)^{2N}$ And therefore the conditional posterior of f can be written $d\mathbb{P}[f \geq p | \mathrm{AC=1\; in \;} N] = \frac{2NZp(1-p)^{2N-1}}{Zp(1-\epsilon)^{2N}} = \frac{2N(1-p)^{2N-1}}{(1-\epsilon)^{2N}}$ Now we can inquire as to the probability that, given AC=1 in N people, the true frequency will be greater than some value $\delta$. This is: $\int_\delta^1 \frac{2N(1-p)^{2N-1}}{(1-\epsilon)^{2N}}dp = \left(\frac{1-\delta}{1-\epsilon}\right)^{2N}$ Now let $\delta = \epsilon + \delta'$: $= \left(\frac{1-\epsilon-\delta'}{1-\epsilon}\right)^{2N} = \left(1 - \frac{\delta'}{1-\epsilon}\right)^{2N}$ Letting $\delta' = \frac{1}{2N} - \epsilon$ yields: $\mathbb{P}[f \geq \frac{1}{2N}-\epsilon | \mathrm{AC = 1 \; in \;} N] = \left(1-\frac{1}{2N(1-\epsilon)}\right)^{2N} \longrightarrow \exp\left[-\frac{1}{1-\epsilon}\right] \approx e^{-1}$ Anyway, the point is you have some posterior estimate $f_{\mathrm{post}}(p)$ of the allele frequency spectrum given that you haven’t observed it in your sequencing. Any variant you haven’t seen yet has that posterior spectrum, and thus given a fixed proportion of phenotypic variance explained T, the odds ratios that such sites must have are constrained to be very large. In particular, for a 10% prevalence disease, using the above model, a 0.01% frequency variant would need an odds ratio of 400,000 to explain 1% of the liability-scale variance. However, the fact that for such a rare variant, 1% phenotypic variance is achievable is really a breakdown of the liability-scale model. In particular, there is a ceiling on the variance explained by a mutation: a 0.01% variant that is fully penetrant only explains 0.1% of the prevalence of the trait. That is, at some point you know that any undiscovered single variant left at a locus, even if fully penetrant, cannot explain a significant fraction of the phenotypic variance. On the observed (0/1) scale, such a variant of frequency f for a disease of prevalence p explains a $f(1-f)/K(1-K)$ proportion of the phenotypic variance. Transferring this to the liability scale by Dempster’s formula $h_{0/1} = t^2h_x/(pq) = t^2 h_x / [K(1-K)]$ Now the genetic variance on the observed scale, $\sigma_{g_{0/1}}^2 = \beta f(1-f) \leq f(1-f)$ as the slope is at most 1. $\sigma_{p_{0/1}}^2 = K(1-K)$ by definition, and therefore $h_{0/1} \leq f(1-f)/K(1-K)$. Obviously during the conversion to the liability scale, the phenotypic variance, which appears in both denominators, will cancel. Thus we have (for prevalence $K$): $f(1-f) = t^2h_x \Rightarrow h_x = f(1-f)/t^2$. Here, $t = \phi(\Phi^{-1}(1-K)) = \phi(1.28) = 0.17$ so for $f = 0.0001$ and 8% prevalence, $h_x = 0.0045$ or 0.45% of the phenotypic variance. This is a bit of a tawdry estimate, as such a variant equipped with OR=15 explains (by the other model) 0.14% of the phenotypic variance. Nevertheless, this transformation provides a reasonable approximation to an upper bound for rarer variants (but we should still search for better corrections to it). Thus, denoting by $\rho(f) = f(1-f)/\phi(\Phi^{-1}(1-K))^2$ the function mapping a frequency to the maximum liability variance explained (e.g. that when the variant is fully penetrant), the probability that an undiscovered locus explains $\geq T$ fraction of phenotypic variance is $\mathbb{P}[V \geq T] = \int_f \mathbb{P}[\rho(f) > T | f]f_{\mathrm{post}}(f) < \int_{f_\mathrm{min}}^1 f_{\mathrm{post}}(x)dx < (1-f_\mathrm{min})f_{\mathrm{post}}(f_{\mathrm{min}})$ where $f_{\mathrm{min}}$ is the minimum f such that $\rho(f) > T$. Note that this expression is also $< f_\mathrm{post}(f_{\mathrm{min}})$, that is, bound by the posterior allele frequency distribution function evaluated at $f_\mathrm{min}$. Now $f_\mathrm{min}$ is easy to calculate, set $V =f(1-f)/\phi(\Phi^{-1}(1-\mathrm{prev}))^2 = zf(1-f)$ so that $f_\mathrm{min} = \frac{1}{2} \pm \frac{1}{2z} \sqrt{z^2 - 4Vz}$ Thus given the prevalence of a disease, there is an effective upper bound on the liability-scale variance explained, given that the observed 0/1 variance cannot exceed $f(1-f)$. Here, we plot $f_{\mathrm{min}}$ as a function of the variance explained So let’s go to our example. Lets say you’ve got a 10% prevalence disease, and have sequenced 50,000 individuals at a locus. You want to know the probability that there’s a 1% variance explained mutation hiding somewhere in your population that you didn’t discover. Then V=1% and z = 0.17, plugging in gives $f_\mathrm{min} = 0.00034$. Using the posterior from above, $f_\mathrm{post}(f_\mathrm{min}) < 1.5\times 10^{-16}$. Here’s the kicker: Even if you’ve sequenced only 10,000 individuals. So long as you haven’t seen a variant at all, the chance that such variant that could explain 1% of the variance is still < 1e-11. That means that for a 1e-6 “chance” to have missed such a variant, about 100,000 such variants would have to exist in the population you sampled. At the locus you’re looking at. Quite reasonable on a genome-wide scale. Quite unreasonable for a gene. In other words, if even a moderate amount of sequencing has been done on your gene, there are no more single variants left to find. At least, if you care about single variants explaining a large portion of heritability. Other caveats: this assumes a homogenous population, it could be that there are populations where such variants do exist even at reasonable frequencies. But wait, there’s more. Let’s say such a variant does exist. It’s completely penetrant, but it has a frequency of $f < 1/20,000$. For the sake of example, let’s say you’ve sequenced people from Dublin, and this variant at frequency 1/50,000 didn’t get into your study. Or maybe you did see a copy or two, it actually doesn’t matter. Suppose it’s causal dominant, you get the disease if you’re a carrier. But here’s the deal. Dublin’s population is 1.3 million. Only 26 copies of this allele exist. You could sequence every person in Dublin and you wouldn’t have the power to associate it at 5e-7. Furthermore, there are constraints beyond population size: you need to actually collect the samples, and you need to pay for the sequencing. Thus, in addition to there not being single variants of high variance explained left to find, many of those things we haven’t found yet cannot be demonstrably associated. They are un-followupable. The good news is, the things we have found can. But at some point the posterior allele frequency becomes so scrunched towards 0 that (on a single-variant level at least) you just have to forget about ever following those up statistically. At that point you’ve got to break out the crispers and engineer some cells. Another Comment on Studies The preceding discussion is basically the entirety of the reason sequencing studies have really dropped the ball regarding the genetic architecture of the traits they examine. Sequencing studies are effectively mini-GWAS studies, characterized by a Q-Q plot and a multiple-testing burden higher even than that of GWAS, with no clear understanding or mention of why sequencing was undertaken in the first place. “We wanted to discover all the variants present in our gene.” Okay fine. Given you’ve done that, do we ever need to sequence that gene again? What’s the probability that we missed something there that’s explaining a large fraction of the variance of the phenotype? You don’t even need phenotyped samples to answer that question, you can just use estimates from the 1000 Genomes Project! This is what’s frustrating: genetic studies cherry-pick one or two novel top hits, and leave everything else on the table. The process of exclusion has been entirely ignored. If we went back, and actually analyzed the data with a careful eye, we probably would have completely excluded hundreds, maybe thousands of loci from consideration. Certainly thousands of single variants. The Multi-Variant View My little (now reaching ~3500 words) invective, up til now, has been focused on a single-variant view of heritability (variance explained). This view, while helpful for talking about what can be true about the properties of single variants, is less helpful about talking about what can be true about sets of variants. Things like “variants in SOX9” (a very cool gene), or “variants in open chromatin regions near actively transcribed genes in renal cells.” This section is basically an extension of the previous discussion to groups of variants; but this alteration presents some novel challenges: the nonlinearity of heritability estimates, the large state space of multi-variant genotypes, and the statistical brutality of linkage disequilibrium. Expanded Multivariant View Now, rather than having a single variant $\nu$, one has multiple variants $\nu_1,\dots,\nu_n$, with frequencies $f_i$, and effects $\beta_i$. It’d be very nice (and indeed, some papers do) to write $\rho(\vec \nu) = \sum_{i=1}^n \rho(\beta_i,f_i)$ But unfortunately $\rho$ is linear neither in $\beta$ nor in $f$. First, $\rho = V/(\sigma_2^2 + V)$ is decidedly nonlinear in V. Second, $V$ is quadratic in both $f$ and $\mu$, and $\mu$ is quadratic in $f$ and inverse logistic in $\beta$. So forget that sum. Instead, it is straightforward to extend from a genotype to a multigenotype. Multiple Variant Variance Explained This follows from replacing the index (Ref,Het,Var) from the previous calculation with the multi-index ([Ref,Ref,…],…,[Var,Var…]). For convenience, we’ll wrap the multi-index to a univariate index $i \in (0,\dots,3^n)$, and denote by $G_j \in i$ the genotype of the $j$th variant in the $i$th multigenotype. Then: $\mathbb{P}[A|i] = [1+\exp(\mu_0 + \sum_{G_j \in i} G_j\beta_j)]^{-1} = P_i(\mu_0)$ The frequency of multigenotype $i$ we denote by $F_i$, which, in the case of full HWE with no LD is $F_i = \prod_{G_j \in i} \binom{2}{G_j} f_j^{G_j}(1-f_j)^{2-G_j}$ $\mathrm{Prevalence} = \sum_{i=0}^{3^n} P_i(\mu_0)F_i$ Then using $\mu_T = -\log(1/\mathrm{Prev}-1)$ $V = \sum_{i=0}^{3^n} (\mu_i-\mu_t)^2F_i$ $\rho = \frac{V}{\pi^2/3 + V}$ The only issue is the sum. As in the case of power (which requires massive sums over genotype space), we can get around this more efficiently via simulation. We can exploit the fact that we can set $\mu_0 = 0$ and allow T to shift. It’s easier also to use the standard normal rather than the logistic for this (as a sum of normals is still normal) A simulation would then be:  multiVarSim <- function(frequency,odds,prevalence,logit=T,prec=0.01) { bt <- log(odds) invlink <- qnorm link <- pnorm if ( logit ) { invlink <- invlogistic link <- logistic } # this part can be done with replicate, but I enjoy vector algebra nRuns = round(10/prec,1) drawGeno <- function(f) { rbinom(nRuns,size=2,prob=f) } S <- as.matrix(sapply(frequency,drawGeno)) %*% as.matrix(bt) # now u_0 needs to be such that the total prevalence is prevalence # this means that sum(link(S+u0))/length(S) = prevalence cPrev <- function(muR) { sum(link(S+muR))/length(S) - prevalence } res <- uniroot(cPrev,lower=invlink(0.001),upper=invlink(0.999)) muR <- res$root
S <- S + muR
# now remember we care about the variance around muP
}



The only trouble with using the multi-genotype approach is that the state space is huge: for n variants, there are $3^n$ genotype assignments, so forget about looking at 15 variants, let alone hundreds. As mentioned, simulation can overcome this setback, but simulations themselves are expensive.

But even beyond that, when one moves from calculating variance explained by several variants to bounding variance explained by several variants, the numerics becomes challenging. In particular, one moves from a one-parameter posterior distribution (two-parameter if taking uncertainty of the frequency into account) to an n-parameter distribution (2n if accounting for frequency uncertainties), with non-axial limits of integration. That is, while for a single variant

$\rho_f^{-1}(V)_u$ produced a single point $\beta_u : \rho(\beta_u,f) = V$, for multiple variants

$\rho_f^{-1}(V)_u$ produces a curve: $\beta_{u_i} : \rho(\beta_u, f) = V$

That is, you can reduce the effect size of one variant, while increasing the effect size of another variant to compensate. This is a situation where one might use Monte Carlo simulation to evaluate the integral: but note that for each point of the Monte Carlo, a simulation needs to be run to get the variance explained. This is terribly costly.

Another way to bound the probability is to solve the constrained optimization

$\mathrm{max} -\log \mathbb{P}[\vec \beta_{\mathrm{true}} = \vec \beta_u | \beta^*,f]$

$s.t. \;\; \rho(\beta_u,f) = V$

The likelihood is (luckily) convex, so this isn’t excessively difficult to do. Let $p^*$ be that maximal p-value. Then

$\mathbb{P}[V_{\mathrm{true}} > V | \beta^*,f^*] \leq 1-\Phi(\phi^{-1}(1-p^*))$

Nevertheless, even this is very costly. For every evaluation of $\rho$ by the optimizer, a simulation needs to be done to estimate its value; even this approximation is really expensive. In addition, LD generates additional difficulty: Haplotypes need to be simulated rather than genotypes (in order to estimate $\rho$), and if the $\beta^*$ were fit independently rather than jointly, both the mean and the the variance-covariance matrix of the posterior distribution need to be adjusted to account for LD.

This, combined with the problem of estimating effect sizes from low-frequency variants, leads to methods of collapsing variants across a locus into scores.

Collapsed Multivariant View

Collapsing is a means of re-writing the multiple-variant joint model. Instead of

$\mathrm{lgt}(Y) = \sum_i \beta_i X_i + \sum_i \gamma_i C_i + \alpha$

we write $X$ as a matrix and have

$\mathrm{lgt}(Y) = \tau X\xi + \sum_i \gamma_iC_i + \alpha$

Where $\xi_i$ is constant and for the calculation of variance explained, $\xi_i = \beta_i$. The point is $\xi$ itself is not fit, but only the coefficient $\tau$ which, when $\xi = \vec\beta$, $\tau=1$. Collapsing tests (“burden” tests) don’t have access to $\beta_i$, but instead try to “guess” using some function $\xi = \Xi(\vec \nu)$ of the variant properties (like frequency, estimated deleteriousness, and so on).

One thing immediately recognizable is that if $\xi = \vec \beta$ and $\tau = 1$ then the formula for

$\mathbb{P}[A|i] = [1+\exp(\mu_0 + \sum_{G_j \in i} \beta_i G_j)]^{-1} = [1 + \exp(\mu_0 + \vec G_i^{\dagger} \xi)]^{-1}$

So in other words, the score vector $\vec S = \tau X\xi$ is an estimate of the latent liability variable for all of the samples. This allows the following simplifying assumption:

$S_i \sim N(\mu,\sigma^2)$

for some $\mu$ and $\sigma$. This is sort of more-or-less true as $X_{\cdot,j}$ are a sequence of binom(2,f_i) variables, and so a weighted sum should be approximately binomial, ergo approximately normal. Therefore, for a given $\tau, \xi$:

Variance Explained by a Collapse Test

We have $\mathbb{P}[A|S] = 1-\Phi(\mu_0-\tau S)$. Here we use a normal latent model because it plays nicely with the assumption that $S$ is normal. Then

$\mathrm{Prevalence} = \int_{T}^\infty (1-\Phi(\tau S))\phi((\tau S - \mu_{\tau S})/\sigma_{\tau S}) dS$

As before, $T$ can be found by numerical quadrature. Then

$V = \mathrm{var}(T - \tau S)$

and $\rho(S,\tau) = V/(1+V)$.

Note that this can be simplifed by talking about a variable $R = \tau S$, but it’s nice to clarify what $\tau$ does (which is scale the proposed score distribution so it aligns with the underlying liability explained).

To use this as an approximation to the variance explained by multiple variants, note that $S$ is a linear function, and therefore (assuming no LD and no relatives):

$\mathbb{E}[S] = \mathbb{E}[X\xi] = \mathbb{E}[X]\xi \rightarrow \mathbb{E}[S_i] = \sum_j 2f_j\beta_j$

$\mathrm{Var}[S_i] = \sum_j \beta_j^2 f_j(1-f_j)$

Expressions which can be adapted for the presence of LD and relatedness.

This gives $\mu_{\tau S} = \tau \sum 2f_j\beta_j$ and $\sigma^2_{\tau S} = \tau^2 \sum \beta_j^2 f_j(1-f_j)$

For an evaluation of “true” variance explained, set $\tau = 1$. To identify the variance explained by a (possibly wrong) guess $\xi$, replace all instances of $\beta_j$ with $\xi_j$.

Using a collapsed estimate of the variance explained has the benefit of being univariate. That is, for a given (either guess or true) $\xi$, then $\rho = \rho_\xi(\tau,f)$, and thus it has a univariate inverse: $\rho_{\xi,f}^{-1}(\rho) = (\tau_l,\tau_u)$.  This enables the bounding of a locus or burden test through the posterior distribution of the parameter $\tau$, just as we did with a univariate $\beta$ in the first section. That is, given $\xi$ (and assuming we don’t need to integrate over the posterior of $f_i$ which is fairly painful in this case), to query the probability of a locus explaining Q or more of the phenotypic variance:

Calculate $(\tau_l,\tau_u) = \rho_{\xi,f}^{-1}(Q)$

Calculate $p = \Phi( (\tau_l-\tau^*)/\mathrm{SE}_\tau ) + 1-\Phi( (\tau_u - \tau^*)/\mathrm{SE}_{\tau})$

However (!), the obvious approach (set $\xi_i = \beta^*_i$) where $\vec \beta$ was fit jointly by MLE has the drawback that $\tau$ is inflated due to overfitting. In fact, in such cases $\tau=1$ always. One way around this is to use $\beta_i$ estimated from a different source. Alternatively let $\xi$ be a hypothesized score (e.g. a burden weight vector). It’s important to note in the latter case that bounding variance explained by a burden test does not bound the maximum possible variance explained at a locus, it bounds the variance explained by the hypothesized $\xi$. The variants at the locus could still contribute substantially to trait heritability, just simply in a direction orthogonal to $\xi$ (of which there are $n-1$).

Surely the many things I haven’t seen explain lots of heritability in bulk, right? Right?

The last thing to check off with the locus view now that we can bound variance explained by (1) multiple variants and (2) burden scores, is the probability that at said locus, there may exist many variants that for whatever reason were not discovered, but that in aggregate explain a large proportion of heritability. As in the single-variant case, you can put bounds on these things, but more importantly, consider what it means to want the answer to this question (as anything more than an exercise in probability).

It’s been a long post. Let’s take a step back.

Human medical genetics is science with an ulterior motive, where the search for additional causal variants is often explicitly motivated by the statement that new loci provide new targets for pharmaceutical treatments. And, by and large, this is a positive force for the advancement of disease biology and the understanding of human disease. At the same time, I believe that this Quest For Treatment, combined with the increasing focus on impact explains a wide variety of phenomena in the medical and statistical genetics community, from hasty and opportunistic study design to the ENCODE backlash (and the resurgence of Big vs Small science).

Sequencing is a good example. Initial GWAS studies provided a good number of variants and loci with strong disease associations. The bulk of these associations, however, were not immediately interpretable: falling between genes (unclear which protein might be affected), or in a noncoding part of a gene (so not obviously gain- or loss-of-function). In part because of the difficulty of interpreting biological effects, and in part because of the worry of a winner’s-curse effect, we started (rather uncritically) to refer to these variants as “tag-SNPs”, hypothesizing that the true causal variant was unlikely to have been ascertained, as the bulk of human variation is uncommon to rare and not present in the set of imputed loci, and furthermore is likely to have an interpretable biological impact. In retrospect, it seems rather clear that common variants which maintain their association across populations are unlikely to be tagging a “true causal variant” of lower frequency which is far more likely to be absent in other populations. (That said, one  proof of an elegant experiment that the results reveal something that is obvious in retrospect)

I think the shift from array-based GWAS towards sequencing-based studies reflects both the optimism of researchers, and the entrance of impact as a scientific consideration. Should loss-of-function mutations underlie even 10% of the GWAS hits, we’d then have generated a respectable number of druggable targets, opening up new paths towards treatment. Even having seen the (largely negative) results, even as I agree with Kirschner’s characteriztion of impact, I still think these were the right designs to follow up on the results of GWAS.

But what it does mean is that these studies weren’t designed to add to the knowledge of genetic architecture. Why didn’t we only sequence cases that had none (or very few) of the known genetic risk factors? Because at the same time we wanted to see if there were un-discovered variants underlying those GWAS signals. Why do we sequence exomes instead of the regions near GWAS hits? Because not only do many GWAS loci fall near to or within genes, but also protein-coding variants, in particular loss-of-function variants, provide a very clear target for both follow-up lab experiments, and potential therapeutic developments. In addition, exome sequencing is cheaper so one can sequence more samples to find more rare variants and have more power to associate them.

So, really, the question “What’s the likelihood that there are multiple variants that I have yet to ascertain that in aggregate explain a substantial fraction of phenotypic variance?” is a continuation of this rather incautious optimism about 1) the strictly additive model holding, and 2) the infinitesmal (“Quantiative”) model not holding, to say nothing of 3) unbiased estimates of heritability. For a common disease, there are reasons on all three fronts to cautiously articulate hypotheses before going any further in the Quest for Treatment.

As mentioned in the last section on bounding the variance explained by undiscovered variants, we’ve now gotten to territory where undiscovered variants are so rare that in order to explain even 0.1% of the phenotypic variance they have to be effectively 100% penetrant. The fact that common diseases are not known to present as Mendelian in a substantial portion of cases, as the assumption that extremely rare (or private) variants of strong effect would suggest, indicates that such variants are not likely to account for a large fraction of disease prevalence. At the same time if these variants are not of high effect but together account for a large fraction of heritability, the model isn’t really distinguishable from the Quantitative model: every variant is causal, the effect size is negligible.

At the same time, the Latent Variable Model is just an approximation, and while it alleviates some of the issues of treating a binary trait as a real-valued trait (with actual values of 0 and 1), it does not eliminate them. That is, phantom epistatic effects are mitigated, but not not eliminated (because the latent variable need not be normal, though we assume so). This not only impacts estimates of heritability, but also indicates that the purely additive model is mis-aligned, and there might be a considerable gain (in terms of understanding heritability estimates) to carefully considering what happens in individuals with many genetic risk factors. At the same time, both association scans and association study experimental designs tend not to consider maternal, dominance, additive-additive, or additive-environment effects. Which seems like a more fruitful use of research money: sequencing many thousands of samples in the hope of finding a fully penetrant f=1/30,000 variant, or collecting a wide array of phenotypes and environmental data to better understand endogenous and exogenous disease and gene x disease covariates? Or buying a cluster (or access to Amazon’s) to scan for additive-additive effects?

In sum: none of this proves that a locus can’t harbor undiscovered variants that explain some given fraction of phenotypic variants. One can bound that probability just as in the single-variant section. But the point is: think about why such a bound is necessary. Is the working hypothesis to be tested really that there are high-penetrance near-private variants lurking in the genome? Is that really the hypothesis Occam’s Razor would suggest, given all of the observations in this post?

I, for one, don’t think so.