June 5, 2013

Back to Biology: Confounding, Bias, and Dichotomous Traits

Filed under: Uncategorized — heavytailed @ 2:56 am

(Update 6/2013 – I’ve edited and extended this old post from 10/2012. I had begun writing a new related post, and decided the material was better placed within this one as an extension.)

Two recent observations set me down a dark and lonely road, and they are unsurprisingly related. They both have to do with sampling and the kinds of confounding biases resulting from uneven sampling with a dichotomous trait, or even in a quantitative trait. The first is the observation that genetic studies (typically) have differential power to find protective and risk variants, an observation which may explain the lack of druggable targets, and perhaps even the ghost of “missing heritability” (although I really doubt that this alone could explain the latter). The second is the observation that correcting for stratification wreaks havoc on rare variant tests.

I think neither of these things are surprising on its own, but I feel it’s worth fleshing out the underlying root of both problems. It all boils down to study design, and more specifically sampling: specifically covariation between dichotomous trait and sample group, or and inflation of group representation above the natural proportion (that is, inflating or deflating group proportions).

Digression: Deriving the Score Statistic

So it turns out I’m really bad at remembering formulae. I blame Wikipedia. And while I wouldn’t pull out limits of Riemann sums to derive integrals of the form

\int \frac{1}{\sqrt{a + bx^2}}dx

understanding the Score (and also: Wald and LRT) test is important enough that it’s worth knowing the derivation. This is taught in every introductory graduate statistics course, but is present in the literature in such a kaleidoscopic number of forms that it’s almost necessary to know the derivation not to be confused, for instance, when a paper writes “it can be shown[1] that the variance of this statistic is” (some formula that doesn’t look quite right). The derivation is a rather fun exercise. The key thing to remember is that “Score” is a terrible name for this test. I think the econometricians got it right: it’s a Lagrange Multiplier Test.

So you have some parametric distribution (typically GLMM, but it will work for any parametrized distribution) that has a likelihood function \mathcal{L}(\tau,S), which is a function of your parameter vector \tau and your sample S. “Fitting” your data is the process:

\tau^* = \mathrm{argmax}_\eta L(\tau,S)

which, by definition, and so long as L is smooth in \tau, is guaranteed to have the property

\left.\frac{\partial L(\tau,S)}{\partial \tau}\right|_{\tau^*} = 0

noting that this is the zero matrix. From the standpoint of optimization, what is the hypothesis that \tau_4 < 0 ? That is to say, if I asked for the maximum likelihood (and corresponding \tau^*) for which \tau_4 < 0, what would we do? We’d try to restrict the domain of \tau_4. We’d just do a constrained optimization! That is:

\tau^*_{\mathrm{constrained}} = \mathrm{argmax}_{\tau} L(\tau,S) \;\; s.t. \;\;\tau_4 < 0

Indeed we can expand to arbitrary constraints (such as \eta lies in the unit ball) quite easily, by taking an arbitrary vector-valued constraint function:

\tau^*_{\mathrm{constrained}} = \mathrm{argmax}_{\tau} L(\tau,S) \;\; s.t. \; \; g(\tau) = 0

You might imagine breaking \tau into constrained components and unconstrained components, but this turns out not to be too helpful. We need to make use of the standard result involving maximum likelihood: namely that

(\tau^*-\tau_t) \sim N(0,-[\nabla^2 \mathcal{L} |_{\tau_t}]^{-1})

Which means that the distribution of the fitted (unconstrained) parameters vary jointly, and thus the variance of \lambda will necessarily involve unconstrained components. We constrain the likelihood by a function g(\tau), and thus our Lagrangian is

\mathcal{L}(\tau) + g(\tau)\lambda

which gives first order conditions

\mathcal{L}_\tau|_{\tau_\lambda} + \nabla g|_{\tau_\lambda}\lambda =0 \;\; \;\;\; g(\tau_\lambda) = 0

Where \tau_\lambda is the maximum of the constrained optimization. By contrast, we’ll call \tau^* the unconstrained maximum, and \tau_t the true value of \tau.

A very typical form for g(\tau) is \tau, which constrains this set of parameters to 0. In this case the first order conditions have a particularly nice form: \mathcal{L}_\tau + \sum \lambda_i; and in particular if \tau is a scalar, this gives the equation \lambda = \mathcal{L}_\tau, which gives rise to the standard nomenclature: the partial derivative of the likelihood with respect to a given parameter is the “score” for that parameter.

At any rate, we want to get around to testing that score, which means somehow we need a value for \mathcal{L}_\tau|_{\tau_\lambda}. So we’ll Taylor expand around \tau_t:

\mathcal{L}_\tau|_{\tau_\lambda} = \mathcal{L}_\tau|_{\tau_t} + \nabla^2\mathcal{L}|_{\tau_t} (\tau_\lambda - \tau_t)

Now while we don’t know much about \tau_\lambda - \tau_t, we do know the distribution of \tau^*-\tau_t, so we examine another expansion:

\mathcal{L}_\tau|_{\tau^*} = \mathcal{L}_\tau|_{\tau^t} + \nabla^2\mathcal{L}|_{\tau_t}(\tau^* - \tau_t) \rightarrow_{H0} N(0,-\nabla^2 \mathcal{L}|_{\tau_t})

So if g(\tau) is such that \nabla g_i \in \{0,1\}, then we’re effectively done, because under the null hypothesis the constraints do not bind, and so the binding \lambda_i will be distributed as \tau^*, and the non-binding \tau will be distributed as \tau^*. The one last step is to derive the conditional distribution of the constrained parameters given the unconstrained ones, so let’s write \nabla^2 \mathcal{L} as \left(\begin{array}{cc} \mathcal{L}_{\tau\tau} & \mathcal{L}_{\tau\alpha} \\ \mathcal{L}_{\alpha\tau} & \mathcal{L}_{\alpha\alpha}\end{array}\right) where now \tau are constrained to 0, and \alpha are not. Then, having fit \alpha:

\lambda|\alpha \sim N(0, -[\mathcal{L}_{\tau\tau}-\mathcal{L}_{\tau\alpha}\mathcal{L}_{\alpha\alpha}^{-1}\mathcal{L}_{\alpha\tau}])

Evaluated at \alpha^*. However if g(\tau) is not of such a nice form, we cannot directly appeal to the null distribution of \mathcal{L}_\tau|_{\tau_\lambda} as we did here, because we relied on the fact that \nabla g(\tau) \in \{0,1\}. In the general case we need to know these values, so we play the Taylor expansion trick again:

g(\tau^*) = g(\tau_t) + \nabla g|_{\tau_t}(\tau^*-\tau_t)

g(\tau_\lambda) = g(\tau_t) + \nabla g|_{\tau_t} (\tau_\lambda - \tau_t)

Now g(\tau_\lambda) = 0, as that is our constraint. This yields a relation for the Jacobian:

g(\tau^*) = \nabla g|_{\tau_t}(\tau^*-\tau_\lambda)

And from above we have a relation for the likelihood derivatives (because by definition \mathcal{L}_\tau|_{\tau^*}=0:

\mathcal{L}_\tau|_{\tau_\lambda} = \nabla^2\mathcal{L}|_{\tau_t}(\tau_\lambda - \tau^*)

We can combine these two equations, yielding

g(\tau^*) = -\nabla g|_{\tau_t}[\nabla^2 \mathcal{L}|_{\tau_t}]^{-1}\mathcal{L}_{\tau}|_{\tau_\lambda}

But we care about \lambda. Remember that we have -\nabla g |_{\tau_\lambda}\lambda = \mathcal{L}_{\tau}|_{\tau_\lambda}. Then:

g(\tau^*) = \nabla g_{\tau t}[\nabla^2\mathcal{L}_{\tau_t}]^{-1}\nabla g|_{\tau_\lambda}\lambda

Which, because we have a symmetric inner product which will be invertible, gives

\lambda = \left[\nabla g_{\tau t}[\nabla^2\mathcal{L}_{\tau_t}]^{-1}\nabla g|_{\tau_\lambda}\right]^{-1}g(\tau^*)

And we can use the Taylor expansion way above for g(\tau^*), combined with the fact that we know how \tau^* is distributed, to calculate that (under H0 where \tau^*=\tau_\lambda=\tau_t):

\lambda = \left[\nabla g_{\tau t}[\nabla^2\mathcal{L}_{\tau_t}]^{-1}\nabla g|_{\tau_\lambda}\right]^{-1} N(0,\left[\nabla g_{\tau t}[\nabla^2\mathcal{L}_{\tau_t}]^{-1}\nabla g|_{\tau_\lambda}\right])

and therefore

\lambda \sim N(0,\left[\nabla g_{\tau t}[\nabla^2\mathcal{L}_{\tau_t}]^{-1}\nabla g|_{\tau_\lambda}\right]^{-1})

And that’s the Lagrange Multiplier (alt. Rao’s Score) test.

[1] Obscure, archaic textbook from 1972

Part 1: Oversampling

As a bit of motivation, and because the mathematics will be largely the same, we review the impact of oversampling on a linear model. For binary outcomes, this amounts to inflating or deflating the rate of outcome in the experiment as compared to the natural rate; for example, sampling 50% cases and 50% controls for a disease with only a 15% prevalence in the underlying population. The basic idea behind oversampling is to note that the tables

\begin{array}{c|c|c|} & A & B \\ \hline \mathrm{Case} & 100 & 8 \\ \hline \mathrm{Control} & 100 & 2 \end{array} \;\;\;\;\;\; \begin{array}{c|c|c|} & A & B \\ \hline \mathrm{Case} & 500 & 12 \\ \hline \mathrm{Control} & 100 & 2 \end{array}

are substantively different (in terms of say, Fisher’s Exact Test), despite the fact that B shows a large enrichment for cases in both situations. A philosophical question that’d be worthwhile for the biological world to answer (but that we won’t get into here): if table 2 is the result of a replication experiment, is the signal in table 1 replicated, or not? While it may seem strange that a frequency of 8% in cases should in an orthogonal experiment come back at a frequency of 2.4% in cases, it only takes a tiny bit of structure.

Anyway, it’s somewhat useful to quantify the degree to which oversampling can reduce power in this way. Following the standard latent-variable setup, we let y_i = \tau X_i + \alpha + \epsilon_i > \theta where \epsilon_i has some cumulative distribution \phi(x). Let r be the rate of sampling, so the number of cases is given by Nr. Also set the frequency of X in cases and controls to f_A and f_U respectively. A score test for the effect of X_i will be

\mathcal{L}_\tau = \sum_A \phi^{-1}\phi'X_i - \sum_U(1-\phi)^{-1}\phi'X_i

(Recall that the likelihood is just the sum of y_i\log(\phi)-(1-y_i)\log(1-\phi)), where sums are over the case and control sets.

the variance of this statistic, as derived above, is


and relies on

\log(1-\phi)'' = -(1-\phi)^{-2}\phi'^2- (1-\phi)^{-1}\phi''

All of these are evaluated under the null hypothesis (\tau=0,\alpha=\hat{\alpha}). For every distribution of \phi, we have that \phi(\alpha) = r, by definition. The values of \phi' and \phi'' vary with the particulars. For logistic regression,

\phi'|_{\hat\alpha} = \frac{e^{-\alpha}}{(1+e^{-\alpha})^2} = r(1-r)

(note: 1+e^{-\alpha} = 1/r so e^{-\alpha} = 1/r-1 = (1-r)/r). Finally \phi''|_{\hat\alpha} = -r(1-r)(2r-1). Therefore:

\mathcal{L}_\tau = \sum_A (1-r) X_i - \sum_U rX_i \mathcal{L}_{\alpha\alpha} = -\sum r(1-r) = -Nr(1-r) \mathcal{L}_{\tau\tau}-\mathcal{L}_{\tau\alpha}\mathcal{L}_{\alpha\alpha}\mathcal{L}_{\alpha\tau} = -\sum X_i^2 r(1-r) \mathcal{L}_{\tau\alpha} = -\sum X_ir(1-r)

So then the score test will be \mathcal{L}_\tau/\sqrt{\mathrm{var}}

T = \frac{U}{\sqrt{\mathrm{Var}[U]}} = \frac{(1-r)\mathrm{AC}_{\mathrm{case}} - r\mathrm{AC}_{\mathrm{control}}}{\sqrt{-r(1-r)\left[\frac{1}{N}(\sum X_i)^2 - \sum X_i^2\right]}}

Sums can be simplified using f_U and f_A:

 = \frac{(1-r)Nrf_A - rN(1-r)f_U}{\sqrt{-r(1-r)[\frac{1}{N}(2N\tilde{f})^2 - 2N\tilde{f}(1-\tilde{f}) - 4N\tilde{f}^2)}} = \sqrt{Nr(1-r)} \frac{f_A-f_U}{\sqrt{2\tilde{f}(1-\tilde{f})}}

and \tilde{f} = rf_{A}+(1-r)f_{U}. Note, however, that this is just to get a sense of T at the expected allele counts in cases and controls. This does not speak to the power of the test, nor even to the mean value of T (as we’ve evaluated T(\mathbb{E}[\mathrm{AC}_{\mathrm{case}},\mathrm{AC}_{\mathrm{control}}], which is different from \mathbb{E}[T(\mathrm{AC}_{\mathrm{case}},\mathrm{AC}_{\mathrm{control}})], due to the nonlinearity of T.

This can be extended to weighted group tests (e.g. burden tests). Let the weighting vector for the variants in a gene be \xi, and let X_i now be a vector of genotypes for sample i. Define S_i = \xi^\dagger X_i and the score test becomes

T_{\xi} = \frac{\sum_A (1-r)S_i - \sum_U r S_i}{\sqrt{-r(1-r)[\frac{1}{n}(\sum S_i)^2 - \sum S_i^2]}}

In both cases, you find that the test statistic evaluated at the expected genotypes (given f_A and f_U) is maximized for evenly matched case-control sampling, or r=0.5. But, maximizing T(\mathbb{E}[AC]) is not necessarily the same as maximizing power. One easy way to demonstrate that is to compare the behavior of T(\mathbb{E}[AC]) to that of \mathrm{Pow}(T|\tilde{f},\mathrm{OR}).

Calculating Power

Examining the formula for the test statistic T, we notice that only four quantities are needed: the allele count in cases, allele count in controls,  total number of alleles, and total number of homozygotes (for X_i^2). This leads to the following obvious means of simulating a draw of genotypes from a population.

Given f_A and f_U, the frequency of the variant in cases and controls (which can be calculated from the odds ratio and the overall frequency in the population), draw

\mathrm{Het}_A \sim \mathrm{binom}(n=|A|,p=2f_A(1-f_A))

\mathrm{Het}_U \sim \mathrm{binom}(n=|U|,p=2f_U(1-f_U))

\mathrm{Hom}_A \sim \mathrm{binom}(n=|A|,p=f_A^2)

\mathrm{Hom}_U \sim \mathrm{binom}(n=|U|,p=f_U^2)

and if by chance \mathrm{Het}_S + \mathrm{Hom}_S > |S|, set \mathrm{Het}_S = |S|-\mathrm{Hom}_S. Then T can be calculated. A draw of 10,000 realizations for a site of MAF=1% with OR=1 and 5000 samples is very well calibrated:


So the quantiles of the normal and uniform distributions line up very nicely. There is (as you would expect) some quantization near 0, due to the fact that allele counts are discrete. The power can be estimated from these simulations by, for a given nominal p-value \rho, counting the number of simulations for which |T| > \Phi^{-1}(1-\rho) (where \Phi is the inverse normal cdf).

One notices that while T(\mathbb{E}[AC]) is nicely symmetric, the power seems to be asymmetrically distributed.


The figures here take a 10,000-sample balanced (5000 case, 5000 control) study, where the prevalence of the disease is 8%. Notice that the distribution of T(\mathbb{E}[AC]) (left plot) is perfectly symmetric, whereas the power is biased towards risk-increasing alleles (middle plot, right plot logscale). So what’s happening to break the symmetry? Consider what case/control frequencies correspond to an OR=1.5 and f=0.05 allele. Using Bayes’ rule:

f_A = (P[A|\mathrm{Het}]P[\mathrm{Het}] + 2P[A|\mathrm{HV}]P[\mathrm{HV}])/(2P[A])

= [2f(1-f)\phi(\alpha + \beta) + 2f^2\phi(\alpha+2\beta)]/[2\rho]

f_U = (P[U|\mathrm{Het}]P[\mathrm{Het}] + 2P[U|\mathrm{HV}]P[\mathrm{HV}])/(2P[U])

= [2f(1-f)(1-\phi(\alpha + \beta) + 2f^2(1-\phi(\alpha + 2\beta))]/[2(1-\rho)]

where \rho is the prevalence, and \beta is the log-OR, and \phi is just the logit. What this means is that for a 1% frequency variant


The case allele frequency (red) increases drastically when the allele confers risk (shaded circles), but decreases drastically when the allele is protective (triangles). At the same time, the control allele frequency (blue) does not significantly change in either case. One potential explanation for the differences in power is that the variance of observed allele counts is higher when the allele is risky: 2p(1-p) is increasing for 0 < p < 0.5, and it could be that this variance of observation dominates any gain from oversampling cases. There’s actually quite a lot in the figure. For diseases even of moderate prevalence (8%, 10%), the control frequency does not change all that much with the odds ratio, either for protective or risk alleles, because the disproportionate number of controls in the population serves as a buffer. So effectively, only the case frequency is changing (and this is particularly true of rare diseases with a prevalence of 1% or less!). Risk alleles increase the case frequency, and the case allele count can increase with it. By contrast, protective alleles lower the case frequency, but the case count can’t decrease with the frequency in the same way, because of (solidifying yet another parallel between genetics and economics) the zero lower bound. You can’t get negative case counts, so in 2500 case samples, the difference between a 0.1% allele and a 0.05% allele is masked by the fact that the realized allele count distribution becomes zero-inflated. What this means is that while a risk allele can increase a 10:5 to 30:5 to 90:5 without any limit, a protective allele can decrease a 5:10 to 0:10 and no lower.

Nevertheless, oversampling does increase power, but it does so differentially on risk versus protective alleles — but always the risk alleles seems to be the one most empowered.


Here are the sampling ratio/power profiles for 1% variants with a risk/protective odds ratio of 2: on the left, the disease prevalence is 8%, while it is set to 30% on the right. As the disease becomes more prevalent, the optimal ratio seems to shift away from balanced samples, and in the latter case it is possible to sample cases to the pont where you are more powered to detect protective alleles.

Coming up: Part 2 – Frequency Confounding, Part 3 – Population Stratification (PCA-confounds)

This too is an older post. See here, and a related post on fixed and random effect models here.


1 Comment »

  1. […] for error covariates, and in modeling the distributions of read counts or genotyping errors. I have written about specific cases before. The general conclusion is that likelihood-based regressions result in […]

    Pingback by Differential *omics in theory and in practice | heavytailed — April 15, 2015 @ 8:22 pm

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: