# heavytailed

## August 18, 2015

### Estimating fraternity coefficients – Part All The Rest

Filed under: omics — Tags: , , , , — heavytailed @ 10:58 pm

It’s been an exciting summer! Building infrastructure at a startup here in SoCal whilst keeping up with my Ph.D. qualifier and research has taken most of my time away from blogging. In my last post, I promised an update on fraternity coefficient estimation “in the coming weeks.” Weeks having come and gone, this will be a brief summary of the qualifier. It’s precisely as you might expect.

The Problem

If there’s any inbreeding in the population (note: the idea of “having populations” implies inbreeding in a broad sense — if mating were purely random there could be no population differentiation) then fraternity coefficients cannot be estimated from bi-allelic markers, as the equations are underdetermined.

This goes back to the condensed coefficients of identity (the Jacquard coefficients [pdf]). The issue is that, if there is inbreeding, fraternity is both D1 and D7 – and all of a sudden you need to estimate D1 in an unbiased way — which means estimating all the Jacquard coefficients. With bi-allelic sites, the genotype observation probabilities, conditioned on each Jacquard state, are given by:

The shape of the matrix gives the entire problem: seven genotype states for nine coefficients. So to estimate the 9 coefficients we need at least three alleles, which has a conditional observation of:

What about four alleles? The only way you could see four alleles in two individuals is (wx, yz) which has probability pqrs. Combining this with the above states, ordering the alleles and reordering/combining rows, gives a final form for this matrix as

And of course there’s an obvious way to make multi-allelic markers from bi-allelic markers: phase them.

There are two estimators associated with this, the first, a maximum likelihood estimator, looks like

$\mathcal{L}(D) = \prod_{k=1}^m \sum_D \mathbb{P}[G^{(k)}|D]\mathbb{P}[D]$

$= \prod_{k=1}^m I[G_k]^TC^{(k)}\Delta$

where C is the matrix above, with p, q, r, s given by the observed alleles at the locus. G is the genotype state for the pair, so $I[G_k]$ is just an indicator variable that selects out the appropriate row, i.e. (aa, bc) would pull out row 3, so $I[(aa, bc)]$ is just $\mathbf{e}_4$.

For a single pair of samples, denote the concatenation of all such rows (over each multi-allelic locus) as the $N \times 9$ matrix Q. The update rule for ML (using logistic boundary functions) is

$\frac{d\mathcal{L}}{d\theta_i} = [1/(Q\Delta) \cdot Q\mathbf{e}_i - 1/(Q\Delta) \cdot Q\mathbf{e}_9]\Delta_i(1-\Delta_i)$

where $\Delta_i = 1/(1 + e^{-\theta_i})$. The term $1/(Q\Delta)\cdot Q$ should be interpreted as follows: $Q\Delta$ is a $N \times 1$ vector, while $Q$ is a $N \times 9$ matrix, therefore $1/(Q\Delta) \cdot Q$ is akin to $\mathrm{diag}(1/Q\Delta) Q$.

Method of Moments estimation can also be used here by taking expectations. In particular, given two individuals at a locus k, letting Ck be the above matrix adapted to k by plugging in the appropriate allele frequencies, and letting G be the indicator variable for the observed genotype state of two individuals, we have that

$\mathbb{E}[G_k^{(ij)}] = \mathbb{E}[C_k\Delta] = \mathbb{E}[C_k]\Delta = C_k\Delta$

as $C_k$ is a matrix of (conditional) Bernoulli random variables. By concatenating the $C_k$ over all loci k, the coefficients could be solved for by least squares:

$\hat \Delta^{(ij)} = (\mathbf{C}^T\mathbf{C})^{-1}\mathbf{C}^T\mathbf{G}^{(ij)}$

The major problem here is that $\mathbf{C}$ gets very large (1.6GB for 25,000 loci with 8 haplotypes). My solution here was to adapt the model via

$\mathbf{\Xi C \Delta}^{(ij)} = \mathbf{\Xi G}^{(ij)}$

and choosing $\mathbf{\Xi}$ in such a way that $\mathbf{\Xi C}$ could be calculated on the fly without ever materializing $\mathbf{\Xi}$ or $\mathbf{C}$.

Both the likelihood and moment estimators can be constrained through the use of Lagrange multipliers (in the latter case the closed-form least squares would have to be replaced by a constrained gradient descent). In particular, given a good estimate of coancestry, one can constrain with

$\Delta_1 + \frac{1}{2}(\Delta_3 + \Delta_5 + \Delta_7) + \frac{1}{4}\Delta_8 = \Phi_{ij}$

and the no-inbreeding assumption would be given by

$\Delta_1 + \Delta_2 + \frac{1}{2}\Delta_3 + \Delta_4 = 0$

$\Delta_1 + \Delta_2 + \frac{1}{2}\Delta_5 + \Delta_6 = 0$

these are just the definition of the coefficients of coancestry and inbreeding. One is free to drop the fractions on the inbreeding constraints, since, combined with $\Delta_i \geq 0$ these two constraints set $(\Delta_1, \dots, \Delta_6) = 0$.

These approaches work fairly well on simulated pedigrees. Here it is on simulated sib and half-sib pairings:

One of the initial claims was that in populations with high degrees of consanguinity, estimating faternity coefficients from bi-allelic sites (and assuming 0 inbreeding coefficients) will result in a biased estimate. Here are populations simulated with 10%, 20%, and 30% rates of consanguinity; the constrained estimator shows decreasing estimates of fraternity (which is incoherent), while the unconstrained estimator correctly increases with consanguinity rate.

Coancestry coefficients also show the same incoherent trend between constrained and unconstrained estimation.

How much do the inclusions of the constraints impact estimates on real data? This next plot shows the coefficient of fraternity as estimated on the 1000G PJL and STU populations, constrained against unconstrained.

It’s heartening to see a good identity trend; in addition to the large number of points which would be set to 0 by constrained estimation of Jacquard coefficients, but have a positive fraternity coefficient when the estimation is not constrained by a 0-inbreeding assumption.

Nevertheless the real-world effect of this is somewhat slight. Looking at the estimates of fraternity across 1000G populations, we see that there’s not an obvious shift in the populations where first-cousin marriages are comparatively common (PJL/STU) [note that the blue outliers here are due to early terminations of constrained Boyden-Fletcher-Goldfarb-Shanno and are eliminated from statistical tests. The points here are per-pair per-chromosome.

But despite these appearances, the PJL/STU populations do have a significantly different distribution from the others. Conditional on fraternity > 0; the PJL and STU population distributions dominate (e.g. occur before/are higher than) the others (Mann-Whitney U: p < 1e-16 vs GBR, FIN — MWU GBR vs Fin: p = 0.22)

So what does all this mean?

The line-item here is: if you’re estimating the coefficient of fraternity, you probably should allow inbreeding, unless you’re sure that the population is mating randomly. If you know there’s a good degree of consanguinity, you should probably allow for nonzero inbreeding coefficients if you’re estimating kinship.

However, if you’re calculating shared alleles for the purpose of a longitudinal model (as in Zhu et al. 2015), you don’t have to worry, really, about anything. In principle, IBD sharing is not even what you should use in this case. If you’ve directly observed genotypes at all (or the vast, vast majority) causal loci (e.g. sequencing, high-density genotype array), you actually care about realized IBS and not IBD at all.

Deconvolving (ab,ac) into its potential sharing states is really just smoothing. Imagine you have typed m sites in the genome for two individuals, but this set does not tag a particular allele (x) which may have additive or dominant effects. Given the genotypes at the m other sites, and the genotype of one sample at x, is it possible to predict the genotype of the other sample? The answer is: take the m sites and calculate the Jacquard coefficients, and, given the genotype for one of the samples, look up the appropriate rows in the C matrix above, and you have a conditional predictor of x on the other sample. In this way, Jacquard estimation is a kind of smoothing; the average IBD sharing.

For the purpose of phenotype prediction, average IBD sharing is only useful in that it predicts average IBS sharing. That is, IBD is expected IBS; however given IBS, calculating IBD does not provide any new information about any loci that went into the IBD calculation. For dominant effects, you care that all four, or two pairs of alleles are shared IBS, as the effect is allele-specific (as opposed to from-whom-did-you-inherit-these-alleles-specific). To put it another way: IBS in the genotype-phenotype map is not a proxy for some underlying quantity, it is what you directly care about.

This particular realization makes be very happy this was just a qualifier and not some actual paper I was planning to submit. For the study of phenotypes, coancestry ideas are effectively rendered obsolete by the ability to get IBS over the whole genome.

Various notes

Path Counting

In some sense it’s very sad that pedigree-based genetics is dying off, if only because of a cool connection between path-counting and graphical models. Sewall Wright, when he defined coancestry as the correlation between two gametes, was in effect constructing a graphical model.

This is just a DAG (except for the edge labeled m which is just a pre-correlation between founders). If you look very closely at the path-counting recurrences

$\Phi_{ab} = \frac{1}{2}(\Phi_{fb} + \Phi_{mb})$

$\Phi_{aa} = \frac{1}{2}(1 + \Phi_{fm})$

you see that these are identical to the recurrences for the covariance structure given by a DAG:

$\Sigma_{ij} = \sum_{k \in \mathrm{pa}(i)} w_{ki}\Sigma_{kj}$

which, given the graph $W$ of weights, corresponds to

$\Sigma = I + W + W^2 + \dots = (I - W)^{-1}$

(see Lior Pachter’s wonderful number deconvolution for an explanation). There are other recurrences for generalized kinship coefficients (Cheng & Ozsoyoglu, 2014) but none that I could find correspond to any sequence of elementary matrix operations; even over fields other than $\mathbb{R}$. It would be nice to spend some more time exploring this connection between gametic correlation, path counting, and related fields (such as routing theory), but seeing as how the correlation between gametes is not so relevant when you can sequence their progeny, these explorations may have little application in the genotype-phenotype world.

Fast reference-free phasing by compressive sensing

As mentioned, the way to go from bi-allelic loci to multi-allelic loci is to phase the SNVs. For this reason I wrote my own code on top of the Beagle source code. However, phasing generally relies on a reference, and is by far the slowest part; so it would make-sense to pre-phase and subsequently calculate the Jacquard coefficients. Reference-free phasing is even harder, as the underlying haplotypes need to be inferred; this is generally done by using a greedy algorithm and restricting segment size (in terms of number of markers). Modern strategies basically derive from $\texttt{HAPLOTYPER}$ (Niu 2002) which takes this approach.

However it’s possible to phase a window of bi-allelic loci quickly and without a reference using compressive sensing; in particular nonnegative matrix factorization. The idea is, given the genotype matrix G we can write it as

$\Sigma A = G$

where $\Sigma$ is the $m \times 2^m$ matrix of haplotypes, and $A$ the $2^m \times 2n$ haplotype assignments. There are some cool things even here: if you consider the least squares estimator:

$A = (\Sigma^T\Sigma)^{-1}\Sigma^TG$

the matrix $\Sigma^T\Sigma$ is particularly cool because the ith column of $\Sigma$ is the binary representation of i, and so neither matrix need be materialized. But $\Sigma^TG$ is an impossibility. The compressive sensing approach is to define some cutoff $K$ with $K \ll 2^m$ and then solve:

which can be solved with SGD. Alternatively $\min(a, 1-a)$ can be replaced with simply $a(1-a)$ which also induces “bidirectional” shrinkage. This would then be a smooth objective, and so BLGFS would work.

I didn’t pursue this direction much: I implemented a version of this using LAML, but it was a standard NMF with one penalty on all parameters, no domain constraints, and no simplex constraint. However the (very, extremely preliminary) results were promising even with all of these drawbacks. Had Jacquard estimation proved more interesting than it did, I would certainly keep driving down this direction; but it seems the applications for fast haplotype inference and biallelic phasing are limited. Perhaps I’ll dust it off if I wind up taking an official course in compressive sensing.