# heavytailed

## August 18, 2015

### Estimating fraternity coefficients – Part All The Rest

Filed under: omics — Tags: , , , , — heavytailed @ 10:58 pm

It’s been an exciting summer! Building infrastructure at a startup here in SoCal whilst keeping up with my Ph.D. qualifier and research has taken most of my time away from blogging. In my last post, I promised an update on fraternity coefficient estimation “in the coming weeks.” Weeks having come and gone, this will be a brief summary of the qualifier. It’s precisely as you might expect.

The Problem

If there’s any inbreeding in the population (note: the idea of “having populations” implies inbreeding in a broad sense — if mating were purely random there could be no population differentiation) then fraternity coefficients cannot be estimated from bi-allelic markers, as the equations are underdetermined.

This goes back to the condensed coefficients of identity (the Jacquard coefficients [pdf]). The issue is that, if there is inbreeding, fraternity is both D1 and D7 – and all of a sudden you need to estimate D1 in an unbiased way — which means estimating all the Jacquard coefficients. With bi-allelic sites, the genotype observation probabilities, conditioned on each Jacquard state, are given by:

The shape of the matrix gives the entire problem: seven genotype states for nine coefficients. So to estimate the 9 coefficients we need at least three alleles, which has a conditional observation of:

What about four alleles? The only way you could see four alleles in two individuals is (wx, yz) which has probability pqrs. Combining this with the above states, ordering the alleles and reordering/combining rows, gives a final form for this matrix as

And of course there’s an obvious way to make multi-allelic markers from bi-allelic markers: phase them.

There are two estimators associated with this, the first, a maximum likelihood estimator, looks like

$\mathcal{L}(D) = \prod_{k=1}^m \sum_D \mathbb{P}[G^{(k)}|D]\mathbb{P}[D]$

$= \prod_{k=1}^m I[G_k]^TC^{(k)}\Delta$

where C is the matrix above, with p, q, r, s given by the observed alleles at the locus. G is the genotype state for the pair, so $I[G_k]$ is just an indicator variable that selects out the appropriate row, i.e. (aa, bc) would pull out row 3, so $I[(aa, bc)]$ is just $\mathbf{e}_4$.

For a single pair of samples, denote the concatenation of all such rows (over each multi-allelic locus) as the $N \times 9$ matrix Q. The update rule for ML (using logistic boundary functions) is

$\frac{d\mathcal{L}}{d\theta_i} = [1/(Q\Delta) \cdot Q\mathbf{e}_i - 1/(Q\Delta) \cdot Q\mathbf{e}_9]\Delta_i(1-\Delta_i)$

where $\Delta_i = 1/(1 + e^{-\theta_i})$. The term $1/(Q\Delta)\cdot Q$ should be interpreted as follows: $Q\Delta$ is a $N \times 1$ vector, while $Q$ is a $N \times 9$ matrix, therefore $1/(Q\Delta) \cdot Q$ is akin to $\mathrm{diag}(1/Q\Delta) Q$.

Method of Moments estimation can also be used here by taking expectations. In particular, given two individuals at a locus k, letting Ck be the above matrix adapted to k by plugging in the appropriate allele frequencies, and letting G be the indicator variable for the observed genotype state of two individuals, we have that

$\mathbb{E}[G_k^{(ij)}] = \mathbb{E}[C_k\Delta] = \mathbb{E}[C_k]\Delta = C_k\Delta$

as $C_k$ is a matrix of (conditional) Bernoulli random variables. By concatenating the $C_k$ over all loci k, the coefficients could be solved for by least squares:

$\hat \Delta^{(ij)} = (\mathbf{C}^T\mathbf{C})^{-1}\mathbf{C}^T\mathbf{G}^{(ij)}$

The major problem here is that $\mathbf{C}$ gets very large (1.6GB for 25,000 loci with 8 haplotypes). My solution here was to adapt the model via

$\mathbf{\Xi C \Delta}^{(ij)} = \mathbf{\Xi G}^{(ij)}$

and choosing $\mathbf{\Xi}$ in such a way that $\mathbf{\Xi C}$ could be calculated on the fly without ever materializing $\mathbf{\Xi}$ or $\mathbf{C}$.

Both the likelihood and moment estimators can be constrained through the use of Lagrange multipliers (in the latter case the closed-form least squares would have to be replaced by a constrained gradient descent). In particular, given a good estimate of coancestry, one can constrain with

$\Delta_1 + \frac{1}{2}(\Delta_3 + \Delta_5 + \Delta_7) + \frac{1}{4}\Delta_8 = \Phi_{ij}$

and the no-inbreeding assumption would be given by

$\Delta_1 + \Delta_2 + \frac{1}{2}\Delta_3 + \Delta_4 = 0$

$\Delta_1 + \Delta_2 + \frac{1}{2}\Delta_5 + \Delta_6 = 0$

these are just the definition of the coefficients of coancestry and inbreeding. One is free to drop the fractions on the inbreeding constraints, since, combined with $\Delta_i \geq 0$ these two constraints set $(\Delta_1, \dots, \Delta_6) = 0$.

These approaches work fairly well on simulated pedigrees. Here it is on simulated sib and half-sib pairings:

One of the initial claims was that in populations with high degrees of consanguinity, estimating faternity coefficients from bi-allelic sites (and assuming 0 inbreeding coefficients) will result in a biased estimate. Here are populations simulated with 10%, 20%, and 30% rates of consanguinity; the constrained estimator shows decreasing estimates of fraternity (which is incoherent), while the unconstrained estimator correctly increases with consanguinity rate.

Coancestry coefficients also show the same incoherent trend between constrained and unconstrained estimation.

How much do the inclusions of the constraints impact estimates on real data? This next plot shows the coefficient of fraternity as estimated on the 1000G PJL and STU populations, constrained against unconstrained.

It’s heartening to see a good identity trend; in addition to the large number of points which would be set to 0 by constrained estimation of Jacquard coefficients, but have a positive fraternity coefficient when the estimation is not constrained by a 0-inbreeding assumption.

Nevertheless the real-world effect of this is somewhat slight. Looking at the estimates of fraternity across 1000G populations, we see that there’s not an obvious shift in the populations where first-cousin marriages are comparatively common (PJL/STU) [note that the blue outliers here are due to early terminations of constrained Boyden-Fletcher-Goldfarb-Shanno and are eliminated from statistical tests. The points here are per-pair per-chromosome.

But despite these appearances, the PJL/STU populations do have a significantly different distribution from the others. Conditional on fraternity > 0; the PJL and STU population distributions dominate (e.g. occur before/are higher than) the others (Mann-Whitney U: p < 1e-16 vs GBR, FIN — MWU GBR vs Fin: p = 0.22)

So what does all this mean?

The line-item here is: if you’re estimating the coefficient of fraternity, you probably should allow inbreeding, unless you’re sure that the population is mating randomly. If you know there’s a good degree of consanguinity, you should probably allow for nonzero inbreeding coefficients if you’re estimating kinship.

However, if you’re calculating shared alleles for the purpose of a longitudinal model (as in Zhu et al. 2015), you don’t have to worry, really, about anything. In principle, IBD sharing is not even what you should use in this case. If you’ve directly observed genotypes at all (or the vast, vast majority) causal loci (e.g. sequencing, high-density genotype array), you actually care about realized IBS and not IBD at all.

Deconvolving (ab,ac) into its potential sharing states is really just smoothing. Imagine you have typed m sites in the genome for two individuals, but this set does not tag a particular allele (x) which may have additive or dominant effects. Given the genotypes at the m other sites, and the genotype of one sample at x, is it possible to predict the genotype of the other sample? The answer is: take the m sites and calculate the Jacquard coefficients, and, given the genotype for one of the samples, look up the appropriate rows in the C matrix above, and you have a conditional predictor of x on the other sample. In this way, Jacquard estimation is a kind of smoothing; the average IBD sharing.

For the purpose of phenotype prediction, average IBD sharing is only useful in that it predicts average IBS sharing. That is, IBD is expected IBS; however given IBS, calculating IBD does not provide any new information about any loci that went into the IBD calculation. For dominant effects, you care that all four, or two pairs of alleles are shared IBS, as the effect is allele-specific (as opposed to from-whom-did-you-inherit-these-alleles-specific). To put it another way: IBS in the genotype-phenotype map is not a proxy for some underlying quantity, it is what you directly care about.

This particular realization makes be very happy this was just a qualifier and not some actual paper I was planning to submit. For the study of phenotypes, coancestry ideas are effectively rendered obsolete by the ability to get IBS over the whole genome.

Various notes

Path Counting

In some sense it’s very sad that pedigree-based genetics is dying off, if only because of a cool connection between path-counting and graphical models. Sewall Wright, when he defined coancestry as the correlation between two gametes, was in effect constructing a graphical model.

This is just a DAG (except for the edge labeled m which is just a pre-correlation between founders). If you look very closely at the path-counting recurrences

$\Phi_{ab} = \frac{1}{2}(\Phi_{fb} + \Phi_{mb})$

$\Phi_{aa} = \frac{1}{2}(1 + \Phi_{fm})$

you see that these are identical to the recurrences for the covariance structure given by a DAG:

$\Sigma_{ij} = \sum_{k \in \mathrm{pa}(i)} w_{ki}\Sigma_{kj}$

which, given the graph $W$ of weights, corresponds to

$\Sigma = I + W + W^2 + \dots = (I - W)^{-1}$

(see Lior Pachter’s wonderful number deconvolution for an explanation). There are other recurrences for generalized kinship coefficients (Cheng & Ozsoyoglu, 2014) but none that I could find correspond to any sequence of elementary matrix operations; even over fields other than $\mathbb{R}$. It would be nice to spend some more time exploring this connection between gametic correlation, path counting, and related fields (such as routing theory), but seeing as how the correlation between gametes is not so relevant when you can sequence their progeny, these explorations may have little application in the genotype-phenotype world.

Fast reference-free phasing by compressive sensing

As mentioned, the way to go from bi-allelic loci to multi-allelic loci is to phase the SNVs. For this reason I wrote my own code on top of the Beagle source code. However, phasing generally relies on a reference, and is by far the slowest part; so it would make-sense to pre-phase and subsequently calculate the Jacquard coefficients. Reference-free phasing is even harder, as the underlying haplotypes need to be inferred; this is generally done by using a greedy algorithm and restricting segment size (in terms of number of markers). Modern strategies basically derive from $\texttt{HAPLOTYPER}$ (Niu 2002) which takes this approach.

However it’s possible to phase a window of bi-allelic loci quickly and without a reference using compressive sensing; in particular nonnegative matrix factorization. The idea is, given the genotype matrix G we can write it as

$\Sigma A = G$

where $\Sigma$ is the $m \times 2^m$ matrix of haplotypes, and $A$ the $2^m \times 2n$ haplotype assignments. There are some cool things even here: if you consider the least squares estimator:

$A = (\Sigma^T\Sigma)^{-1}\Sigma^TG$

the matrix $\Sigma^T\Sigma$ is particularly cool because the ith column of $\Sigma$ is the binary representation of i, and so neither matrix need be materialized. But $\Sigma^TG$ is an impossibility. The compressive sensing approach is to define some cutoff $K$ with $K \ll 2^m$ and then solve:

which can be solved with SGD. Alternatively $\min(a, 1-a)$ can be replaced with simply $a(1-a)$ which also induces “bidirectional” shrinkage. This would then be a smooth objective, and so BLGFS would work.

I didn’t pursue this direction much: I implemented a version of this using LAML, but it was a standard NMF with one penalty on all parameters, no domain constraints, and no simplex constraint. However the (very, extremely preliminary) results were promising even with all of these drawbacks. Had Jacquard estimation proved more interesting than it did, I would certainly keep driving down this direction; but it seems the applications for fast haplotype inference and biallelic phasing are limited. Perhaps I’ll dust it off if I wind up taking an official course in compressive sensing.

## April 11, 2015

### Differential *omics in theory and in practice

This is a comment for a required course. The initial paper of interest is Dudoit’s paper Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Though the paper itself is over a decade old, it provides a necessary introduction to oligonucleotide array technology and some of the errors associated with it. Due to the awesomeness of inkjet printing, custom oligo arrays can be built for any kind of DNA reporter, but they can further be used for combinatorial mutagenesis. In the intervening years, RNA-seq has become the tool of choice for investigating gene expression data; and while the principles outlined in Dudoit 2002 have remained the same, the theory and practice of differential *omics are much evolved. Note that differential expression is but one of many A/B testing experiments that follow the same pattern; others include differential methylation (bisulfite sequencing), differential histone modification (ChIP-seq), differential chromatin state (ATAC-seq), and so on and so forth. Quite honestly, old-fashioned GWAS falls into this category too. While each specific application of A/B testing has its own particular concerns, these concerns fall into categories with near-perfect overlap.

Part 1: Units

Chips yield intensity values, sequencing yields read counts. That’s about it, right? Well, think about DNA-based studies (SNP/Indel/CNV) for a moment. These studies don’t “operate” on the level of intensity (chip) or read counts (seq), they operate on the level of the genotype. Hom-ref, hom-var, or het. The unit we want to test here is the allele. There’s an entire layer of inference between the measured units (intensity or read counts) and the allele: things like Birdseed or zCall will infer allelic state from raw oligo intensities; while programs like the GATK Haplotype Caller or Freebayes will perform this inference from aligned reads.

But genotypes turn out to be the only place you can really do this. In theory, methylation should follow the same process, but since typically cell populations (not single cells) are sequenced, there are very many CpG sites which are intermediately-methylated; this makes it more similar to pooled DNA sequencing experiments than it is to standard exome/genome-seq. The unit becomes the methylation frequency; and similar maximum-likelihood approaches such as MLML can be used to estimate these. For methylation arrays (such as the Illumina Infinium HM450 chip), this estimation is not performed; instead the raw log(R/G) values are used after suitable data normalization (see Part 2).This is a general trend for chip data.

What’s the ideal unit for mRNA? I think it should be the concentration of each mRNA in solution. That is, the ideal unit is mol/L. Note that this is an absolute measure: as stoichiometry is in general nonlinear, there are regimes where the same relative concentration ([transcript mRNA]/[all mRNA]) may result in vastly different translational dynamics, particularly in the presence of translational regulators such as FMRP. The estimation of absolute concentration from RNA-seq does not appear to have been attempted; and typically library preparation itself should alter absolute concentrations in a nontrivial way. So maybe we should forget about absolute concentration, and focus instead on relative mRNA concentration. One way to estimate this might be via

$\frac{[\mathrm{tx}]}{[\mathrm{all}]} \approx \frac{\mathrm{reads\;in\;gene}}{\mathrm{gene\;length} \times \mathrm{total\;reads}} = \frac{r_g}{\ell_g \times R} = \hat{[g]}_R$

This approximation makes the assumption that $R \propto \mathrm{total\; transcripts}$, and is a well-known quantity. Indeed, the quantity above is

$\hat{[g]}_R = 10^9\times \mathrm{RPKM}_g$

where RPKM is the well known Reads per Kilobase per Million Reads. However, Wagner, Kin, and Lynch point out that there are better estimates for the total abundance of transcripts, and introduce a related measure, TPM, in place of RPKM.

This all goes out the window with array data. Instead of counts, one has two-channel intensities for competitively-bound cDNA, and these intensities are some (hopefully monotonic) function of post-library-construction absolute concentration. While you’d like to have a measure of relative concentration (relative to the total mRNA concentration), you’re stuck with relative intensity (relative to whatever reference mRNA sample you used) per gene. While there is literature in fitting Langmuir or stoichiometric models to titration experiments, which enables a direct estimate of absolute mRNA concentration from probe intensity, these models cannot be directly applied to expression microarrays. First, the models in these papers were fit to different arrays; second, in order to train the parameters, tens of thousands of titration experiments would need to be performed to fit calibration curves for each appropriate (mRNA, probe) pair; and, even should these be fit to a single chip, synthesis efficiency during manufacturing is poor enough that the resulting estimates would be very noisy.

But even stuck with relative intensities, there’s still a question about the correct axes. Dudoit 2002 makes a great deal about the (M, A) = (log R/G, log R*G) versus (logR, logG) axes, even though these coordinate systems are linearly related. From a pure dimensional analysis perspective, log R/G is most appealing, as the quantity R/G is dimensionless, and so the log doesn’t screw up any units.

For splicing, again, ideally we would want isoform-level absolute concentrations. We could settle for isoform-level relative concentrations (TPM as for expression), but breaking down total mRNA abundance for a gene (viewed as the union of all of its isoforms) into abundance for each isoform separately turns out to be a fairly difficult inference problem. Instead, differential splicing (and spliceQTLs) tend to focus on the exon inclusion fraction, $\psi_i$: the proportion of mRNA, for a given gene, which contain the ith exon. There are many, many models for estimating $\psi$. DEXSeq uses a generalized linear model (negative binomial), the mean parameter of which is $\psi$, SpliceTrap takes a Bayesian approach (combinations of normals and betas) to estimate $\psi$, MATS uses a similar approach but with the binomial as opposed to normal distribution, and Xiong et al use a beta-binomial model with a positional bootstrap to account for mapping biases (see SM of that paper).

How about for differential histone modification? The important biological detail here is, for a given locus, what proportion of chromosomes have a histone with the given modification sitting at that locus. For non-histone ChIP-seq, such as for transcription factor binding, the story becomes more complicated, as the protein may directly bind DNA, or indirectly bind (i.e. as a part of a complex where some other TF binds). Here, the unit is the affinity-frequency distribution; that is, the frequency of direct and indirect binding, where “degree of directness” is itself a real-valued parameter to be inferred from the data. ChIPDiff uses a beta-binomial model to perform an initial, local estimate of chromatin modification frequency within groups ($p_A$ and $p_B$). It combines this with an HMM to aggregate information over multiple small bins, to estimate the indicator function

$I[p_A/p_B > \tau]$ and $I[p_B/p_A > \tau]$.

Many other methods (F-Seq, dCaP, and Wu’s Nonparametric) utilize the naive estimator

$\langle \mathrm{bin} \rangle = \frac{r_{\mathrm{bin}}}{\ell_{\mathrm{bin}}\cdot R}$

not as an estimate of $p_i$, but rather as one piece of the final test statistic (for instance $\langle \mathrm{bin} \rangle_A - \langle \mathrm{bin} \rangle_B$).

Part 2: Normalization

Surely, having as best as possible placed your data in the appropriate units, you are ready to proceed to differential expression! Let’s say you’ve done basic high-throughput exome sequencing on two flowcells of HiSeq, genotyped each separately, and have some 50-100 samples’ worth of data. You take your genotypes, and perform PCA. Much to your chagrin, one of the top PCs is the flowcell. You go online (say to seqanswers) and they suggest that you put the raw reads together and jointly genotype all the samples — and maybe add some 1000 Genomes samples for good measure. You do so, repeat the PCA, and find that none of the top PCs correlate with flowcell. You may not think this is normalization, but it really is. In particular you normalized out coverage and artefactual variation that are due to differences in library preparation and Illumina’s manufacturing process. These artifacts, whether due to small variations in reagent concentration, temperature, cycling time, the accuracy of manufacturing, the initial quantity of your sample, the generation number of cell lines, (&c &c &c) distort your measurements. They’re noise. Some effects are subtle, and can be controlled for as covariates (see part 3); others are the dominant source of variance, and somehow they need to be removed before you can even start. For oligonucleotide arrays, the major distortion is due to manufacturing differences; for *seq efforts, it’s total read depth. Removing these particular effects is referred to as normalization.

logR/G-logRG (M-A) plot of zebrafish expression microarray, with loess curves for each print head.
Obtained from here.

Dudoit 2002 references nonparametric regression (Loess) as a means of correcting for printhead-specific biases. This is still the standard approach for analyzing (or re-analyzing) microarray data — and the Loess is performed within printhead and within sample; this enables one to compare across multiple chips. An example is shown above, where the multiple lines show how different print heads lead to different smoothed M-A curves. The goal would be to shift each of these curves to some “reference” curve. One advantage of Loess is that the smoothing model can be semiparametric and include other kinds of error covariates (such as sequence GC%, gene length, and so on). Another approach introduced in Bolstad 2003 is quantile normalization. It should really be called “rank normalization”. Consider having performed four RNA-seq experiments and calculated RPKMs (or equivalent) for these samples. Something typical to see looks like: Here there are slight distortions making the cumulative (rank) distributions not equal. Something very common is zero-inflation: distortion about genes with zero or very few reads.This can be due to inefficient ribosomal RNA depletion — ribosomal RNA so dominates that even a small variance can render RPKM calculations not directly comparable across experiments.

Quantile normalization (“rank normalization”), in it’s most aggressive form, replaces each curve in the above plot by the average curve. Suppose, for instance, that in sample i, gene j is the 5000th gene (when sorted by RPKM). Then the expression value $g_{ij}$ is replaced by the average expression of the 5000th gene across all samples. Note that which gene is gene 5000 will differ between samples. That is, we directly force the cumulative distributions of all samples to match, while constraining the transformation to be rank-preserving within each sample, so that there is no variance within ranks, but there is still variance within genes. While this approach was developed particularly for competitive hybridization (where the R and G channels would be separately transformed, thereby indirectly normalizing the M and A plots), it is also regularly applied to RNA-seq — particularly in data sets with widely varying library complexities or RNA integrity numbers.

This degree of tampering is entirely unphysical; so there are less heavy-handed approaches which, rather than forcing every rank to match exactly, choose a set of “plausibly invariant” measurements (things whose expression should match — e.g. housekeeping genes or spike-ins), and fit a smooth function $f(rank, expr) = expr_{\mathrm{new}}$ which forces (as much as possible) the expression of the invariant genes to match. A very nice explication of these approaches can be found here.

Quantile normalization has been extended to regress away technical covariates as well, resulting in conditional quantile normalization, which regresses out covariates using cubic splines, and then applies quantile normalization to the residuals to attempt to match a target distribution; this normalization is directly applied to highly-expressed genes, and extended (via an approximation) to genes with lower expression. In practice, full quantile normalization does not appear to add additional power to detect differentially-expressed genes over simply dividing the entire expression distribution by a single summary such as the 1st quartile (Bullard, 2010). Nevertheless, for some work I have done in the past, we felt the warping of gene expression distributions between samples was mainly due to experimental artifacts, and so full rank normalized (with CQN) to a carefully-constructed and replicated reference distribution (similar to the use of Brain atlases in MRIs).

The ERCC proposed a set of RNA spike-in standards to aid in normalization (the idea is the absolute concentration of these unique RNA sequences is known up-front, and can be used to calibrate across experiments). However, the resulting reads are still too unstable to be sufficient for normalization purposes – I remember hearing frustration about this from a number of labs.

Empirically, DESeq’s approach (each gene is separately normalized by the geometric mean) appears to perform the best, but not much better than other approaches. Notably, this paper does not include CQN in the comparison. The modern standards are dChip and IRON.

The current standard in microarray expression analysis presents an odd paradox: while for a given experiment, the Loess or full quantile normalization methods are used; normalizing across experiments (for instance, analyzing multiple published microarray datasets) utilizes an array of novel methods not typically applied within a single experiment. This seems a bit strange, as cross-batch normalization should reduce to within-batch normalization if the number of batches is 1 (and let’s completely ignore the issue of what constitutes a “batch”). Lazar et al. make a distinction between three sources of error: expression heterogeneity, batch effects, and other sources of error. The origination of batch effects, they claim, is the simple fact that only a fixed set (typically 96) of samples can be processed together. In fact, the “batch” unit is just a proxy for a number of other sources of variation: library prep, experimental conditions at time of processing, and so forth. In general, within a batch there is not necessarily a good proxy for these sources of intensity (or expression) variation, but across multiple batches they can be controlled, by using the batch as a proxy. While nonparametrics like the above (LOESS or CQN against a reference distribution) still apply in this case, they are typically not used. Instead, parametric batch normalization typically recenter and rescale RPKM (or logR/G) values to have the same mean and variance across batches. While gene i and gene j may have different means, gene i in batch 1 is forced to have the same mean and variance as gene i in batch 2 (and so on and so forth). These normalizations can be performed in the linear or logarithmic scale. This basic approach can be extended to linear models which allow the incorporation of other covariates, but the idea is the same: estimate per-batch means and variances, and adjust the data so that these parameters are homogenous across batches for each gene. Other methods are unsupervised or latent in that they assume the primary sources of variation are nonbiological (we will see this again in statistical testing). Under this assumption, latent variables can be extracted from the expression matrix via PCA; and these are assumed to be technical sources (indeed, batch usually correlates highly with one of the top principal components). All of these methods are reviewed in Lazar et al. linked above.

The practical approach here is pretty straightforward: use the QQ-plot as a readout of how well experimental or batch differences have been controlled. A QQ-plot that is for the most part well-calibrated (we expect differential expression for a small fraction of genes only) indicates that normalization has worked effectively. Thus there’s a straightforward practical procedure: start with basic mean-variance normalization between batches, run your statistical test, and check calibration. If it’s not calibrated, pick a normalization method more or less at random and apply it; recalculate your statistics and check if the QQ-plot looks good. Lather. Rinse. Repeat.

I hope that doesn’t shatter any ideas about Very Serious Statisticians poring over a dataset and considering which model most applies to the situation, choosing appropriately based on visualization of the data and theoretical justifications, and nodding approvingly at the results. That’s not how it works for machine learning, and it’s really not how it should work for the practice of statistics. When you’ve got lots of null hypotheses, it’s really hard to argue with well-calibrated test statistics, just as it’s hard to argue with a high CVAUC. Theoretical considerations happen at the margins, but the main thing is: if you see a crappy QQ plot, you went off the rails somewhere, so it’s time to try something else.

Effectively the same approaches apply to ChIP-seq. Bailey, et al. quickly review several standard normalization procedures: depth normalization, linear normalization, reference-based normalization, LOESS, and quantile. See the paper for specific references. This can be extended to spike-ins or control samples, which is the approach taken by NCIS. At the same time, ChIP-seq (and Hi-C and clip-seq) provide additional challenges which are not entirely addressed by the above methods. These technologies have nonspecific backgrounds which are heavily dependent on DNA sequence and small variations in antibody concentration or binding efficiency during the experiment. While backgrounds are present in RNA-seq, it is generally a small enough proportion that it can be ignored. By contrast, since ChIP methods tend to rely on peak-calling (identifying bound regions by virtue of normalized/smoothed read depth exceeding a threshold), the variation of the background signal needs to be accounted for. Consider for instance two technical replicates with the same number of aligned reads; the replicate with a higher background will, because of the constraint of having the same total reads, have smaller peaks on average. A normalization approach to deal with this (as opposed to direct statistical modeling, see below) is to combine background subtraction with one of the above methods. That is, all read depths are adjusted by subtracting the background estimate (either a constant, or the result of a fitted model that maps genomic features to expected reads due to background). The resulting counts can then be normalized to a reference distribution following any of the methods described above.

More sophisticated methods do not treat the background during data normalization, but instead model it directly during statistical testing. This is the approach taken by dCaP and ChIPComp. It’s worth asking: in all these cases, seeing as the transformations are generally rank-preserving within sample, why perform normalization at all? Statistics could be calculated on the ranks within samples as opposed to on the direct expression/intensity/frequency estimates. The reason for this is that using ranks results in a p-value without a biologically-related effect estimate, and estimates of fold increase/decrease are important not only for interpretation, but for understanding results in context. Normalization considerations are more about retaining effect size than they are about finding some way of getting a calibrated p value.

Part 3: Testing for differences (or: the triumph of the LMM)

Dudoit 2002 uses a simple T statistic for testing differential expression. How have we progressed in the past decade? We’re still using (pretty much) a T statistic, although in an asymptotic guise from the LMM (or GLMM). Where we have gotten much better are in providing proxies for error covariates, and in modeling the distributions of read counts or genotyping errors. I have written about specific cases before. The general conclusion is that likelihood-based regressions result in powerful tests with well-controlled false-discovery rates. The use of a linear mixed model can even obviate the need for location-scale normalization: by including mean and variance components, batch effects can be adjusted during model fitting, without need for a separate normalization step.

Differential binding from ChIP/Clip/Hi-C is the new kid on the block. Even so, there are lots of models for statistical prediction of differentially enriched regions. These have all been developed for ChIP, but almost immediately extend to other link-digest-sequence experiments. What’s surprising is the variety of models in this area. DIME uses a mixture model after Loess normalization. MAnorm fits a linear model and tests the residuals, jMOSAiCS uses a graphical model with nodes of the form

$Z \sim \mathrm{NegBin}(a, a \mathrm{exp}[-(\beta_0 + \beta_1 X^c)])$

PePr normalizes ChIP peaks within sample and performs an asymptotic (Wald) test after fitting a negative binomial distribution, after extensive preprocessing and normalization. dCaP uses a LMM with a normal approximation, while dPCA compares global patterns between two conditions by decomposing the difference between multiple ChIP signals. ChIPComp uses a hierarchical model of a Poisson sampling distribution atop a linear model, and demonstrate a well-calibrated null distribution.

By contrast, differential expression, splicing, methylation, variant association and QTL studies, are almost all linear mixed models of various flavors. Wockner, et al. use LIMMA for differential methylation analysis. However, even though methylation occurs at specific CpG loci, differential methylation is typically observed in regions; and many methods have been developed to associate not just individual loci, but to test entire regions for differential methylation, either by aggregation or direct testing. Robinson, et al. list 14, of which at least 8 heavily involve linear models. For splicing, ARH-seq uses an entropy-based method, while rMATS is a GLM (logit) model. In practice, the only way to appropriately include covariates is through some kind of linear model. For instance, a very standard expression model is:

$\mathrm{log} \mathrm{TPM}_k = d_k + \mathrm{RIN}_k + \mathrm{case}_k + \sum_{i=1}^p s_{ik} + \sum_{i=1}^r b_{ik} + \sum_{i=1}^\ell c_{ik} + \epsilon_k$

$\vec \epsilon \sim N(0, \sigma_g^2\mathrm{G} + \sigma_e^2I)$

Where d is the depth, RIN is the RNA integrity, case is case status (it may also be the tissue type indicator), the s are individual genotypes one may want to control for, the b are “latent factors” (typically the top r principal components of the expression matrix, assumed to be nonbiological factors), c are clinical covariates (such as age, gender, BMI, etc), and G is the genetic relationship matrix, ideally calculated excluding the cis-region of the gene of interest, and excluding variants related to case/control status. The statistical test here is typically one of: a Wald test on the case coefficient; a likelihood ratio test (case included vs case excluded); or a score test (fit everything else, test the covariance between the case indicator and the residuals — care should be taken when random effects are included in this setting).

For other methods (take ARH-seq, for instance), in order to adjust for covariates, the above model would need to be fit (without the case variable) and the effects of covariates removed. NOISeq is the above model, equipped with a fixed coefficient on d, without any other covariates, and using the empirical distribution to estimate $\sigma_e^2$ (or even replacing the normal with the empirical noise distribution). Soneson et al find that, empirically, LIMMA (a direct implementation of the above model) is quite robust, performing well in most situations; similarly Seyednasrollah et al find that for small sample sizes, LIMMA identifies the most number of genes at low FDR (high precision – see fig 2). The best part about the linear model here is: if you test the coefficients on s instead of “case”, you’ve just performed an eQTL analysis; and the models for eQTLs (ICE-EMMA, PANAMA, svaseq, Matrix eQTL, PEER, and HEFT) are all linear models, their only differences being implementation detail, and how certain covariates or latent factors are calculated.

This review is far from comprehensive, but I think it captures the practice of differential *omics as it is today. As we move forward, there’s going to be additional work particularly in crosslink-and-seq (ChIP/clip/Hi-C) and methylation statistics to better incorporate technical covariates into the detection of differentially modified or differentially bound regions. And, of course, there’s the big open question: how do we put it all together? (But that’s a topic for another post).

Update Aug 18 2015

The Pachter Lab has recently released announced Sleuth for testing differential expression. Professor Pachter makes an extremely good point about the necessity of estimating transcript-level abundances, and that these estimates induce technical variation due to read ambiguity. The solution is another LMM (well, what did you expect?), with boostrap-derived estimates of technical variance included in the variance component. I will be checking Kallisto/Sleuth (Доверяй, но проверяй), particularly on microRNA and mini-exons, and switch to using it in place of Bowtie+CL+Limma-VOOM. I would speculate that the Kallisto/Sleuth pipeline probably functions well (with some paramater tuning) on other feature quantifications (ChIP/Clip/Hi-C).