# heavytailed

## October 7, 2012

### B2B Part 2: On Stratification

Filed under: Uncategorized — heavytailed @ 7:35 pm

Part 2: Structure and Stratification

This may be obvious to some, but for most arguments about power (especially for rare variation), a very strong, and incorrect, assumption is made about sampling: that it is done randomly. It’s always the first thing that comes to mind when you think about the possible sources of error in a study, but rarely does it spring to mind in discussions of power in genetic studies. I’ve seen too many papers assert that sampling more people will inevitably lead to more power (for a certain class of variants, typically rare), when it’s clear that if you were to sample in the way that the study suggests, your power would not change at all.

Here’s the quintessential example. You’re studying the genetic causes of Disease X, and you have 10 participating academic hospitals collecting samples. You realize you’ve got some geographic confounding, so you’re very careful to collect paired case-control samples from each hospital, so that there is no sampling imbalance that could be geographically correlated. 5 of the centers get their data in, and 5 are delayed. You go ahead and analyze the data you do have, and find some tantalizing evidence. Of the 10,000 samples you find a bunch of sites which look like this:

$\begin{array}{ccc} & \mathrm{Case} & \mathrm{Control} \\ \mathrm{A} & 12 & 3 \\ \mathrm{B} & 4988 & 4998\end{array}$

So you’ve got a variant that’s potentially a 4-fold increase in risk, but it’s rare. It’s nearly significant, too; the statistic is about 5.408 (so p~0.02). So you write up a short paper titled “Tantilizing Evidence for Rare Genetic Architecture of Disease X,” in which you say that, since only half of your centers finished collection, there are another 10,000 samples coming, and that will really give you power to demonstrate that these variants are strongly associated with Disease X. Well, Charlie Murphy has something to say about that…

So everyone’s excited to see what comes out of this study. The other 5 centers get in their matched samples. And lo and behold…none of your statistics have changed! Instead you have a lot of sites that look like this:

$\begin{array}{ccc} & \mathrm{Case} & \mathrm{Control} \\ \mathrm{A} & 12 & 3 \\ \mathrm{B} & 9988 & 9998\end{array}$

So what happened? You sequenced 5000 more samples! You expected to see A at least 8 more times! Well…in your calculation you pretended to sample randomly from the same population. In your sampling, you didn’t. Here’s what happened:

By definition your sampling (and indeed, nearly all sampling) is geographically stratified. In cities. With research hospitals. That have funding. That have enough funding to do sample collection, preparation, and storage. That’s a pretty tight constraint.

Similarly, rare variants, by virtue of being (mostly) recent variants, will be very locally distributed. These variants arose a few generations ago, in towns and villages around the big city center. The chances that those few members of the family who did travel a long distance just happened to travel to one of the 9 other centers that are doing your sampling is basically zero. So: you sample 10,000 more people from everywhere but where the variant is. You don’t see it again.

This, my friend, is structure. And you just got burned. Your second paper is the same as the first. It doesn’t get published. You don’t ever leave your postdoc. Your PI is forever giving you tasks to implement boring imputation methods. Then you quit genetics and start a successful career as a professional League of Legends competitor.

So with that motivating example, let’s talk about population structure, and contrast it with its ever-present cousin, stratification. Population structure is, by definition, the deviation (or lack thereof) of a natural population from strictly random mating. A population whose genome is only under the effects of mutation and drift has no structure. Selection imposes a structure. Assortive mating imposes structure. Local mating (e.g. within your own geography) imposes structure. Migration imposes structure. Population history, like bottlenecks, imposes structure.

All of these things have happened to us over the history of our species. We’re so structured that just by looking at our mitochondria, we can identify different waves of African migration into europe: the Slave Trade, the Arab conquering of the Iberian peninsula, the Roman conquest of Egypt, and some prior events dating back to as early as 11,000 years ago. Heck, if you get your 23andMe, you can calculate the percentage of your genome that is Neandertal. This tremendous amount of structure is the focus of human population geneticists, who among other things, are interested in the impact that human demographic history has had on the structure of the human genome (and also the reverse: what the structure of the human genome can tell us about our history).

To be a little bit more technical: population structure is the coalescence tree: on what haplotype background mutations arose, their drift or selection to current allele frequencies, their recombination with other haplotypes, and so forth. Structure is just a property of a population, and under the right design, it doesn’t impact a phenotype/genotype association study either way.

On the other hand, if there’s a correlation between geography and biased sampling (or with any source, really, of non-genetic causes of the disease), then that sample will show stratification, by virtue of our geographically-related population structure. I think this is a really under-appreciated point: in genetic association we correct by population structure because they are correlated with what we really think is confounding: geography. The local environment. Cultural and social norms (high-carb vs high-protein diet) which are geographically correlated. And of course the most obvious (and my favorite): differential case/control sampling from different geographic cohorts.

Gil McVean and Iain Mathieson wrote a paper on this subject back in March: Differential Confounding of rare and common variants in spatially structured populations. To rephrase the paper result in language common to this post: different types of geographic confounding (e.g. locally biased sampling versus a dispersed environmental effect) stratify at different levels of population structure. And by that I really mean at different depths of the coalescence tree. In particular, large, dispersed geographic confounds tend to confound common variation (towards the root); conversely concentrated, local confounds fall disproportionately on the rare variants (towards the leaves).

The take-home figure is fig. 1:

The insert on the upper left is the (geographic) distribution of non-genetic risk. On the left: a dispersed confound, while on the right is a highly-concentrated geographic non-genetic risk. Because the risk density on the left has such a particular gaussian shape, it’s easier to think of it as an environmental effect, such as distance to a source, rather than stratified sampling. On the other hand, the density on the RHS looks very much like “I have sampled lots of cases from this area.”

Here, the effects of the risk/geography correlation on genetic association tests are profiled in the form of quantile-quantile plots. The dispersed risk has a massive effect on the common variants, those that by the very nature of genetic drift and local random mating and migration are likely to correlate with geography on a large scale: these are exactly the variants that are likely to show a gradient towards or away from the center of the source. They’re old enough to have had time to migrate to the whole area of the grid, and drift in frequency. By contrast, the effect of a dispersed non-genetic risk geography on rare variant association is far more muted: rare variants tend to be new, and localized. In this case, they function almost as indicator functions: am I very close to the source, or not.

For that very reason, rare variants are wildly inflated by the locally-concentrated risk geography. On the other hand, since the risk is so geographically localized, it only strongly correlates with variants with very locally-distributed allele frequencies. For common variants, even those whose highest allele frequency is directly on the center of the geographic non-genetic risk, because everywhere else there’s no real association between frequency increase and risk increase, the test statistics are not shifted tremendously from the axis.

Flipping this issue on its head, this makes a statement about the relationship between population structure and geography (granted, a statement that has been known since at least Crow and Kimura): because variation (to first order, absent long-range migration) follows both local diffusion and drift, the frequencies of common variants tell us about global (literally in this case) geography. Marching down the frequency spectrum pushes us to more and more recent population histories, and therefore more local geographies.

One slight bit of care is needed, and unfortunately I have to get a little bit ahead of myself to articulate it. Though rare variants contain a lot information regarding local geography, the nature of that information is conditional. That is, it is conditioned on the structure of the population in the region; thus, computing principal components on rare variants across all samples is not going to correct for local geographies, because in a sense one is averaging over all local histories. There is a correct way to do this, see below, but it’s important to note at this point that it is not expected for rare-variant-based PCA to just work in correcting geographically-localized confounds.

So after establishing that nongenetic risk profiles on different geographic scales impact different slices of the allele frequency spectrum, we have to consider methods of correcting for this effect.

The obviously best mode of correction is to directly include the geography: hands down. The problem really is how to. If you know how the nongenetic risk is geographically distributed, it’s easy. For instance, above you could calculate the distance to the center of the risk gaussian, and it’s solved. In the second case, just an indicator function. Unfortunately, absent that distribution, it’s not entirely clear how best to parametrize. For instance, as McVean and Mathieson mention, the appropriate covariate, given (x,y) coordinates, is a nonlinear function. You could use a number of kernels and try to learn the appropriate function, but in blindly checking the ability to predict risk, you will likely conflate geographic risk with underlying genetic risk.

But what if geography is not available? Since structure covaries with geography, it follows that if there is population [risk] stratification imposed by a geographical risk confound, correcting for population structure should correlate with geography in such a way to correct for this stratification. The standard method of correcting for stratification is to calculate the principal components (i.e. the right-eigenvectors of the [Sample] x [Variant] matrix), and include the projection of the sample genotypes onto those components.

McVean and Mathieson show that this works really well for a dispersed gaussian risk, but that it has very little effect for highly local geographic risk profiles. This is to be expected. In the former case, the geographic variance of the risk profile (e.g. the area) is of the same scale as the variance of the sampling. Thus, the principal components of the full sample, which aim to maximize the genetic variance explained, will covary with a large geographic scale. In other words, the PCA worked because the scales lined up. As you march down the principal components, you’re explaining a smaller and smaller proportion of the variance, and so the geographic scale is decreasing. However, it suggests that geographic risk profiles with geographic variance of a large fraction (say 20%+) of the geographic extent of sampling are well corrected-for by principal components.

(A rather large parenthetical here: the reason PCA works so well in McVean/Mathieson is that the geographic risk profile is serendipitously adapted to the population structure. In other words, because the diffusion of local mating and migration generates radial allele frequency gradients, and moreover because the geographic risk profile is also radial, PCA is perfectly set up for success. I doubt it would work nearly as well for, say, a crescent-shaped risk profile. Luckily, examples of such odd geographic non-genetic risk distributions seem particularly contrived.)

By contrast, the geographic extent of the risk in the figure on the right is a much much smaller fraction of the total geographic extent. It should come as no surprise that principal components that explain large fractions of the total extent fail to account for very local risk profiles. But if you go far enough down (20, 50, 100 components) you start to correct for it. The authors also show that other methods of correcting (Genomic Control, kinship correction, and PCA based only on rare variants). Genomic control doesn’t work for the same reason as normal PCA – it’s looking at global variance. Kinship correction doesn’t work because while it’s local, it’s conditionally local. There are plenty of relatives who don’t show any increased risk, but relatives with a specific genetic background (e.g. from a specific region) will show increased risk. And rare variant PCA doesn’t work for reasons previously articulated.

So what’s the right thing to do? McVean and Mathieson suggest that, absent fine-grained information regarding the geographic origin of samples, perhaps nonlinear combinations of components, or novel approaches of ancestry. In truth, the answer is much simpler, you just have to remember how physical stratification is dealt with in the agricultural world: through multilevel models.

Here’s the archetypal example: a farmer is interested in the effect on total crop yield of the density of planting, and the method of crop thinning. That is, the number of seeds you put in each hole, and the method by which you prune back parts of the plant to achieve fruit/seed growth rather than stem growth.

So say you have 6 blocks, each block has 3 farm plots, and of those 3 farm plots, you choose 2 and subdivide them. Across the plots you randomize the seed density, and across the subdivisions you randomize the method of thinning. In this case, the method of thinning is a mean correction from the baseline of seed density: an interaction term. Similarly, you need to correct for the geography (which is related to e.g. sunlight, soil aeration, nitrogen content, etc). So your model looks like:

$\mathrm{yield} \sim (1 + \mathrm{density} + \mathrm{thinning} + \mathrm{density}\times\mathrm{thinning} ) + [\mathrm{block} + \mathrm{block}\times\mathrm{main} + \mathrm{block}\times\mathrm{main}\times\mathrm{split}]$

Where the parentheses are the fixed effects, and the brackets the random effects. So the “conditional local” effects of the subplots are interaction effects, that is, unique to the main plots they belong to. And similarly the main plots to the blocks.

Now consider an analogous situation in a genetic study: I’m interested in investigating the genetic architecture of Disease X. I’ve just read a paper about tantalizing evidence that there’s a contribution of rare variants, but apparently the investigator gave up and decided to play League of Legends for a living. I think maybe a more diverse sample could really unravel the architecture (by the way, there’s as yet no obvious evidence suggesting that multiple-population analysis is a better bet than single-population analysis for association studies, so this is not an endorsement, just an example), but I have to be careful. So I talk to collaborators at six centers and samples kind of like this:

$\begin{array}{c|c|c} \mathrm{UK} & \mathrm{India} & \mathrm{Japan} \\ \hline \mathrm{London},\mathrm{Cardiff} & \mathrm{New Delhi},\mathrm{Calcutta} & \mathrm{Sapporo},\mathrm{Tokyo} \end{array}$

But darn if they didn’t all have different case/control proportions! Not only that, but my centers partnered with several nearby hospitals, and there’s limited information on which samples came from where. So: I’m going to have to take advantage of the correlation between population structure and geography to do the hard work for me. First, calculating my global components should cluster my data into bulk geographic regions (east asia, central asia, and europe). By including the values of my samples when projected onto the top, say, 5 principal components, I can approximately correct for non-genetic risk stratifying across these regions.

As mentioned, these PCs say very little about the local history of each region, and even looking at rare variants in bulk will not help to measure these local histories. By contrast, if I re-calculate principal components within each cohort, the PCs that explain the most of this (much reduced) variance will almost certainly point in different directions, ones that stratify between the centers I’m sampling. So my final model (for a variant association test) will be:

$y_i \sim x_i + \sum_{j=1}^5 G_{ji} + \sum_{j=1}^5\sum_{c=1}^3L_{cji}\delta_{c,C(i)}$

where $G$ is the global principal components matrix, and $L_c$ is the local principal components matrix for cohort $c$. That is, the principal components calculated just on Indian samples, just on UK samples, or just on Japanese samples (for $c=1,2,3$ respectively). Finally $C(i)$ is the cohort of sample $i$, so the delta function sets to 0 all the projections of sample $i$ onto principal components calculated on cohorts other than its own. The final term in the model can be recognized as an interaction term, making this a multilevel model. By taking a multi-level view of principal components, we can correct both for global and local stratification. We can keep going, for instance, and see if we can’t use genetic structure to identify geographic clades even within cities. What’s really needed is a stopping condition.

The truth is, however, this process will only work up to a point. Real data does not behave this way, and an explicit example is informative of just where and how a multi-level structural correction for locally-peaked geographic risk breaks down. I’m going to use some data from the 1000 Genomes Project as illustration here (this is all on chr20 of the phase-1 release), and just go through the process described above.

In order to capture differences related to global geography, well, I already know that PCA is going to work pretty well. So I run it on my samples and I find this:

So I’ve got 3 geographically distinct clusters. Checking these against known origin of the samples reveals that these clusters correspond to continent of origin: Africa, Asia, and Europe (which does include ethnically-european Americans). For non-genetic risk which is geographically varying with these clusters, inclusion of the top few (say 5) PCs will give enough degrees of freedom to correct for such geographic risk stratification. But for risks that stratify geographically within each region, these PCs aren’t going to work. Here’s what it looks like using the “global” PCs for the Asian populations (which further stratify to Chinese and Japanese). It looks just like a bivariate Gaussian:

The whole idea of multilevel modeling is that “zooming in” on a cluster will reveal more of the underlying structure. Focusing on European samples, for instance, and re-calculating the principal components (removing outliers as necessary) shows much more structure, varying geographically:

There are two obvious clusters revealed by this re-calculation that we could separate. Using the reported country of origin from the 1000 Genomes Project reveals the clusters as obviously corresponding to “Finland” and “Other European”. So inclusion of these PCs would enable correction for Finland-vs-Other non-genetic risk stratification. None of this is particularly novel or surprising, the whole point is just to emphasize that the inclusion of multilevel PCs can correct for multiple levels of geographically-correlated risk. But, we need to go deeper and remove the Finnish, to see if a third level is viable. Doing so reveals slightly unclear, but still present clusters:

Here we see that the kinship matrix calculated on rare (<10%) variants only (!) can separate these European cohorts. As one would expect, the more common variants, which do explain more of the total variance, do not stratify the two populations as well as looking only at rare variation. One potential observation is that this kind of multilevel starts to break down with smaller numbers of samples, where the geographically-correlated structure may cease to become apparent. To put it another way, with small numbers of samples, mutation could dominate drift for explanation of the variance within the sample.

(This is only ~100 samples, and only on chr20, so probably the inclusion of additional samples and the whole genome will pull these clusters more distinctly apart. For this plot I removed the CEU, which had some cryptically-related individuals that were distorting the PCs. One thing to note is that looking at rare variants alone, the PCs start to get dominated by (presumably) cryptic relations; PC2 above for instance, was like this.)

Regardless, as an illustration of our example, this suggests that multilevel correction will indeed work. But note, I chose my example in a very specific way: the cities involved are all fairly geographically distinct. If this is your situation, multilevel correction for rare variants association will probably work well!

Unfortunately for most situations, the correction is going to break down. By virtue of the recent structure of the human population in those places from which studies are likely to draw samples, it is difficult to cluster locale of origin through principal components. You have to take a step back a minute and think about the effects of urbanization and globalization on the genetic structure of humans.

For one, there is a process of aggregation/migration. People in the areas of a city start aggregating in search of work. This means that alleles from a wide range of the local geography are brought together and start to mix. At the same time, cities become connected through railways, highways, and airports. The point is, this generates migration to cities, and between cities. In other words, M, the migration rate, has massively increased for populations of major urban centers over the past six generations, at the same time as the rate of gene inflow has gone up. To complicate matters, M almost certainly didn’t experience a discontinuity to a high value, likely the dynamics of M have been closer to sigmoidal.

The method of multilevel correction relies on a very specific property of structural dynamics: that the diffusion process is slow enough to accumulate rare (but not so rare as to be de novo) variants in a specific region. This is a very specific kind of deviation from random mating (and the “Wright-Fisher Model” in general). A very large migration rate will really wreck this whole idea, accelerating the diffusion process. McVean and Mathieson show this through simulation: a high migration rate significantly reduces the trend between distance and deviation from “average” allele sharing under a random mating model:

So in my mind the modern human urban population looks like a mix of these scenarios. The related simulation would use a larger grid (so there can be some isolation by distance) and hold M fixed at a low value (say 0.1) for many generations, and then increase it sigmoidally and quickly for a few generations, say 10, to a high value. The number of generations of high mixing would be few enough (I predict) not to disturb very much the correlation between distance and variants of moderate frequency, but by virtue of rare variants having arose over that time period, those variants would look close to a random-mating model. That is, common variants would still look structured, but rare variants would look unstructured. That is, Rare variants in bulk fail to correctly predict geography.

So obviously one place I would expect this to break down is China, which has got massively accelerated urbanization and migration, plus events like the Down to the Countryside Movement which caused crazy mixing. Following the multilevel approach we recalculate the PCs on just the ASN kinship matrix, we find the Japanese (in green) samples stratifying out. The blue and black are two cities from which the Chinese samples are collected: Beijing and Shanghai. On this axis there’s a slight shift in the mean, but they’re not separable.

So, following the procedure from before: remove the Japanese samples, and re-calculate just on the Chinese samples. The contention is that not even the rare variants on chr20 will separate these populations because of mixing. Looking just at variants under a frequency of 10% shows clusters just about on top of each other. Instead, the PCs are being dominated by individual samples

There’s obviously some cryptic relatedness going on. Cross-referencing the outliers on PC2 and PC1 with the kinship matrix, the top pair of different individuals is (HG00542,HG00537). However, all the outliers fall in the top 15 when you sort the samples by inbreeding coefficient (NA18602 is #2, NA18561 is #15, HG00625 is #7, HG00537 is #3), so it’s possible that past consanguinity may be driving some of the PCs, or maybe these samples are all cryptically related. Whatever it is, it’s not local geography. And this holds if you don’t restrict only to low-frequency variants as well:

So it’s clear that in this example, while multilevel PCs will separate continents and countries, it can’t separate Beijing from Shanghai. One possible explanation for this is the incredibly high effective migration rate in China, particularly between it’s largest urban and cultural centers. However, in all honesty, there just aren’t enough data to be conclusive. There are only 151 samples here, I wasn’t completely aggressive about removing outliers, and this is only chromosome 20. At the very least I hope this is provocative, albeit cursory and inconclusive, evidence about the role recent high migration has in reducing the ability to use genetic stratification as a proxy for geography on the most local of levels. For less mobile regions, such effects should be diminished.

Finally, what does this say about the ability to use rare variants to correct for geographically localized pockets of non-genetic risk? Basically, if your migration rate is high, rare variants can’t separate local clusters from one another. However, there’s some good news hidden here, too. If your migration rate is high, then local increases in non-genetic risk are less likely to confound associations of rare variants, simply because those variants are more dispersed due to high migration. Thus, there’s less of an effect to correct for. However, that’s not to say you shouldn’t correct for geography, and since you can’t do this with population structure, you have to record that data directly during collection.

Summary

Okay. The goal of this post was to discuss genetic structure as it relates to geographic confounding through the concept of population stratification. There are a couple of items I think are important in summary:

• Rare variants are local. The chances of finding a rare variant outside of where you first found it are incredibly low. Replication or follow up strategies really should take this into account.
• By virtue of being local, rare variants can be monstrously confounded by geographically local distributions of non-genetic risk. However if the migration rate isn’t too high, a multilevel PCA could correct for such effects.
• In places where we tend to collect samples (large cities), migration rate is really high, suggesting that care should be taken in correcting for geographic confounds.
• If you’re designing a study which assays or discovers rare variants, make local geographical information one of the metrics you collect for each sample.

I didn’t really talk at all about burden tests here. The conclusions will pretty much extend wholesale, as typically a target for a burden test (i.e. a gene) is miniscule compared to the rest of the genome. Rather than having a variant coming from a particular locale, you have a small set of rare variants coming from a small set of geographically distinct locales, and and a locally-concentrated increase in non-genetic risk for any one of them will inflate your burden test. You would correct for this in the same way as described above, as, remember, lumping all samples together and calculating kinship on rare variants does not produce PCs correlated with local geography, but some kind of weird average over all the histories.

Tools and Data:

1000 Genomes Project; GATK (sample selection, bed conversion), Plink (LD pruning, subsetting), GCTA (GRM and PCA), McVean and Mathieson, Nature Genetics