Last year, I wrote about the effect that oversampling (in a case-control study) has on the power of a study (using the Score test). I recently updated that post with a bit more information regarding the difference between evaluating the test statistic under various choices of the input space, and calculating power. In this post, we move from the analysis of a single study, to the analysis of multiple studies: *meta*-analysis. This has particular applications to multiethnic genetic studies, as one typical approach to testing for genetic association is to test for association within each population, and meta-analyze genetic mutations across the populations in which they appear. By virtue of genetic drift, selection, and isolation-by-distance, many variants will be unique to a single population, and many variants will show large frequency differences between them. We will find that this structure significantly impacts the power of standard meta-analysis approaches, identify the cause, show why alternate methods work, and propose a novel method that performs slightly better in this particular context.

**Meta-Analysis**

The need for meta-analysis typically arises when several studies have studied the same phenomenon, and (more importantly) asked the same question of data sampled in potentially slightly different ways. “The same” here means, statistically, having tested the same null hypothesis against the same alternate hypothesis.[1] While *p*-values can typically be interpreted as probabilities, they are in fact random variables, and one can get into extreme danger when treating them as such, and it is almost never the case that . Instead, one can treat , the *p*-values for the studies, as a draw from a multivariate distribution. Along the same lines, one might use the test statistics and their variances as draws from multivariate distributions. One can then ask the probability if these draws under a suitably defined null distribution (in particular, and ).

Methods for Meta-Analysis in GeneticsLet’s say you have N studies, producing p-values

p, z-valuesz, with variancesv. Within each study, the variant of interest has frequencyf, which can change between study populations. Then

Fisher’s Method:

Sum of Uniform:

Inverse Variance Weighting:

Sample Size Weighting:

Inverse Covariance:

Random Effects:Model the between-study mean and variance as then estimate

And formulate a likelihood ratio

Xas

Comment:One needs to be careful when applying Fisher, Sum Uniform, or Inverse Covariance as these can be two-tailed meta analyses. In particular, for Fisher and Sum Uniform, if all the p-values were generated by two-tailed tests, then the Fisher and SU meta-statistics are also (akin to) two-tailed, i.e. there’s no checking that the statistics from the individual studies are directionally consistent. The Inverse Covariance approach, being an inner product, also has this property: a large negative statistic will not “cancel” a large positive statistic. Nevertheless, even if you expect directional consistency, finding two large statistics of opposite signs does suggest something’s going on, if only artifactual.

Ideally the meta-statistics will be 1) Calibrated or conservative under the null, and 2) Sensitive to the alternate hypothesis. Not much more to it than that.

[1] Likelihood ratio tests are a common way to violate the second part of this condition, it’s easy to include different explanatory variables, which means the models generating the LR may be different between the two studies.

**Evaluating the Power of Meta-analyses**

Following the approach in the previous post, we can calculate score statistics within two studies (or two populations) where the variant of interest potentially has different frequencies. Each of these statistics can then be combined into a meta-statistic and meta-p-value, and this process can be repeated to find the power of the meta-analysis at a particular false-positive level (in this case, alpha is ). Doing this with Inverse Variance Weighting reveals a startling phenomenon:

That’s right. Seeing the variant *more times* (in another population) ** kills your power **if the variant is of a lower frequency in the second population.

*What?!*

Okay so what is going on here. Let’s take a careful look at the realized distributions (since we’re generating simulated genotypes in order to estimate power *anyway*, this is just sitting in system memory). First we plot the test statistics (under the alternate hypothesis) within each of the populations, low-frequency in red, high-frequency in blue. Adjacent, we plot in green the meta-statistic that results from 1) multiplying the statistic in the first plot by sqrt(variance) [this recovers the un-normalized test statistic] 2) weighing the results by 1/var 3) normalizing by the root of the sum of 1/var. (In other words the IVW meta-analysis statistic)

The vertical dotted line is the Z-value associated with our alpha-level (basically ). Basically, by averaging the two distributions on the left in this way, we will (obviously) get a distribution “between” the two, one that has more power than the red distribution, but less than the blue. Thus, weighting like this kills your power in multiethnic studies, or for meta-analyses where the variant frequency is different between the studies you’re analyzing.

This motivates a search for potentially “better” meta analyses, ones that might outperform inverse-variance weighting. Ideally, these methods should perform about as well as or better than , while remaining calibrated under the null hypothesis. Given the above, clearly inverse-variance weighting is going to perform worse than . We might first identify well-calibrated statistics by simulating under the null. Let there be N populations (or studies). For each one (*i*) draw a *Vi* from a chi-squared distribution, and then draw a *Zi** *from N(0,*Vi*). We consider the cases of 5 and 10 populations, under a high- and low- variance setting. The distribution of variances under the “high” and “low” heterogeneity settings look like:

And the distributions of the meta-statistics under these scenarios look like (for 5 populations):

And for 10 populations:

Inverse-variance weighting seems undercalibrated compared to all other statistics. Fisher’s method is overly aggressive when the variances are allowed to change drastically between populations, and the sum-of-uniform has this property as well, but becomes better calibrated as the number of populations increase. Sample-size weighting (assuming 2000 samples, with no relation to the variance) is uniformly over-aggressive. By contrast, the random effects model is only slightly conservative (it was made for heterogeneity, after all), and the Inverse Covariance method is well-calibrated throughout.

*However*: When running a meta-analysis, you don’t get provided with the actual variances of the distributions from which the test statistics were drawn, you’re instead provided with an *estimate*. In addition, this estimate will covary with the test statistic itself. In the case of the score test, as the (observed) frequency increases, both the computed value and the variance of the statistic also increases. Performing a simulated meta-analysis with OR=1 (so no association) generates the following null calibrations:

Where here, the “high variance” environment (left) has frequencies of (20%,0.5%) in two populations, while the low variance environment (right) has frequencies of (5%,0.5%). All of a sudden, the meta-analysis statistics all (except for the random effects) become reasonably calibrated under the null hypothesis, strikingly even the sample-size weighted meta analysis! The black lines at the top and bottom are “min(P)” and “max(P)” respectively – falling outside the bounds suggests one may be overly conservative or overly aggressive, at least under the null. Of course, power is all about the alternate, so what happens if we raise the odds ratio:

On the left we’ve set the OR=1.3, and OR=2 on the right. I’ve plotted on the x-axis “Min(P)” – the minimum p-value in the two populations, as a kind of baseline target for the sensitivity of the meta analysis. So what do these plots tell us? First we find a confirmation that inverse variance weighting is just not as sensitive to the alternate hypothesis as other approaches, and that overconservativeness is compounded when the variance is high. Second, we find that sample-size weighting performs far better, and is comparable to the random-effects meta analysis, though underperforms at variants of large frequency difference. Finally, we find that both the Inverse Covariance and Fisher’s method perform comparably to max(P), while remaining calibrated under the null hypothesis.

Take-home messages: Don’t inverse-variance weight in this circumstance. Even sample-size weighting is preferable, though random effects performs just as well. However, it’s even more advisable to be “old school” and use Fisher’s method or Inverse Covariance. The last observation is that frequency information has a *lot* of gain. The Red curve (“Fixed (SF)”) is sample-size weighting that takes frequency into account: . Simply providing that information (granted, it’s the *actual* frequency, not the *observed* frequency) to the weighting drops the *p* values by orders of magnitude, drastically increasing power under the alternate hypothesis.

**Note: Inverse Inverse Variance Weighting**

Most meta-analyses use the *Wald* statistic to calculate the final meta-statistic. This differs from the above in that we have used strictly the *Score* statistic. The Score statistic is such that, for a single-parameter logistic regression, the statistic can be calculated in closed form. By contrast, the Wald statistic requires both the intercept and the variable coefficients be computed under the alternate hypothesis, and this needs to be fit manually. They also relate in the following way: we know from the derivation in my prior post

(Wald)

(Score)

In other words, the variances of the Score and Wald tests are inverses of each other. This suggests that Inverse *Inverse* Variance Weighting may perform appropriately for the Score test as a meta-analytic statistic.

InverseInverseVariance

Replacing IVW with IIVW does indeed have the desired effect, and it performs comparably to Fisher and Inverse Covariance.

I suppose this is a whole lot of hubbub over the fact that the Score and Wald tests have inverse variances. I guess, except that one can find regimes where IVW is terrible, but IIVW works alright, *and** *the reverse. For example, IIVW works well in the above case (frequencies: 20%, 0.5%, 2000 samples, balanced case/control), it works miserably in the case below (frequencies: 5%, 3%, 1%, 0.5%; samples: 1500/500,750/750,600/200,1000/1000):

Here, IVW is the blue, while the orange-red is IIVW. Given the inverse relationship between Wald and Score, this implies that there will be regimes for the Wald statistic where inverse-variance weighting performs poorly as well. One thing to note is that Sample-Frequency weighting as well as Inverse Covariance (and Fisher’s method) perform well throughout.

## Leave a Reply