### Watterson estimator calculation (theta or $\theta_w$) under infinite-sites assumption

Continuing on from the previous post, I'm still working on making sense of population genetics metrics. Today's is the Watterson estimator, frequently depicted as $\theta_w$. Wikipedia defines this one as

$\hat{\theta}_w = \frac{K}{a_n}$

Where $K$ is the number of segregating sites (e.g., a SNP) and $a_n$ is the $(n-1)$th harmonic number.

To check my calculations this time around, I'm using PopGenome (whose calculations of $\pi$ also agree with the vcftools calculations from the previous post).

So, at a single SNP and a sample of 52 diploid individuals (104 chromosomes), the calculation should be

$\hat{\theta}_w = \frac{1}{a_{104}}$

To get the harmonic number, you can use a calculator or just implement it. Note that the number is calculated for $(n - 1)$. A quick example of calculating the harmonic number that I wrote in R:

For 104 chromosomes, the harmonic number used for the denominator is 5.216791. So for three different windows where the first window contains 0 SNPs, the second window contains 1 SNP, and the third window contains 2 SNPs, PopGenome reports the following values for $\theta_w$

And we can calculate those same values from $\frac{1}{5.216791} = 0.1916887$ and $\frac{2}{5.216791} = 0.3833774$.

Dividing that by the size of the window (i.e. to get a per bp estimate) seems to be what folks do when calculating sliding window averages of $\theta_w$. Wikipedia indicates that $\theta_w$ is a measure of the "population mutation rate" which seems appropriate enough; when calculated on the basis of sliding windows, this appears to represent the density of genetic variants in a region scaled slightly for the number of individuals in the sample. This seems to indeed be corroborated by a response from the vcftools developer to a user looking for $\theta_w$, indicating that using the --SNPdensity option from vcftools provides all of the necessary information for calculating the Watterson estimator.

As a final note, this appears to be the calculation under the infinite-sites assumption. The finite-sites calculation is evidently different as indicated in the LDhat documentation, though I'm not sure what that difference is yet. It looks like McVean et al. (2002) has additional information regarding that difference.

$\hat{\theta}_w = \frac{K}{a_n}$

Where $K$ is the number of segregating sites (e.g., a SNP) and $a_n$ is the $(n-1)$th harmonic number.

To check my calculations this time around, I'm using PopGenome (whose calculations of $\pi$ also agree with the vcftools calculations from the previous post).

So, at a single SNP and a sample of 52 diploid individuals (104 chromosomes), the calculation should be

$\hat{\theta}_w = \frac{1}{a_{104}}$

To get the harmonic number, you can use a calculator or just implement it. Note that the number is calculated for $(n - 1)$. A quick example of calculating the harmonic number that I wrote in R:

harmonicNumber = 0 numChromosomes = 104 for (i in 1:(numChromosomes - 1)) { harmonicNumber = harmonicNumber + 1.0/i } print(harmonicNumber)

For 104 chromosomes, the harmonic number used for the denominator is 5.216791. So for three different windows where the first window contains 0 SNPs, the second window contains 1 SNP, and the third window contains 2 SNPs, PopGenome reports the following values for $\theta_w$

> slide@theta_Watterson pop 1 [1,] NA [2,] 0.1916887 [3,] 0.3833774

And we can calculate those same values from $\frac{1}{5.216791} = 0.1916887$ and $\frac{2}{5.216791} = 0.3833774$.

Dividing that by the size of the window (i.e. to get a per bp estimate) seems to be what folks do when calculating sliding window averages of $\theta_w$. Wikipedia indicates that $\theta_w$ is a measure of the "population mutation rate" which seems appropriate enough; when calculated on the basis of sliding windows, this appears to represent the density of genetic variants in a region scaled slightly for the number of individuals in the sample. This seems to indeed be corroborated by a response from the vcftools developer to a user looking for $\theta_w$, indicating that using the --SNPdensity option from vcftools provides all of the necessary information for calculating the Watterson estimator.

As a final note, this appears to be the calculation under the infinite-sites assumption. The finite-sites calculation is evidently different as indicated in the LDhat documentation, though I'm not sure what that difference is yet. It looks like McVean et al. (2002) has additional information regarding that difference.

## Comments

## Post a Comment