Monday, November 14, 2016

Watterson estimator calculation (theta or $\theta_w$) under infinite-sites assumption

Continuing on from the previous post, I'm still working on making sense of population genetics metrics. Today's is the Watterson estimator, frequently depicted as $\theta_w$. Wikipedia defines this one as

$\hat{\theta}_w = \frac{K}{a_n}$

Where $K$ is the number of segregating sites (e.g., a SNP) and $a_n$ is the $(n-1)$th harmonic number.

To check my calculations this time around, I'm using PopGenome (whose calculations of $\pi$ also agree with the vcftools calculations from the previous post).

So, at a single SNP and a sample of 52 diploid individuals (104 chromosomes), the calculation should be

$\hat{\theta}_w = \frac{1}{a_{104}}$

To get the harmonic number, you can use a calculator or just implement it. Note that the number is calculated for $(n - 1)$. A quick example of calculating the harmonic number that I wrote in R:

harmonicNumber = 0
numChromosomes = 104
for (i in 1:(numChromosomes - 1)) {
  harmonicNumber = harmonicNumber + 1.0/i

For 104 chromosomes, the harmonic number used for the denominator is 5.216791. So for three different windows where the first window contains 0 SNPs, the second window contains 1 SNP, and the third window contains 2 SNPs, PopGenome reports the following values for $\theta_w$

> slide@theta_Watterson
         pop 1
[1,]        NA
[2,] 0.1916887
[3,] 0.3833774

And we can calculate those same values from $\frac{1}{5.216791} = 0.1916887$ and $\frac{2}{5.216791} = 0.3833774$.

Dividing that by the size of the window (i.e. to get a per bp estimate) seems to be what folks do when calculating sliding window averages of $\theta_w$. Wikipedia indicates that $\theta_w$ is a measure of the "population mutation rate" which seems appropriate enough; when calculated on the basis of sliding windows, this appears to represent the density of genetic variants in a region scaled slightly for the number of individuals in the sample. This seems to indeed be corroborated by a response from the vcftools developer to a user looking for $\theta_w$, indicating that using the --SNPdensity option from vcftools provides all of the necessary information for calculating the Watterson estimator.

As a final note, this appears to be the calculation under the infinite-sites assumption. The finite-sites calculation is evidently different as indicated in the LDhat documentation, though I'm not sure what that difference is yet. It looks like McVean et al. (2002) has additional information regarding that difference.

Monday, November 7, 2016

Nucleotide diversity (pi or $\pi$) calculation (per site and per window)

Part of a project I'm working on is looking at how genetic variation accumulates in genomes. Population genetics has developed a number of metrics for evaluating diversity, and while I see them used quite often in papers, it's not always clear to me what they're actually measuring.

Today I had a look at a measurement of nucleotide diversity called pi ($\pi$). Trying to find a good definition of it, I repeatedly came across the same definition provided by Wikipedia: "the average number of nucleotide differences per site between any two DNA sequences chosen randomly from the sample population".

$\pi = \sum\limits_{ij}^{} x_i x_j \pi_{ij} = 2 \times \sum\limits_{i=2}^{n} \sum\limits_{j=1}^{i-1} x_i x_j \pi_{ij}$

where $x_i$ and $x_j$ are the respective frequencies of the $i$th and $j$th sequences, $\pi_{ij}$ is the number of nucleotide differences per nucleotide site between the $i$th and $j$th sequences, and $n$ is the number of sequences in the sample.

I sort of conceptually understand that in the context of looking at specific genes across a population, but I didn't really grok the actual values reported for $\pi$ values in a genome-wide scan. So, to figure out what those plots are actually showing, I had a look at the calculations made by vcftools. Specifically, I was looking for how the results of --site-pi and --window-pi are calculated.

For --site-pi, the function of interest is output_per_site_nucleotide_diversity() in variant_file_output.cpp, and the code of interest is

for(unsigned int allele = 0; allele < N_alleles; allele++)
   int other_alleles_count = (total_alleles - allele_counts[allele]);
   mismatches += (allele_counts[allele] * other_alleles_count);

// pairs is the number of pairwise combinations.
int pairs = (total_alleles * (total_alleles - 1));
double pi = (mismatches/static_cast<double>(pairs));

So, given biallelic loci, this seems to be proportional to allele frequency, and we can calculate $\pi$ for a single position simply from the VCF annotations AC (alternate allele count) and AN (allele number). This seems to use number of possible pairwise allelic combinations (which would be AN * (AN-1)) as the denominator and the number of comparisons that would not be equivalent (i.e. mismatches) when comparing all pairwise combinations (which would be AC * (AN - AC) + (AN - AC) * AC).

So, for this variant:
chromosome_1    186     .       G       T       546.80  PASS    AC=4;AF=0.040;AN=100;DP=750;FS=0.000;GQ_MEAN=39.08;GQ_STDDEV=36.12;InbreedingCoeff=0.6092;MQ=60.00;MQ0=0;NCC=4;QD=27.34;SOR=2.531;VQSLOD=12.99;culprit=FS

$\pi = ( 4 * (100 - 4) + (100 - 4) * 4 ) / (100 * (100 - 1)) = 0.0775758 $

As far as I can tell, the single-site $\pi$ shouldn't go beyond 0.5 for practical applications with meaningfully large population samples (though you'll note that one heterozygous individual will give a $\pi$ of 1).

Okay, so what about the --window-pi option? The function of interest in vcftools is output_windowed_nucleotide_diversity(), and it's basically the same thing as the single site calculation except it considers the monomorphic sites in the window as well:

N_pairs = bins[CHROM][s][N_variant_site_pairs] + (N_monomorphic_sites * N_comparisons);
pi = bins[CHROM][s][N_mismatches] / double(N_pairs);

So, suppose we have the window 201 - 300 and the variants:
chromosome_1    275     .       C       T       7509.57 PASS    AC=20;AF=0.196;AN=102;DP=968;FS=0.000;GQ_MEAN=50.00;GQ_STDDEV=44.84;InbreedingCoeff=0.9368;MQ=60.00;MQ0=0;NCC=3;QD=31.29;SOR=1.456;VQSLOD=12.40;culprit=FS
chromosome_1    291     .       T       A       1247.11 PASS    AC=4;AF=0.040;AN=100;DP=945;FS=0.000;GQ_MEAN=45.96;GQ_STDDEV=32.66;InbreedingCoeff=0.8539;MQ=60.00;MQ0=0;NCC=4;QD=29.00;SOR=1.455;VQSLOD=12.26;culprit=FS

This time we also consider the monomorphic sites in the calculation (though this calculation doesn't seem to gracefully consider an absence of read data), which increases the number of pairs that are considered. This time around, our number of pairwise comparisons is the number of pairwise comparisons for the variant sites (for each variant, sum (AN * (AN - 1))) plus the number of monomorphic sites (2 variants, so 98 sites were non-variant in the 100 bp window) times the number of pairwise chromosome comparisons (in this example, there were 52 samples, so it was 104 * 103). So we have:

$nPairs = ( 102 * (102 - 1) + 100 * (100 - 1) ) + (100 - 2) * (104 * (104 - 1)) = 1069978 $

And the number of mismatches is the same as before except summed for each variant (i.e. AC * (AN - AC) + (AN-AC) * AC):

$nMismatches = (20 * (102 - 20) + (102 -20) * 20) + (4 * (100 - 4) + (100 - 4) * 4) = 4048 $

So $\pi$ for this 100 bp window is:

$\pi = 4048 / 1069978 = 0.0037832554 $

Okay, so as far as I can tell, $\pi$ is proportional to how many variants are present (for the window-based calculation) and increases as minor allele frequency grows (for both). That seems like an appropriate metric to describe how much nucleotide diversity is present at a given locus in a given population.

Saturday, October 1, 2016

Postmortem: Chillennium 2016 48-Hour Game Jam - Tardigrade's Dangerous Day

The Llama and I participated in the Chillennium 2016 Game Jam a couple of weekends ago and made our first video game: Tardigrade's Dangerous Day. The game jam was a 48-hour competition to make a game with the theme "foofaraw" (which means a big fuss about a minor matter). If you're interested in the technical details, read on; otherwise, you can check it out on here and play it on HyperDev here - please note that the audio only seems to be working in Google Chrome.

Since tardigrades had been in the science news recently and tardigrades are all around pretty bonkers, we decided to make the game about a (not to scale) tardigrade.

Our stalwart tardigrade protagonist
We wanted the game to be browser-based so that it would be portable and allow users to quickly try the game. JavaScript or using something like Emscripten to port C++ to JavaScript are the only good options I know of for writing browser-based games. While we aren't that familiar with JavaScript, we chose to write the game in it because we didn't want to deal with the additional complexity of Emscripten.

Our experience with JavaScript is limited, but it does seem to be a powerful language. Projects like node.js let you write both your server-side and client-side code in JavaScript, so one language can be used for the full stack. While we didn't have much (or anything, really) in the way of server-side code, node.js was used server-side to serve up the game page.

We used HyperDev as a collaborative IDE for web application development. It updates your application in real time as you write code (and also uses node.js for serving), and it functions mostly like Google Docs in terms of multiple people being able to concurrently write in a document. HyperDev is still in beta, but it worked well for the most part (though it didn't play nicely with an intermittent internet connection).
With respect to game engines, there are a handful of good JavaScript engines out there, and we chose to use Phaser. It handles asset loading and display, user input, has a basic physics engine, good tutorials, an active community; overall, it worked very well. It doesn't seem to have any dedicated functionality for multi-player games that would run the physics simulation server-side and update the clients, but everything was single-player and client-side in our game so that wasn't a problem. The only issue we ran into was that the audio only seemed to work on Google Chrome browsers; Mozilla Firefox and Internet Explorer didn't seem to want to play the .ogg audio files, but I'm not sure if that's a Phaser issue or something else.

UV rays are no joke, safety crab
The artwork was done in Inkscape and entirely by the Llama. Using SVG assets was great for size-scaling purposes, and I much preferred them to pixel-based sprites both from an artistic and technical perspective. The Llama's artistic ability is top-notch, and the artwork is definitely the best thing about the game.

The music was recorded using Audacity by setting up a microphone in front of a speaker hooked up to my electric guitar. This wasn't ideal, and the sound quality suffered (it definitely did not capture the full awesome of my sick guitar skillzzzz). Buying a cable that connects the guitar straight to the computer for recording would probably have been much better. 

Get ready for 60 seconds of fun and laughter
The game only lasts about 60 seconds, but overall we're happy with how it turned out. We went into the jam with a simple goal since we had never made a game before: make a program that draws assets on the screen and responds to user input. We succeeded in meeting that goal, and we made something that gives us a laugh when we play it.

Now for what worked and what didn't.

What worked
  • Free software and hosting platforms - HyperDev as a web app development platform, Phaser for the game engine, Inkscape for art assets, Audacity for music assets, and for hosting; they were all great. I saw a lot of folks using Unity and Unreal Engine, but I'm quite happy with how our toolchain worked.
  • The Llama's artwork - If we fail to make it as scientists, I'm pretty sure the Llama could make it as a graphic designer or artist.
  • Chillennium, its organizers, and its volunteers - They put together a great game jam with good food and attentive staff, and they stayed on schedule.
  • Not gorging on candy and Red Bull, and getting some sleep - I'm pretty sure there were enough chips, candies, Slim Jims, sodas, and Red Bulls around such that we could have subsisted exclusively on that, and I almost did in my initial excitement at their availability. Fortunately, better judgement won out, and we didn't destroy ourselves. We probably only spent a total of 18 hours working (so about 36 people-hours); we stuck with a regular sleep schedule and ran regular weekend errands like grocery shopping. Maybe a lack of putting in the hours contributed to not winning any awards, but I'm happy that we didn't have to spend any time "recovering" from the jam.

What didn't:
  • My limited understanding of JavaScript - I think this hurt production the most. I spent valuable time trying to figure out the scope of variables in callbacks, and I still don't really understand it. I also didn't really grok how to instantiate member variables in a class, or even define a class; it seemed like you could just write this.memberVariable and poof, now your object has memberVariable. I need to spend some more time learning the basics since JavaScript is definitely not like the C family of languages.
  • Cross-Origin Resource Sharing - I spent a silly amount of time trying to figure out how to deal with a Cross Origin error: "Cross-origin image load denied by Cross-Origin Resource Sharing policy". Since the assets were loaded from HyperDev and I didn't control the server with the assets, I couldn't enable CORS on it to serve up the images. Setting the Phaser loader to anonymous by this.load.crossOrigin = "anonymous"; worked, but I still need to learn more about what CORS is.
  • Audio works only in Google Chrome - Still haven't gotten to the bottom of this one. Other browers show the audio icon as though something should be playing, and I'm using .ogg audio files that should be compatible, but don't seem to get output. 
  • Too much text - We watched a few of our labmates play our game, and it was evident that we did not consider our audience. As casual gamers ourselves, we now realize that we don't often read text in games. However, we lacked the time (or maybe the will) to create more graphical content to tell our story; instead we just wrote it out in text. Our players clicked right through it without reading, and then were confused later. We needed to make the whole experience more intuitive with less text.  

Monday, August 22, 2016

Postmortem: 3D sorghum reconstructions from depth images identify QTL regulating shoot architecture

One of the approaches that researchers are using to improve plant productivity is high-throughput plant phenotyping. The objective is to use computing and robotics to increase the number of plants that can be evaluated for traits of interest. As the number of plants that can be screened increases, the rate at which the genetic basis of traits can be identified increases, and we can make better selection decisions for breeding. The way I see it, the future of plant phenotyping is going to be large numbers of cheap, autonomous robots that can crawl or fly a field or greenhouse, gather data, and send that data for genetic analysis and breeding decisions (as well as precision agriculture); I think the major hurdles are a lack of software, both for robot autonomy and data analysis, and the energy density of batteries.

The project that ultimately led to our recent manuscript "3D sorghum reconstructions from depth images identify QTL regulating shoot architecture" originated sometime around October 2014. Given our results with the RIG manuscript, I agreed with the general sentiment in the field that genotyping was not the bottleneck for crop improvement, and that phenotyping was the limitation. While industry platforms like Lemnatec and Traitmill and public platforms like PlantScan had been been out for a little while, we were looking for something that required less static infrastructure, especially since one of our major interests is bioenergy sorghum; 3+ meter tall bioenergy sorghum plants require more flexible acquisition platforms.

Sometime around Fall every year, I tend to get on an annual game development kick (notice that last Fall in November 2015 there are posts on procedural content generation, and this year the Llama and I are registered for the September Chillenium game jam at Texas A&M). Similarly, in Fall 2014, I was on my annual game development kick, and I was interested in the idea of the Microsoft Kinect as an input device (version 2 to be specific). Since the Kinect (and related cameras that use structured light or time-of-flight) directly samples information about the geometry of scenes at high rates and was relatively cheap, it seemed like an good fit for building a plant morphology phenotyping platform. Even if it didn't work, we only put the boss out ~$150, and we still learned a little about hardware interfaces and image processing.

3D sorghum - we have more than a thousand of these things now.
As it turns out, it worked pretty well, and after about 6 months of tooling around with some prototypes, we had a workflow in place and had enough working test cases to merit a larger scale experiment. This was around the time of the ARPA-E TERRA project proposals, so plant phenotyping was starting to get more attention in the United States, and our boss was onboard.

We grew the plants, imaged them loads of times, processed the data, and wrote the manuscript. The review process was productive; the editor and reviewers provided very helpful and critical feedback that improved the manuscript substantially.

The publication of this manuscript marks the completion of much of what the Llama and I had hoped to accomplish in our graduate studies. When we entered the TAMU Genetics program 4 years ago, we wanted to become proficient in interdisciplinary research and use mathematics and computing to answer questions in genetics. We now have publications that use modeling, high-performance computing, and image processing to understand the genetic basis of quantitative traits and plant physiology, we can program in multiple languages, and we can comfortably talk shop with a range of disciplines. We're happy with how things turned out in graduate school, and we're grateful for the training opportunities and interactions that we've had while we've been here. Over the next year we'll try to polish off some additional manuscripts in the queue and try to get a job in research.

Now for the what worked and what didn't.

What worked:
  • Open source libraries - Particularly RapidXML, OpenCV, and PCL. Open source is the best. Being able to browse the source code to step through the code was priceless for learning and debugging. In this spirit, we put all of our code on GitHub.
  • GitHub for code hosting - Free, simple, publicly accessible, and has a good user interface. If you write non-trival code for an academic publication, please just put it online somewhere so that folks like me can look at it. Ideas are cheap; implementations are valuable.
  • Dryad for data hosting - Acting as a data repository for research data seems to be an impossible and thankless task, particularly as data sets are exploding in size. Dryad was professional and efficient.
What didn't:
  • Getting bogged down in the details - The first draft of the manuscript that we sent for review was a bit heavy handed with extraneous detail. We didn't develop any elegant algorithms, but that didn't stop me from talking about our methods excessively and using far too much jargon; these got in the way of the overall message of our manuscript. Pruning this out and moving some of it to the supplemental benefited the manuscript.
  • Not merging RGB and depth and not using templating properly - Originally, I was only interested in the depth images since I figured the RGB data wouldn't provide much useful information (the plant is rather uniformly green after all, and variation in environmental lighting was a complexity I didn't want to handle). This ended up as most of the code base being designed specifically for clouds with point types of pcl::PointXYZ rather than using a template. If I were to do it again, I would have gone ahead and mapped the RGB to the depth data for figure generation and used templates to make the code base more generic (the same way PCL uses templates).

Friday, July 15, 2016

Our first bioRxiv submission

Our recent manuscript on image-based phenotyping in sorghum has been in the review process for a while now, so we decided to post it on the bioRxiv preprint server (after we received permission from the journal peer review manager and editor) to get it out in the wild prior to publication. The premise of the bioRxiv (and its inspiration in the physical sciences, the arXiv) is to make scientific findings immediately available, and Needhi Bhalla wrote a useful primer on the pros and cons of submitting your work to a preprint server prior to publication.

I'm not sure if we'll always choose to post a preprint, but I think it makes sense this time since (1) we know the manuscript has already been shared with others outside of our intended circle of collaborators, and (2) this contribution is more timely than our past findings given the goals of the ARPA-E TERRA initiative. Even though pre-prints don't hold the same prestige as peer-reviewed journal articles, there will at least be timestamped documentation and source code that can be used by others and can be shown to potential future employers.

Without further ado, the following text links to a preprint documenting an image-based phenotyping platform we developed and quantitative trait loci that we identified using it: "3D sorghum reconstructions from depth images enable identification of quantitative trait loci regulating shoot architecture" by McCormick, Truong, and Mullet.

The software we developed can be found on GitHub by following this link. Many previous posts on this blog were written in the context of developing that code.

And most importantly, here are some 3D sorghum plants.

Thursday, June 9, 2016

Robot Hexapod (RHex)

Lately we've been musing about what the right strategy will be for agricultural phenotyping solutions, and we're currently thinking about the advantages of large numbers of cheap, small, legged, autonomous robots that can crawl the fields and image plants. These are a few ideas that we want to come back to later if we ever get around to prototyping something.

Boston Dynamics built a rugged version of the robot hexapod archetype (RHex).

The Koditschek lab works extensively on the RHex (I believe RHex was born of a DARPA project that Koditschek was on):
Koditschek lab site link

And a couple of relevant manuscripts from Koditschek:
Saranh, Buehler, Koditschek, (2001) RHex: A simple and highly mobile hexapod robot
Weingarten, Buehler, Groff, Koditschek (2003) Gait generation and optimization for legged robots
Haynes, Pusey, Knopf, Johnson, Koditschek (2012) Laboratory on legs: an architecture for adjustable morphology with legged robots

This is a student project for building an RHex that uses a Kinect for navigation (RHex Box); it has a bit of content for a DIY RHex:
Project website link

Wednesday, April 20, 2016

Getting started with Intel's Threading Building Blocks: installing and compiling an example that uses a std::vector and the Code::Blocks IDE

I hit a problem where it looks like I'll have to parallelize execution. After reading a bit about it, I opted to use Intel's Threading Building Blocks over OpenMP. The consensus seemed to be that Intel TBB is better suited for slotting into C++ projects. TBB also has a bunch of birds on its webpage, which is a plus in my book.

The most excellent TBB bird logo.

To download and install, I got the .tgz corresponding to the source code at the download page:

After unarchiving and decompressing, I invoked "make" in the directory (this was on an Ubuntu 14.04 OS). This built the debug and release targets in the "build" directory.

You can also invoke "make" in the "examples" directory; it builds and runs a number of fun examples that use TBB.

To set that up to compile with the Code::Blocks IDE, I did the following:

Go to the Project's Build Options -> Search Directories -> Linker and add the debug and release build directories (e.g. tbb44_20160128oss/build/linux_intel64_gcc_cc4.8_libc2.19_kernel3.19.0_debug).

Go to the Project's Build Options -> Search Directories -> Compiler and add the include directory (e.g. tbb44_20160128oss/include).

Go to the Project's Build Options -> Linker settings, and, under "Other linker options" add "-ltbb" (don't include the quotes).

 To test the compilation with a simple use case, I modified one of the examples in the TBB parallel_for documentation. Here's the premise:

#include <iostream>
#include <vector>

#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"

// Example adapted from:
class AverageWithVectors {
    std::vector<float> *input;
    std::vector<float> *output;
    void operator()( const tbb::blocked_range<int>& range ) const {
        for( int i=range.begin(); i!=range.end(); ++i )
            (*output)[i] = ( (*input)[i-1] + (*input)[i] + (*input)[i+1]) * (1/3.f);

int main() {
    std::vector<float> inputVector = {30.0, 70.0, 90.0, 120.0, 133.0, 199.0, 245.0, 266.0, 289.0};
    std::vector<float> outputVector = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
    AverageWithVectors avg;
    avg.input = &inputVector;
    avg.output = &outputVector;
    tbb::parallel_for( tbb::blocked_range<int>( 1, 7 ), avg );
    for (uint32_t i = 0; i < (*avg.output).size(); i++) {
        std::cout << "output of element " << i << ": " << (*avg.output)[i];


The output should be something like:
output of element 0: 0
output of element 1: 63.3333
output of element 2: 93.3333
output of element 3: 114.333
output of element 4: 150.667
output of element 5: 192.333
output of element 6: 236.667
output of element 7: 266.667
output of element 8: 0

Note that this doesn't actually prove anything is being executed in parallel, only that it compiles and successfully calls tbb::parallel_for().

Finally, I came across a helpful post on the Intel forums from a user named Jim Dempsey. Most of it is reproduced below.

Generally speaking: 
Use for (or for_each) when the amount of computation is small (not worth the overhead to parallelize). 
Use parallel_for when you have a large number of objects and each object has equal work of small work. 
Use parallel_for with appropriate grain size when you have a large number of objects and each object has un-equal work of small work. 
Use parallel_for_each when each objects work is relatively large and number of objects is few and each object has un-equal work. 
When number of objects is very small and known, consider using switch(nObjects) and cases using parallel_invoke in each case that you wish to impliment and parallel_for_each for the default case