Friday, October 31, 2014

OpenCV Mat types

I'm currently working on reading in a depth image with OpenCV and converting it to a point cloud. After wrestling with data types for a while, I came upon a tremendously useful listing of the OpenCV Mat types and their corresponding numerical values, along with the data types corresponding to those map types. For example, a CV_16U OpenCV Mat outputs 2 with the type() member function, and stores the channel data as unsigned shorts. The mapping is posted on this blog here.

    cv::Mat depthImage;
    depthImage = cv::imread("banana_1_1_1_depth.png", CV_LOAD_IMAGE_ANYDEPTH | CV_LOAD_IMAGE_ANYCOLOR ); // Read the file

    if(! )                              // Check for invalid input
        std::cout <<  "Could not open or find the image" << std::endl ;
        return -1;

    std::cout << "Size of depthImage:\t" << depthImage.size() << std::endl;
    std::cout << "Type of depthImage:\t" << depthImage.type() << std::endl;
    std::cout << "Depth image at coord 0,0:\t" <<<ushort>(0,0) << std::endl;

Thursday, October 30, 2014

Installing Point Cloud Library and configuring Code::Blocks to compile (Ubuntu 14.04 LTS)

Most of the credit for this post goes to this blog post and this documentation at the Point Cloud Library website. However, those didn't quite get me to the finish line, so here's what got me there:

In anticipation of getting a Microsoft Kinect for Windows v2 to image some plants, I wanted to get the Point Cloud Library (PCL) up and running so that we could define 3D entities from the 2D RGB-D images we'll get from the Kinect. I'll be developing in C++ with Code::Blocks so my first goal was to compile some code using the PCL.

Mercifully, prebuilt binaries for the PCL are available through the apt package manager. As documented at the PCL website:
sudo add-apt-repository ppa:v-launchpad-jochen-sprickerhof-de/pcl
sudo apt-get update
sudo apt-get install libpcl-all
As for configuring Code::Blocks, I followed a blog post at this site, but it seemed that some of the shared object libraries requirements had changed since that blog post. Ultimately, it required adding the following:
Go to Project->Build Options->Search Directories Tab->Compiler Tab and add the following locations:
Then in Project->Build Options->Linker Settings Tab add the following link libraries:
/usr/lib/libvtk* (all of the libvtk .so files)
/usr/lib/libpcl* (all of the libpcl .so files)
These files may not be in these locations on all systems, so use find to track them down, e.g.:
find /usr/lib* -name "*boost*.so"
The code I used to test if I could compile was obtained here from the Point Cloud Library, and it's also reproduced here:
#include <iostream>
#include <pcl/io/pcd_io.h>
#include <pcl/point_types.h>

int main (int argc, char** argv)
  pcl::PointCloud<pcl::PointXYZ> cloud;

  // Fill in the cloud data
  cloud.width    = 5;
  cloud.height   = 1;
  cloud.is_dense = false;
  cloud.points.resize (cloud.width * cloud.height);

  for (size_t i = 0; i < cloud.points.size (); ++i)
    cloud.points[i].x = 1024 * rand () / (RAND_MAX + 1.0f);
    cloud.points[i].y = 1024 * rand () / (RAND_MAX + 1.0f);
    cloud.points[i].z = 1024 * rand () / (RAND_MAX + 1.0f);

  pcl::io::savePCDFileASCII ("test_pcd.pcd", cloud);
  std::cerr << "Saved " << cloud.points.size () << " data points to test_pcd.pcd." << std::endl;

  for (size_t i = 0; i < cloud.points.size (); ++i)
    std::cerr << "    " << cloud.points[i].x << " " << cloud.points[i].y << " " << cloud.points[i].z << std::endl;

  return (0);
And hopefully you get the output:

Code::Blocks Son of Obsidian theme

Edit: 09/10/2015 -
I had some trouble getting back to the Code Blocks forum link to get the Son of Obsidian .conf file. Here is a link that you can get the .conf file from what I think is the original author.

Working in Code::Blocks? Might as well make it easy on the eyes.

In this post at the Code Blocks forums, a user provides a port of the Visual Studio color scheme Son of Obsidian as a .conf file. The .conf file can be imported using cb_share_config as detailed in this blog post. Importing the .conf file to the default.conf file gives the colors in the image above.

Thursday, October 16, 2014

BibTeX .bst file generation

Not all of the life science journals we submit to support $\LaTeX$ and friends, so getting the right bibliography formatting can be a challenge sometimes. I just discovered a web application that allows you to answer a series of questions about the citation formatting, after which it spits out a .dbj file that can be converted to a .bst bibliography style file for BibTeX. It's appropriately called the Bst generator, and can be found at .

Monday, August 25, 2014

Recurrent artificial neural network evolution using a genetic algorithm for reverse engineering of gene regulatory networks - Postmortem

During the SIAM Conferences on the Life Sciences poster session, one of my poster neighbors was a couple (who also worked in the same lab!); their poster was on a method to evolve an artificial neural network using a genetic algorithm to reverse engineer a gene regulatory network given expression data. This was the first time I had heard of such a concept, and it sounded like a great idea. Once the Llama and I returned from the conference, I started work on implementing such a method.

I started with the poster authors' original paper, Kordmahalleh and Sefidmazgi et. al. (2014), but I was unable to determine how to apply the methods in that paper to reverse engineering of gene regulatory networks. I eventually ended up on a paper by Noman, Palafox, and Iba (2013) using a recurrent neural network model that was sufficiently detailed for me to get a handle on how such a model and genetic algorithm might be implemented. While this got me about 80% of the way to a successful artificial neural network and genetic algorithm, the method by which they evaluated the fitness function for the genetic algorithm wasn't described in enough detail for me to implement. My interpretation of the paper was that their method evaluated the fitness of each node independently of the other nodes, but I was unable to determine how they actually evaluated a node's output independently of the output of the other nodes at a given time point. The implementations I tried failed to reconstitute the synthetic network that generated the input data.

After that, I abandoned the Noman, Palafox, and Iba (2013) subproblem method and tried solving the entire system of equations for the network. Coming into this, I didn't know a thing about solving differential equations; one of the reasons I went for the Noman et al. (2013) method was that I perceived it wouldn't require the solution of a system of differential equations. Fortunately for me, the Llama gave me plenty of guidance. Once I had gotten my bearings in Matlab, I tried to find a C++ library that could numerically solve a system of ordinary differential equations. Thankfully, there is a nice library called odeint, and odeint has been incorporated into the Boost libraries. After grabbing the Boost libraries, I could solve for the entire system of ordinary differential equations given network parameters, and use the mean squared error to determine the fitness of the network. With the network fitness in hand, I was in business. As it turns out, this method is very similar to the one described in Wahde and Hertz (2000).

The code base can be found at the ANN_GA_prototype repo on my Github. To date, if a realistic amount of input data is provided, it can vaguely reproduce a synthetic network's output after about 100,000 generations, but the topology is very different from the target synthetic network. It may be the case that more input data is required than is realistically available from most biological experiments to accurately reverse engineer network topology.

And now for the postmortem:

What worked:
Code Complete: the Llama and I have been reading Code Complete as part of a weekly meeting we attend with a research group from the Mathematics department. When I first started writing the code, I tried to pull in some of the ideas from Code Complete, such as using abstraction to manage complexity and hiding implementation details. This worked great in the beginning since I had spent some time on design prior to writing any code. However, I ultimately failed to maintain Code Complete's tenets since I changed methodology in the middle (see more in What didn't work).

Object Oriented Programming: I usually don't work on code that is big enough to merit much OOP, but the abstraction helped a lot here.

odeint: While odeint wasn't as easy to use as Matlab, it was considerably faster at numerical solution of the system of ordinary differential equations than Matlab. However, that doesn't mean it wasn't easy to use; I don't know a thing about solving ordinary differential equations, so I think that's a testament to its usability.

What didn't work:
Changing design midstream: the code base degraded quite a bit when I switched to implementing a different method to evolve the network. Basically, the original design was to operate on a population of nodes (as I had interpreted from Noman et al. (2013)). However, when this didn't work and I had to switch to operating on a population of networks, I was pretty frustrated and just wanted to get to a working solution. Now the code base is littered with remnants and re-tooled duplicates of the original method that need to be cleaned out. However, I think this was the best path to take since I could have potentially wasted time designing for a second method that also may not have worked. Now that it's working, I can go back and clean things up.

Defining the system of equations for odeint at runtime: There must be a better way to do this, but as it stands, I define the network size (i.e. the number of equations in the system that odeint solves) as a macro. I have been unable to work out how to define the network size at runtime in a way that odeint will take. The result is that I have to change the NETWORK_SIZE macro and recompile if I want to modify the number of equations in the system that odeint solves for. I'd much rather be able to do this at runtime based on the number of genes in the input file.

Friday, August 15, 2014

Split string on delimiter in C++

Some suggest using the Boost libraries for this, but there's an answer to this question on StackOverflow by user Evan Teran (albeit not the accepted answer) that I much prefer because it uses the standard libraries.

As posted on StackOverflow:
The first puts the results in a pre-constructed vector, the second returns a new vector.
std::vector<std::string> &split(const std::string &s, char delim, std::vector<std::string> &elems) {
    std::stringstream ss(s);
    std::string item;
    while (std::getline(ss, item, delim)) {
    return elems;

std::vector<std::string> split(const std::string &s, char delim) {
    std::vector<std::string> elems;
    split(s, delim, elems);
    return elems;
As noted on the StackOverflow post, it can be used like:
std::vector<std::string> x = split("one:two::three", ':');
This overloads the split function, and the second function requires the first, so you need both. As a secondary example, here is a version that I slightly expanded for readability and used as a member function of a class called TimeSeriesSetDataFileParser:
std::vector<std::string> & TimeSeriesSetDataFileParser::SplitString(const std::string &inputString, char delimiter, std::vector<std::string> &elements) {
    std::stringstream sstream(inputString);  //Taken from
    std::string element;
    while (std::getline(sstream, element, delimiter)) {
    return elements;

std::vector<std::string> TimeSeriesSetDataFileParser::SplitString(const std::string &inputString, char delimiter) {
            std::vector<std::string> elements;   //Taken from
            this->SplitString(inputString, delimiter, elements);
            return elements;
And I used it to parse a data file to a 2D vector of strings (the member functions Good() and GetLine() are just methods encapsulating the good() and getline() member functions inherited by ifstream).
    std::cout << "Parsing input data file" << std::endl;
    std::string line;
    std::vector<std::vector<std::string> > vvstr_data;
    std::vector<std::string> vstr_splitLine;

    while (inputTimeSeriesSetDataFile->Good()) {
        line = inputTimeSeriesSetDataFile->GetLine();
        if(inputTimeSeriesSetDataFile->Good()) {
            vstr_splitLine = this->SplitString(line, '\t');
            vvstr_data.push_back(vstr_splitLine);   //For now we just read it into a 2 dimensional vector of strings.

Monday, August 11, 2014

SIAM Conference on the Life Sciences 2014 - Joint Recap and Post Mortem

This post was written by both Precocious Llama and Frogee:

We've just returned from Charlotte, North Carolina where we attended the 2014 SIAM Conference on the Life Sciences. Both of us were fortunate to have received student travel awards to present our respective posters at the conference, and between the plenary talks, the minisymposiums, the panels, and the poster session, we took away many new (at least to us) ideas.

There were two talks that we found particularly motivational. The first that we heard was during a minisymposia on Genetic and Biochemical networks by Eduardo Sontag; he told us of work on the resolving scale-invariance, the phenomena of fold-change detection, in biological networks by considering multiple time scales. He told us stories of these proposed biological models that didn't hold up to mathematical inspection with respect to scale-invariance and how by tweaking the model they uncovered an unknown component of the biochemical network (at least this was our understanding). The second talk was a plenary talk by James Collins' on synthetic biology. Collins took us through the timeline of his work in synthetic biology from modelling toggle switches to implementing them into bacteria and all the applications that they have been involved in since.

When Frogee and I discussed these two talks after the conference, we enumerated some qualities that we appreciated about these talks:

  1. These talks struck a nice balance between the application and the mathematics used to solve the problems at hand. These were well-motivated stories.
  2. Through their collaborations and research it was clear that they were open to learning new fields, and that by doing so they had a mastery of both the mathematics and life sciences. They didn't self-identify as mathematicians or physicists. They were scientists solving problems.
  3. Both these speakers had phrases that began like "About 40 years ago, I was interested in ____". It's interesting to hear the insights of somebody who has been working on related problems for 40 years.
  4. Neither of these talks are directly related to the research that we are currently pursuing, but we really enjoyed their accessibility; it reminded us why it's beneficial to make your work accessible across disciplines.
Although the conference advertised that it would provide a cross-disciplinary forum to catalyze applied mathematical research to the life sciences, it seemed that the majority of the minisymposia's presenters did not clearly bridge their theorems and algorithms to applications in life sciences, nor did they care to provide a life science motivation to their research. Multiple speakers even proclaimed that there was no real-life application to their work, but rather they were just exploring the properties of different mathematical models. Unfortunately, this caused many of the minisymposia talks to have far less impact than the plenary talks. Regardless, we were still able to take away a few ideas that are translatable to our work.

Throughout the conference, there seemed to be a strong focus on neuronal signaling and biochemical reaction networks, with cellular behavior/movement/biophysics following close behind. Cancer modeling also made a strong showing, though not nearly as dominant as we expected it to be. Gene regulatory networks had a brief gasp of a showing. We found it very surprising that genomics and genetics in general were almost entirely absent from the conference. Of note, there was one plenary talk by Oliver Jensen on plant root modeling that was relevant to the modeling that we have started pursuing in our work.

Unfortunately, there was an undercurrent of condescension towards both females and life scientists at the conference. Specifically, comments like "You should have more equations; it's what draws people in at this conference", and "Women tend to take research more personally" were particularly disappointing. We applaud one of the panelists who, during the Lee Segal forum, expressed his displeasure with a male conference attendee (who remained anonymous) with respect to a misogynistic attitude. The panelist and the attendee had been on the hotel elevator along with a female non-attendee; in response to the female non-attendee describing her position as a manager at a bank, the male conference attendee replied "Oh, so you have an easy job." 

I think we took away many useful and innovative ideas from the conference. Overall, the conference was productive, and we'd like to attend again if we're given the opportunity to do so.

And now for the post-mortem.

What worked:

  1. We extended a poster mailing tube we had scavenged from around the department using duct tape and cardboard. We had no problems carrying this on the plane as a carry-on (American Airlines).
  2. Going to the grocery store to get some snacks for the hotel room. We actually ended up getting full-blown meal materials from the grocery store so that we could eat meals in the hotel room given the relatively short duration of the meal times.
  3. Using the coffee pot to cook oatmeal and macaroni and cheese.
  4. Getting some sleep. At the beginning we tried to attend every session, but the 8am to 10pm duration of the conference every day eventually wore us down. We then started prioritizing sessions so that we could get a decent night's rest.
What didn't work:

  1. Having only a 2 hour layover between connecting flights. A delay on the first flight caused us to nearly miss the next one. 3 hours seems a reasonable buffer.
That's it. Here we are the morning after the poster session!
Ryan McCormick at poster session for SIAM LS 2014
Sandra Truong at poster session for SIAM LS 2014

Thursday, June 26, 2014

sra to gzipped fastq

One way to take NCBI's SRA format to a gzipped fastq file (since BWA can take a gzipped fastq as input) is to pipe the SRA Toolkit's fastq-dump to gzip and direct that to an output file. Here's an example that uses the -X flag to only take 10 spots from the SRA file to test it out:
#Get an sra file:

#Designate where the SRAToolkit executables are

#fastq-dump converts the sra format to fastq format. -X designates how many spots to convert, and -Z designates to write to stdout.
${SRATOOLKITBIN}fastq-dump -X 10 -Z SRR771638.sra | gzip > SRR771638.fasta.gz

echo "Finished compression"

#Take a peek at the gzipped file.
gunzip -c SRR771638.fasta.gz | head -n 10

Monday, June 9, 2014

Generate Q-Q plot using Python (Install StatsModels Ubuntu 12.10)

While I prefer C or C++ as a programming language, I frequently turn to Python for prototyping and data plotting. R is an absolute last resort if the tools I'm looking for aren't implemented elsewhere.

That said, I was interested in generating a Q-Q plot for some phenotype data, and wanted to do so in a way that did not involve R. Fortunately, it is already implemented in the StatsModels package for Python.

Regarding the theory behind Q-Q plots, I found this documentation very helpful.

First I grabbed setuptools to make the installation easier:
sudo ls    #I just ran this so that the next command didn't prompt for a password on the sudo python command
wget -O - | sudo python
Unfortunately, the the StatModels instructions page command for setuptools didn't work:
sudo easy_install -U statsmodels   #This didn't work for me
The command failed, stating that there was a SandboxViolation because the setup script attempted to modify files on the system that were not within the EasyInstall build area (it looked like it was matplotlib, which I had installed long ago).
So I went to the StatsModels page on pypi and downloaded the source there (statsmodels-0.5.0.tar.gz). After decompression and moving to the directory in the terminal:
python build
I was lacking a few dependencies; fortunately, getting them with setuptools worked and was as easy as:
sudo easy_install -U pandas
sudo easy_install -U patsy
sudo easy_install -U openpyxl
Then, once again for StatsModels:
python build
sudo python install
You can test the installation by going into the Python interpreter and trying to import it.
>>> import statsmodels.api as sm
This threw some warnings stating my lxml was too old to be used with openpyxl. To resolve that I tried:
sudo apt-get build-dep python-lxml
This didn't fix this issue for me, but it sounds like it's not necessary, at least according to this StackOverflow post. At any rate, at this point, you should be able to open the Python interpreter and run (this code is from the StatsModels documentation):
>>> import statsmodels.api as sm
>>> from matplotlib import pyplot as plt
>>> data = sm.datasets.longley.load()
>>> data.exog = sm.add_constant(data.exog)
>>> mod_fit = sm.OLS(data.endog, data.exog).fit()
>>> res = mod_fit.resid # residuals
>>> fig = sm.qqplot(res)
And get a plot. Alright, so StatsModels is working for us now. Onto making a Q-Q plot with our data. I had a .tsv file named data.tsv of the format (single column of floats):
And I wrote the following code in (some of it is extraneous for this example); note that this is the Q-Q plot against the normal distribution, the default distribution for qqplot():
# -*- coding: utf-8 -*-

import sys
import statsmodels.api as sm
import numpy as np
from matplotlib import pyplot as plt

def usage():
    sys.stderr.write("\ input.mds\n")
    sys.stderr.write("\tExample usage: data.tsv\n")

    li_tsv = [line.strip() for line in open(sys.argv[1])]
    li_tsv = [element.split("\t") for element in li_tsv]
except IndexError:  # Check if arguments were given
    sys.stderr.write("No arguments received. Please check your input:\n")
except IOError:  # Check if file is unabled to be opened.
    sys.stderr.write("Cannot open target file. Please check your input:\n")

def plotQQPlot():
    li_x = []

    for i in range(len(li_tsv)):
        li_x.append(float(li_tsv[i][0]))  #Assuming your data are in the first column

    print li_x
    na_x = np.array(li_x)
    fig = sm.qqplot(na_x)

And ran it with:
python data.tsv
Which gave:

To test that against R, from the R interpreter you can do:
> data <- readLines("data.tsv")
> dataNum <- as.numeric(data)
> qqnorm(dataNum)
Which gave:

Tuesday, May 27, 2014

RNA extraction - insoluble gel-like pellet

The Llama and I were working on some RNA extractions recently from a series of developmental stages in sorghum. Our lab's typical RNA extraction protocol (a GITC-phenol-chloroform based protocol; TRIzol and friends) worked fine for the developmentally younger tissue samples, but all of the more mature tissue samples ended up with a fairly large, insoluble, gel-like pellet at the end of the extraction. We initially believed the pellet to be RNA that we hadn't properly de-salted or that we had over-dried, leading to decreased solubility. However, additional de-salting, heating, physical beating, etc. never managed to resolubilize the pellet. After quite a bit of troubleshooting we determined that the insoluble pellet was starch (or at least some kind of carbohydrate) that, due to its high abundance in the tissues, wasn't removed during the extraction (of note, I had also occasionally run into a similar problem in Neurospora).

Once we had an idea that it might be a starch problem, the Llama found some specific extraction protocols to deal with high-starch tissues. The protocol from Wang et al. (2011) seems to work well enough in our hands, and we no longer have the insoluble pellet problem (although many of our samples are, unfortunately, fairly transcriptionally quiescent, so the yields aren't that high).

In summary, an insoluble, gel-like pellet at the end of an RNA extraction may be starch carry-over. A somewhat cloudy aqueous phase during the extraction (or, more extreme, a completely gelatinized aqueous phase) is a good indicator that you might run into this problem. The SDS and multiple extractions in the Wang et al. (2011) protocol seem to solve the problem.

Saturday, May 10, 2014

Modify PLINK to accept a new organism

06/10/14 Edit: It looks like PLINK is undergoing renewed development; a new version of PLINK (1.90) is in beta and has new flags that render this post unnecessary (see the --chr-set and --autosome-num flags)
By default PLINK has a few different organisms basically hardcoded, but it isn't too much work to add your organism provided you can compile the code from source (note the compiler version issue resolved in this previous post).

Credit goes to this post from the biostars forum.

To modify PLINK to take a different organism, you'll need to modify the files listed below (switching the name of your organism of interest for Sorghum/sorghum). Add the relevant line near the other similar lines (e.g. the ones for dog, human, horse, etc.):

In helper.h, around line 195, add:
void defineSorghumChromosomes();
In helper.cpp, around line 1674,   add:
void defineSorghumChromosomes() {
//copy the function for a relevant species to here (e.g. rice) and change the chromosome number accordingly for your organism.
//in my case I also changed haploid to false.
In options.h, around line 578, add:
static bool species_sorghum;
In options.cpp, around line 467, add:
bool par::species_sorghum = false;
In plink.cpp, around line 167, add:
else if (par::species_sorghum) defineSorghumChromosomes();
In parse.cpp, around line 3536, add:
if (a.find("--sorghum")) par::species_sorghum = true;
After compilation, you can designate your added species when invoking PLINK (e.g. --sorghum) at the command line.

Friday, May 9, 2014

Downgrade g++ compiler on Ubuntu (compilation of PLINK and redeclaration errors)

I wanted to make a few modifications to the PLINK source code so that we can have it work with our organism for some GWAS. I was unable to compile with g++ version 4.7.2 since the PLINK code base redeclares variables after a previous declaration, and 4.7 does not permit variables to be redeclared. The easiest way around this was to downgrade compilers, since g++ 4.6 permits redeclaration. On Ubuntu this was as easy as:
sudo apt-get install g++-4.6
and specifying g++-4.6 as the compiler in the Makefile. This doesn't change the default version invoked with g++ (i.e. g++ -v at the CL still returns 4.7).

In summary:
g++ v4.7 stops with errors:
sets.cpp: In member function ‘vector_t Set::profileTestScore()’:
sets.cpp:771:37: error: redeclaration of ‘std::vector<individual>::iterator i’
sets.cpp:703:12: error: ‘int i’ previously declared here
g++ v4.6 proceeds with a few warnings

Tuesday, April 15, 2014

Inkscape svg to pdf and $\LaTeX$ fonts in Inkscape (Ubuntu 12.10)

The Llama and I are presenting posters the day after tomorrow. I made mine in Inkscape using the $\LaTeX$ Computer Modern font, and had some troubles getting it to export properly to a .pdf file for printing using Inkscape's export functions.

To get Computer Modern font in Inkscape, I followed this blog post and copied the appropriate files to ~/.fonts

To get a .pdf that was suitable for printing the poster, instead of using Inkscape to export to a .pdf (which did not properly save the fonts), I exported a bitmap, and converted the exported .png file with convert, which was already in my Ubuntu distribution (Ubuntu 12.10).
convert input.png output.pdf
Conversion in this fashion maintained everything as expected.

Saturday, March 29, 2014

Keystone Symposia on Big Data in Biology

The Llama and I just got back from the 2014 Keystone Symposia on Big Data in Biology. The meeting was enjoyable, and it was nice to see what the cancer biology folks are doing to manage their data deluge. The organizers were Lincoln Stein, Doreen Ware, and Michael Schatz.

The major points of interest that I took away were, in no particular order:

  1. General need/interest in the community in the adoption of standardized file formats or (more importantly) standardized file interface application programming interfaces (APIs).
  2. The scale of genomic analyses are causing a shift to colocated computing (e.g. cloud computing).
  3. General need/interest in the community for "standardized" analysis pipelines or "recipes" since methodological differences in analyses are causing reproducibility problems.
  4. The community acknowledges that we can rapidly generate large amounts of data, but we're barely keeping our heads above water for storage and analysis, and we're still pretty bad at translating the data into actionable information.
  5. Different variant callers give different results. The GATK is generally considered one of the best performing programs for calling SNPs, but the jury is still out for indel calling. A DREAM competition is coming soon for variant callers that may help with benchmarking (!Synapse:syn312572/wiki/60874).
  6. General interest in the community for new models of reference genomes. Instead of monolithic strings representing the "one true reference", reference genomes would be modeled as graphs to represent pangenomes.
  7. At the poster session, we learned of a method for prioritizing candidate genes that are found under a QTL interval, and we hope it will be published on soon so we can use it.
  8. At the poster session, we learned of a method for mathematically modeling sequence strings and scoring them based on a training set (i.e. scoring them for transcription factor binding sites); we hope to use this one soon too once it's published.
  9. iPlant may be a useful cloud computing resource that we want to look further.
  10. We learned some of the general "good" practices for analysis of data at scale which may be applicable to our pipelines.
  11. At this scale of analysis, implementation matters a great deal; implementations in languages like R, Perl, and Python suffer in performance relative to implementations in C.
  12. BWA (specifically BWA mem) is generally accepted in the community as the de facto read mapping algorithm.
  13. There is a disconnect between what biologists think it takes and what it really takes to manage and analyze the data; biologists frequently underestimate the resources (time, effort, and hardware) required for the analyses they propose.
  14. The IBM Computational Biology group could make a tremendous splash once their text based learning method using Watson is released (it basically reads the available literature and provides researchers with potential leads).

Friday, March 7, 2014

GATK Runtime Error: Somehow the requested coordinate is not covered by the read

This error sometimes crops up for me and others in the Unified Genotyper from GATK v 2.8. As this post on the GATK forums indicates, none of us can reliably reproduce it so nothing can really be done at this point. The reads at loci causing the crash all look normal, and subsetting the loci that causes the crash in one instance doesn't reproduce the crash when run alone. With hopes it will automagically go away in the imminent GATK v 3 release.

In the meantime, here's a hacked together method to get variants called for the entire genome. It's just a bash loop that loops through all of the genomic intervals in a file until they successfully complete. The output .vcf files can then be merged back to one file.

First you need an interval file. I just generated mine using awk and the .vcf file produced by a Unified Genotyper run that crashed:
awk '{ 
        if (substr($0, 1, 8) == "##contig") {
                split($0, stringArray, "=")
                split(stringArray[3], subStringArray, ",")
                print subStringArray[1]
        if (substr($0, 1, 2) != "##") {
                exit 0
}' crashedUGoutput.vcf > intervalFile.intervals
Which should make an interval file with something like (depending on how your contigs are named):
Then you can do something like the following loop to process each interval until it succeeds:

for file in $INPUTFILES
        inputString="${inputString} -I ${file}"


mkdir tmpVCF
rm ./tmpVCF/log.txt
rm ./tmpVCF/*
while read interval
        while [ $fail != 0 ]
                echo "Attempting UG on $interval"
                echo "Attempting UG on $interval" >> ./tmpVCF/log.txt
                java -Xmx${MEMORY} -jar ${GATKPATH} \
                -T UnifiedGenotyper \
                -R ${REFERENCE} \
                ${inputString} \
                --genotyping_mode DISCOVERY \
                --genotype_likelihoods_model BOTH \
                -stand_emit_conf 30 \
                -stand_call_conf 30 \
                -nct ${NUMTHREADS} \
                -L $interval \
                -o ./tmpVCF/${INITIALVARS}_$interval.vcf
done < intervalFile.intervals
Following this, you'll have a .vcf file for each interval stored in the ./tmpVCF directory, and they can be merged back into one .vcf.

Saturday, February 1, 2014

Write .ods (for LibreOffice Calc) from Python (2.7) (Ubuntu 12.10)


Edit 10-02-2015

The Simple ods py module doesn't seem to work with updated versions of odfpy. Testing with simpleodspy 1.1 and odfpy 1.3.1 throws a failed assertion, while simpleodspy-1.1 and odfpy 0.9.6 works.

It doesn't look like simpleodspy is being maintained any longer. The best bet may be to use odfpy directly if you can't get your hands on an older version of odfpy.


I was looking for an easy way to write .ods files (for LibreOffice Calc) from Python (2.7). I really only needed the functionality of designating the contents of a cell and changing the background color of a cell based on its contents.

Simple ods py, written by Yaacov Zamir, worked very well for this purpose. It required the odfpy module for the functionality I needed. Here's what worked for me:

Download the odfpy module:

Download the Simple ods py module:

Both can be installed after downloading by entering their respective directories and:
python build
sudo python install
This delightful post on Stack Overflow has a very nice way of getting the string corresponding to a column index in C#:

After adapting that method to Python and working from Yaacov's example code, I ended up with piece the following test code:
from simpleodspy.sodsspreadsheet import SodsSpreadSheet
from simpleodspy.sodsods import SodsOds

#Adapted from
def GetColumnName(int_columnNumber):
    int_dividend = int_columnNumber
    str_columnName = ""
    int_modulo = 0
    while (int_dividend > 0):
        int_modulo = (int_dividend - 1) % 26
        str_columnName = chr(65 + int_modulo) + str_columnName
        int_dividend = (int_dividend - int_modulo) / 26
    return str_columnName

Sods_table = SodsSpreadSheet()   #Default size is fairly small, may have to increase size; e.g. SodsSpreadSheet(100,100)
li_row1 = [ "a", "b", "a", "b", "a", "b" ]
li_row2 = [ 1, 2, 3, 4 ]
lili_table = [ li_row1, li_row2 ]

for rowIndex in range(len(lili_table)):
    for colIndex in range(len(lili_table[rowIndex])):
        str_coordinate = GetColumnName(colIndex + 1) + str(rowIndex + 1)   #Spreadsheet is not 0 based.
        print str_coordinate
        Sods_table.setValue(str_coordinate, str(lili_table[rowIndex][colIndex]))
        if lili_table[rowIndex][colIndex] == "a":
            Sods_table.setStyle(str_coordinate, background_color= "#00ff00")

Sods_ods = SodsOds(Sods_table)"test.ods")


Friday, January 24, 2014

Journal Club: Reporting new genome sequences

Journal club is back in session.

This week's paper: An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions by Haudry et al. (2013). 

I guess papers on straight genome sequences can't get published anymore, perhaps rightfully so. However, in this case, I think the availability of the genome sequences the authors obtained will be more useful than the analyses they performed.

My comment:
I think the important contribution of the paper are the new plant genome sequences, but it seemed that the authors tried to sell it as though the collection of conserved non-coding sequences were the important contribution. I didn't think the analyses were particularly novel, nor were the findings, but I suspect the new genome sequences will be useful for the field.

JC: Bud's buds

An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions by Haudry et al. (2013). 

My comment:
Conserved noncoding sequences are interesting because they reveal something about selection. In this comparitive case though I'm more interested in what distinguishes these genomes. In particular, this vague connection that the authors allude to that the candidate for differential genome size is transposable elements. And, I guess on a telelogical level, it's like what is their purpose!?

Also, can these crucifer species be crossed to another? Can A thaliana and A. lyrata (it says they are congeners)?

Thursday, January 16, 2014

NTFS permissions and mounting (Ubuntu 12.04)

We have a workstation with multiple hard drives formatted with ntfs file systems to allow them to play nicely with Windows. Those file systems gave me issues when trying to allow multiple users to access them from Ubuntu.

Mounting from nautilus only allowed the current user to access it, and I don't think that the ntfs file system can have its permissions changed via chmod or chgrp after it has been mounted. What ended up working (for the most part) was suggested here. In this solution, the file system is mounted with the designated permissions using mount's fmask and dmask options for an ntfs file system.

First I needed to add the necessary users to a group that could access the drives; here I added them to the existing group plugdev
id username

sudo usermod -a -G plugdev username

id username
I made a directory in /media to hold the file system:
sudo mkdir /media/DATAPART1
And then I mounted the drive, which was at /dev/sdb1 (which I located using gParted). In this case, plugdev is group id 46, and this gives full permissions to the user and group (which becomes root and plugdev, respectively).
sudo mount -t ntfs -o gid=46,fmask=0003,dmask=0002 /dev/sdb1 /media/DATAPART1
Unfortunately, I couldn't work out how to execute a file on the mounted file system without giving execution permissions on ALL files; thus every non-directory file ends up the color of an executable with this method. I also couldn't work out how to enable the sticky bit for files to protect one user's files from being deleted by another user.


Thursday, January 9, 2014

Compile svmvia (g++ 4.7.2-1ubuntu1)

I'm looking into some work with support vector machines (SVMs), and I ran into some errors when trying to compile svmvia from source. I think the error occurs because some compilers implicitly include certain files at compile time, but I'm not sure.

g++ 4.7.2 complained that "'strcmp' was not declared in this scope", and "'rand' was not declared in this scope". This was resolved by adding the following includes to their respective files; cstring contains strcmp and cstdlib contains rand.

Add to mainVia.cpp:
#include <cstring>
Add to basic.cpp:
#include <cstdlib>
Add to smo.cpp:
#include <cstdlib>

Monday, January 6, 2014

Bash: for loop; loop through files; remove path and suffix from file names

Credit for the bash for loop goes here:
Credit for the suffix and prefix modifications goes here:

Loop through a series of numbers:

for int in 1 2 3 4
   echo $int
Loop through a series of files and modify the file name; see the Stack Overflow link above for a few more options:


for file in $files
   echo $file
   echo $SUFFIX
   echo $NEWNAME