Thursday, August 29, 2013

Update Stacks installation

Julian Catchen has written a very useful piece of software called Stacks for analysis of restriction site associated DNA (RAD) data. The stacksToVCF code that I recently wrote may have exposed a bug in how Stacks processes the CIGAR string from read alignments with indels, and I'm updating my Stacks installation to test a fix pushed out by Julian.

The following is only for updating, not for a fresh installation.

Note that you need to have Google's sparsehash and samtools' bam library and header available for these instructions. For this case, I compiled both sparsehash and samtools from source, and installed sparsehash via make install. Since I didn't do that for the bam header and library, I have to specify them below.
#Get the most recent version (1.06 in this case):

#Unarchive and decompress the .tar.gz:
tar -zxvf stacks-1.06.tar.gz

#Move to the directory and configure:
cd stacks-1.06
./configure --enable-sparsehash --enable-bam \
--with-bam-include-path=/path/to/directory/with/samtools_bam.h \


sudo make install
Then you need to reconfigure the file to be able to send email:
sudo vim /usr/local/bin/
And modify the configurations to suit your system. For my case, it's just modifying the url variable to point to my server. All other settings can be left as they are. You can test that this was successful by invoking the version option for any of the component programs:
$ -v 1.06

Monday, August 26, 2013

Bioconductor is out-of-date: update R to version 3.0 in ubuntu

I wanted to install cummeRbund into a new workstation, per the protocol of TopHat and Cufflinks (Trapnell et al 2012), however I ran into a problem with R version and bioconductor. This may not be relevant soon as we believe that R 3.0 will soon be put into the repository for apt-get, but who knows it may be useful: Here is the error that I got: After going into R:
$ R
$ source("")
I got the error:
Your Bioconductor is out-of-date, upgrade to version %s by following instructions at
However for me this was not the case, and following instructions on Bioconductor's website did not help. I found that updating my version of R from 2.14 to 3.0 solved my problems. The instructions to update R to 3.0 can be found on the R website . Here is what I did: 1. removed older version of R
$ sudo apt-get remove r-base-core
2. Add deb to sources.list using vim: 2a. open sources.list using vim
$ sudo vim /etc/apt/sources.list
2b. scroll down to the bottom holding down Ctrl + F
2c. type "I" to insert text, and type in the following
## yourname modification on date    
deb percise/
(my ubuntu version is precise) 2d. push "Esc" button to exit insert mode
2e. type ":w" and "Enter" to save
2f. type ":q" and "Enter" to exit vim editor
3. Add key to sign CRAN packages
$ sudo apt-key adv --keyserver --recv-keys E084DAB9
4. Add specific PPA to the system
$ sudo add-apt-repository ppa:marutter/rdev
$ sudo apt-get update
$ sudo apt-get upgrade
5. Install latest version of R:
$ sudo apt-get install r-base
Now I had the right version of R, so I could install bioconductor and then cummeRbund: 1. open R
$ R
2. Install CummeRbund package:
> source("")
> biocLite('cummeRbund')
3. load CummeRbund library:
> library(cummeRbund)
4. Now you should be able to use the cummeRbund package as described in the protocol!

Friday, August 23, 2013

mysql++ compile example Linux Ubuntu 32-bit (a.k.a. example of linking libraries, makefile, headers for C++)

This is more of an exercise in linking libraries and Makefiles than something specific to mysql++, but since I didn't know how to do it before, here goes:

To compile one of the mysql++ examples on my system, I had to specify where the header files were for both the mysqlclient, which I had installed with the apt package manager (see this post), and for mysql++, which I had compiled from source (again, see this post). I also had to specify where the shared object libraries were. The -I option contains the include paths, the -L option contains the shared object paths, and the -l option contains the names of the .so files (excluding the lib and .so component; e.g. -lmysqlpp looks for On my system, this worked for compilation:
g++ simple1.cpp -I/usr/include/mysql -I/usr/local/include/mysql++ -L/usr/local/lib -L/usr/lib/i386-linux-gnu -lmysqlpp -lmysqlclient -o simple1
And at runtime, the .so files are searched for, so their location needs to be specified. The was installed by the package manager into /usr/lib/i386-linux-gnu and the was installed to /usr/local/lib by default. I ran this and also appended this to my .bashrc file.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/i386-linux-gnu:/usr/local/lib
Once that was working, I moved over to my code. My goal is to learn how to interact with a MySQL database using C++. First I needed to get some basics down, which was simply compiling multiple files since I've evidently forgotten the small amount I knew about C++.

I have 4 to start with:

The main file, stacksToVCF.cpp containing:
#include <stdlib.h>
#include <iostream>
#include "stacksDB.h"
int main ( int argc, char *argv[] )
 testcon(argc, argv);
 std::cout << "Hello from main" << std::endl;
stacksDB.h containing

int testcon(int argc, char *argv[]);

stacksDB.cpp containing
#include <stdlib.h>
#include <iostream>
#include <mysql++.h>
#include "stacksDB.h"

int testcon (int argc, char *argv[])
 mysqlpp::String greeting("Hello from mysqlpp");
 std::cout << greeting << std::endl; 
and the Makefile, containing:
CXXFLAGS := -I/usr/include/mysql -I/usr/local/include/mysql++
LDFLAGS := -L/usr/local/lib -L/usr/lib/i386-linux-gnu
LDLIBS := -lmysqlpp -lmysqlclient 


stacksToVCF: stacksToVCF.o stacksDB.o
        g++ stacksToVCF.o stacksDB.o $(CXXFLAGS) $(LDFLAGS) $(LDLIBS) -o $(EXECUTABLE)

stacksToVCF.o: stacksToVCF.cpp
        g++ -c stacksToVCF.cpp

stacksDB.o: stacksDB.cpp
        g++ -c stacksDB.cpp $(CXXFLAGS) $(LDFLAGS) $(LDLIBS)

        rm -rf *o $(EXECUTABLE)
This compiles successfully with make and will be the starting point.

compile mysql++ from source on Linux Ubuntu 32-bit

Trying to compile mysql++ from source on Ubuntu 12.10 32-bit: Downloaded most recent version of mysql++ from here. Made sure I had the C API on which mysql++ depends:
sudo apt-get install libmysqlclient-dev
Moved to the mysql++ directory and configured for the mysql installation (note that I found where these were located with this, and got tips from the mysql++ README):
./configure --with-mysql-lib=/usr/lib/i386-linux-gnu --with-mysql-include=/usr/include/mysql
sudo make install
And the final output after the install noting install directories:
mkdir -p /usr/local/lib
/usr/bin/install -c -m 644 /usr/local/lib
/usr/bin/install -c /usr/local/lib
(cd /usr/local/lib ; rm -f; ln -s; ln -s
mkdir -p /usr/local/include/mysql++
(cd . ; /usr/bin/install -c -m 644  lib/*.h /usr/local/include/mysql++)

Thursday, August 22, 2013

Queue command to run after existing command finishes

Credit goes to this post at superuser. Probably due to poor planning/pipeline development, I sometimes want to queue up some other commands execute after an existing, long-running process has already been running for a while. I use Ctrl-z (pause) and fg (foreground) for this.
#Execute a long running program
#Realize you want to have another program execute after the first concludes, and hit Control-z to pause
#Set up a bash command to resume the paused program with fg, and run the next in queue.
fg; ./

Wednesday, August 21, 2013

awk print every other line (or every Nth line) in fasta file

This specific line of awk doesn't have much general utility, but it was intended to pull out every other sequence record in a .fasta file. It can be applied to every Nth record in the fasta file as well by changing the modulo operator statement.  It only applies to .fasta files in which the sequence string isn't wrapped into multiple lines.

Here it is in its one-liner form:
awk 'BEGIN{i=0} (substr($0,1,1) == ">") { if (i%2 == 0) {print $0; getline; print $0} i++}' test.fa
And it makes a bit more sense when formatted:
awk 'BEGIN{i=0} (substr($0,1,1) == ">") {
 if (i%2 == 0) {
  print $0
  print $0
}' test.fa
This assumes the .fasta file is of the format:
and the sequence string is contained entirely on one line.

Monday, August 19, 2013

Create directory from R

Credit goes to this post at StackOverflow. As our genotyping and QTL mapping pipeline continue to scale up, it is helpful to automatically isolate R/qtl output for each trait into their own directories. As stated in the stackoverflow post by user robbrit, the dir.create() function creates a directory in the working directory, and throws a warning if it already exists.
I believe robbrit's use of file.path() is a more flexible, platform independent method to do it. "testDir" could be a traditional file path (e.g. "/home/user_name/Documents/testDir"), otherwise it creates it in the working directory.
And finally set the working directory to the new path.

Tuesday, August 13, 2013

MySQL Purge Binary Logs (aka MySQL is taking up too much hard drive space)

Credit goes to this MySQL documentation.

My MySQL installation by default keeps a set of binary log files recording data changes. With many database modifications, these grow to be very large (terabytes). It isn't mission critical that we be able to recover our database at any given moment since our raw data is archived, and the database can be repopulated if need be, so I typically clean these out once hard drive space starts becoming an issue.
$ mysql -u root
This gets rid of all binary logs made before the date you run the command. You could also set the system variable expire_logs_days.

Location of MySQL database files on Linux Ubuntu

I installed MySQL from the apt package manager on Linux Ubuntu (12.10), and my database files are located at:

Wednesday, August 7, 2013

Compile Genome Analysis Toolkit (GATK) from source

07/28/14 Edit: I don't think this post is up to date any more. Word on the street is that they're using something else as the build tool.

I'm planning on using the Broad Institute's Genome Analysis Toolkit (GATK) for some variant calling. The Java source code is available on GitHub, so I forked it over so I could have a look at the code (and play with it one day?). I've never compiled any Java code before, so I figured it was a good time to learn. I was doing this on a brand new workstation in our lab that I'm administrating, so I had to get a few dependencies first:
sudo apt-get install git ant openjdk-7-jre openjdk-7-jdk
Git is a version control software, ant is (from my understanding) a build tool for Java, conceptually similar to GNU make, and, based on the name, I assume the others are Java version 7, the Java Runtime Environment and the Java Development Kit, respectively. Then I cloned my forked repository:
git clone
My best guess is that a build.xml file for ant are conceptually similar to a Makefile for make, although I'm not entirely sure. Changing to the top-level of the cloned directory, I simply ran:
And everything compiled successfully. From then on, I could move to the dist directory and test out the GATK:
java -jar GenomeAnalysisTK.jar --version
Good times ahead.

Tuesday, August 6, 2013

VirtualLeaf tutorial: Fatal error message stepwise underflow in rkqs, with h=0.000000 and htry = 0.100000

TL;DR: In order to successfully run a model you have to have a suitable geometry for the model to initiate.

When trying to run parts of the VirtualLeaf tutorial, I have been running into the following error when running the simulation.
Fatal error message stepwise underflow in rkqs, with h=0.000000 and htry = 0.100000

Per usual, this turns out to be my lack of understanding on how VirtualLeaf works (baby steps). From my current understanding, one can build models, i.e. mymodel.cpp, and one can run those models on different initial cellular geometries called "leaves", i.e. *.xml files.

It turns out that I was lacking the correct leaf geometry, *.xml file which is defaulted in my header file for the model, i.e. mymodel.h .In order to successfully run a model you have to have a suitable geometry for the model to initiate. I know of how to modify this in two ways:

1) Through your model's header file, yourfavoritemodel.h :
 virtual QString DefaultLeafML(void) {
  return QString("yourfavoriteleafgeometry.xml");

You can change yourfavoriteleafgeometry.xml to another *.xml which I found in VirtualLeaf-v1-0.1-src/data/leaves/* . Save file, make, reopen VirtualLeaf and simulate.

2) Or, you can create your leaf geometry, stop simulation, save File -> Save Leaf -> yourfavoriteleafgeometry.xml.

Just make sure that the DefaultLeafML Qstring is the the correct leaf geometry you need for your model!

Gamasutra post by Timo Heinapurola: "What documentation?"

This post written originally by Timo Heinapurola here and reposted by Gamasutra discusses the topic of code documentation. I'm guilty of writing code thinking that it's going to be "throwaway code" and then I find myself needing to use it again months later.

Monday, August 5, 2013

Gamasutra blog post by Lee Perry: "One reason we see so many clones? Communication."

This article by Lee Perry posted on the video game business news website Gamasutra is strangely applicable to research as well.

"I want to make a figure like they did in this paper."

I could also use a Mindlink 2000.

Saturday, August 3, 2013

Make bootable USB Linux Ubuntu

I find myself doing frequently making bootable USBs,  so I wanted to document it here.

I can't speak for other distributions, but the versions of Ubuntu I use have an application called Startup Disk Creator. Simply launch that application, choose the .iso file you'd like to use, and choose the USB you'd like to use. I usually reformat the USB (with the Erase Disk button) before each try.

I've had troubles with Startup Disk Creator crashing, but I just rinse and repeat until it successfully completes.