## Wednesday, October 30, 2013

### Communicating science without undermining its complexities: different sizes of infinity

I just came across this YouTube video that I believe explains different sizes of infinity in a very understandable way.The video does so without the introduction of diagonalization (when Professor Harold Boas first introduced this in my first "real math" class it blew my mind), which I felt was instrumental in my understanding of the different sizes of infinity.

However, this video reminds me that a critical aspect of practising science is the ability to communicate its importance without requiring extensive background knowledge. In my opinion, this video does this very well without undermining the complexity of the problem.

## Monday, October 28, 2013

### Comment and annotate PDF in Ubuntu (12.10)

I finally found a somewhat satisfactory way to comment and annotate PDF files in Ubuntu after trying a few of the solutions suggested by the interwebs. A piece of software called foxit can be run through wine.

I installed wine a while back, but I don't remember if I did anything special. If I recall correctly, I installed wine using these directions at the Ubuntu site.

After downloading the foxit exe from the foxit site, I just ran it with wine and installed using all the defaults.

wine FoxitReader606.0722_enu_Setup.exe


Oh glorious day.

## Thursday, October 24, 2013

### Open terminal to current working directory from nautilus (Ubuntu 12.10)

Credit goes to this Ask Ubuntu forum post.

There is a package that enables right clicking nautilus' directory window and choosing to open a terminal with the working directory set to the directory in the nautilus window.

Install the package via the package manager and then restart nautilus.
sudo apt-get install nautilus-open-terminal
nautilus -q


Then right click and choose open in terminal.

## Saturday, October 19, 2013

### awk and GNU parallel: problems with quotes

Embarrassing as it is to admit, I spent about two hours trying to work out how to parallelize an awk command with GNU parallel. I think the conclusion is that I don't understand quotes in bash as well as I should.

My goal was to run the awk code from my previous post in parallel. Credit goes to this post on Stack Overflow for getting me to the solution.

One of the answers in the Stack Overflow post suggested storing the awk command in a string, and that worked for me:
#!/bin/bash

awk_body='(substr($0, 1, 1) != "@") && ($0 != "+") {print substr($0, 6)} (substr($0, 1, 1) == "@") || ($0 == "+") {print$0}'

(substr($0, 1, 1) == "@") || ($0 == "+") {print 0}' input.fastq > output.fastq  Output sans RE site: @1_1101_11092_1965_1 ACAGGCCTTGGAGGCGCTGGAGGCACACGANGTGGCACGTTGATCTGTGTGTGAGGCGCCGCGGGTTTGTGAATGTGGACCGGACTGCTTCTGCTGCT + IHGGH;FHIIIIGIGIIIIIBGHEIHHHED#,,;?CC@BBBBBCCECCC:?5?>CBBB@B9BBBB08?@??3:@@:CC>@?BB39<@C@CAC::>@AC @1_1101_11396_1963_1 GACTCATAGGCAGTGGCTTGGTTAAGGGAANGGAACCCACCGGAGCCGTAGCGAAAGCGAGTCTTCATAGGGCGATTGTCACTGCTTATGGACCCGAA + JJJJJJJJJJJJJGIJJJJJJGIIJEIIHH#-;BEFDDDDDDDDDDDDDDDBDDDDDD<BB>CDEDEEDDDDDD@DBDCDEDDDDDDDDCDCDDDDBD  ## Monday, October 7, 2013 ### Using awk: bitwise operations and string manipulation I learned a few new things trying to reformat a sam file with awk. I also learned that what I was attempting to do wouldn't ultimately solve my problem, but I wanted to keep what I learned about bitwise operations and other string manipulations in awk here. To test for the presence of a certain bit, I had to use GNU awk (gawk) since the awk on my system didn't have these functions, and I used the and() function (see here for bitwise operation documentation; see here and here for additional sam bit flag information). I used substring() for string manipulation and length() to access elements from the end of the string (credit goes to here and here). I used printf when I wanted to avoid generating a newline character (credit goes to here). The original goal was to trim restriction sites from RAD reads and quality scores, but this doesn't account for the CIGAR string or the alignment coordinate so it ultimately doesn't work. Here is the gawk script in its entirety (with some comments): #!/bin/bash #Skip the header lines gawk '(substr(0, 1, 1) != "@") {
#Test if the bit flag has the 4th bit set (to indicate a read on the reverse strand using the and() function.
if (and($2,0x10)) { #Don't adjust the first 9 columns. for (i=1; i<=9; i++) {printf "%s\t" ,$i}
#Trim 5 bases from the end of the string and the quality score
printf "%s\t", substr($10, 1, length($10)-5)
printf "%s\t", substr($11, 1, length($11)-5)
#Print any remaining columns; NF is a special variable containing the number of fields.
for (i=12; i<NF; i++) {printf "%s\t", $i} printf "%s\n",$NF
}
#Else it is not a reverse strand read; trim 5 bases from the start.
else {
for (i=1; i<=9; i++) {printf "%s\t" , $i} printf "%s\t", substr($10, 6)
printf "%s\t", substr($11, 6) for (i=12; i<NF; i++) {printf "%s\t",$i}
printf "%s\n", $NF } } #Include the header lines. (substr($0, 1, 1) == "@") {print $0} ' input.sam  ## Saturday, October 5, 2013 ### Journal Club: short ORFs and gene prediction This week's journal club paper: Small open reading frames associated with morphogenesis are hidden in plant genomes by Hanada et al. (2013). The implications of the idea that many functional, small open reading frames (sORFs) exist in eukaryotic genomes are pretty exciting. The paper didn't entirely convince me that this was the case, but I was sufficiently persuaded that it merits further investigation. I didn't see any major red flags with the paper. I would have preferred biological replicates for their sORF arrays over technical replicates. Additionally, overexpression mutants can cause all kinds of wacky effects that aren't necessarily a consequence of the functional properties of the overexpressed genes; I'm not sure concluding that overexpression mutants of sORFs causes phenotypic consequences more frequently than random chosen genes is going to hold up to further scrutiny (maybe sORFs accumulate more effectively than regular protein products to cause strange side effects). However, again, I think it's sufficient to merit further investment in the topic, such as the knock-down/knockouts they propose. My comment: "I think it's worth continuing to pursue study of these small ORFs as the authors propose. They may have already shown this in one of their previous publications, but I'm interested in the properties of these small ORFs in the context of gene prediction. They said they used hexamer composition bias for prediction; are there other properties that might be useful for prediction, like CpG island promoters?" ## Friday, October 4, 2013 ### Journal club: uORFs, eukaryotic "operon"? Paper: Small open reading frames associated with morphogenesis are hidden in plant genomes. (Hanada et al 2013). Overall, I liked the paper. It was the first time I had been made aware of the abundance of un-annotated small open reading frames, and I think it adds awareness in the field to something new, a method of detection, and an example of its applications. Comment 1: Their investigation of expression of both RNA and protein reminded me that we often make assumptions about the relationships between transcript abundance and protein abundance; even if we detect translation, it doesn't guarantee that the we can detect transcription accumulating, and vice versa. Comment 2: I had never heard of upstream ORFs; they sound conceptually similar to operons/polycistronic DNA found in prokaryotes, so I thought it was interesting that they exist in a eukaryotic system. Comment 3: The paper described some techniques (controls) that I had never heard of before that I thought were interesting and useful. For example, determining the null distribution of the Ka/Ks ratio with random sequences to obtain a neutrality baseline to identify sORFs under selection was a new idea to me. Comment/Question 4: What tissue types were used for the 17 different environmental conditions? They refer to seedling, but what tissue on the seedling? Comment 5: They made a mention that other organisms such as humans contain sORFs, and specifically they reference the ENCODE project. I was wondering what ENCODE used to identify these sORFs; was it a similar algorithm, and if not, maybe it could be used to identify more un-annotated genes? Comment 6: I'm a bit skeptical of their claim that their prediction algorithm has the best performance of identifying small ORFs. I agree that it performed well and was able to identify a significant number of previously un-annotated genes, however their training set to obtain their a prior hexamer composition bias was done in the same organism, Arabidopsis, that they go on to test in. I question whether it would perform as well in other organisms. ## Wednesday, October 2, 2013 ### awk to transpose (super fast) Credit goes to Stack Overflow - Transpose a file in bash. I was looking for a fast way to transpose a tab separated variable file, file.tsv, and so per instructions, I created the following shell script, transpose.sh: #! /bin/bash awk ' { for (i=1; i<=NF; i++) { a[NR,i] =$i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str"\t"a[i,j];
# delimitting variable can be replaced (i.e "," or " " or "\t")
}
print str
}
}' file.tsv


Then, I ran transpose.sh through the command line and printed the transposed file.tsv to transpose_file.tsv
 transpose.sh > tr_file.tsv