Thursday, November 19, 2009

Using Vmatch to combine assemblies

In praise of Vmatch

If I could take only one bioinformatics tool with me to a desert island it would be Vmatch.

In addition to being a versatile alignment tool, Vmatch has many lesser-known features which also leverage suffix trees. The dbcluster option allows Vmatch to cluster similar sequences using Vmatch index files. This method is about 1000x faster than attempting to align all sequences against each other, no matter how clever the algorithm.

Originally presented as means to cluster EST sequences, this option is useful for processing the output from de novo short read assemblers like Velvet and ABYSS. Vmatch -dbcluster allows us to easily create a non-redundant set of contig sequences originating from several assemblies.

Why would we want to combine assemblies?

Every kmer is sacred

Why not just take the one with the highest N50? At the recent Genome Informatics meeting at CSHL, Inanc Birol (ABYSS) said "every kmer is sacred". Our experience with the de novo transcriptome assembly of plants has been that the best set of contigs is spread out all over the parameter space. Longer kmer and cvCut settings can produce a longer set of elite contigs at the expense of omitting lowly expressed (or spurious?) contigs that appear at less stringent settings.

Reads held hostage

To the point above, I would add "every cvCut is also sacred". That doesn't exactly roll off the tongue, but we have seen some instances in which an assembly will have a higher read usage at a cvCut of 10 than at 5. This paradox suggests there is a "contig read hostage" situation, in which reads critical to the formation of longer contigs can be held captive in low coverage short contigs. Raising the cvCut threshold frees up these critical reads to extend more plausible contigs and allowing additional reads from the pool to be recruited into the assembly.

What is a non-redundant set?

The following sequences have some redundancies. NODE2 and NODE3 do not add any information.
The non-redundant set will have only NODE1 and NODE4.
>NODE2 a substring of NODE1
>NODE3 a reverse complement substring of NODE1
>NODE4 a new sequence

How do we create a non-redundant set?

Do not use -nonredundant

No, really. There is an option called -nonredundant which should presumably do what we want, but unfortunately that writes a "representative" member of each cluster to a file, which may or may not be the longest contig. I'm not sure what makes a sequence representative of a cluster, but for this application we want the longest member of each cluster.

On April 29th 2010, Vmatch 2.1.3 was released. The most important change is that the option -nonredundant now delivers the longest sequence from the corresponding cluster (instead
of an unspecified representative). This should make the approach unnecessary.

To create a non-redundant set we will produce cluster files and then extract the longest sequence from each file. Use the following commands to produce your cluster files from an index:
mkvtree -allout -pl -db contigs1.fa contigs2.fa -dna -indexname myIndex
#mkvtree will accept multiple fasta or gzipped fasta files

#if using Vmatch 2.1.3 or later:
vmatch -d -p -l 25 -dbcluster 100 0 -v -nonredundant nonredundantset.fa myIndex > mySeqs.rpt
# thx to Hamid Ashrafi for debugging this syntax

#if using older Vmatch
mkdir sequenceMatches 
#this is where vmatch will put each cluster

vmatch -d -p -l 25 -dbcluster 100 0 mySeqs "(1,0)" -v -s myIndex > mySeqs.rpt
mv mySeqs*.match mySeqs*.fna sequenceMatches

What do these options do?

  • -d direct matches (forward strand)
  • -p palindromic matches (reverse strand)
  • -l search length (set this below your shortest sequence)
  • -dbcluster queryPerc targetPerc - runs an internal alignment of the index created by mkvtree.
    • The two numeric arguments specify what percentage of your query sequence (the smaller) is involved in an alignment to the cluster sentinel/target superstring. For our purposes we require 100% of our query sequence substring to match the target. We don't care what percentage of the target is aligned, so set the second parameter to 0.
    • The third argument to dbcluster is the index prefix name that vmatch will give to the THOUSANDS of cluster .fna and .match files it will create.
    • The fourth argument "(1,0)" specifies that we want to keep singletons (in a file called mySeqs.single.fna) and that there is no limit to the number of sequences in an acceptable cluster.

  • -v verbose report (redirected to a the .rpt)
  • -s create the individual cluster fasta files and match reports

Consult the vmatch manual for fuzzy matches and more examples:

The standard output details the clusters and singlets.
0:761437: NODE_83_length_31_cov_22.451612857642: 
NODE_106_length_31_cov_22.451612621409: NODE_152_length_29_cov_27.758621749981: 
NODE_185_length_29_cov_27.7586211:761861: NODE_531_length_424_cov_26.851416805590: 
NODE_1413_length_320_cov_28.187500837480: NODE_1236_length_320_cov_28.187500858407: 
NODE_937_length_320_cov_28.187500765510: NODE_1542_length_108_cov_34.870369621979: 
NODE_786_length_425_cov_32.602352750915: NODE_1290_length_321_cov_34.392525

Extract the longest sequence from your cluster files

If using Vmatch 2.1.3 or later this is unnecessary
Here is a perl script, which we will call, to do that
#print the longest sequence in a fasta file
use strict;my %seqs;my $defline;
while (<>) {
  if ( /^#/ || /^\n/ ) {
    #comment line or empty line do nothing
  }elsif (/>/) {
    s/>//g;$defline = $_;
    $seqs{$defline} = "";
  }elsif ($defline) {
    $seqs{$defline} .= $_;
    die('invalid FASTA file');
my $max = $defline;
foreach my $def ( keys %seqs ) {
  $max = ( length( $seqs{$def} ) > length( $seqs{$max} ) ) ? $def : $max;
print ">" . $max . "\n" . $seqs{$max} . "\n";

We want the longest member of each cluster and all sequences in the singletons file:
for f in sequenceMatches/*fna;
if [ "$f" = "sequenceMatches/mySeqs.single.fna" ];
cat $f >> mySeqs_longest_seqs.fa;
else perl $f >> mySeqs_longest_seqs.fa;

That's it - now you have a comprehensive and non-redundant set of the longest contigs from a number of assemblies.

Wednesday, November 4, 2009

R's xtabs for total weighted read coverage

Samtools and its BioPerl wrapper Bio::DB:Sam prefer to give read coverage on a depth per base pair basis. This is typically an array of depths, one for every position that has at least one read aligned. OK, works for me. But how can we quickly see which targets (in my case transcripts) have the greatest total weighted read coverage (i.e. sum every base pair of every read that aligned)?

My solution is to take that target-pos-depth information and import a table into R with at least the following columns:

I added the pos column here to emphasize the base-pair granularity
         tx pos depth
1 tx500090 227 1
2 tx500090 228 1
3 tx500090 229 1
4 tx500090 230 1
5 tx500090 231 1
66 tx500123 184 1
67 tx500123 185 1
68 tx500123 186 1
69 tx500123 187 2
70 tx500123 188 2
71 tx500123 189 2

In R
myxTab<-xtabs(depth ~ tx,data=myCoverage)

xtabs will sum up depth-weighted positions by default (i suppose this is what tabulated contigency really means) and return an unsorted list of transcripts and their weighted coverage (total base pair read coverage)
> myxTab[1:100]
tx500090 tx500123 tx500134 tx500155 tx500170 tx500178 tx500203 tx500207
38 92 610 46 176 46 92 130
tx500273 tx500441 tx500481 tx500482 tx500501 tx500507 tx500667 tx500684
76 2390 114 71228 762 222 542 442
tx500945 tx500955 tx501016 tx501120 tx501127 tx501169 tx501190 tx501192
1378 3604 46 46 420 854 130 352
tx501206 tx501226 tx501229 tx501245 tx501270 tx501297 tx501390 tx501405
244 1204 206 15926 214 46 168 46
tx501406 tx501438 tx501504 tx501694 tx501702 tx501877 tx501902 tx502238
38 2572 7768 3274 314 298 84 198
tx502320 tx502364 tx502403 tx502414 tx502462 tx502515 tx502517 tx502519
122 38 588 46 46 38 38 466
tx502610 tx502624 tx502680 tx502841 tx502882 tx503090 tx503192 tx503204
206 38 168 3750 38 122 76 92
tx503416 tx503468 tx503523 tx503536 tx503571 tx503578 tx503623 tx503700
260 38 168 38 46 46 84 38
tx503720 tx503721 tx503722 tx503788 tx503872 tx503892 tx503930 tx503970
97112 38 38 4708 38 38 1290 84
tx503995 tx504107 tx504115 tx504346 tx504353 tx504355 tx504357 tx504398
46 152 206 46 3416 1402 122 290
tx504434 tx504483 tx504523 tx504589 tx504612 tx504711 tx504751 tx504827
290 8728 176 46 46 76 5644 1308
tx504828 tx504834 tx504882 tx504931 tx504952 tx505017 tx505029 tx505078
2336 328 46 34138 1000 1838 46 474
tx505123 tx505146 tx505159 tx505184
38 123344 160 588

this is approximately 10000x faster than using a formula like:

Friday, October 23, 2009

Installing Bio::DB::Sam from CPAN

Bio::DB::Sam is Lincoln Stein's BioPerl API to the SamTools package.

Installing via CPAN might skip a necessary question that will cause it to fail.

cpan> install Bio::DB::Sam
DIED. FAILED tests 1-93
Failed 93/93 tests, 0.00% okay
Failed Test Stat Wstat Total Fail Failed List of Failed
t/01sam.t 2 512 93 186 200.00% 1-93
Failed 1/1 test scripts, 0.00% okay. 93/93 subtests failed, 0.00% okay.
make: *** [test_dynamic] Error 2
/usr/bin/make test -- NOT OK
Running make install
make test had returned bad status, won't install without force

Navigate to where CPAN has downloaded the gz ffile
A closer examination reveals that Build.PL file wants to know where the SamTools header files are located.

Please enter the location of the bam.h and compiled libbam.a files:

I have no idea how to pass these arguments using CPAN. I would just avoid this method of installation. Do the local build instead.

Monday, October 12, 2009

Disable file locking in Eclipse for OS X

Eclipse will refuse to use a workspace on an automounted OS X Server home directory.
Workspace in use or cannot be created

To remedy this problem do the following:
  • Right click the Eclipse application and select "Show Package Contents"
  • Contents->MacOS
  • Edit the eclipse.ini file in a text editor
  • Add -Dosgi.locking=none to the line below -vmargs

Wednesday, September 23, 2009

My 2009 Bridge-to-Bridge Experience

Bridge-to-Bridge is a 105-mile ride up to the top of Grandfather Mountain, one of the highest peaks in the Blue Ridge mountains.

Although B2B is not an officially sanctioned race, the organizers conduct it just as professionally (with the exception of neutral support). Cops manage the major crossings, volunteers provide hand-offs at the dozen feed stations, and the event is officially timed with the aid of some magnetic shoe things.

I last did this ride in 1999. Now 10 years older but about 10 pounds lighter I had somehow forgotten how much suffering was involved and figured this was a good time to tackle the challenge, despite falling ill a couple of weeks beforehand.

Due to constant rain and very heavy fog, this year was utter torture for the 299 finishers and 371 non-finishers who braved the elements. I believe there may have been another 130 non-starters who stayed in their hotel rooms enjoying the Golden Girls marathon on tv. While I'm sure I would have felt some sense of accomplishment doing that, I was obligated to finish this ride as we had already driven down from Philadelphia (en-route to a wedding in Nashville the following weekend).

The Grandfather Mountain staff said riders could not enter the park before 3pm. To accommodate both the riders and this odd rule two start times were offered - 10 and 11 am, with slower riders encouraged to start first. To me this was a welcome change from the ungodly pre-dawn start times of most big rides. Riders were advised to start at the later time if they estimated they would be pushing the 3pm threshold.

I was on the fence about which group to join. I saw several very fit looking riders and expensive bikes in the 10am pack. I was still riding the same Litespeed I used in 1999. The word going around was that more rain was on the way (this turned out to be true for everyone). In the end I felt the risk of having an inexperienced rider fall in front of me to be the deciding factor to go with the second group. I knew I would have to pass several of that first group anyway but it would be on the later climbs instead of the early rollers. Ironically I almost got clipped by some idiot drifting carelessly in our pack. At the end there was considerable overlap in finishing times between the two groups.

The two times I did this ride ('98 and '99) I got dropped by the leaders on the 13 mile climb up NC181, then spent the rest of the ride either riding alone or with a couple other guys. Despite my best efforts this year turned out roughly the same except I stuck with that front group for about half the climb instead of just the first couple miles. Remind me to buy a compact crank.

This climb is very difficult psychologically - a relentless slog up a roughly-paved 4 lane highway. I was not very familiar with the profile and prematurely thought I had crested three times - each time putting in a kick over the "top". The fog would clear and I would see yet another rise.

Feeling dispirited and exhausted, I nearly froze to death on the descents of 181 and the Blue Ridge Parkway and was eventually caught by a small group that stuck together until the ascent of Grandfather. I was amazed by how few words were exchanged in that group during the hour or so we traded pulls through the fog and rain, which only got worse as we neared the finish. One guy did say something that stuck with me, "It's like we're riding through a horror movie."

After crawling to the finish in a 39x27, I was very fortunate that Mary Ellen had the foresight to drive up to the summit well in advance to meet me with a warm car. I thanked her by singing the Golden Girls theme song all the way to the hotel.

Wednesday, September 16, 2009

Printing a specific line from bash_history

Often I want to archive specific commands out of my immediate bash history to a batch script that I can run later.
Unfortunately I could find no way of redirecting !# (where # is the bash history line I wish to save, e.g. !58 executes line 58) to a file. There is the "colon p" option - where !#:p will print the command instead executing it, but I could not redirect or pipe that output either.

So I added this one-liner to a bin directory in my path:

cat $HISTFILE | sed -n "$1p"

I call it getHistLine. So now to save line 58 to a batch script I can just type:
getHistLine 58 >>

To save a range of lines I can type
getHistLine 50,58 >>

Tuesday, August 25, 2009

Standardized Velvet Assembly Report

I finally got my Velvet Assembler report script up on google code. This "program" consists of some short scripts and a Sweave report designed to help Velvet users identify the optimal kmer and cvCut parameters.

Thursday, August 13, 2009

GregorianCalendar foolishness

I wanted to add queries to my Grails application to find tasks that needed to be completed today, or were delinquent (due date < last midnight).
My application kept thinking things were delinquent the afternoon of the due date.

The problem was that I neglected to read how HOUR_OF_DAY differed from HOUR and absentmindedly mixed the two.

You can try to set HOUR to 0 but it will default to 12. If it is the afternoon when you request the Calendar object then it will assume you mean 12 noon.

Calendar lastMidnight = Calendar.getInstance();
lastMidnight.set( Calendar.HOUR, lastMidnight.getMinimum(Calendar.HOUR_OF_DAY ));

//this is ok
lastMidnight.set( Calendar.HOUR_OF_DAY, lastMidnight.getMinimum(Calendar.HOUR_OF_DAY ));

//this is also ok
lastMidnight.set( Calendar.AM_PM, Calendar.AM );
lastMidnight.set( Calendar.HOUR, lastMidnight.getMinimum(Calendar.HOUR ));

The other settings are pretty self-explanatory

lastMidnight.set( Calendar.MINUTE, lastMidnight.getMinimum(Calendar.MINUTE));
lastMidnight.set( Calendar.SECOND, lastMidnight.getMinimum(Calendar.SECOND));
lastMidnight.set( Calendar.MILLISECOND,lastMidnight.getMinimum(Calendar.MILLISECOND));

Tuesday, July 28, 2009

Beware! Groovy split and tokenize don't treat empty elements the same

Groovy's tokenize, which returns a List, will ignore empty elements (when a delimiter appears twice in succession). Split keeps such elements and returns an Array. If you want to use List functions but you don't want to lose your empty elements, then just use split and convert your Array into a List in a separate step.

This might be important if you are parsing CSV files with empty cells.

import groovy.util.GroovyTestCase

class StringTests extends GroovyTestCase {

protected void setUp() {

protected void tearDown() {

void testSplitAndTokenize() {
assertEquals("This, ,should,have,six,items".tokenize(',').size(),6)

assertEquals("This, ,should,have,six,items".split(',').size(),6)

//convert array to List and re-evaluate
def fieldArray = "This,,should,have,six,items".split(',')
def fields=fieldArray.collect{it}
assert fields instanceof java.util.List

Tuesday, May 5, 2009

Loading files in grails bootstrap

I've had to do this for two separate projects.

A web application always has weird ideas of where it is in the path. I tend to look for examples that work without assumptions. This works for the bootstrap scripts I use to start up grails apps. Locating stuff in /web-app in the deployment context is a different story.

I think the best way is as follows:
create a directory: grails-app/conf/resources

For a tab delimited-file I used the following routine to load and create domain objects

class BootStrap {

def init = {servletContext ->
//AC204211.3 c0189C22 ACTIVEFIN UNKNOWN 153071100 153252400 ctg708 9800 191100

def filePath = "resources/fpc_report.txt"

def text = ApplicationHolder.application.parentContext.getResource("classpath:$filePath").inputStream.text
text.eachLine {
println it
def BacFields = it.split("\t")
new Bacs(accession: BacFields[0],
cloneName: BacFields[1],
status: BacFields[2],
chr: BacFields[3],
chrStart: BacFields[4],
chrEnd: BacFields[5],
contig: BacFields[6],
contigStart: BacFields[7],
contigEnd: BacFields[8]).save()

For a CSV file I used the opencsv library
I added to my maven pom the following dependency


and used something similar to get the grails appHolder object to give up a Java File

import org.codehaus.groovy.grails.commons.ApplicationHolder

class BootStrap {

def init = { servletContext ->

def filePath = "resources/E357Lims.CSV"
def appHolder=ApplicationHolder.application.parentContext.getResource("classpath:$filePath")

def reader = new CSVReader(new FileReader(appHolder.getFile()))
def header = true
reader.readAll().each{ csvrow ->
new FlatReport(name:csvrow[0].trim()).save()
header = false

Thursday, April 2, 2009

How to set your grails app to different environments in IntelliJ

This is something everyone needs to do but my google search did not come up with anything (too easy I suppose). Use the -Dgrails.env as a VM parameter.

Wednesday, April 1, 2009

I pick on the WSJ again: A true cost analysis of home prices

Today's Wall Street Journal article Home Prices: Low, But Still No Bargain by Brett Arends featured a chart comparing the Case-Shiller home price index to average earnings - a metric used to measure true housing cost from Jan 1987 to the present.

Compared to the Case-Shiller index alone this is a useful analysis, but it doesn't take into account that most homes are bought using mortgages, not cash, so the cost of money must be factored in.

It is easy enough to access historical mortgage rates to see the rates people paid to buy homes over this period:

The raw WSJ index data behind their snazzy chart can be gleaned from

Using a Groovy script I parsed their index data and used a multiplier based on the 30-year mortgage rate. As was done in the article, the resulting data was normalized to the first point in Jan '87 so the graphs would be comparable.

WSJ index

My index

Using my "True Cost" index, the summer of 2006 was still very elevated (my index of 150 compared to 100 in Jan '87) but not quite as crazy as it would appear by not taking into account the prevailing mortgage rates (WSJ index 200 compared to 100 in Jan '87).

March of 1989, with its 30-year rate at 11.03%, was also a crappy time to buy a house. The true cost of housing actually did not return to the '89 level until the height of the boom in '06. Refinancing would only offer some comfort to the '89ers- you would have had to wait until August of 1992 to refinance below 8% and until Jan of 2003 to get below 6%.

With today's insanely low rates we see my index is at a manageable 84, similar to May of 1999, instead of the alarming 128 Arends has used as a metric for his article. A dip much below 5% would mean the cheapest mortgaged houses relative to earnings of in over 20 years.

Thursday, February 26, 2009

The WSJ made an accounting error

Wall Street Journal 2/26:
The 2% Illusion

I was curious to see how they arrived at the numbers for this article, given there are no citations. I found the table they were using:

Unfortunately, this table is only for taxpayers who filed itemized returns. From the opinion piece:
Consider the IRS data for 2006, the most recent year that such tax data are available and a good year for the economy and "the wealthiest 2%." Roughly 3.8 million filers had adjusted gross incomes above $200,000 in 2006. (That's about 7% of all returns; the data aren't broken down at the $250,000 point.) These people paid about $522 billion in income taxes, or roughly 62% of all federal individual income receipts.
To arrive at $522B they took rows 26:32

To arrive at 62% they divided this number by $837B, once again this is only for itemized returns. One clue that might have tipped them off (they were looking at a subset) would be that there are only 42 million returns on this table for a country of over 300 million people.

Perhaps most bizarre is their claim that 7% of returns report more than $200k/yr. I don't understand how this number got past them.

The total for all returns (standard and itemized) can be found on

Using the correct table:
  • The total income tax collected is $1.023T, not $837B. ($1,023,920,139)
  • The total income tax collected by households making more than $200k is $544B, not $522B ($544,318,726)
  • The percentage of receipts collected by rich households is roughly 53%, not 62%.
  • Finally, the percentage of returns with income >$200k is 2.9%, not 7% (wow)
This does not fundamentally change the WSJ's arguments, which I feel are simply irrelevant. Rich people know to hide their wealth in corporations. Their assets are all comingled with corporate assets and their expenses become business costs. In 2002, US corporations report revenues of almost $50T, income of $563B, and paid just $154B in corporate income tax. I find that very strange.

Even stranger, I've learned that some rich people actually take the standard deduction! In fact, 244 of the 15956 households making more than $10M in income decided it was not worth itemizing in TY2006.

Monday, January 19, 2009

Getting java to work with firefox 3 on solaris 10

Q: Why is it all java-related things never work right on Solaris?
A: The people at Sun use Macs

Create a symbolic link to your java's
(do not copy this file, it will not work properly)

sudo ln -s /usr/jdk/jdk/jre/plugin/i386/ns7/ \

(this is where I put my java 6 installation - your location is probably different

Friday, January 16, 2009

XMLTree.c:1016: error: void value not ignored as it ought to be

Normally I don't recommend trying to alter code when an installation throws an error - the problem is usually user-related (i.e. me) or totally unfixable. This is one exception:

If you get this error trying to install the R XML library from Omegahat

XMLTree.c: In function `xmlBufferWrite':
XMLTree.c:1016: error: void value not ignored as it ought to be
*** Error code 1
make: Fatal error: Command failed for target `XMLTree.o'

Don't give up. It is actually worth fixing this one.

Expand the tarball and alter the file called 'XMLTree.c'

gunzip XML_1.99-0.tar.gz
tar -xvf XML_1.99-0.tar
rm XML_1.99-0.tar
emacs XML/src/XMLTree.c

The offending code:


/* These two taken from libxml2-2.6.27
They are needed if xmlOutputBufferCreateBuffer()
is not in the installed libxml2.
It appeared in libxml2-2.6.23, released on Jan 5 2006
static int
xmlBufferWrite (void * context, const char * buffer, int len) {
int ret;

ret = xmlBufferAdd((xmlBufferPtr) context, (const xmlChar *) buffer, len);
if (ret != 0)

xmlOutputBufferCreateBuffer(xmlBufferPtr buffer,
xmlCharEncodingHandlerPtr encoder) {
xmlOutputBufferPtr ret;

if (buffer == NULL) return(NULL);

ret = xmlOutputBufferCreateIO((xmlOutputWriteCallback)
NULL, (void *) buffer, encoder);



Notice the disclaimer - if your libxml2 is up to date these methods are not necessary anyway.
So just delete those two methods and the directive they rode in on, from #ifdef to #endif

tar -cvf XML_1.99-0.tar XML
gzip XML_1.99-0.tar

and install
R CMD INSTALL XML_1.99-0.tar.gz