Comparable WU BLAST and NCBI BLAST Parameters

Introduction

WU BLAST and NCBI BLAST are distinctly different software packages with distinctly different default behaviors and distinctly different command line options. For existing users of NCBI BLAST, to ease the transition to using WU BLAST, a PERL script named wu‑blastall is bundled with WU BLAST that converts NCBI blastall command line arguments into their (sometimes rough) equivalent WU BLAST parameters and then invokes the appropriate WU BLAST program. The output format remains unchanged as one of WU BLAST's native formats, depending on the format requested on the wu‑blastall command line.

The remainder of this page is primarily devoted to highlighting the differences between NCBI BLAST and WU BLAST and illustrating some of the ways that these differences can be smoothed out or eliminated.

Performance Comparisons

For fair performance comparisons between any two approaches, one must be cognizant of the factors affecting speed, sensitivity and selectivity. Through parameter settings, as well as one's choice of test data, the performance characteristics of software tools can often be dramatically altered to achieve any desired goal, whether the goal is to improve the performance of an existing tool or to showcase the performance of a new one. When different algorithms and statistical approaches are employed, “apples-to-oranges” comparisons may be entirely unavoidable. That said, rough observations of performance may be informative and useful, if sufficient care is taken in the preparations.

At the bottom of this page, parameter sets (or command line options) are provided that may be useful for comparing the relative performance (sensitivity, selectivity and speed) of WU BLAST 2.0 (blasta) and NCBI Gapped BLAST 2.0 (blastall) in the various search modes these programs offer. The command line arguments shown for NCBI BLAST are merely those that are required for any search, thus yielding the “default” behavior of this software. The optional parameter settings indicated for WU BLAST reduce its sensitivity to approximately that of the NCBI BLAST defaults. As the speed of the BLAST algorithm is inversely related to its sensitivity, any speed comparisons should be made at equal sensitivity levels.

Outlined specifically with respect to NCBI and WU BLAST, the speed, sensitivity and selectivity factors include:

By normalizing for such factors as those described above, a reasonably fair evaluation of relative performance can often be obtained, but is certainly not guaranteed. Differences may exist between the NCBI's built-in low-complexity filters and the external filters employed by WU BLAST. With WU BLAST, the filters are external plug-in programs provided with the software distribution (or user's can plug-in filters of their own design), so the user can generate filtered sequences independently of performing an actual search; and WU BLAST's -echofilter option allows the user to capture in the output the precise filtered sequence used internally by the search programs. All of this is just to say that with WU BLAST, one has more complete control and can more easily verify correct behavior of the software, while differences with the NCBI software can be difficult to eliminate with complete confidence.

Other differences in alignment procedures and statistics remain, as well, some of which can impact speed, sensitivity and selectivity. For example, NCBI BLASTP does not use "Sum" statistics to identify multiple regions of similarity; and NCBI BLASTN curiously uses the same lambda, K and H values to evaluate the significance of gapped alignments as it uses for ungapped alignments, regardless of how relaxed the gap penalties are.

Last, but certainly not least, NCBI BLASTN often (always?) reports incorrect values for the score thresholds used, which can seriously confound even the most careful of comparisons. While the inaccuracies may seem small and benign, their effect on speed is exponential and they make NCBI BLASTN appear faster than it really is. They also convey incorrect information about the sensitivity of the search.


In the examples below, the hitdist option invokes the WU implementation of the 2-hit BLAST algorithm (not available in version 2.0a19). Alternate WU BLAST command lines are shown that increase the value of the T parameter for the 1-hit BLAST algorithm, to yield roughly the same level of sensitivity (and speed) as the default parameterization of the NCBI 2-hit algorithm. The more-efficient 2-hit BLAST implementation in WU BLAST 2.0 may be used to obtain still more speed if desired — running significantly faster than the NCBI 2-hit BLAST — albeit with the reduced sensitivity associated with the 2-hit algorithm.

Benchmarking should be performed on computer systems over which one has full control. For example, avoid benchmarking via a web server whose configuration and operational state are unknown. As an example of how surprisingly important this can be, users of SGI IRIX 6.x may have noted that versions of this operating system released from 1997, until about 1999-2000, reported extremely inaccurate (i.e., low) execution times for programs like BLAST that use POSIX threads. Typically, the CPU time reported was actually 1/N of its actual value, where N was the number of CPUs or threads employed. Only for about 1 in N searches would the correct CPU time be reported. NCBI computers at the time were typically configured with 8-16 CPUs, so the CPU time reported was typically 8- to 16-fold lower than its actual value. This explains why the NCBI BLAST servers usually reported execution times of just a few seconds for lengthy database searches. It is also curious that the BLAST binaries and source code posted by the NCBI for users to download for local database searching did not report execution times at all, whereas supposedly the same software running on their servers did report CPU times. In any case, this particular bug seems to have been fixed in IRIX 6.5, the release of which correlates well with when the NCBI stopped reporting CPU times on their BLAST servers. ;-)

Database I/O can be a significant contributor --- even the major contributor -- to the overall search time. To minimize the overhead and impact on search speed of database I/O, search times should be performed on cached database files. Working with cached files is generally recommended, not just when benchmarking, to avoid contention for slow physical devices such as disk drives. Contemporary operating systems more-or-less do a good job of automatically caching files in what would otherwise be unused memory; hence, BLAST software moved away from using System V shared memory segments for storing database files and instead began using memory-mapped I/O and file caching, starting with BLAST version 1.4 (W. Gish, unpublished). Pre-caching of files can be accomplished by first performing an untimed search to prime the cache with the desired database files before the actual benchmark run(s) are executed. Of course, the host computer must have sufficient free memory available that the relevant database files can indeed by cached.

Even when copious amounts of physical memory are present, operating systems sometimes seem to limit the amount of file system data that can be cached. Sometimes these limits are configurable, as in Solaris, but other times there may be no apparent way to increase the amount of unused memory that can be utilized for file caching. Personal experience with Linux 2.4 falls in the latter category. Your “mileage” may vary.

When file caching can not be exploited, the overhead of database I/O may be reduced by using longer (less trivial) query sequences, such that the search programs spend relatively more time actually comparing sequences than they do reading and parsing the database.

General Tips for Benchmarking BLAST

Comparable Commands

The command lines below are presented in triples that are for:

  1. NCBI blastall;
  2. licensed WU blasta 2.0 with its optional 2-hit algorithm;
  3. WU blasta with the default 1-hit BLAST algorithm (the only mode available in the obsolete version 2.0a19).

BLASTP

1. blastall -p blastp -d nr -i query.aa

2. blastp nr query.aa cpus=1 hitdist=40 kap T=11 s2=41 gaps2=62 x=16 gapx=38 \
				q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg

3. blastp nr query.aa p=1 T=13 s2=41 gaps2=62 x=16 gapx=38 \
				q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg

BLASTX

1. blastall -p blastx -d nr -i query.nt

2. blastx nr query.nt hitdist=40 T=12 cpus=1 s2=41 gaps2=68 x=16 gapx=38 \
                q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg

3. blastx nr query.nt T=15 p=1 s2=41 gaps2=68 x=16 gapx=38 \
                q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg

TBLASTN

1. blastall -p tblastn -d nr -i query.nt

2. tblastn nr query.nt hitdist=40 T=13 cpus=1 s2=41 gaps2=62 x=16 gapx=38 \
				q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg

3. tblastn nr query.nt T=100 p=1 s2=41 gaps2=62 x=16 gapx=38 \
				q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg

TBLASTX

1. blastall -p tblastx -d nr -i query.nt

2% tblastx nr query.nt nogaps hitdist=40 T=13 cpus=1 s2=41 x=16 filter=seg

3% tblastx nr query.nt nogaps T=100 p=1 s2=41 x=16 filter=seg

BLASTN

1. blastall -p blastn -d nr -i query.nt

2. blastn nr query.nt kap cpus=1 m=1 n=-3 x=6 gapx=25 \
				q=7 r=2 gapL=1.37 gapK=.711 gapH=1.31 filter=dust

3. blastn nr query.nt p=1 m=1 n=-3 x=6 gapx=25 \
				q=7 r=2 gapL=1.37 gapK=.711 gapH=1.31 filter=dust

Last modified: 2005-05-24


Return to the WU BLAST Archives home page

Copyright © 2004-2005 by Warren R. Gish, Saint Louis, Missouri 63108 USA. All rights reserved.