Want to see BLAST "fly"?

Copyright (C) 2004 Warren R. Gish, Saint Louis, Missouri 63108 USA.
All Rights Reserved.


With a properly crafted choice of parameters, and a rationale organization of the data, WU BLAST can operate extremely fast -- much faster than any other BLAST or BLAST-like program -- while using much less memory. Such speed may not always be desired, however, if sensitivity is impacted too greatly for the problem at hand. Through appropriate use of its flexibility, WU BLAST can effectively substitute for many existing programs; and yet because of its efficient design, WU BLAST often executes faster. (As a case in point, see MaskerAid or the wu-blastall script included in licensed WU BLAST distributions). The flexibility and optimizations present in WU BLAST can obviate the need to re-invent or re-write the similarity search backend for more specialized programs, allowing the developer instead to focus on the specific needs of their application.

In the discussion that follows, several command line options or parameters are mentioned. For more detailed information about them, see parameters.html, WU-BLAST-ControlParms.pdf, and blast1.pdf.


The Trouble with Low-Complexity Regions

Low-complexity regions, such as proline- or glycine-rich regions or acidic or basic regions, can yield tremendous numbers of spurious matches between sequences that have no other similarity between them. The statistics break down when such decidedly non-random sequences appear; furthermore, search times may be needlessly increased. To avoid spurious matching and make the statistics more robust, low-complexity regions can be filtered from the query sequence using the -filter=filtername option.

NCBI blastall contains some hard-coded filters that are used by default. In contrast, in WU-BLAST (and NCBI-BLAST 1.4 and earlier), filtering is optional and is performed by external programs of the user's choosing. More information about standard filters is available in the Installation section.

Licensed WU BLAST 2.0 supports several command line options relevant to sequence filtering and masking of low-complexity and repetitive regions: filter, wordmask, lcfilter, and lcmask.


When Less Is More

When studying sequences substantially longer than the longest expected gene in the specie(s) of interest, restrictions can optionally and productively be imposed on the maximum allowed distance between HSPs that will be grouped together for joint probability calculations with Sum statistics. The parameters for imposing these restrictions are named hspsepQmax and hspsepSmax. Particularly when searching with or against genomic-scale sequences (whole genomes, whole chromosomes, large contigs, or really any sequence that is much longer than the longest expected gene), use of the "hspsep*max" parameters can not only speed up a search, but can also have a dramatic effect on the sensitivity at the level of the statistics. Using these options, consistent groups of HSPs may satisfy the significance threshold for being reported that otherwise would not, even though found by the search algorithm.


Threshold Scores Too Low

The number of high-scoring segment pairs found of both the ungapped (HSP) and gapped (GSP) varieties may be determined in large part by the score thresholds S2 and gapS2. The numbers of HSPs and GSPs tend to increase exponentially as the score thresholds decrease (and vice versa). As a further consideration, some post-processing steps performed by the BLAST software have supra-linear computational complexities of N logN (e.g., sorting) and N2 (e.g., joint probability calculations), where N is the number of HSPs or GSPs. Clearly then, the run time of a BLAST search can be very sensitive to the score thresholds, particularly once they are in the range of scores expected to yield many spurious hits or if the thresholds are high but non-random features have not been masked from the sequences (e.g., see the discussion of low-complexity sequences above).

To avoid unnecessarily long execution times, the software by default limits the number of ungapped HSPs to 1000 -- and by implication the number of gapped HSPs to 1000 -- via the hspmax parameter. For the software to adhere to this limit, when too many ungapped HSPs are found for a given database sequence, the S2 threshold is transiently raised by just enough that it brings the number of HSPs back under control. The gspmax parameter can be optionally and independently used to limit the number of gapped HSPs to some number less than hspmax, although the default limit imposed by hspmax often suffices.

For consistent treatment across all database sequences, it may be desirable to set explicitly higher score thresholds on the BLAST command line, such that hspmax and gspmax are never exceeded and never cause the score thresholds to be increased transiently by the software. This can of course be accomplished via the S2 and gapS2 command line options, but it can also be performed indirectly via the E2 and gapE2 parameters. (The values for S2, E2, gapS2, and gapE2 are displayed in the Parameters section at the end of each BLAST report).

An alternative but not exclusive approach to assuring consistency of the score thresholds across all database sequences searched is to set hspmax=0 and gspmax=0, thus signifying no limit on the number of HSPs and GSPs. However, if non-random features of the sequences, such as low-complexity regions (see above) and repetitive elements, have not been masked, then the numbers of satisfying HSPs and GSPs may be huge and lead to unacceptably long run times.


Adjust the neighborhood word score threshold

Casting aside entirely the issue of sensitivity, for high speed in protein-level searches with BLASTP, BLASTX, TBLASTN or TBLASTX, try these parameters:

     W=3 T=1000
or
     W=4 T=1000

Sensitivity is greatly reduced, but sensitivity isn't always what one needs. Modifying W and T to trade sensitivity for speed has been a key feature of the BLAST algorithm since its inception (Altschul et al. 1990). With settings of W=3 T=1000, BLASTX's memory use may be decreased by an order of magnitude, while increasing the program's speed by as much. This may be important when performing fast queries with whole genome or whole chromosome sequences.

When one is looking for exact matches, a different and more efficient scoring matrix should be used, as well. For this purpose, try the IDENTITY matrix ("matrix=identity" on the command line), and searches will run even faster. Or the BLOSUM80 or PAM40 matrices can be used without modifying the W or T parameters, to emphasize identity a little more diplomatically, while increasing speed.


Be careful when adjusting the neighborhood word score threshold!

Word lengths far greater than 4 can be used with WU BLAST and may yield increased speed, depending on the search conditions, but beware of the software's lack of intelligence in setting the neighborhood word score threshold, T, when W is greater than 4. For larger words, no neighborhood words are generated by default. That is, only exactly matching words between query and subject sequences are used to seed alignments. If increased sensitivity is desired with longer words, then a value for T must be set explicitly on the command line. If the value for T is not chosen carefully, though -- i.e., if T is set just a little bit too low -- a combinatorial explosion in neighborhood words will soon lead to the depletion of all available memory. Even if the neighborhood word list does fit in memory, however, its sheer size may produce an adverse effect on speed, due to the consequent loss of processor cache efficiency.


For a modest gain in speed, with a minimal sacrifice of sensitivity...

set the parameter gapE=2000. Depending on conditions of the search, this change may or may not yield a noticeable improvement in speed/sensitivity.

For a modest gain in speed, with a moderate sacrifice in sensitivity...

just increase the value of T a little from its default value. The default value for T in BLASTP searches is computed to emphasize sensitivity, and often turns out to be about 11 when using the BLOSUM62 scoring matrix; on average this value for T yields several neighborhood words per W-mer in the query sequence.

Relative to BLASTP, sensitivity is usually sacrificed by default with BLASTX/TBLASTN/TBLASTX in order to gain speed, because most of the noncoding sequence or noncoding reading frames being compared by these programs are not expected to yield significant matches; a higher default value for T of about 12 or 13 is used by these programs (with the default BLOSUM62 matrix).

To increase the speed of BLASTP, try bumping up T just a couple of points to T=13, which will reduce -- rather than entirely eliminate -- the number of neighborhood words. At T=13, BLASTP will run up to 3X faster as compared to T=11, with much of its sensitivity intact. Similarly, BLASTX/TBLASTN/TBLASTX may run up to 3X faster with T=14 than with the default value of T=12; but you may only want to raise T to 13 to avoid losing too much sensitivity.


2-Hit BLAST

For licensed users of WU BLAST 2.0, the
hitdist option invokes 2-hit BLAST. Using hitdist=40 sets conditions that are comparable to the NCBI 2-hit implementation for BLASTP, BLASTX, TBLASTN and TBLASTX. Maintaining T at its default value and switching to the 2-hit algorithm is one way to obtain significantly higher speed (roughly 3-fold); This is in fact what the NCBI has done in its blastall program (Altschul et al., 1997). To maintain the same theoretical sensitivity, however, the 2-hit algorithm requires a lower T, and a lower T slows down the 2-hit algorithm by increasing the computational requirements surrounding all of the additional neighborhood words that are produced. Memory use is a factor in execution speed, as well, due to effects on processor cache efficiency, with storage requirements for the additional neighborhood words further dissolving the 2-hit algorithm's speed advantage.

At the same level of sensitivity, and without departing from the classical 1-hit BLAST algorithm, WU BLAST provides speed comparable to the NCBI implementation of 2-hit BLAST. In this case, the implementation of an algorithm is seen to be an important determinant of its speed. WU BLAST still tends to execute faster when using its own 2-hit algorithm than when using its 1-hit algorithm (by roughly 30%) and faster still than NCBI BLAST in its default 2-hit mode (by 15-200% or more, depending on the search mode, specifics of the problem, and computing platform). This illustrates that the choice of the 1-hit algorithm versus the 2-hit algorithm remains an important determinant of speed with WU BLAST. For a given level of sensitivity and longer query lengths, the 2-hit algorithm may be prohibited due to the additional memory required; under some circumstances the 2-hit algorithm may even run slower than the 1-hit algorithm.

Speed and memory issues aside, if the sequences being compared are not very similar, the current 2-hit algorithm is theoretically disadvantaged at finding short regions of similarity, such as individual protein domains and short coding exons in vertebrate genome sequence, particularly if T is not lowered to compensate for the use of the 2-hit algorithm. For this reason, the 1-hit algorithm is the default for WU BLAST and the 2-hit algorithm is recommended against when the highest sensitivity is desired.

For a further comparison of WU- and NCBI-BLAST parameters for speed and sensitivity, see.


Statistical Effects

Statistical computations can also take their toll on the overall search speed, particularly when combinatorial methods are used to assess joint probabilities of multiple alignments found between long sequences containing unmasked repetitive or low-complexity regions. The combinatorics can generally be avoided by specifying the kap option and can be reduced by setting smaller values for hspmax and gspmax.

Wink-ing

The
wink=n (W increment) option increases speed by instructing the search program to generate neighborhood words at every nth position along the query, with successive word positions overlapping by max(W-n,0) letters The default is wink=1, such that neighborhood words are generated at every position, with successive overlap of W-1 letters. This option is most useful when searching for identical or nearly identical sequences rapidly, and can be used in conjunction with the hitdist option for maximum search speed. Care should be taken that desirable, less-than-perfect matches are not precluded by the use of these parameters, however.

Even perfect matches of any length can be missed when searching compressed nucleotide sequence databases in their compressed form, if an odd (as opposed to even) value is used for wink. Newer releases of WU BLAST (since 16-Oct-2004) avoid this pitfall and actually provide behavior more consistent with user expectations regardless of the setting for wink. Users of older versions of the software are strongly advised to use even values.

Some other search programs use non-overlapping words exclusively or by default; be careful and considerate of this characteristic if sensitivity matters.


What about BLASTN?

For BLASTN, speeding up searches is rather simple but more limited in effect, unless one is ready to take a major hit in sensitivity and/or memory consumption. Speed can usually be increased by:

Due to current implementation details and typical computer hardware limitations, increasing the wordlength, W, from its default value of 11 usually has limited effect on BLASTN's speed -- and can even reduce its speed if taken too far. While BLASTN permits W to be increased to as high as 32 for experimental purposes, the practicality of increasing W above 12 or 13 may be limited. Setting hitdist=<w> and wink=<w> together, where <w> is the word length being used, effectively doubles the word length, causing index words to be generated on every W-mer boundary in the query. This can make for much faster searches and significantly reduce the memory required, but at a high cost in sensitivity.

The combination of parameter settings and options that achieves maximum speed without unduly sacrificing sensitivity may only be found through trial-and-error and is likely to vary greatly depending on the computer hardware and class of problems at hand. For searches that will be repeated many times, some effort expended up front to identify improved combinations of parameter settings and options may pay tremendous dividends in the overall time saved.

For increased sensitivity with BLASTN at the expense of speed, W may be decreased. While WU BLASTN is the only BLASTN that supports neighborhood words, the T option should probably be left alone or only modified with great care. By default, BLASTN uses no neighborhood words, recognizing instead only exact matching words of length W. Effectively, the default value for T is obtained from the product of the word length, W=11, and the match score, M=5. By default, this product is then 11 x 5 = 55. Reducing W can increase sensitivity at the expected expense in speed; but for practical purposes, using values of W less than 8 should probably not be considered because of a tremendous reduction in speed that accompanies shorter word lengths.

Lowering T even a small amount to incorporate non-identical neighborhood words and, thus, increase sensitivity can have disastrous effects on BLASTN's memory use -- and your computer system's health -- the number of neighborhood words increases so dramatically with the long word lengths BLASTN typically uses. At very short word lengths, "neighborhooding" with BLASTN becomes practical memory-wise, but then the short words typically make the computational times impractical. Still, it's an option.

Rapid search conditions comparable to those used to cluster ESTs for Unigene can be obtained with W=13 hitdist=28 M=1 N=-2 Q=1 R=1 X=6 gapw=20.


The Problem with Repeats

Another source of computational overhead is the presence of repetitive elements in query and database sequences. Prime examples of this are Alus and LINEs in human genomic sequence. The presence of repeats can cause the gapped alignment portion of WU BLAST to dominate the execution time, resulting in an order of magnitude decrease in speed.

When repetitive elements may be present in the query, a good way to avoid wasting computer time on them is to filter them from the query prior to the database search. A useful method for this is RepeatMasker. Run your sequence through RepeatMasker, then search it against the comprehensive database of interest.

If the slow speed of RepeatMasker is getting you down, check out MaskerAid. Built around the fast and flexible WU BLAST 2.0, MaskerAid is a drop-in enhancement that accelerates RepeatMasker 30-fold while maintaining sensitivity.


Multiplex Short Queries

When querying with short sequences, the vast majority of the CPU time expended by BLAST may actually be spent reading the database and parsing its contents. This is particularly true for BLASTN searches with typical EST- or single sequencing read-length queries (e.g., sequences of 500 bp or less). The cost of database I/O can be substantially reduced by searching with more than one query sequence at a time -- multiplexing queries. For 500 bp queries, the speed-up is about 10-fold and increases as the queries become shorter.

For a brief discussion of multiplexing with BLAST and code to do it, see Korf and Gish (2000) and MPBLAST.


Organize A Rational Workflow

Even fast software will run like a tortoise if misused. It's therefore necessary not only to use computer science in this business but computer sense. If you have, say, tens of millions of mouse whole genome shotgun sequencing reads to compare against the human genome, imagine how costly it would be to search the human genome database once with each sequencing read or tens of million times! Multiplexing might yield only about a 10-fold improvement, which is still problematic. Now reverse the problem and search the human genome against a database comprised of the mouse sequencing reads. The human genome is organized into just a few hundred sequences, so the mouse reads database need be read only a few hundred times. With a little forethought, the cost of database I/O is greatly reduced -- by an estimated 3 or 4 orders of magnitude in this case, such that it's contribution to the overall run time is insignificant.

Skip the Alignments

If you need the end-points of alignments but not the alignments themselves, computer time and disk storage can be conserved by specifying the noseqs option. An example of noseqs output is shown below, where the alignment text has simply been replaced by an hyphen. The result of using noseqs is an abbreviated output that is faster both to produce and to parse.
>pir|S28916|S28916 dystrophin - mouse
        Length = 3678

 Score = 44 (20.5 bits), Expect = 7.3, P = 0.9994
 Identities = 22/106 (20%), Positives = 42/106 (39%)

Query: 96 - 200
Sbjct: 910 - 1013

Cache As Cache Can

Most contemporary operating systems do good-to-excellent job of caching recently accessed data from disk in otherwise unused areas of memory. This makes subsequent accesses of these same data as fast as memory accesses, rather than several orders of magnitude slower by having to go to disk. Working with cached files can also avoid contention by other concurrent processes for the same physical disks. If your data analysis pipeline is currently designed to search the same set of databases repeatedly and in succession, check the sizes of the database files -- most importantly the sizes of the *.x[np][st] files, whose contents are read in their entirety for every search. If insufficient free memory is available on your system with which to cache all of these data, try to reorganize the pipeline so that all searches against a given database are performed before moving on to search the next database.

Some Notes in Closing

While increasing the word length W for any of the search programs will tend to increase speed, note that this also increases the search start-up time and memory use, because additional neighborhood words must be generated and remembered. (The
wink option can help alleviate this problem.) In contrast, raising T reduces all of:  the number of neighborhood words, memory used, and startup time (except for BLASTN, as noted above). How does one know what the memory use for the neighborhood words is? Check the size reported for the DFA (deterministic finite-state automaton) at the end of the search output.
Last modified 2004-04-27
Return to the WU BLAST 2.0 Page