Copyright (C) 2004 Warren R. Gish, Saint Louis, Missouri 63108 USA.
All Rights Reserved.
In the discussion that follows, several command line options or parameters are mentioned. For more detailed information about them, see parameters.html, WU-BLAST-ControlParms.pdf, and blast1.pdf.
NCBI blastall contains some hard-coded filters that are used by default. In contrast, in WU-BLAST (and NCBI-BLAST 1.4 and earlier), filtering is optional and is performed by external programs of the user's choosing. More information about standard filters is available in the Installation section.
Licensed WU BLAST 2.0 supports several command line options relevant to sequence filtering and masking of low-complexity and repetitive regions: filter, wordmask, lcfilter, and lcmask.
When studying sequences substantially longer than the longest expected gene in the specie(s) of interest, restrictions can optionally and productively be imposed on the maximum allowed distance between HSPs that will be grouped together for joint probability calculations with Sum statistics. The parameters for imposing these restrictions are named hspsepQmax and hspsepSmax. Particularly when searching with or against genomic-scale sequences (whole genomes, whole chromosomes, large contigs, or really any sequence that is much longer than the longest expected gene), use of the "hspsep*max" parameters can not only speed up a search, but can also have a dramatic effect on the sensitivity at the level of the statistics. Using these options, consistent groups of HSPs may satisfy the significance threshold for being reported that otherwise would not, even though found by the search algorithm.
The number of high-scoring segment pairs found of both the ungapped (HSP) and gapped (GSP) varieties may be determined in large part by the score thresholds S2 and gapS2. The numbers of HSPs and GSPs tend to increase exponentially as the score thresholds decrease (and vice versa). As a further consideration, some post-processing steps performed by the BLAST software have supra-linear computational complexities of N logN (e.g., sorting) and N2 (e.g., joint probability calculations), where N is the number of HSPs or GSPs. Clearly then, the run time of a BLAST search can be very sensitive to the score thresholds, particularly once they are in the range of scores expected to yield many spurious hits or if the thresholds are high but non-random features have not been masked from the sequences (e.g., see the discussion of low-complexity sequences above).
To avoid unnecessarily long execution times, the software by default limits the number of ungapped HSPs to 1000 -- and by implication the number of gapped HSPs to 1000 -- via the hspmax parameter. For the software to adhere to this limit, when too many ungapped HSPs are found for a given database sequence, the S2 threshold is transiently raised by just enough that it brings the number of HSPs back under control. The gspmax parameter can be optionally and independently used to limit the number of gapped HSPs to some number less than hspmax, although the default limit imposed by hspmax often suffices.
For consistent treatment across all database sequences, it may be desirable to set explicitly higher score thresholds on the BLAST command line, such that hspmax and gspmax are never exceeded and never cause the score thresholds to be increased transiently by the software. This can of course be accomplished via the S2 and gapS2 command line options, but it can also be performed indirectly via the E2 and gapE2 parameters. (The values for S2, E2, gapS2, and gapE2 are displayed in the Parameters section at the end of each BLAST report).
An alternative but not exclusive approach to assuring consistency
of the score thresholds across all database sequences searched
is to set hspmax=0 and gspmax=0,
thus signifying no limit on the number of HSPs and GSPs.
However, if non-random features of the sequences,
such as low-complexity regions (see above) and repetitive elements,
have not been masked,
then the numbers of satisfying HSPs and GSPs
may be huge and lead to unacceptably long run times.
Casting aside entirely the issue of sensitivity,
for high speed in protein-level searches with
BLASTP, BLASTX, TBLASTN or TBLASTX,
try these parameters:
W=3 T=1000
or
W=4 T=1000
Sensitivity is greatly reduced, but sensitivity isn't always what one needs. Modifying W and T to trade sensitivity for speed has been a key feature of the BLAST algorithm since its inception (Altschul et al. 1990). With settings of W=3 T=1000, BLASTX's memory use may be decreased by an order of magnitude, while increasing the program's speed by as much. This may be important when performing fast queries with whole genome or whole chromosome sequences.
When one is looking for exact matches, a different and more efficient scoring matrix should be used, as well. For this purpose, try the IDENTITY matrix ("matrix=identity" on the command line), and searches will run even faster. Or the BLOSUM80 or PAM40 matrices can be used without modifying the W or T parameters, to emphasize identity a little more diplomatically, while increasing speed.
Word lengths far greater than 4 can be used with WU BLAST and may yield increased speed, depending on the search conditions, but beware of the software's lack of intelligence in setting the neighborhood word score threshold, T, when W is greater than 4. For larger words, no neighborhood words are generated by default. That is, only exactly matching words between query and subject sequences are used to seed alignments. If increased sensitivity is desired with longer words, then a value for T must be set explicitly on the command line. If the value for T is not chosen carefully, though -- i.e., if T is set just a little bit too low -- a combinatorial explosion in neighborhood words will soon lead to the depletion of all available memory. Even if the neighborhood word list does fit in memory, however, its sheer size may produce an adverse effect on speed, due to the consequent loss of processor cache efficiency.
Relative to BLASTP, sensitivity is usually sacrificed by default with BLASTX/TBLASTN/TBLASTX in order to gain speed, because most of the noncoding sequence or noncoding reading frames being compared by these programs are not expected to yield significant matches; a higher default value for T of about 12 or 13 is used by these programs (with the default BLOSUM62 matrix).
To increase the speed of BLASTP,
try bumping up T just a couple of points to T=13,
which will reduce -- rather than entirely eliminate --
the number of neighborhood words.
At T=13, BLASTP will run up to 3X faster as compared to T=11,
with much of its sensitivity intact.
Similarly, BLASTX/TBLASTN/TBLASTX may run up to 3X faster with T=14
than with the default value of T=12;
but you may only want to raise T to 13 to avoid losing too much
sensitivity.
hitdist=40 sets conditions
that are comparable to the NCBI 2-hit implementation
for BLASTP, BLASTX, TBLASTN and TBLASTX.
Maintaining T at its default value and switching to the 2-hit algorithm
is one way to obtain significantly higher speed (roughly 3-fold);
This is in fact what the NCBI has done in its
blastall program (Altschul et al., 1997).
To maintain the same theoretical sensitivity, however,
the 2-hit algorithm requires a lower T,
and a lower T slows down the 2-hit algorithm
by increasing the computational requirements surrounding
all of the additional neighborhood words that are produced.
Memory use is a factor in execution speed, as well,
due to effects on processor cache efficiency,
with storage requirements for the
additional neighborhood words further dissolving
the 2-hit algorithm's speed advantage.
At the same level of sensitivity, and without departing from the classical 1-hit BLAST algorithm, WU BLAST provides speed comparable to the NCBI implementation of 2-hit BLAST. In this case, the implementation of an algorithm is seen to be an important determinant of its speed. WU BLAST still tends to execute faster when using its own 2-hit algorithm than when using its 1-hit algorithm (by roughly 30%) and faster still than NCBI BLAST in its default 2-hit mode (by 15-200% or more, depending on the search mode, specifics of the problem, and computing platform). This illustrates that the choice of the 1-hit algorithm versus the 2-hit algorithm remains an important determinant of speed with WU BLAST. For a given level of sensitivity and longer query lengths, the 2-hit algorithm may be prohibited due to the additional memory required; under some circumstances the 2-hit algorithm may even run slower than the 1-hit algorithm.
Speed and memory issues aside, if the sequences being compared are not very similar, the current 2-hit algorithm is theoretically disadvantaged at finding short regions of similarity, such as individual protein domains and short coding exons in vertebrate genome sequence, particularly if T is not lowered to compensate for the use of the 2-hit algorithm. For this reason, the 1-hit algorithm is the default for WU BLAST and the 2-hit algorithm is recommended against when the highest sensitivity is desired.
For a further comparison of WU- and NCBI-BLAST parameters for speed
and sensitivity,
see.
Even perfect matches of any length can be missed when searching compressed nucleotide sequence databases in their compressed form, if an odd (as opposed to even) value is used for wink. Newer releases of WU BLAST (since 16-Oct-2004) avoid this pitfall and actually provide behavior more consistent with user expectations regardless of the setting for wink. Users of older versions of the software are strongly advised to use even values.
Some other search programs use non-overlapping words exclusively or by default;
be careful and considerate of this characteristic
if sensitivity matters.
Due to current implementation details and typical computer hardware limitations,
increasing the wordlength, W,
from its default value of 11 usually has limited effect on BLASTN's speed --
and can even reduce its speed if taken too far.
While BLASTN permits W to be increased to as high as 32
for experimental purposes,
the practicality of increasing W above 12 or 13 may be limited.
Setting hitdist=<w> and
wink=<w> together,
where <w> is the word length being used,
effectively doubles the word length,
causing index words to be generated on every W-mer boundary in the query.
This can make for much faster searches and significantly reduce the
memory required,
but at a high cost in sensitivity.
The combination of parameter settings and options that achieves maximum speed without unduly sacrificing sensitivity may only be found through trial-and-error and is likely to vary greatly depending on the computer hardware and class of problems at hand. For searches that will be repeated many times, some effort expended up front to identify improved combinations of parameter settings and options may pay tremendous dividends in the overall time saved.
For increased sensitivity with BLASTN at the expense of speed, W may be decreased. While WU BLASTN is the only BLASTN that supports neighborhood words, the T option should probably be left alone or only modified with great care. By default, BLASTN uses no neighborhood words, recognizing instead only exact matching words of length W. Effectively, the default value for T is obtained from the product of the word length, W=11, and the match score, M=5. By default, this product is then 11 x 5 = 55. Reducing W can increase sensitivity at the expected expense in speed; but for practical purposes, using values of W less than 8 should probably not be considered because of a tremendous reduction in speed that accompanies shorter word lengths.
Lowering T even a small amount to incorporate non-identical neighborhood words and, thus, increase sensitivity can have disastrous effects on BLASTN's memory use -- and your computer system's health -- the number of neighborhood words increases so dramatically with the long word lengths BLASTN typically uses. At very short word lengths, "neighborhooding" with BLASTN becomes practical memory-wise, but then the short words typically make the computational times impractical. Still, it's an option.
Rapid search conditions comparable to those used to cluster ESTs
for Unigene can be obtained with
W=13 hitdist=28 M=1 N=-2 Q=1 R=1 X=6 gapw=20.
When repetitive elements may be present in the query, a good way to avoid wasting computer time on them is to filter them from the query prior to the database search. A useful method for this is RepeatMasker. Run your sequence through RepeatMasker, then search it against the comprehensive database of interest.
If the slow speed of RepeatMasker is getting you down,
check out
MaskerAid.
Built around the fast and flexible WU BLAST 2.0,
MaskerAid is a drop-in enhancement that accelerates
RepeatMasker 30-fold while maintaining sensitivity.
For a brief discussion of multiplexing with BLAST and code to do it,
see Korf and Gish (2000)
and
MPBLAST.
>pir|S28916|S28916 dystrophin - mouse
Length = 3678
Score = 44 (20.5 bits), Expect = 7.3, P = 0.9994
Identities = 22/106 (20%), Positives = 42/106 (39%)
Query: 96 - 200
Sbjct: 910 - 1013