.\" Command to print: psroff -man .\" Command to view: nroff -man | more .TH BLAST 1L "1 February 1995" .SH NAME blastp, blastn, blastx, tblastn, tblastx - rapid sequence database search programs utilizing the \s-1BLAST\s0 algorithm .SH SYNOPSIS .LP .nf .ft B blastp aadb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [-matrix scorefile] [Y=#] [Z=#] [H=#] [V=#] [B=#] [-sort_by...] .ft R .sp .fi .nf .ft B blastn ntdb ntquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [ [[M=matchscore][N=mismatchpenalty]] [-matrix scorefile] ] [Y=#] [Z=#] [H=#] [V=#] [B=#] [[-top][-bottom]] [-sort_by...] .ft R .sp .fi .nf .ft B blastx aadb ntquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [-matrix scorefile] [Y=#] [Z=#] [C=#] [H=#] [V=#] [B=#] [[-top][-bottom]] [-sort_by...] .ft R .sp .fi .nf .ft B tblastn ntdb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [-matrix scorefile] [Y=#] [Z=#] [-dbgcode #] [H=#] [V=#] [B=#] [[-dbtop][-dbbottom]] [-sort_by...] .ft R .sp .fi .nf .ft B tblastx ntdb ntquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#] [-matrix scorefile] [Y=#] [Z=#] [C=#] [-dbgcode #] [H=#] [V=#] [B=#] [[-top][-bottom]] [[-dbtop][-dbbottom]] [-sort_by...] .ft R .fi .SH DESCRIPTION .LP This document describes the \s-1BLAST\s0 version 1.4 programs. .LP .SM BLAST (\fBB\fRasic \fBL\fRocal \fBA\fRlignment \fBS\fRearch \fBT\fRool) is the heuristic search algorithm employed by the programs .BR blastp , .BR blastn , .BR blastx , .BR tblastn , and .BR tblastx ; these programs ascribe significance to their findings using the statistical methods of Karlin and Altschul (1990, 1993) with a few enhancements. The .SM BLAST programs were tailored for sequence similarity searching -- for example to identify homologs to a query sequence. The programs are not generally useful for motif-style searching. For a discussion of basic issues in similarity searching of sequence databases, see Altschul \fIet al.\fR (1994). .LP The five \s-1BLAST\s0 programs described here perform the following tasks: .LP .TP 10 .B blastp compares an amino acid query sequence against a protein sequence database; .TP .B blastn compares a nucleotide query sequence against a nucleotide sequence database; .TP .B blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database; .TP .B tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). .TP .B tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. .LP The fundamental unit of .SM BLAST algorithm output is the .B High-scoring Segment Pair (\s-1HSP\s0). An \s-1HSP\s0 consists of two sequence fragments of arbitrary but equal length whose alignment is locally maximal and for which the alignment score meets or exceeds a threshold or \fIcutoff\fR score. A set of \s-1HSP\s0s is thus defined by two sequences, a scoring system, and a cutoff score; this set may be empty if the cutoff score is sufficiently high. In the programmatic implementations of the .SM BLAST algorithm described here, each .SM HSP consists of a segment from the query sequence and one from a database sequence. The sensitivity and speed of the programs can be adjusted via the standard \s-1BLAST\s0 algorithm parameters .BR W , .BR T , and .BR X (Altschul \fIet al.\fR, 1990); selectivity of the programs can be adjusted via the cutoff score. .LP A .B Maximal-scoring Segment Pair (\s-1MSP\s0) is defined by two sequences and a scoring system and is the highest-scoring of all possible segment pairs that can be produced from the two sequences. The statistical methods of Karlin and Altschul (1990, 1993) are applicable to determining the significance of .SM MSP scores in the limit of long sequences, under a random sequence model that assumes independent and identically distributed choices for the residues at each position in the sequences. In the programs described here, Karlin-Altschul statistics have been extrapolated to the task of assessing the significance of .SM HSP scores obtained from comparisons of potentially short, biological sequences. .SH "SEARCH STRATEGY" .LP The approach to similarity searching taken by the .SM BLAST programs is first to look for similar segments (\s-1HSP\s0s) between the query sequence and a database sequence, then to evaluate the statistical significance of any matches that were found, and finally to report only those matches that satisfy a user-selectable threshold of significance. Findings of multiple \s-1HSP\s0s involving the query sequence and a single database sequence may be treated statistically in a variety of ways. By default the programs use \*(lqSum\*(rq statistics (Karlin and Altschul, 1993). As such, the statistical significance ascribed to a set of \s-1HSP\s0s may be higher than that ascribed to any individual member of the set. Only when the ascribed significance satisfies the user-selectable threshold (\fBE\fR parameter) will the match be reported to the user. .LP The task of finding \s-1HSP\s0s begins with identifying short words of length .BR W in the query sequence that either match or satisfy some positive-valued threshold score .BR T when aligned with a word of the same length in a database sequence. .BR T is referred to as the .I neighborhood word score threshold (Altschul \fIet al.\fR, 1990). These initial neighborhood .I word hits act as seeds for initiating searches to find longer \s-1HSP\s0s containing them. The word hits are extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Extension of the word hits in each direction are halted when: the cumulative alignment score falls off by the quantity .BR X from its maximum achieved value; the cumulative score goes to zero or below, due to the accumulation of one or more negative-scoring residue alignments; or the end of either sequence is reached. .SH "SETTING PARAMETERS" .LP Many of the .SM BLAST program parameters have one- or two-letter names and default values that can be modified using a .I name=value syntax on the command line, .I e.g., E=0.05 or S2=35. Other command line options are flags that appear alone on the command line (\fIe.g., \-span\fR). Parameter names are expected to be followed by a new value, separated from the parameter name by white space, as in \fI\-filter\ seg\fR or \fI-dbrecmax\ 10500\fR. An alternative parameter-value syntax supported by the programs is illustrated in these examples: \fIfilter=seg\fR and \fIdbrecmax=10500\fR. .SH "SELECTIVITY IN REPORTING MATCHES" .LP The parameter .BR E establishes a statistical significance threshold for reporting database sequence matches. .BR E is interpreted as the upper bound on the expected frequency of chance occurrence of an .SM HSP (or set of \s-1HSP\s0s) within the context of the entire database search. Any database sequence whose matching satisfies .BR E is subject to being reported in the program output. If the query sequence and database sequences follow the random sequence model of Karlin and Altschul (1990), and if sufficiently sensitive .SM BLAST algorithm parameters are used, then .BR E may be thought of as the number of matches one expects to observe by chance alone during the database search. The default value for .BR E is 10, while the permitted range for this Real valued parameter is 0 < .BR E <= 1000. .LP The parameter .BR S represents the score at which a single \s-1HSP\s0 would by itself satisfy the significance threshold .BR E. Higher scores -- higher values for .BR S -- correspond to increasing statistical significance (lower probability of chance occurrence). Unless .BR S is explicitly set on the command line, its default value is calculated from the value of .BR E. If both .BR S and .BR E are set on the command line, the one which is the most restrictive is used. When neither parameter is specified on the command line, the default value for .BR E is used to calculate .BR S. .LP The values for .BR E and .BR S are interconvertible, given the context of the search, which includes: the length and residue composition of the query sequence; the length of the database; a fixed, hypothetical residue composition for the database; and the scoring system employed. The scoring system used by the \s-1BLAST\s0 programs consists of a scoring matrix, wherein a score is ascribed to the alignment of each letter (residue) in the alphabet with every other letter in the alphabet as well as to itself. .LP The significance of an alignment score depends intimately upon the specific scoring matrix employed and the length and residue composition of the query sequence and database, all of which may vary with each search performed. Instead of the having the user guess at an appropriate value for the cutoff score .BR S for each search, an intuitive, general way to set thresholds for reporting matches is via the .BR E parameter, which has the direct statistical interpretation mentioned above. .SH KARLIN-ALTSCHUL STATISTICS .LP From Karlin and Altschul (1990), the principal equation relating the score of an .SM HSP to its expected frequency of chance occurrence is: .sp .ce .I E = K N exp(-Lambda S) .sp where .I E is the expected frequency of chance occurrence of an .SM HSP having score .I S (or one scoring higher); .I K and .I Lambda are Karlin-Altschul parameters; .I N is the product of the query and database sequence lengths, or the size of the search space; and .I exp is the exponentiation function. .LP .I Lambda may be thought of as the expected increase in reliability of an alignment associated with a unit increase in alignment score. Reliability in this case is expressed in units of information, such as .I bits or .I nats, with one nat being equivalent to 1/log(2) (roughly 1.44) bits. .LP The expectation .I E (range 0 to infinity) calculated for an alignment between the query sequence and a database sequence can be extrapolated to an expectation over the entire database search, by converting the pairwise expectation to a probability (range 0-1) and multiplying the result by the ratio of the entire database size (expressed in residues) to the length of the matching database sequence. In detail: .sp .ce \fIE_database = (1 - exp(-E)) D / d\fR .sp where \fID\fR is the size of the database; \fId\fR is the length of the matching database sequence; and the quantity .I (1 - exp(-E)) is the probability, .I P, corresponding to the expectation .I E for the pairwise sequence comparison. Note that in the limit of infinite \fIE\fR, \fIP\fR approaches 1; and in the limit as \fIE\fR approaches 0, \fIE\fR and \fIP\fR approach equality. Due to inaccuracy in the statistical methods as they are applied in the .SM BLAST programs, whenever \fIE\fR and \fIP\fR are less than about 0.05, the two values can be practically treated as being equal. .LP In contrast to the random sequence model used by Karlin-Altschul statistics, biological sequences are often short in length -- an .SM HSP may involve a relatively large fraction of the query or database sequence, which reduces the effective size of the 2-dimensional search space defined by the two sequences. To obtain more accurate significance estimates, the .SM BLAST programs compute .I effective lengths for the query and database sequences that are their real lengths minus the expected length of the .SM HSP, where the expected length for an .SM HSP is computed from its score. In no event is an effective length for the query or database sequence permitted to go below 1. Thus, the effective length of either the query or the database sequence is computed according to the following: .sp .ce \fILength_eff = \fRMAX\fI( Length_real - Lambda S / H , 1)\fR .sp where .I H is the relative entropy of the target and background residue frequencies (Karlin and Altschul, 1990), one of the statistics reported by the .SM BLAST programs. .I H may be thought of as the information expected to be obtained from each pair of aligned residues in a real alignment that distinguishes the alignment from a random one. .SH "HSP SCORE THRESHOLDS" .LP Using the default parameters, many more aligned segment pairs are typically found by the .SM BLAST programs than are ultimately reported. First, only those segment pairs scoring at or above a selectable cutoff score are saved as .I bona fide \s-1HSP\s0s for further consideration of their statistical significance. And second, any \s-1HSP\s0s that are found may not satisfy the significance threshold for reporting. .LP The cutoff score which defines \s-1HSP\s0s is parameterized as .BR S2. A value for .BR S2 can be set on the command line, or its value can be set indirectly via the command line parameter .BR E2. .BR E2 is interpreted as the .I expected number of \s-1HSP\s0s that will be found when comparing two sequences that each have the same length -- either 300 amino acids or 1000 nucleotides, whichever is appropriate for the particular program being used. .BR S2 may be thought of as the score expected for the .SM MSP between two such sequences. The default value for .BR E2 is typically about 0.15 but may vary from version to version of each program. The default value for .BR S2 will be calculated from .BR E2 and, like the relationship between .BR E and .BR S , is dependent on the residue composition of the query sequence and the scoring system employed, as conveyed by the Karlin-Altschul .I K and .I Lambda statistics. .SH "SEARCH SENSITIVITY" Sensitivity of the .SM BLAST programs should be considered in two areas. First, there is the question of how well ungapped alignments (\s-1HSP\s0s) can capture or represent the similarity between two biological sequences that may have evolved independently and/or contain sequencing errors. Particularly in the presence of insertions/deletions or frameshifts, it may be necessary to increase .BR E2 (or lower .BR S2 ), in order to detect the remnants of extended similarity. The amount of evidence or information to support the hypothesis that a given alignment is real and not random decreases with each mutation or sequencing error (States \fIet al.\fR, 1991; Gish and States, 1993). As a corollary of this, the expected length of a statistically significant .SM HSP increases with each mutation or sequencing error. At some point, accumulated mutations and errors completely obscure the presence of a relationship between two sequences; the .SM BLAST programs' focus on ungapped alignments may cause this point to be reached sooner than for other alignment methods. .LP The second area where sensitivity may be of concern is in the heuristic nature of the .SM BLAST algorithm for finding \s-1HSP\s0 alignments. Using this algorithm, along with a properly composed scoring scheme for Karlin-Altschul statistics to be applied, the lower the score is of an .SM HSP, the higher is the probability that the .SM HSP will go undetected. At the user's discretion, the speed of the .SM BLAST algorithm and the programs can be sacrificed in exchange for increased sensitivity of detecting these lower significance \s-1HSP\s0s, and vice versa; however, the default parameters for all of the programs except .B blastn have already been chosen to generally obtain moderate (\fBblastx\fR, \fBtblastn\fR, and \fBtblastx\fR) or high (\fBblastp\fR) sensitivity. If sensitivity is not an issue but speed is, then one should consider adjusting the .SM BLAST algorithm parameters to achieve higher speed (\fIe.g.,\fR increase \fBW\fR by one and \fBT\fR by 10-50%). .LP Raising .BR E2 or lowering .BR S2 can improve the .I apparent sensitivity of the .SM BLAST programs by permitting them to assess larger sets of \s-1HSP\s0s for statistical significance; but lower-scoring \s-1HSP\s0s are more difficult to detect, due to the heuristic nature of the .SM BLAST algorithm. Therefore, merely adjusting .BR E2 or .BR S2 may not significantly increase sensitivity -- it may also be necessary to adjust the .SM BLAST algorithm's .BR W , .BR T , and .BR X parameters to increase the .I true sensitivity of the programs. .LP If .BR E2 and .BR S2 are adjusted much from their default values to observe even lower-scoring \s-1HSP\s0s, search speed may suffer significantly because the computational complexity of the statistical methods is nonlinear in the number of \s-1HSP\s0s that are found. For Sum statistics, the complexity is a quadratic function of the number of \s-1HSP\s0s; for Poisson statistics, the complexity is even worse, a cubic function. Furthermore, as more \s-1HSP\s0s are considered, fuzziness in the .SM HSP consistency rules yield more reports of false positives. .LP Without varying the scoring scheme employed, the probability that the .SM BLAST algorithm can detect an .SM HSP having any particular score can be increased by: lowering the neighborhood word score threshold, .BR T , while keeping the word size, .BR W , constant; lowering both .BR W and .BR T appropriately (see Altschul \fIet al.\fR, 1990); and/or raising the word hit extension drop-off score .BR X (described earlier). .LP The default value for .BR W is 3 amino acids for .BR blastp , .BR blastx , .BR tblastn , and .BR tblastx , and 11 nucleotides for .BR blastn. For the first 4 .SM BLAST programs, which perform comparisons of amino acid sequences, .BR W should usually be restricted to values less than 5, unless the value for .BR T is specified disproportionately larger, to avoid consuming too much memory for the neighborhood word list (see below and Altschul \fIet al.\fR, 1990). .LP .BR X is a positive integer representing the maximum permissible decay of the cumulative segment score during word hit extension. Raising .BR X may decrease the chance that the .SM BLAST algorithm overlooks an .SM HSP, but it may significantly increase the search time, as well. If computation time is of little concern, .BR X might be increased a few points from its default value, but often little or no increase in sensitivity is observed by increasing this parameter from its default value. .LP For .BR blastp , .BR blastx , .BR tblastn , and .BR tblastx , the default value for .BR X is calculated to be the minimum integral score representing 10 bits of information, or a decay in the statistical significance of the alignment by a factor of 2 to the tenth power (or about 1,000). Since the .BR X parameter is used to terminate extensions independently in both directions, about 1 in 500 alignments are expected to be terminated prematurely that would have attained a higher score had termination not come so soon. .LP For .BR blastn , the default value of .BR X is the minimum integral score that represents at least 20 bits of information, or a reduction in the statistical significance of the alignment by a factor of 2 to the twentieth power (or about one million). .SH "THE NEIGHBORHOOD" .BR T is the neighborhood word score threshold for generating all words of length .BR W that yield a score of at least .BR T when aligned with some word of length .BR W from the query sequence. The list of words so generated is called the .I neighborhood (Altschul \fIet al.\fR, 1990). The size of the neighborhood can be increased, thus improving sensitivity, by lowering .BR T. Conversely, raising the value of .BR T decreases the size of the neighborhood and decreases the likelihood of detecting \s-1HSP\s0s. Generally, the larger the neighborhood (the lower \fBT\fR is), the slower the programs run, as well. .LP The default value for the neighborhood word score threshold is calculated at run-time from the residue composition and length of the query sequence and the scoring matrix employed, using an .I ad hoc equation that is a function of .I Lambda and .I H. Occasionally it may be necessary to manually set the neighborhood word score threshold via the command line, for which 13 may be a good value to try, but a good choice is .I highly dependent on the particular scoring matrix and word length used. .LP The .SM PAM120 amino acid scoring matrix supplied with the \s-1BLAST\s0 programs, produced to a scale of natural log(2)/2, yields values for .I Lambda that are expected to be close to 0.5 bits per unit score for query sequences of typical residue compositions. Under these conditions, an increase in an alignment score by 2 units is expected to increase the reliability or informativeness of the alignment by 2 times 0.5 = 1 bit, corresponding to an increase in its statistical significance by a factor of 2. The supplied .SM PAM250 matrix was produced to a scale of natural log(2)/3, suggesting that an increase in alignment score by 3 units will be required to increase statistical significance by a factor of 2. These are rules of thumb for the matrices mentioned. Generally, the significance of an alignment score is indeterminate without specific knowledge of the scoring matrix employed. If one communicates scores in a report, it may be useful to attach the values for the Karlin-Altschul parameters .I Lambda and .I K, so that statistical significance can be properly ascribed to the scores. .SH "MORE OPTIONS" Except where noted, all of the .SM BLAST programs accept the following command line options: .TP 8 .B \-altscore \fIscore_specification\fR This option can be used to alter entire rows, columns, or just individual scores in a scoring matrix. .I score_specification is a (quoted) character string consisting of three components each separated by at least one space: a letter in the query sequence alphabet (amino acid or nucleotide); a letter in the database sequence alphabet (amino acid or nucleotide); the new pairwise score (integer) to be assigned to the alignment of these two letters. If either character is specified as .I any, then the altered score will be assigned to the entire row or column in the scoring matrix. If the new score is given as .I min (\fImax\fR) then the new score assigned will be the minimum (maximum) observed score overall in the matrix; if the the new score is given as .I na, then the alignment of the two characters will not be allowed (effectively an infinite negative score is assigned to the alignment of the two letters). Multiple .B \-altscore options can be specified on the command line, with each one applying to the scoring matrix last specified in a .B \-matrix option, or to the default scoring matrix if no .B \-matrix option has been used. As an example of this option's use, to assign an alignment score of zero (0) to the presence of a stop codon in either the query sequence or database sequence, these two specifications can be used together: \fI-altscore \*(lq*\ any\ 0\*(rq -altscore \*(lqany\ *\ 0\*(rq\fB. .TP .B \-asn1 This option causes the programs to produce printable, structured output (not for human consumption, but for accurate automated parsing) in conformance with specifications written in the ISO 8824 standard \s-1ASN.1\s0 language. .TP .B \-asn1bin This option causes the programs to produce binary-encoded, structured output (not for human consumption, but for accurate automated parsing) in conformance with specifications written in the ISO 8824 standard \s-1ASN.1\s0 language and encoded according to the rules established by ISO 8825. .TP .B \-bottom See the .B \-top option. .TP .B \-codoninfo \fIcodoninfofile\fR This (\fBblastx\fR version 1.3 only) option is used to specify a file containing codon usage or codon bias information to be used in concert with a traditional scoring matrix to score alignments. The file containing codon usage information must have a .I .cdi extension on its name, but this extension should be omitted from the .I codoninfofile argument specified on the command line. Codon usage information should be expressed in units that coincide with the scale of the scoring matrix employed, and the scoring matrix employed must also have a .I .cdi extension to its name. A few such pairs of scoring matrix and codon usage files are provided in the .SM BLAST software distribution. .BR blastx expects to find the codon usage files in the /usr/ncbi/blast/cdi directory, or the program can be directed to look in another directory by setting the .SM BLASTCDI environment variable. \fINOTE: this option is presently supported only by the previous version 1.3 of \fBblastx\fR\fR. .TP .B \-compat1.3 This option is used to invoke behavior from the .SM BLAST version 1.4 programs that is very similar to that of the previous version 1.3 programs. This option affects the \fB\-poissonp\fR, \fB\-span1\fR, \fB\-olfraction 0.5\fR, \fB\-ctxfactor\fR, \fB\E\fR and \fB\E2\fR .TP .B \-consistency This option turns off both the determination of the number of \s-1HSP\s0s that are .I consistent with each other in a gapped alignment and an adjustment that is made to the Sum and Poisson statistics to account for the consistency. .TP .B \-dbbottom See \fB\-dbtop\fR. .TP .B \-dbgcode \fIgenetic_code_ID\fR For the .BR tblastx program, which translates both the query sequence and the database, this option permits the genetic code used to translate the database to be set separately from the genetic code used to translate the query sequence. This option may also be used to set the genetic code used by \fBtblastn\fR to translate the database. See the list of genetic code identifiers later in this document. See also the .B \-gcode option. .TP .B \-dbrecmax \fIlast_record_number\fR By default the .SM BLAST programs search the entire database. Using the .B \-dbrecmax option, the record number of the last database sequence to search can be specified. See also the .B \-dbrecmin option. .TP .B \-dbrecmin \fIfirst_record_number\fR By default the .SM BLAST programs search the entire database. Using the .B -dbrecmin option, the record number of the first database sequence to search can be specified. Searching will continue from that point on, until the end of the database is reached or until the sequence is reached whose record number corresponds to that specified in a .B \-dbrecmax option. Record numbers are one-based (\fIi.e.,\fR 1 is the first record, 2 is the second record, and so on). Statistics are computed using the complete database length, not the length of the subset selected. See also the .B \-dbrecmax option. .TP .B \-dbtop For those programs that translate a nucleotide sequence database (\fBtblastn\fR and \fBtblastx\fR), the .B \-dbtop and .B \-dbbottom options can be specified to restrict the search to a particular strand of each database sequence. The top strand consists of the database sequence as stored in the database; the bottom strand refers to the reverse complement of the database sequence. .TP .B \-echofilter This option causes the filtered query sequence to be displayed in the output. Any masked letters are typically indicated with X's (protein) or N's (nucleic acid). .TP .B "\-filter \fIfiltermethod\fR" This option activates filtering or masking of segments of the query sequence based on a potentially wide variety of criteria. The usual intent of filtering is to mask regions that are non-specific for protein identification using sequence similarity. For instance, it may be desired to mask acidic or basic segments that would otherwise yield overwhelming amounts of uninteresting, non-specific matches against a wide array of protein families from a comprehensive database search. The .SM BLAST programs have internally-coded knowledge of the specific command line options needed to invoke the .SM SEG and .SM XNU programs as query sequence filters, but these two filter programs are not included in the .SM BLAST software distribution and must be independently installed. All filter programs must reside in the /usr/ncbi/blast/filter directory, or the .SM BLASTFILTER environment variable must be set to point to the directory containing the desired filter programs. The .SM SEG program (Wootton and Federhen, 1993) masks low compositional complexity regions, while .SM XNU (Claverie and States, 1993) masks regions containing short-periodicity internal repeats. The .SM BLAST programs can pipe the filtered output from one program into another. For instance, .SM XNU+SEG or .SM SEG+XNU can be specified as the .I filtermethod to have each program filter the query sequence in succession. Note that neither .SM SEG nor .SM XNU is suitable for filtering untranslated nucleotide sequences for use by .BR blastn . .TP .B \-gapdecayrate \fIrate\fR This parameter defines the common ratio of the terms in a geometric progression used in normalizing probabilities across all numbers of Poisson events (typically the number of \*(lqconsistent\*(rq \s-1HSP\s0s). A Poisson probability for .I N segments is weighted by the reciprocal of the \fIN\fRth term in the progression, where the first term has a value of .I (1-rate), the second term is .I (1-rate)*rate, the third term is .I (1-rate)*rate*rate, and so on. The default .I rate is 0.5, such that the probability assigned to a single .SM HSP is discounted by a factor of 2, the Poisson probability of 2 \s-1HSP\s0s is discounted by a factor of 4, for 3 \s-1HSP\s0s the discount factor is 8, and so on. The rate essentially defines a penalty imposed on the gap between each .SM HSP, where the default penalty is equivalent to 1 bit of information. The suggestion to normalize Poisson probabilities was made by Phil Green (University of Washington, Seattle, WA). .TP .B \-gcode \fIgenetic_code_ID\fR This parameter permits the genetic code used in translating nucleotide query sequences to be changed from its default value of the Standard genetic code (sometimes erroneously called the \*(lqUniversal\*(rq genetic code). See the available list of genetic code identifiers below. \fINote: the \fBC\fR parameter is a synonym for the -gcode parameter\fR. .TP .B \-gi When GenInfo .I gi identifiers are available for the database sequences (in their deflines), this option can be specified to have these identifiers reported in the program output. .TP .B \-hspmax \fImax_hsps_per_dbseq\fR This option can be used to limit the number of \s-1HSP\s0s reported per database sequence. The default limit is 1000, which is ample leeway for most searches. Notable exceptions are when long query sequences are used (\fIe.g.\fR, an entire cosmid) and numerous repetitive or low-complexity (low-entropy) regions exist in the query and database sequences. .TP .B \-matrix \fImatrixfile\fR This option is used to specify the name of a file containing an alternate or user-defined scoring matrix. Most of the programs will accept only one .B \-matrix option at a time, but .BR blastp currently accepts as many as eight (8) on a single command line, all of which are used simultaneously during the database search for increased sensitivity. .TP .B \-nwlen \fIlength\fR See .B \-nwstart. .TP .B \-nwstart \fIstart_coord\fR .BR blastp and .BR blastx support this option and the .B \-nwlen option, for restricting .SM BLAST neighborhood word generation to a specific segment of the query sequence that begins at .I start_coord and continues for .I length residues or until the end of the query sequence is reached. .SM HSP alignments may extend outside the region of neighborhood word generation but the alignments can only be initiated by word hits occurring within the region. Through the use of these options, a very long query sequence can be searched piecemeal, using short, overlapping segments each time. The amount of overlap from one neighborhood region to the next need only be the .SM BLAST wordlength .BR W minus 1, in order to be assured of detecting all \s-1HSP\s0s; however, to provide greater freedom for statistical interpretation of multiple \s-1HSP\s0 findings -- \fIe.g.\fR, matches against exons -- more extensive overlapping is recommended, with the extent to be chosen based on the expected gene density and length of introns. .TP .B \-olfraction \fIoverlap_fraction\fR This parameter (with default value of 0.125) allows the user to define the maximum fractional length of an \s-1HSP\s0 that can overlap another \s-1HSP\s0 and still have the two \s-1HSP\s0s be considered to be consistent with one another, for the purpose of evaluation with Karlin-Altschul Sum statistics or Poisson statistics. .TP .B \-outblk This option causes .SM ASN.1 output to be encapsulated in a .SM BLAST0\-Outblk structure. For a description of this structure, see the .SM ASN.1 message specifications accompanying the .SM BLAST program source code. .TP .B \-poissonp This option causes Poisson statistics, instead of the default Sum statistics, to be used in assessing the statistical significance of multiple \s-1HSP\s0s. .TP .B \-progress \fIperiod\fR Some network client installations of the \s-1BLAST\s0 programs require a response from the server at least every 90 seconds or so, in order to be assured that the network connection to the server is still alive and that the search is progressing. The default reporting period from the programs is therefore set to the slightly conservative period of 60 seconds, but can be altered using the \fBprogress\fR option. Setting a period of length 0 will entirely disable the time-based reporting of search progress. Time-based reporting of search progress is indicated in the text form of program output merely by one or more asterisks (*). In the \s-1ASN.1\s0 form of output, a complete job-progress message is sent. .TP .B \-prune This option causes \s-1HSP\s0s that are not involved in achieving statistical significance to be eliminated from the program output. When Sum statistics are used, the pruning is robust; when Poisson statistics are used, some \s-1HSP\s0s may be reported that were not involved in achieving statistical significance. .TP .B \-qoffset \fIoffset\fR This option permits query sequence coordinate numbers to be adjusted by the value of .I offset, through simple addition. This may useful when a query sequence must be split into short, overlapping segments in order to complete individual searches within a restrictive time period. .TP .B \-qres This option causes the .SM BLAST programs to exit non-zero if the query sequence contains an invalid letter code for the type of query sequence expected (amino acid or nucleic acid). .TP .B \-qtype This option causes the .SM BLAST programs to exit non-zero if the query sequence appears to be of the wrong type (either amino acid or nucleic acid) for the particular program invoked. .TP .B \-span This option turns off entirely the feature of detecting and discarding spanned \s-1HSP\s0s. Voluminous output often results from its use. \fINote: this option was previously called\fR \-overlap \fIin the .SM BLAST version 1.3 programs\fR. .TP .B \-span1 This option relaxes the criteria for judging whether an \s-1HSP\s0 spans another, prior to discarding one of them if spanning is detected. With this option, it is merely a matter of either the query segment or the database segment (or both) spans the corresponding segment(s) in the other \s-1HSP\s0, whereas the .BR \-span2 option requires that \fIboth\fR segments be spanned. The \fR\-span1\fR option may be useful in suppressing reports of \s-1HSP\s0s when the query or a database sequence contains internal repeats. \fINote: this option was previously called\fR \-overlap1 \fIin the .SM BLAST version 1.3 programs\fR. .TP .B \-span2 While examining each database sequence, the programs use a greedy algorithm to discard any .SM HSP they find which is spanned from start to end by a previously found .SM HSP. When this option is invoked (the default), an \s-1HSP\s0 is deemed to be \fIspanning\fR another when both the query and database segments from the first \s-1HSP\s0 completely cover the corresponding segments in the other \s-1HSP\s0. When an .SM HSP spans another, the higher scoring one is retained and the lower scoring one is discarded; if their scores are equal, the longer, less information-dense .SM HSP is discarded. \fINote: this option was previously called\fR \-overlap2 \fIin the .SM BLAST version 1.3 programs\fR. .TP .B \-stats Invoking this option causes a slightly trimmer version of the underlying .SM BLAST search engine to be employed that doesn't waste computer time collecting statistics about neighborhood word hits, word hit extensions, etc. The amount of computer time saved is relatively small, but it may add up to a signficant savings during batch processing. .TP .B \-sump This option (the default) causes Karlin and Altschul (1993) \*(lqSum\*(rq statistics to be used in assessing the statistical significance of multiple \s-1HSP\s0s. See also \fB-poissonp\fR. .TP .B \-top Whenever a nucleotide query sequence is used (\fBblastn\fR, \fBblastx\fR and \fBtblastx\fR), both strands or all 6 reading frames are searched by default. The .B \-top and .B \-bottom options may be used to restrict a search to the specified strand or set of 3 reading frames. If both .B \-top and .B \-bottom are specified, both strands will be searched. In the case of the .BR tblastx program, which translates both the query and the database, the .B \-top and .B \-bottom options refer to strands in the query sequence only. See .B \-dbtop and \fB\-dbbottom\fR. .TP .B \-warnings This option turns off the reporting of all .SM WARNING messages. options. .SH "SORT OPTIONS" .LP The default sort order for reporting database sequences is by increasing probability (P-value). The following sort options are available and may be combined together in the same search: .TP 20 .B \-sort_by_pvalue Sort from most statistically significant (lowest P-value) to least statistically significant (highest P-value), the default sort order. .TP .B \-sort_by_count Sort from highest to lowest by the number of HSPs found for each database sequence. .TP .B \-sort_by_highscore Sort from highest to lowest by the score of the highest scoring HSP for each database sequence. .TP .B \-sort_by_totalscore Sort from the highest to the lowest by the sum total score of all HSPs for each database sequence. .SH "SCORING SCHEMES" .LP The default scoring matrix used by .BR blastp , .BR blastx , .BR tblastn , and .BR tblastx is the .SM BLOSUM62 matrix (Henikoff and Henikoff, 1992). The \fB\-matrix\fR option can be used to select an alternate scoring matrix file (\fIe.g.,\fR one of the .SM PAM matrices described below). In version 1.4, the .B \-matrix option can also be used with .BR blastn to define a scoring matrix, in addition to supporting the traditional .B M and .B N parameters of this program. .LP Several .SM PAM (point accepted mutations per 100 residues) amino acid scoring matrices are provided in the .SM BLAST software distribution, including the .SM PAM40, .SM PAM120, and .SM PAM250. While the .SM BLOSUM62 matrix is a good general purpose scoring matrix and is the default matrix used by the .SM BLAST programs, if one is restricted to using only .SM PAM scoring matrices, then the .SM PAM120 is recommended for general protein similarity searches (Altschul, 1991). The .BR pam(1) program can be used to produce .SM PAM matrices of any desired iteration from 2 to 511. Each matrix is most sensitive at finding similarities at its particular PAM distance. For more thorough searches, particularly when the mutational distance between potential homologs is unknown and the significance of their similarity may be only marginal, Altschul (1991, 1992) recommends performing at least three searches, one each with the .SM PAM40, .SM PAM120 and .SM PAM250 matrices. .LP When multiple scoring matrices are used in searches with the same query sequence, additional degrees of freedom for optimizing alignment scores are available, which reduces each score's statistical significance. The reduction may be by a factor that is as large as the number of matrices employed; however, the potential loss of sensitivity from using a suboptimal matrix is typically much greater, suggesting that the use of multiple matrices remains advantageous (Altschul, 1992). Altschul (1992) has shown that, because .SM PAM matrices are related to one another through a common mutational model and set of initial conditions, statistical significance is reduced by a factor of no more than 4.6 (just over 2 bits of information) regardless of how many .SM PAM matrices are employed. .LP In .BR blastn, the .BR M parameter sets the reward score for a pair of matching residues; the .BR N parameter sets the penalty score for \fImis\fRmatching residues. .BR M and .BR N must be positive and negative integers, respectively. The relative magnitudes of .BR M and .BR N determines the number of nucleic acid PAMs (point accepted mutations per 100 residues) for which they are most sensitive at finding homologs. Higher ratios of .BR M:N correspond to increasing nucleic acid PAMs (increased divergence). The default values for .BR M and .BR N , respectively 5 and -4, having a ratio of 1.25, correspond to about 47 nucleic acid PAMs, or about 58 amino acid PAMs; an .BR M:N ratio of 1 corresponds to 30 nucleic acid PAMs or 38 amino acid PAMs. At higher than about 40 nucleic acid PAMs, or 50 amino acid PAMs, better sensitivity at detecting similarities between coding regions is expected by performing comparisons at the amino acid level (States \fIet al.\fR, 1991), using conceptually translated nucleotide sequences (re: .BR blastx , .BR tblastn , and .BR tblastx ). .LP Independent of the values chosen for .BR M and .BR N , the default wordlength \fBW\fR=11 used by .BR blastn restricts the program to finding sequences that share at least an 11-mer stretch of 100% identity with the query. Under the random sequence model, stretches of 11 consecutive matching residues are unlikely to occur merely by chance even between only moderately diverged homologs. Thus, .BR blastn with its .I default parameter settings is poorly suited to finding anything but very similar sequences. If better sensitivity is needed, one should use a smaller value for .BR W . .LP For the .BR blastn program, it may be easy to see how multiplying both .BR M and .BR N by some large number will yield proportionally larger alignment scores with their statistical significance remaining unchanged. This scale-independence of the statistical significance estimates from .BR blastn has its analog in the scoring matrices used by the other .SM BLAST programs: multiplying all elements in a scoring matrix by an arbitrary factor will proportionally alter the alignment scores but will not alter their statistical significance (assuming numerical precision is maintained). From this it should be clear that raw alignment scores are meaningless without specific knowledge of the scoring matrix that was used. .SH SCORING REQUIREMENTS .LP Regardless of the scoring scheme employed, two stringent criteria must be met in order to be able to calculate the Karlin-Altschul parameters .I Lambda and .I K. First, given the residue composition for the query sequence and the residue composition assumed for the database (Robinson and Robinson, 1991), the alignment score expected for any randomly selected pair of residues (one from the query sequence and one from the database) must be negative. Second, given the sequence residue compositions and the scoring scheme, a positive score must be possible to achieve. For instance, the match reward score of .BR blastn must have a positive value; and given the assumption made by .BR blastn that the 4 nucleotides .SM A, .SM C, .SM G and .SM T are represented at equal 25% frequencies in the database, a wide range of value combinations for .BR M and .BR N are precluded from use -- namely those combinations where the magnitude of the ratio .BR M:N is greater than or equal to 3. .SH "SEQUENCE LENGTH AND STATISTICAL SIGNIFICANCE" .LP For the purpose of calculating significance levels, .BR Y is the effective length of the query sequence and .BR Z is the effective length of the database, both measured in residues. The default values for these parameters are the actual lengths of the query sequence and database, respectively. Larger values signify more degrees of freedom for aligning the sequences and reduced statistical significance for an alignment of any given score. To normalize the statistics reported when databases of different lengths are searched, the parameter .BR Z may be set to a constant value for all database searches. Similarly, when querying with sequences of different lengths, the parameter .BR Y can be used to normalize over all searches. .SH "GENETIC CODES" .LP The parameter .BR C can be set to a positive integer to select the genetic code that will be used by .BR blastx and .BR tblastx to translate the query sequence. The .BR \-dbgcode parameter can be used to select an alternate genetic code for translation of the database by the programs .BR tblastn and .BR tblastx . In each case, the default genetic code is the so-called \*(lqStandard\*(rq or \*(lqUniversal\*(rq genetic code. To obtain a listing of the genetic codes available and their associated numerical identifiers, invoke .BR blastx or .BR tblastx with the command line parameter .I C=list. Note: the numerical identifiers used here for genetic codes parallel those defined in the NCBI software Toolbox; hence some numerical values will be skipped as genetic codes are updated. .LP The list of genetic codes available and their associated values for the parameters .BR C and .BR \-dbgcode are: .LP .B 1 Standard or Universal .LP .B 2 Vertebrate Mitochondrial .LP .B 3 Yeast Mitochondrial .LP .B 4 Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma .LP .B 5 Invertebrate Mitochondrial .LP .B 6 Ciliate Macronuclear .LP .B 9 Echinodermate Mitochondrial .LP .B 10 Alternative Ciliate Macronuclear .LP .B 11 Eubacterial .LP .B 12 Alternative Yeast .LP .B 13 Ascidian Mitochondrial .LP .B 14 Flatworm Mitochondrial .SH "SUM STATISTICS" .LP Whereas the version 1.3 .SM BLAST programs use Poisson statistics to ascribe significance to multiple \s-1HSP\s0s, the version 1.4 programs retain Poisson statistics as an option, but use Karlin and Altschul (1993) \*(lqSum\*(rq statistics by default instead. Sum statistics tends to rank database matches in a more intuitive order than Poisson statistics and, in many cases, yields markedly increased sensitivity. The Sum P-value for a set of \s-1HSP\s0s is a function of the sum of the information scores of the \s-1HSP\s0s (expressed in bits) and the number of \s-1HSP\s0s in the set. .SH "POISSON STATISTICS" .LP The occurrence of two or more \s-1HSP\s0s involving the query sequence and the same database sequence can be modeled as a Poisson process by specifying the .B \-poissonp option. An important result of applying Poisson statistics is that an .SM HSP having a low score and high Expect value (low statistical significance) may be ascribed a statistically significant Poisson P-value when the .SM HSP appears in the context of additional match(es) of equal or greater score with the same database sequence. .LP The Poisson P-value for any given .SM HSP is a function of its expected frequency of occurrence and the number of \s-1HSP\s0s observed against the same database sequence with scores at least as high. The Poisson P-value for a group of \s-1HSP\s0 events is the probability that at least as many \s-1HSP\s0s would occur by chance alone, each with a score at least as high as the lowest-scoring member of the group. \s-1HSP\s0s which appear on opposite strands of a nucleotide query or database sequence are considered to be independent, distinguishable events, and are counted separately. .SH "P-VALUES, ALIGNMENT SCORES, AND INFORMATION" .LP The Expect and P-values reported for \s-1HSP\s0s are dependent on several factors including: the scoring system employed, the residue composition of the query sequence, an assumed residue composition for a typical database sequence (Robinson and Robinson, 1991), the length of the query sequence, and the total length of the database. .SM HSP scores from different program invocations are appropriate for comparison even if the databases searched are of different lengths, as long as the other factors mentioned here do not vary. For example, alignment scores from searches with the default .SM BLOSUM62 matrix should not be directly compared with scores obtained with the .SM PAM120 matrix; and scores produced using two versions of the same .SM PAM matrix, each created to different scales (see above), can not be meaningfully compared without conversion to the same scale. .LP Some isolation from the many factors involved in assessing the statistical significance of \s-1HSP\s0s can be attained by observing the information content reported (in bits) for the alignments. While the information content of an .SM HSP may change when different scoring systems are used (e.g., with different .SM PAM matrices), the number of bits reported for an .SM HSP will at least be independent of the scale to which the scoring matrix was generated. (In practice, this statement is not quite true, because the alignment scores used by the .SM BLAST programs are integers that lack much precision). In other words, when conveying the statistical significance of an alignment, the alignment score itself is not useful unless the specific scoring matrix that was employed is also provided, but the .I informativeness of an alignment is a meaningful statistic that can be used to ascribe statistical significance (a P-value) to the match independently of specific knowledge about the scoring matrix. .SH "GOVERNING OUTPUT" .LP .SM BLAST program output is organized into three independently governed sections: a histogram of the statistical significance of the matches found; one-line descriptions of the database sequences that satisfied the statistical significance threshold (\fBE\fR parameter); and the high-scoring segment pairs themselves. Each section of the output can be selectively suppressed by setting the parameters .BR H, .BR V, and .BR B to 0 (zero). .LP The .BR H parameter regulates the display of a histogram of the expected frequency of chance occurrence of the database matches found. If .BR H is assigned a non-zero value, a histogram will be displayed. The default value for .BR H is 0 (no histogram displayed). .LP Parameter .BR V is the maximum number of database sequences for which one-line descriptions will be reported. The default value for .BR V is 500. A bold warning message is displayed at the end of the one-line descriptions section when more than .BR V sequences yield \s-1HSP\s0s satisfying the significance threshold. When .BR V is zero, no one-line descriptions are reported and no warning is given. Negative values for .BR V are undefined and disallowed. .LP As an example of how .BR V can be used advantageously, if a high value for .BR E is desired to virtually assure in all cases that at least one \s-1HSP\s0 will be found, selecting a small value for .BR V will ensure that the output will not be overly voluminous; only the most statistically significant matches will be reported. .LP Parameter .BR B regulates the display of the high-scoring segment pairs (alignments). For positive values, .BR B is the maximum number of .I "database sequences" for which high-scoring segment pairs will be reported. This may be much smaller than the actual number of high-scoring segment pairs reported, since any given database sequence may yield several HSPs. The default value for .BR B is 250. Negative values for .BR B are undefined and disallowed. .SH "ENVIRONMENT VARIABLES" .LP The environment variables .SM BLASTDB, .SM BLASTMAT, .SM BLASTFILTER, and .SM BLASTCDI may be set by the user to override the default directories in which the programs look to find database files, scoring matrix files, filtering programs, and codon usage information files, respectively. The default directories are /usr/ncbi/blast/db, /usr/ncbi/blast/matrix, /usr/ncbi/blast/filter, and /usr/ncbi/blast/cdi. .SH "SUPPORT UTILITIES" .LP Databases to be searched by the .SM BLAST programs must first be formatted by the .BR setdb program for protein sequence databases (re: \fBblastp\fR and \fBblastx\fR) or the .BR pressdb program for nucleotide sequence databases (re: \fBblastn\fR and \fBtblastn\fR). The input database files read by .BR setdb and .BR pressdb must be in \s-1FASTA\s0/Pearson format. For each input file, three output files are created for searching by the .SM BLAST programs. .LP Point accepted mutation (\s-1PAM\s0) matrices of various generations can be produced automatically with the .BR pam program. The output can be saved in a file whose name can then be specified in the .BR M=filename option of a .BR blastp, .BR blastx, or .BR tblastn query. .SH SAMPLE OUTPUT .LP The BLAST programs all provide information in roughly the same format. First comes (A) an introduction to the program; (B) a histogram of expectations (see above) if one was requested; (C) a series of one-line descriptions of matching database sequences; (D) the actual sequence alignments; and finally the parameters and other statistics gathered during the search. .LP Sample .BR blastp output from comparing .I pir|A01243|DXCH against the .SM SWISS-PROT database is presented below. .SS "A. Program Introduction" The introductory output provides the program name (\fB\s-1BLASTP\s0\fR in this case), the version number (1.4.6MP in this case), the date the program source code last changed substantially (June 13, 1994), the date the program was built (Sept. 22, 1994), and a description of the query sequence and database to be searched. These may all be important pieces of information if a bug is suspected or if reproducibility of results is important. .LP The "Searching..." indicator indicates progress that the program made in searching the database. A complete database search will yield 50 periods (.), or one period per database sequence, whichever number is smaller. When searching a database consisting of 50 sequences or more, if fewer than 50 periods are displayed and the program aborted for some reason, dividing the number of periods by 0.5 will yield the approximate percentage (0-100%) of the database that was searched before the program died. If the program had difficulty making progress through the database, one or more asterisks (*) may be interspersed between the periods at one-minute intervals. .SS "B. Histogram of Expectations" Shown in the output below is a histogram of the lowest (most significant) Expect values obtained with each database sequence. This information is useful in determining the numbers of database sequences that achieved a particular level of statistical significance. It indicates the number of database matches that would be reportable at various settings for the expectation threshold (\fBE\fR parameter). .SS "C. One-line Summaries" The one-line sequence descriptions and summaries of results are useful for identifying biologically interesting database matches and correlating this interest with the statistical significance estimates. Unless otherwise requested, the database sequences are sorted by increasing P-value (probability). Identifiers for the database sequences appear in the first column; then come brief descriptions of each sequence, which may need to be truncated in order to fit in the available space. The \*(lqHigh Score\*(rq column contains the score of the highest-scoring .SM HSP found with each database sequence. The \*(lqP(N)\*(rq column contains the lowest P-value ascribed to any set of \s-1HSP\s0s for each database sequence; and the \*(lqN\*(rq column displays the number of \s-1HSP\s0s in the set which was ascribed the lowest P-value. The P-values are a function of N, as used in Karlin-Altschul \*(lqSum\*(rq statistics or Poisson statistics, to treat situations where multiple \s-1HSP\s0s are found. It should be noted that the highest-scoring .SM HSP whose score is reported in the \*(lqHigh Score\*(rq column is not necessarily a member of the set of \s-1HSP\s0s which yields the lowest P-value; the highest-scoring .SM HSP may be excluded from this set on the basis of consistency rules governing the grouping of \s-1HSP\s0s (see the .B -consistency option). Numbers of the form \*(lq7.7e-160\*(rq are in scientific notation. In this particular example, the number being represented is 7.7 times 10 to the minus 160th power. which is astronomically close to zero. .SS "D. Alignments" Alignments found with the .SM BLAST algorithm are ungapped. Several statistics are used to describe each \s-1HSP\s0: the raw alignment Score; the raw score converted to bits of information by multiplying by .I Lambda (see the Statistics output); the number of times one might Expect to see such a match (or a better one) merely by chance; the P-value (probability in the range 0-1) of observing such a match; the number and fraction of total residues in the .SM HSP which are identical; the number and fraction of residues for which the alignment scores have positive values. When Sum statistics have been used to calculate the Expect and P-values, the P-value is qualified with the word \*(lqSum\*(rq and the N parameter used in the Sum statistics is provided in parentheses to indicate the number of \s-1HSP\s0s in the set; when Poisson statistics have been used to calculate the Expect and P-values, the P-value is qualified with the word \*(lqPoisson\*(rq. Between the two lines of Query and Subject (database) sequence is a line indicating the specific residues which are identical, as well as those which are non-identical but nevertheless have positive alignment scores defined in the scoring matrix that was used (the .SM BLOSUM62 matrix in this case). Identical letters or residues, when paired with each other, are not highlighted if their alignment score is negative or zero. Examples of this would be an .SM X juxtaposed with an .SM X in two amino acid sequences, or an .SM N juxtaposed with another .SM N in two nucleotide sequences. Such ambiguous residue-residue pairings may be uninformative and thus lend no support to the overall alignment being either real or random; however, the informativeness of these pairings is left up to the user of the .SM BLAST programs to decide, because any values desired can be specified in a scoring matrix of the user's own making. .de FS .IN 0 .nf .ls 1 .ll 7.5i .sz 9 .cs R 22 .ss 22 .lg 0 .(b .. .de FE .)b .fi .cs R .ss 11 .sz 12 .lg 1 .IN 0.25i .. .LP .FS BLASTP 1.4.6MP [13-Jun-94] [Build 13:58:36 Sep 22 1994] Reference: Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol. 215:403-10. Query= pir|A01243|DXCH 232 Gene X protein - Chicken (fragment) (232 letters) Database: SWISS-PROT Release 29.0 38,303 sequences; 13,464,008 total letters. Searching..................................................done Observed Numbers of Database Sequences Satisfying Various EXPECTation Thresholds (E parameter values) Histogram units: = 31 Sequences : less than 31 sequences EXPECTation Threshold (E parameter) | V Observed Counts--> 10000 4863 1861 |============================================================ 6310 3002 782 |========================= 3980 2220 812 |========================== 2510 1408 303 |========= 1580 1105 393 |============ 1000 712 179 |===== 631 533 161 |===== 398 372 80 |== 251 292 73 |== 158 219 50 |= 100 169 32 |= 63.1 137 18 |: 39.8 119 9 |: 25.1 110 6 |: 15.8 104 9 |: >>>>>>>>>>>>>>>>>>>>> Expect = 10.0, Observed = 95 <<<<<<<<<<<<<<<<< 10.0 95 4 |: 6.31 91 3 |: 3.98 88 1 |: 2.51 87 3 |: 1.58 84 0 | 1.00 84 2 |: Smallest Sum High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) (... 1191 7.7e-160 1 sp|P01014|OVAY_CHICK GENE Y PROTEIN (OVALBUMIN-RELATED). 949 7.0e-127 1 sp|P01012|OVAL_CHICK OVALBUMIN (PLAKALBUMIN). 645 3.4e-100 2 sp|P19104|OVAL_COTJA OVALBUMIN. 626 1.2e-96 2 sp|P05619|ILEU_HORSE LEUKOCYTE ELASTASE INHIBITOR (LEI). 216 3.7e-71 3 sp|P80229|ILEU_PIG LEUKOCYTE ELASTASE INHIBITOR (LEI) (... 325 4.0e-71 2 sp|P29508|SCCA_HUMAN SQUAMOUS CELL CARCINOMA ANTIGEN (SCC... 439 3.5e-70 2 sp|P30740|ILEU_HUMAN LEUKOCYTE ELASTASE INHIBITOR (LEI) (... 211 1.3e-66 3 sp|P05120|PAI2_HUMAN PLASMINOGEN ACTIVATOR INHIBITOR-2, P... 176 1.8e-65 4 sp|P35237|PTI_HUMAN PLACENTAL THROMBIN INHIBITOR. 473 1.3e-61 1 sp|P29524|PAI2_RAT PLASMINOGEN ACTIVATOR INHIBITOR-2, T... 183 9.4e-61 4 sp|P12388|PAI2_MOUSE PLASMINOGEN ACTIVATOR INHIBITOR-2, M... 179 1.8e-60 4 sp|P36952|MASP_HUMAN MASPIN PRECURSOR. 198 2.6e-58 4 sp|P32261|ANT3_MOUSE ANTITHROMBIN-III PRECURSOR (ATIII). 142 4.0e-48 5 sp|P01008|ANT3_HUMAN ANTITHROMBIN-III PRECURSOR (ATIII). 122 7.5e-48 5 WARNING: Descriptions of 80 database sequences were not reported due to the limiting value of parameter V = 15. ... alignments with the top 8 database sequences deleted ... >sp|P05120|PAI2_HUMAN PLASMINOGEN ACTIVATOR INHIBITOR-2, PLACENTAL (PAI-2) (MONOCYTE ARG- SERPIN). Length = 415 Score = 176 (80.2 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 38/89 (42%), Positives = 50/89 (56%) Query: 1 QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNN 60 +I +LL S D DT +VLVNA+YFKG WKT F + PF V + PVQMM + Sbjct: 180 KIPNLLPEGSVDGDTRMVLVNAVYFKGKWKTPFEKKLNGLYPFRVNSAQRTPVQMMYLRE 239 Query: 61 SFNVATLPAEKMKILELPFASGDLSMLVL 89 N+ + K +ILELP+A L+L Sbjct: 240 KLNIGYIEDLKAQILELPYAGDVSMFLLL 268 Score = 165 (75.2 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 33/78 (42%), Positives = 47/78 (60%) Query: 155 ANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFL 214 AN +G+S L +S+ H A ++++E+G E A TG + + QF ADHPFLFL Sbjct: 338 ANFSGMSERNDLFLSEVFHQAMVDVNEEGTEAAAGTGGVMTGRTGHGGPQFVADHPFLFL 397 Query: 215 IKHNPTNTIVYFGRYWSP 232 I H T I++FGR+ SP Sbjct: 398 IMHKITKCILFFGRFCSP 415 Score = 144 (65.6 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 26/62 (41%), Positives = 41/62 (66%) Query: 90 LPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTD 149 + D + LE +E I ++KL +WT+ + M + V+VY+PQ K+EE Y L S+L ++GM D Sbjct: 272 IADVSTGLELLESEITYDKLNKWTSKDKMAEDEVEVYIPQFKLEEHYELRSILRSMGMED 331 Query: 150 LF 151 F Sbjct: 332 AF 333 Score = 61 (27.8 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 10/17 (58%), Positives = 16/17 (94%) Query: 81 SGDLSMLVLLPDEVSDL 97 +GD+SM +LLPDE++D+ Sbjct: 259 AGDVSMFLLLPDEIADV 275 WARNING: HSPs involving 86 database sequences were not reported due to the limiting value of parameter B = 9. Parameters: V=15 B=9 H=1 -ctxfactor=1.00 E=10 Query ----- As Used ----- ----- Computed ---- Frame MatID Matrix name Lambda K H Lambda K H +0 0 BLOSUM62 0.316 0.132 0.370 same same same Query Frame MatID Length Eff.Length E S W T X E2 S2 +0 0 232 232 10. 57 3 11 22 0.22 33 Statistics: Query Expected Observed HSPs HSPs Frame MatID High Score High Score Reportable Reported +0 0 62 (28.2 bits) 1191 (542.5 bits) 330 24 Query Neighborhd Word Excluded Failed Successful Overlaps Frame MatID Words Hits Hits Extensions Extensions Excluded +0 0 4988 5661199 1146395 4504598 10187 13 Database: SWISS-PROT Release 29.0 Release date: June 1994 Posted date: 1:29 PM EDT Jul 28, 1994 # of letters in database: 13,464,008 # of sequences in database: 38,303 # of database sequences satisfying E: 95 No. of states in DFA: 561 (55 KB) Total size of DFA: 110 KB (128 KB) Time to generate neighborhood: 0.03u 0.01s 0.04t Real: 00:00:00 No. of processors used: 8 Time to search database: 32.27u 0.78s 33.05t Real: 00:00:04 Total cpu time: 32.33u 0.91s 33.24t Real: 00:00:05 WARNINGS ISSUED: 2 .FE .SH BUGS .LP The statistics are not fully worked out yet for .BR blastp when multiple .B \-matrix options are specified in a single command. .LP .BR blastn by default uses a large value of 11 for the wordlength, .BR W , which severely reduces the program's sensitivity but provides for high speed searches. Consequently, the program with its default parameter values is well suited to finding nearly identical sequences rapidly, but poorly suited to finding moderately- or distantly-related sequences. The value for .BR W may be reduced to increase the sensitivity (at the expense of speed), but to identify weak similarity between coding regions, greater sensitivity is obtained by comparing translation products (States \fIet al.\fR, 1991); one should use .BR blastx , .BR tblastn , or .BR tblastx . .BR blastn is poorly suited to characterizing PCR primers. .LP In the protein-comparing programs .BR blastp , .BR blastx , .BR tblastn , and .BR tblastx , .I ad hoc equations are used to calculate a default value for the neighborhood word score threshold .BR T when the word length .BR W has a value of 3 (the default) or 4. Equations have not been implemented for calculating a default value of .BR T when .BR W has any value other than 3 or 4. .LP When nucleotide sequences are compressed into searchable form by the .BR pressdb program, any .SM IUPAC ambiguity letters are replaced by an appropriate random selection from the list .SM A, .SM C, .SM G and .SM T. For example, an .SM R (purine) would be replaced on the average half of the time by an .SM A (adenosine) and the remainder of the time by a \s-1G\s0 (guanosine). If the original database in \s-1FASTA\s0 format is not accessible to the .BR blastn , .BR tblastn or .BR tblastx programs at the time of a search, the original locations and identities of the ambiguity codes can not be determined from the compressed sequences and the alignments and alignment scores may be in error with respect to the original sequences. .LP .BR tblastn and .BR tblastx use only one genetic code to translate the entire nucleotide sequence database, although the code that is used is selectable via the .BR \-dbgcode option. .LP .BR blastn , .BR blastx , .BR tblastn , and .BR tblastx treat .SM U and .SM T residues in nucleotide sequences as being the same residue (\fIi.e.\fR, they match perfectly or translate in exactly the same manner). .LP The amino acid alphabet used by the .SM BLAST programs consists of the .SM IUB and .SM IUPAC amino acid codes (\s-1ABCDEFGHIKLMNPQRSTVWXYZ\s0), plus asterisk (*) and hyphen (-). An asterisk signifies a stop codon; and a hyphen signifies a gap of indeterminate length through which \s-1BLAST\s0 alignments are never permitted to extend. Any letter which is not a member of this alphabet will be stripped from an amino acid query sequence on input and will not contribute to the query sequence coordinate numbers displayed in program output. In protein sequence databases that are processed into searchable form by the .BR setdb program, any non-alphabetic letters are also stripped. .LP The nucleotide alphabet used by the .SM BLAST programs consists of the .SM IUB and .SM IUPAC nucleotide codes (\s-1ACGTRYMKWSBDHVNU\s0), plus hyphen (-) to signify a gap of indeterminate length. \s-1U\s0 (uracil) is treated like a \s-1T\s0 (thymidine). When non-alphabetical codes appear in the \s-1FASTA\s0-format input database to the .BR pressdb program, the program complains about their appearance and then halts with a non-zero exit status. .LP Unlike its version 1.3 predecessor, .BR blastn version 1.4 can employ a concept of partial matching, such as might be used when two \fIR\fRs (purines) are aligned with each other. When the .BR blastn scoring system is defined using the .BR M and .BR N parameters, the scoring matrix constructed by the program accounts for partial matching of nucleotide ambiguity codes. If the .BR \-matrix option is used instead, the user has complete freedom to decide how to score alignments involving ambiguity codes. .LP When calculating the Sum and Poisson statistics, some \s-1HSP\s0s may be inconsistent or incompatible with one another in the same gapped alignment, and yet the programs will count them as independent, consistent events, leading to false positives being reported in the output. See the .BR \-olfraction option. (However, \s-1HSP\s0s appearing on opposite strands of the query or database sequence, or in reading frames on opposite strands, are considered separately in all cases). .LP The nucleotide composition of a .BR blastn query sequence is irrelevant to the values reported for the Karlin-Altschul .I Lambda and .I K parameters. This is due to the equi-probable 0.25/0.25/0.25/0.25 A/C/G/T residue distribution assumed by .BR blastn for the database sequences. The values of the Karlin-Altschul parameters are still affected by the scoring system employed (defined by the parameters .BR M and .BR N , or the \fB\-matrix\fR option). .LP On multiprocessor computing platforms, .BR blastn restricts itself by default to using 4 processors maximum, due to the long start-up time per processor relative to the brief processor time required for the searches themselves when the default wordlength of 11 is used. If desired, more than 4 processors can be recruited for the search using the .BR P command line option. .SH "SEE ALSO" .BR blast3 (1). .br .SH COPYRIGHT This work is in the public domain. .SH AUTHOR Warren Gish, gish *AT* watson.wustl.edu .SH REFERENCES .LP Altschul, Stephen F. (1991). .I Amino acid substitution .I matrices from an .I information theoretic .I perspective. J. Mol. Biol. \fB219\fR:555-65. .LP Altschul, S. F. (1993). .I A protein alignment .I scoring system sensitive .I at all evolutionary distances. J. Mol. Evol. \fB36\fR:290\-300. .LP Altschul, S. F., M. S. Boguski, W. Gish and J. C. Wootton (1994). .I Issues in searching .I molecular sequence databases. Nature Genetics \fB6\fR:119\-129. .LP Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman (1990). .I Basic local alignment search tool. J. Mol. Biol. \fB215\fR:403\-10. .LP Claverie, J.-M. and D. J. States (1993). .I Information enhancement methods .I for large scale sequence analysis. Computers in Chemistry \fB17\fR:191\-201. .LP Gish, W. and D. J. States (1993). .I Identification of .I protein coding .I regions by database .I similarity search. Nature Genetics \fB3\fR:266\-72. .LP Henikoff, Steven and Jorga G. Henikoff (1992). .I Amino acid substitution .I matrices from protein blocks. Proc. Natl. Acad. Sci. USA \fB89\fR:10915\-19. .LP Karlin, Samuel and Stephen F. Altschul (1990). .I Methods for assessing the statistical .I significance of molecular .I sequence features by using .I general scoring schemes. Proc. Natl. Acad. Sci. USA \fB87\fR:2264\-68. .LP Karlin, Samuel and Stephen F. Altschul (1993). .I Applications and statistics .I for multiple high-scoring segments .I in molecular sequences. Proc. Natl. Acad. Sci. USA \fB90\fR:5873\-7. .LP Robinson, Arthur B. and Laurelee R. Robinson (1991). .I Distribution of glutamine .I and asparagine residues and .I their near neighbors .I in peptides and proteins. Proc. Natl. Acad. Sci. USA \fB88\fR:8880\-4. .LP States, D. J. and W. Gish (1994). .I Combined use of .I sequence similarity .I and codon bias for .I coding region identification. J. Comput. Biol. \fB1\fR:39\-50. .LP States, D. J., W. Gish and S. F. Altschul (1991). .I Improved sensitivity .I of nucleic acid database .I similarity searches using .I application specific scoring .I matrices. Methods: A companion to Methods in Enzymology \fB3\fR:66\-70. .LP Wootton, J. C. and S. Federhen (1993). .I Statistics of local complexity .I in amino acid sequences .I and sequence databases. Computers in Chemistry \fB17\fR:149-163.