Format of BLAST binary files
for NUCLEOTIDE SEQUENCES
To avoid error messages when converting the original FASTA file into
database binary files, you should make sure that all
sequence lines are of
the same length except for the last line of each sequence which
could be ragged (or sequences can be stored on a single line with no breaks
whatsoever).
If the original FASTA text file with nucleotide sequences has name NAME,
program PRESSDB produces the following 3 binary
files:
-
NAME.nhd -- all sequence headers written one
after another without any special encoding or compression, each header
starts with symbol '>', there is NO other '>' within one header, so that,
if necessary, the sequences can be counted by the number of '>' sumbols
in the header file.
-
NAME.csq --
nucleotide sequences written one after another in a compressed
form, sequences are separated with a "magic byte" which has value 0x78,
that is '-'; four nucleotides are stored in
one byte, that is two bits are used per one nucleotide; the correspondence
between nucleotides and their binary codes is as follows.
|
Nucleotide
|
Encoded as
|
Nucleotide
|
Encoded as
|
|
'A'
|
0 (binary 00)
|
'W' ('A'
| 'T')
|
8
|
|
'C'
|
1 (binary 01)
|
'S' ('C'
| 'G')
|
9
|
|
'G'
|
2 (binary 10)
|
'B' ('T'|'G'|'C')
|
10
|
|
'T'|'U'
|
3 (binary 11)
|
'D' ('A'|'T'|'G')
|
11
|
|
'R' ('A'|'G')
|
4
|
'H' ('A'|'T'|'C')
|
12
|
|
'Y' ('C'|'T')
|
5
|
'V' ('A'|'G'|'C'))
|
13
|
|
'M' ('A'|'C')
|
6
|
'N' ('A'|'T'|'G'|'C')
|
14
|
|
'K' ('G'|'T')
|
7
|
'-'
|
15
|
Note, that unlike similar amino acid file, NAME.csq contains sequences
that are on average longer than the true sequences in the original FASTA
file. This is because when sequence length is not divisible by 4,
there are inevitably a few unoccupied bits in the last byte of the compressed
sequence, and there is no way to distinguish these bits from encoded 'A'
nucleotides.
Important: degenerate
symbols (such as 'N' -- any nucleotide,
'R' -- a purine, or 'W'
-- a weak-bonding nucleotide, etc.) in the compressed nucleotide sequences
are randomly substituted with a non-degenerate symbols from the corresponding
set of nucleotides; the only way to get the original nucleotide sequence
is to go to the original FASTA file.)
Finally,
-
NAME.ntb (table with absolute file addresses
of compressed sequences in file NAME.csq, and sequence headers in file
NAME.nhd).
Format of a NAME.ntb
file
-
32-bit int -- database
type. For a nucleotide
database this number should be set to 0x788325f8.
-
32-bit int -- database
format. For a nucleotide
database this should be set to 6.
-
32-bit int -- the number
of bytes in the title of the database (N).
-
N * (8-bit
byte) -- database title.
-
if (N % 4) not equal
to 0, skip (4
- N % 4) bytes.
-
32-bit int -- the length
of line in the original FASTA file (Important:
all sequences in the original FASTA file MUST be formatted in the same
way, otherwise PRESSDB wouldn't work).
-
32-bit int -- the number
of sequences in database (M).
-
32-bit int -- the length
of the longest sequence in the database.
-
32-bit int -- the number
of characters in the database BEFORE sequences are COMPRESSED.
-
32-bit int -- the number
of characters in the COMPRESSED database.
-
32-bit int -- the number
of "overrepresented" nucleotide 8-mers (N8).
If this number is non-zero, corresponding 8-mers will not be used by BLAST
for searches in this database.
-
skip (N8
* 4) bytes (or read overrepresented 8-mers if required).
-
for(i=1; i less
or equal to M;
i++) {
read 32-bit int
(= the absolute position of the beginning
of the ith
compressed sequence
in file NAME.csq)
};
-
for(i=1; i less
or equal to M;
i++) {
read 32-bit int
(= the absolute position of the beginning
of the ith
sequence in the
original FASTA file)
};
-
for(i=1; i less
or equal to M;
i++) {
read 32-bit int
(= the absolute position of the beginning
of the ith
sequence header
in file NAME.nhd)
};
-
Byte array of size (M/8 + 1), where ith
bit is 1 if the ith
sequence has a degenerate symbol, and 0 otherwise.
-
end of
file.
Recorded by Andrey Rzhetsky 9/25/97
Return to the
WU-BLAST Archives