Format of BLAST binary files
for AMINO ACID
SEQUENCES
To avoid error messages when converting the original FASTA file into
database binary files, you should make sure that all
sequence lines are of
the same length except for the last line of each sequence which
could be ragged (or sequences can be stored on a single line with no breaks
whatsoever).
If the original FASTA text file with amino acid sequences has name NAME,
program SETDB produces 3 additional binary
files:
-
NAME.ahd -- all sequence headers written one
after another without any special encoding or compression; each header
starts with symbol '>', there is NO other '>' within one header, so that,
if necessary, the sequences can be counted by the number of '>' sumbols
in the header file.
-
NAME.bsq --
protein sequence written one after another in an encoded form, one
amino acid is stored in one byte; sequences are separated by a "magic byte"
which has value 0x78, that is '-', the correspondence between amino
acids and their numeric codes are as follows.
Numerical codes for amino acids
in a *.bsq file
|
Amino acid
|
Binary code
|
Amino acid
|
Binary code
|
|
'-'
|
0
|
'K'
|
12
|
|
'A'
|
1
|
'M'
|
13
|
|
'R'
|
2
|
'F'
|
14
|
|
'N'
|
3
|
'P'
|
15
|
|
'D'
|
4
|
'S'
|
16
|
|
'C'
|
5
|
'T'
|
17
|
|
'Q'
|
6
|
'W'
|
18
|
|
'E'
|
7
|
'Y'
|
19
|
|
'G'
|
8
|
'V'
|
20
|
|
'H'
|
9
|
'B'
(D or N)
|
21
|
|
'I'
|
10
|
'Z'
(E or Q)
|
22
|
|
'L'
|
11
|
'X'
(any a.a)
|
23
|
|
|
|
'*'
|
24
|
-
NAME.atb -- table with absolute file addresses
of sequences in file NAME.bsq, and sequence headers in file NAME.ahd.
The difference between two ajacent header offsets is used to compute the
length of each header line.
Format of a NAME.atb
file
-
32-bit int -- database
type. For a protein database
this number should be set to 0x78857a4f.
-
32-bit int -- database
format. For a protein
database this should be set to 3.
-
32-bit int -- the number
of bytes in the title of the database (N).
-
N * (8-bit
byte) -- database title.
-
if (N
% 4) not equal to 0, skip
(4 - N % 4) bytes.
-
32-bit int -- the number
of sequences in the database (M).
-
32-bit int -- the length
of the longest sequence in the database.
-
for(i=1;
i less
or equal to M;
i++) {
read 32-bit int
(= the absolute position of the beginning
of the ith
sequence in file
NAME.bsq)
};
-
for(i=1;
i less
or equal to M;
i++) {
read 32-bit int
(= the absolute position of the beginning
of the ith
sequence header
in file NAME.ahd)
};
-
end of
file.
Recorded by Andrey Rzhetsky 9/25/97
Return to the
WU-BLAST Archives