NCBI BLAST 2.0: Amino Acid Sequence Database
Files
-
NAME.phr --
file with sequence headers written one after another; unlike similar file
in the older type database, this file does not contain '>' characters to
delineate the beginnings of sequence headers, and no "magic byte" at the
end of each header is appended.
-
NAME.psq
-- sequences in binary format written one after another; sequences are
separated by a "sentinel byte" -- '\0'.
Numerical codes for amino acids
in a *.psq file
(note that the code table is
different from the those
used in the previous versions
of BLAST databases)
|
Amino acid
|
Binary code
|
Amino acid
|
Binary code
|
|
'-'
|
0
|
'M'
|
12
|
|
'A'
|
1
|
'N'
|
13
|
|
'B'
|
2
|
'P'
|
14
|
|
'C'
|
3
|
'Q'
|
15
|
|
'D'
|
4
|
'R'
|
16
|
|
'E'
|
5
|
'S'
|
17
|
|
'F'
|
6
|
'T'
|
18
|
|
'G'
|
7
|
'V'
|
19
|
|
'H'
|
8
|
'W'
|
20
|
|
'I'
|
9
|
'X'
|
21
|
|
'K'
|
10
|
'Y'
|
22
|
|
'L'
|
11
|
'Z'
|
23
|
|
|
|
'*'
|
24
|
-
NAME.pin --
index-file containing references ("offsets") to the beginnings of headers
in file *.phr
and to the beginnings of sequences in file *.psq.
In more detail, the order of fields in this file is as follows.
-
(1) 32-bit
int -- formatdb_version;
-
(2) 32-bit
int -- PROTEIN_DUMP
(0 for nucleotide
sequences, 1
for protein sequences);
-
(3) 32-bit int
-- the length of database title (T);
-
(4) T
bytes -- the
database title itself;
-
(5) 32-bit int
-- the length of byte array with time/date (D);
-
(6) D
bytes -- date
and time of the database creation;
-
(7) 32-bit int
-- the number of sequences in the database (N);
-
(8) 32-bit int
-- the total number of characters in the database;
-
(9) 32-bit int
-- length of the longest sequence in the database (Lmax);
-
(10) for(i=0; i<=N;
i++){
read 32-bit int (the ith header offset);
}
-
(11) for(i=0; i<=N;
i++){
read 32-bit int (the ith sequence offset);
}
-
e.o.f.
-
NAME.psd --
string "directory" aiding in identifying
the unique sequence order number in the database
using a known
GenBank, SwissProt, PIR or other accession
number. The format of this file is as
follows. For each string containing an accession number (such as
"gi|111111,"
or "pir||a64226,"
or "gnl|pid|e41337")
the directory has the following record.
-
(1) the string with accession number followed
by '\2' character;
-
(2) string representation of the order
number of the sequence in the FASTA file followed by '\n';
-
The number of lines in the file is thus equal
to the number of different accession numbers in the original FASTA file;
one sequence can have multiple non-identical accession numbers.
-
NAME.psi
-- index for quick search in NAME.psd
file. This file has the following format.
-
(1) 32-bit int
-- ISAM version;
-
(2) 32-bit int
-- type of data (protein -- 1, nucleotide -- 2, unknown -- 3);
-
(3) 32-bit int
-- length of the directory (NAME.psd)
file;
-
(4) 32-bit int --
the number of entries in the directory file;
-
(5) 32-bit int --
the number of "reper" entries chosen in the directory to be separated by
equal number of "non-reper" entries.
-
(6) 32-bit int
-- "page size" -- the number of non-reper entries between two repers +
1;
-
(7) 32-bit int --
set to 0 for future use;
-
(8) 32-bit int
-- set to 0 for future use;
-
(9) if "page
size" > 1,
for(i=0;
i<the number of repers+1; i++){
32-bit
int -- the absolute address of the ith
reper item in the directory
file;
}
-
(10) for(i=0;
i<the number of repers+1; i++){
32-bit
int -- the absolute address of the beginning
of ith
reper item in the index
file (this file);
}
-
(11) for(i=0;
i<the number of repers+1; i++){
the string
corresponding to the ith
reper item followed by '\0';
the last string is
simply '\0';
}
-
e.o.f.
-
NAME.pnd --
numeric "directory" file. Essentially the same as *.psd file,
but (i)
only GenBank accession numbers are used, and (ii)
both GenBank accession numbers and the order numbers of sequences in the
FASTA file are stored as 32-bit int
numbers rather than as strings.
-
NAME.pni
-- numeric index file with the following format.
-
(1) 32-bit int
-- version of
the program;
-
(2) 32-bit int
-- data type;
-
(3) 32-bit int
-- length of
the corresponding numeric directory file;
-
(4) 32-bit int
-- the number of entries
in the directory file;
-
(5) 32-bit int
-- the number of repers
in the index file;
-
(6) 32-bit int --
the "page size";
-
(7) 32-bit int
-- set to 0;
-
(8) 32-bit int --
set to 0;
-
(9) 32-bit int --
set to 0;
-
(10) 32-bit int
-- set to
0;
-
(11) for(i=0;
i<number
of repers+1; i++){
32-bit int --
GenBank entry number of the ith
reper entry;
32-bit int --
absolute address in the directory file of the
ith reper entry;
}
-
e.o.f.
Recorded by Andrey Rzhetsky 1997-10-22
Return to the
WU-BLAST Archives