NCBI BLAST 2.0: Nucleotide Sequence Database
Files
Ambiguity codes for Y and R corrected 2004-05-22
Originally provided by Andrey Rzhetsky 1997-10-22
-
NAME.nhr --
file with sequence headers written one after another; unlike similar file
in the older type database, this file does not contain '>' characters to
delineate the beginnings of sequence headers, and no "magic byte" at the
end of each header is appended.
-
NAME.nsq
-- sequences in binary format written one after another; sequences are
separated by a "magic byte" -- '\0'. The encoding which uses two
bits per nucleotide is as follows (no degenerate symbols).
|
Nucleotide
|
Encoded as
|
|
'A'
|
0 (binary 00)
|
|
'C'
|
1 (binary 01)
|
|
'G'
|
2 (binary 10)
|
|
'T'|'U'
|
3 (binary 11)
|
The degenerate symbols are stored at the end of the sequence file
in encoding that uses four bits per symbol.
|
Nucleotide
|
Encoded as
|
Nucleotide
|
Encoded as
|
|
'-'
|
0
|
'T'
|
8
|
|
'A'
|
1
|
'W' ('A'
| 'T')
|
9
|
|
'C'
|
2
|
'Y' ('C'|'T')
|
10
|
|
'M' ('A'|'C')
|
3
|
'H' ('A'|'T'|'C')
|
11
|
|
'G'
|
4
|
'K' ('G'|'T')
|
12
|
|
'R' ('A'|'G')
|
5
|
'D' ('A'|'T'|'G')
|
13
|
|
'S' ('C'
| 'G')
|
6
|
'B' ('T'|'G'|'C')
|
14
|
|
'V' ('A'|'G'|'C')
|
7
|
'N' ('A'|'T'|'G'|'C')
|
15
|
Important: degenerate
symbols (such as 'N' -- any nucleotide,
'R' -- a purine, or 'W'
-- a weak-bonding nucleotide, etc.) in the compressed nucleotide sequences
are randomly substituted with a non-degenerate symbols from the corresponding
set of nucleotides; to get the original nucleotide sequence one needs to
refer to degenerate symbols stored at the ecnd of NAME.nsq file.
-
NAME.nin --
index-file containing references ("offsets") to the beginnings of headers
in file *.nhr
and to the beginnings of sequences in file *.psq.
In more detail, the order of fields in this file is as follows.
-
(1) 32-bit
int -- formatdb_version;
-
(2) 32-bit
int -- PROTEIN_DUMP
(0 for nucleotide
sequences, 1
for protein sequences);
-
(3) 32-bit int
-- the length of database title (T);
-
(4) T
bytes -- the
database title itself;
-
(5) 32-bit int
-- the length of byte array with time/date (D);
-
(6) D
bytes -- date
and time of the database creation;
-
(7) 32-bit int
-- the number of sequences in the database (N);
-
(8) 32-bit int
-- the total number of characters in the database;
-
(9) 32-bit int
-- length of the longest sequence in the database (Lmax);
-
(10) for(i=0; i<=N;
i++){
read 32-bit int (the ith header offset);
}
-
(11) for(i=0; i<=N;
i++){
read 32-bit int (the (i-1)th
sequence offset
-- don't ask me why not the ith!);
}
-
(12) for(i=0; i<=N;
i++){
read 32-bit
int (the ith "ambiguous character" array offset:
each array is one-dimensional
with cell size 32 bits,
the first cell contains the
number of elements in the array;
each 32-bit integer starting with second one
in the array
contains degenerated character (the first
4 bits), length of
the array (4 bits), and the offset in the
sequence (the
remaining 24 bits))
if there is no room between
ith and (i+1)th offset, the ith
sequence doesn't contain ambiguous characters.
}
-
e.o.f.
-
NAME.nni
-- numeric ISAM index;
-
NAME.nnd --
lookup hash table file.
Return to the
WU-BLAST Archives