A new "improved" Alu/Alu(aa) dataset for Alu detection and masking July 26, 1994 - J-M. Claverie A new and more complete subset of representative Alu repeat segments is now available for seaching, query masking (using xblast) and error detection. This subset contains 325 select sequences with the following statistics: Average size: 276.23 St. dev size: 62.723 Max. size: 397 Min. size: 180 Composition analysis char count % A 27668 30.82 C 19918 22.19 G 23994 26.73 T 18194 20.27 all 89774 The 325-alu subset contains multiple representatives of each of the Alu classes and sub-classes previously defined in "repbase" (Human and other primate Alu repeats, Dr. Jerzy Jurka, June 1994). The Palu database is constituted from the corresponding 6-frame conceptual translations. Problematic low-entropy segments (mainly generated by the poly-A tails) have been removed from those sequences (using xnu filtering), ensuring a very low level of false positive matches when adequate significant score threshold are used and provided the query is itself filtered by using the -filter "XNU" option of blastp and blastx. Usage ----- 1) with a nucleotide sequence query: blastn Alu.325.db query S=150 (see below for statistics) or, for masking blastn Alu.325.db query S=150 | xblast + query > query.Alu.msk (see xblast manual for more) or even blastx Palu.325.db query S=65 S2=65 M=blosum62.msk -filter "XNU" (see below for statistics) 2) with a peptide sequence query: blastp Palu.325.db query S=65 S2=65 M=blosum62.msk -filter "XNU" or, tblastn Alu.325.db query S=65 S2=65 M=blosum62.msk -filter "XNU" Scoring matrix -------------- blosum62.msk is simply the blosum62 scoring matrix in which the strong negative penalties for matching a stop ("*") with any amino-acid have been replaced by "0". Alternatively, one can use the -altscore option in the blast command line: blastx alu query S=65 S2=65 M=blosum62 -altscore "* any 0" -altscore "any * 0" Statistics on Alu detection (using blosum62.msk) --------------------------- blastn: S>=150 correspond to p<0.01 when scanning genbank (400-nt query), p<10-05 when scanning Alu.325.db. blastx: S>=65 correspond to p<0.002 when scanning Palu.325.db (400-nt query). Sensitivity: blastn on Alu.325.db Threshold scores % known Alu detected 150 99.3 178 97.3 207 94.3 231 89.3 blastp or blastx or tblastn Threshold scores % known Alu detected 65 98.7 70 96.7 75 93.7 80 88.8 Although using the conceptual translation appear a little less sensitive, it has the huge advantage to allow for the filtering of low-entropy induced false positive matches. Note: remarkably, the previous ALu/Palu db only made of 6 select sequences achieved 95% detection with blastn (S>=150) and 94.4% detection with blastx (S>=65). Thus this minimal subset is still useful when storage space is an issue. Methods ------- 4887 Alu segments contained in repbase where progressively clustered using a succession of self matching steps / uniking steps with blastn and then tblastx. The final set is (on purpose) still very redundant to ensure a desirable low probability for false negatives (eg. missing an Alu match). None of the sequences in the 325-alu subset are matching another one with with scores S>=170 (using blosum62.msk). "Imperfect" sequences (eg. containing non ATGC letters, less than 100 nt in size, or containing more than one alu segment) are not present in the final 325-member set. Also, 22 "atypical" Alu (eg. containing coding regions or pseudogene segments) have been carefully eliminated. Bibliography ------------ Blast: ----- Gish, Warren and David J. States (1993). Identification of protein coding regions by database similarity search. Nature Genetics 3:266-72. Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol. 215:403-410. xblast/xnu: ---------- Claverie, Jean-Michel and David J. States (1993) Information enhancement methods for large scale sequence analysis. Computers and Chemistry 17: 191-201. Claverie, Jean-Michel (1994) Large scale sequence analysis, chap. 36 in Automated DNA Sequencing and Analysis techniques (M.Adams, C. Fields, & J.C. Venter, eds.) Academic Press, pp. 267-279. (also see /pub/jmc/xblast and /pub/jmc/xnu) Alu contamination and artifact filtering: ---------------------------------------- Claverie J-M & Makalowski W. (1994) Alu Alert. Nature 371: 752. Claverie, J-M (1994) A streamlined random sequencing strategy for finding coding exons Genomics 23: 575-581. Claverie, J-M (1992) Identifying coding exons by similarity search: Alu-derived and other potentially misleading protein sequences. Genomics 12: 838-841.