EXALIN README EXALIN is a package which can be used to align spliced sequence with genome sequence. ------------------------------------------------------------------------------- CONTENT I. Usage II. Multiple methods 1. Full length Dynamic programming using consensus splice site model 2. Use the BLAST result and part of sequences a. result from blastn parser b. blastn parameter used in program 3. Full length Dynamic Programming using simplified splice site model III. Consensus Splice site model file IV. Others 1. Sequence Input file 2. result -------------------------------------------------------------------------------- I. Usage Usage: exalin [-d] [-a outmode] [-s runmode] [-o ngroups] [-p splicemodelfile] [-L lambda] [-A start][-B end] [-w] [-b result-from-blast-parser] [-o outputgroup] [-t misdist] [-l len] [-f len] [-g len] [-m match] [-n mismatch] [-k cutoff] [-i strand] [-q gap1] [-r gapext] [-S scoreing matrix] [-x intron-begin] [-y intron-end] [-e] [-u intron-with-splice] [-v intron-without-splice] [-c len] [-I small-intron-len][-P trim-poly-A-tail NOTE: The default alignment method uses full dynamic programming (DP) between EST and genome over their entire lengths, which may demand large memory proportional to the product of their lengths. See the -A and -B options to restrict analysis to a segment of the genome sequence. -p: input filename for a consensus splice site model default: use the splice site model file for human -b: input filename for a parsed BLAST result (output from blast_deluxe.pl). This information directs a constrained (not full) dynamic programming method: group number, HSP locations and lengths of query and subjct sequences -o report results for only the first groups in the BLAST input (default is 1). The maximum supported value is 4. If the number of groups in the input is less than , it is not an error. -e: perform full dynamic programming with a simplified splice site model -d do not accept nucleotide ambiguity codes (non-ACGT) in input sequences -a: 0: report coordinates only (default) 1: report alignments and coordinates -s: 0: GT...AG only (default) 1: CT...AC only 2: Try both GT...AG and CT...AC; only report the best result 3: Try both GT...AG and CT...AC; report both results -L: lambda value. When scores are changed, the lambda value should be adjusted accordingly. -A: the start position in the genome to use when doing full DP (useful to reduce the amount of memory required for full dynamic programming) -B: the end position in the genome to use when doing full DP (useful to reduce the amount of memory required for full dynamic programming) -w: use blast guided full dynamic programming -b is needed when -w is set -t: the maximum distance along the genome to search by constrained DP for unaligned or missing ends of ESTs from the BLAST report. default: min(50000, 0.01 * 4**x), where x is the length of the missing or unaligned part. (See the -l option). -l: minimum segment length at start or end of EST that, if unaligned in the BLAST report, will be searched for by constrained DP. default: 10 -f: length of flanking sequence to include in the constrained DP when adjacent HSPs overlap. default: 60 -g: length of flanking sequence to include in the constrained DP when adjacent HSPs do not overlap. default: 40 -S: using match and mismatch scoring matrix from scoring matrix file -m: match score. default: +5 (+1 when using simplified splice site model full DP) -n: mismatch score. default: -11 (-1 when using simplified splice site model full DP) -k: cut-off score when doing full dynamic programming. If score is lower than cut-off, the alignment is not reported. Default: 50 -i: the strand that EXALIN examines. default: 0: both strands and ouput the best one 1: positive strand 2: negative strand -q: penalty for first residue in a gap. default: -11 (2 for simplfied splice site model) -r: gap extension penalty. default: -11 (2 for simplified site model full DP): -x: penalty for intron-begin. default: -15 -y: penalty for intron-end. default: -5 -u: the penalty for an intron with splice site when using simplfied splice site model full DP. default: 20 -v: the penalty for an intron when using simplified splice site model full DP. default: 40 default: 100,000,000 -I: report the intron when intron length is smaller than input default: 20 -P: the poly-A tail will be trimmed when set and poly-A length is greater than the input value. EXIT 1 -------------------------------------------------------------------------------- II. Multiple methods The package is designed to align spliced sequence with genome sequences. Three different methods can be used. The default method is Full Dynamic Programming. Currently, EXALIN can only accept one query once. 1. Full Dynamic Programming with consensus splice site model In this method, instead of using part of EST sequence and genomic sequence, full length EST sequence and genomic sequence are used. Splice site information is incorporated into the recurrence. The required input files are EST sequence and genomic sequence. The result from BLASTN is not necessary, therefore, options -o, -t, -l, -f, -g won't work when use this method. When use this method, the memory can be a problem if the sequences are too long. User can set the start point and end point for the genomic sequence if the location of EST in genomic sequence is roughly know at advance. (-A for start point, -B for end point). For example, for sequences provided in the package, using command exalin -a 1 -A 70000 -B 80000 ctg est when -w is set, same as using -A and -B while -A and -B are from BLAST result. At this time -b should also be set to provide BLAST result file. For example, for sequence provided in the package using command exalin -a 1 -w -b best ctg est Exalin can also only look one strand when user use -i option to set up the strand. By doing these, exalin can work more time and memory efficiently. All these information can be obtained from WU-BLASTN result. For example, exalin -a 1 -A 70000 -B 80000 -i 1 ctg est For those three commands, they will generate same result. When aligning est sequences with genomic sequence, -s 2 is suggested to be used since splice direction can be reversed. 2. WU-BLAST result guided dynamic programming with consensus splice site model To use this method, result from BLAST parser, a EST sequence and a genomic sequence are required as input. -b option is set followed by the result from BLAST parser. Basically, WU-BLASTN is used to find the location of est sequence in the genomic sequence which it is derived. Then dynamic programming is used to locate the position of most reasonable splice site. In this method, to find the splice site between two adjacent high scoring pairs (hsp), only part of these two adjacent HSPs in EST sequence and correspondent genomic sequence are used. To get the most reasonable splice site, splice site information is also incorporated into program. When use this method, option -b should be set and followed by blast report file name. This blast result is the output of blast_deluxe.pl. For example, exalin -a 1 -b best ctg est a. The result from WU-BLASTN The result from blast_deluxe.pl should be like this: QUERY: gi|9805890|gb|BE562170.1|BE562170 601346146F1 NIH_ SBJCT: NIHOOP HSP: 115 231 86 201 1 HSP: 230 334 841 945 1 HSP: 335 521 1380 1567 1 HSP: 516 680 6857 7024 1 HSP: 115 231 86 201 2 HSP: 230 334 841 945 2 HSP: 335 521 1380 1567 2 | | | | | | | | | group number | | | | | | | end point in genomic sequenc | | | | | start point in genomic sequence | | | end point in EST sequence | start point in EST sequence The result should be ordered by start point in genomic sequence at first and then ordered by group number. The result is obtained by using blast_deluxe.pl to parse the BLASTN result. The parser uses the Perl package BPdeluxe.pm developed by Ian Korf, Jarret Glasscock and Raymond Yeh. The usage for the parser is blast_deluxe.pl < blast_out or just run the blastn and the blast_deluxe.pl together using pipe. like blastn ..... |./blast_deluxe.pl Both BPdeluxe.pm and blast_deluxe.pl are included in this package. b. BLASTN parameters used in the program. Since the result from BLASTN is used, to get the best result, following BLASTN parameters are recommended: genome_sequence est_sequence S=200 S2=100 gapS2=200 X=26 gapX=55 W=12 gapW=18 gapall Q=11 R=11 M=5 N=-11 Z=3e9 Y=3e9 V=1e6 B=1e6 hspmax=1000 hspsepqmax=2e5 topcomboN=4 wordmask=seg lcmask maskextra=10 gi novalidctxok nonnegok ********************************************************************** * IMPORTANT: Please be advised that the database file should be * * genomic sequence. This set of parameter can get good result * * when used in in-species comparison. * * * ********************************************************************** 3. Full Dynamic Programming with simple splice site model In this method, modified Smith-Waterman algorithm is used and intron is allowed in the recurrence. Full length EST sequence and genomic sequence are used. However, the splice site information is not used anymore. A small penalty for intron is used when splice site is canonical type (use option -u to change this parameter) and big penalty for an intron without canonical splice site (use option -v to change this parameter). To use this method, flag -e should be set. The options for -m -n -q still work here. But -r -x -y won't work anymore. 4. Possible problem: Exalin will try to find splice-site model file under working directory. User should use -p option to provide the correct directory and file name of the splice-site model if don't run EXALIN under current directory. -------------------------------------------------------------------------------- III. splice site model file 1. The splice site model file is also in the package. The name is HumanSPM.par. 2. The splice site model format should be like this if user want to use a new splice site model into program: # <------------------ the beginning of comment line. dlength=X #the length of the donor site model. alength=X #the length of the acceptor site model dosplicept=X #the intron beginning position in donor site model acsplicept=X #the last nucleotide of the intron in acceptor #site model charlist=A C T G #the nucleotides used in sequence 2 -1 -1 -1 #the score for A C T G separately in first position #of donor site. . . . . 0 -1 -2 3 #the score for A C T G separately in last position #of acceptor site. Otherwise, the model for human will be used. This model is the default in the program. The parameter file is included in the package. ------------------------------------------------------------------------------- IV. Other 1. Sequence input file FASTA file is highly recommended even though sequence not in FASTA format can be used. 2. An example is also included. Run the example exalin -s 2 -a 1 -i 1 -A 70000 -B 80000 ctg est EXALIN [06-May-2005] ARGS: -a 1 -A 70000 -B 80000 ctg est T-SEQUENCE: gi|1766096|gb|AA182927.1|AA182927 zp36c07.r1 Stratagene muscle 937209 Homo sapiens cDNA clone IMAGE:611532 5' similar to gb:J04760 TROPONIN I, SLOW SKELETAL MUSCLE (HUMAN);, mRNA sequence. T-LENGTH: 596 G-SEQUENCE: H_KvLQT1 : renamed from Hs chr11.(669054-1948475) length=1279422 G-LENGTH: 1279422 SCORE: 516.3 nats (744.8 bits) SPLICE-DIR: + T-STRAND: + EXON: 1 - 21 (76027 - 76047) 26.2 N/A 4.64 100.00% EXON: 22 - 28 (76192 - 76198) 8.8 5.05 8.21 100.00% EXON: 29 - 70 (76746 - 76787) 39.0 7.92 7.03 90.91% EXON: 71 - 196 (76871 - 76999) 123.8 5.96 3.37 92.37% EXON: 197 - 285 (77162 - 77251) 108.5 4.74 8.02 98.89% EXON: 286 - 461 (77374 - 77550) 187.8 6.21 6.38 94.97% EXON: 462 - 596 (77799 - 77941) 65.2 9.50 N/A 80.00% T: TCAGGACCTCAGGATGGGAGA...............TGAGGAG...............AA 30 |||||||||||||||||||||>>>>>>...>>>>>>|||||||>>>>>>...>>>>>>|| G: TCAGGACCTCAGGATGGGAGAGTAAGT...GGACAGTGAGGAGGTAAGT...CTGCAGAA 76747 T: GCGGAACAGGGCCATCACGG--CGCNANGGCAGCACCTGAAG...............AGT 73 |||||||||||||||||||| ||| | ||||||||||||||>>>>>>...>>>>>>||| G: GCGGAACAGGGCCATCACGGCCCGC-A-GGCAGCACCTGAAGGTAGGT...CCGCAGAGT 76873 T: GTGATGCTGCAGATAGCGGCCACGGAGCTGGAGAAGGAGGAGAGCCGCCGTGANG-A-NG 131 ||||||||||||||||||||||||||||||||||||||||||||||||||||| | | | G: GTGATGCTGCAGATAGCGGCCACGGAGCTGGAGAAGGAGGAGAGCCGCCGTGAGGCAGAG 76933 T: AAGCAGAACTACCTGGCGGNA-C-NTGCCCG-CGCNTGCATATCCCGGGCTCCATGTCTG 188 ||||||||||||||||||| | | |||||| ||| |||||||||||||||||||||||| G: AAGCAGAACTACCTGGCGG-AGCACTGCCCGCCGC-TGCATATCCCGGGCTCCATGTCTG 76991 T: AAGTGCAG...............GAGCTCT-CAAACAGCTGCACGCCAAGATCGATGCGG 232 ||||||||>>>>>>...>>>>>>||||||| ||||||||||||||||||||||||||||| G: AAGTGCAGGTACCA...CCCTAGGAGCTCTGCAAACAGCTGCACGCCAAGATCGATGCGG 77198 T: CTGAAGAGGAGAAGTACGACATGGAGGTGAGGGTGCAGAAGACCAGCAAGGAG....... 285 |||||||||||||||||||||||||||||||||||||||||||||||||||||>>>>>>. G: CTGAAGAGGAGAAGTACGACATGGAGGTGAGGGTGCAGAAGACCAGCAAGGAGGTGAGT. 77251 T: ........CTGGAGGACATGAACCAGAAGCTATTTGATCTGCGGGGCAAGTTCAAGCGGC 337 ..>>>>>>|||||||||||||||||||||||||||||||||||||||||||||||||||| G: ..CCACAGCTGGAGGACATGAACCAGAAGCTATTTGATCTGCGGGGCAAGTTCAAGCGGC 77425 T: CCCCACTGCGGAGGGTGCGCATGTCGACCGATGCCATGCTCAAGGCCCTGCT--GCTTCG 395 |||||||||||||||||||||||||| ||||||||||||||||||||||||| || ||| G: CCCCACTGCGGAGGGTGCGCATGTCGGCCGATGCCATGCTCAAGGCCCTGCTGGGC-TCG 77484 T: AAGCACAAGGTGTTCATGGACCTGANGG-CAACCTGAAGCAGGTTCAAGAAGGANGACAC 454 ||||||||||||| ||||||||||| || |||||||||||||| |||||||||| ||||| G: AAGCACAAGGTGTGCATGGACCTGAGGGCCAACCTGAAGCAGG-TCAAGAAGGAGGACAC 77543 T: AGAGAAG...............GAGCCGGGACCTNCGAGACGTTGGTNACTTGAAGAAGA 499 |||||||>>>>>>...>>>>>>||| |||||||| |||||||| ||| ||| || ||||| G: AGAGAAGGTGCGT...CCACAGGAG-CGGGACCTGCGAGACGTGGGTGACTGGAGGAAGA 77835 T: ACATCGAGGGAGAAGTNTGGCAT-GAAGGCCNGAAGAAGATNTTTTANTTCGAGT-CTA- 556 ||||||| |||||||| |||||| || |||| ||||||||| ||| | | ||||| ||| G: ACATCGA-GGAGAAGTCTGGCATGGAGGGCCGGAAGAAGATGTTTGAGTCCGAGTCCTAG 77894 T: GCCA-TNGCTG-CCCTA-ANCTG-CCC-GT-TCCGG-TTCCAGCAGA 596 |||| | |||| ||||| ||| ||| || |||| | |||||||| G: GCCACTCGCTGCCCCTACGCCTGCCCCGGTGCCCGGCTCCCAGCAGA 77941 EXIT 0 when run command exalin -a 1 -b best ctg est EXALIN [06-May-2005] ARGS: -a 1 -b best ctg est T-SEQUENCE: gi|1766096|gb|AA182927.1|AA182927 zp36c07.r1 Stratagene muscle 937209 Homo sapiens cDNA clone IMAGE:611532 5' similar to gb:J04760 TROPONIN I, SLOW SKELETAL MUSCLE (HUMAN);, mRNA sequence. T-LENGTH: 596 G-SEQUENCE: H_KvLQT1 : renamed from Hs chr11.(669054-1948475) length=1279422 G-LENGTH: 1279422 GROUP: 1 SCORE: 577.0 nats (832.5 bits) SPLICE-DIR: + T-STRAND: + EXON: 1 - 21 (76027 - 76047) EXON: 22 - 28 (76192 - 76198) EXON: 29 - 70 (76746 - 76787) EXON: 71 - 196 (76871 - 76999) EXON: 197 - 285 (77162 - 77251) EXON: 286 - 461 (77374 - 77550) EXON: 462 - 596 (77799 - 77941) ALIGNMENT T: CCTCAGGATGGGAGA...............TGAGGAG...............AAGCGGAA 36 |||||||||||||||>>>>>>...>>>>>>|||||||>>>>>>...>>>>>>|||||||| G: CCTCAGGATGGGAGAGTAAGT...GGACAGTGAGGAGGTAAGT...CTGCAGAAGCGGAA 76753 T: CAGGGCCATCACGG--CGCNANGGCAGCACCTGAAG...............AGTGTGATG 79 |||||||||||||| ||| | ||||||||||||||>>>>>>...>>>>>>||||||||| G: CAGGGCCATCACGGCCCGC-A-GGCAGCACCTGAAGGTAGGT...CCGCAGAGTGTGATG 76879 T: CTGCAGATAGCGGCCACGGAGCTGGAGAAGGAGGAGAGCCGCCGTGANG-A-NGAAGCAG 137 ||||||||||||||||||||||||||||||||||||||||||||||| | | ||||||| G: CTGCAGATAGCGGCCACGGAGCTGGAGAAGGAGGAGAGCCGCCGTGAGGCAGAGAAGCAG 76939 T: AACTACCTGGCGGNA-C-NTGCCCG-CGCNTGCATATCCCGGGCTCCATGTCTGAAG 191 ||||||||||||| | | |||||| ||| ||||||||||||||||||||||||||| G: AACTACCTGGCGG-AGCACTGCCCGCCGC-TGCATATCCCGGGCTCCATGTCTGAAG 76994 ALIGNMENT T: CNTGCATATCCCGGGCTCCATGTCTGAAGTGCAG...............GAGCTCT-CAA 206 | ||||||||||||||||||||||||||||||||>>>>>>...>>>>>>||||||| ||| G: C-TGCATATCCCGGGCTCCATGTCTGAAGTGCAGGTACCA...CCCTAGGAGCTCTGCAA 77172 T: ACAGCTGCACGCCAAGATCGAT 228 |||||||||||||||||||||| G: ACAGCTGCACGCCAAGATCGAT 77194 ALIGNMENT T: ACATGGAGGTGAGGGTGCAGAAGACCAGCAAGGAG...............CTGGAGGACA 295 |||||||||||||||||||||||||||||||||||>>>>>>...>>>>>>|||||||||| G: ACATGGAGGTGAGGGTGCAGAAGACCAGCAAGGAGGTGAGT...CCACAGCTGGAGGACA 77383 T: TGAACCAGAAGCTATTTGATCTG 318 ||||||||||||||||||||||| G: TGAACCAGAAGCTATTTGATCTG 77406 ALIGNMENT T: TATTTGATCTGCGGGGCAAGTTCAAGCGGCCCCCACTGCGGAGGGTGCGCATGTCGACCG 367 |||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||| G: TATTTGATCTGCGGGGCAAGTTCAAGCGGCCCCCACTGCGGAGGGTGCGCATGTCGGCCG 77455 T: ATGCCATGCTCAAGGCCCTGCT--GCTTCGAAGCACAAGGTGTTCATGGACCTGANGG-C 424 |||||||||||||||||||||| || |||||||||||||||| ||||||||||| || | G: ATGCCATGCTCAAGGCCCTGCTGGGC-TCGAAGCACAAGGTGTGCATGGACCTGAGGGCC 77514 T: AACCTGAAGCAGGTTCAAGAAGGANGACACAGAGAAG...............GAGCCGGG 469 ||||||||||||| |||||||||| ||||||||||||>>>>>>...>>>>>>||| |||| G: AACCTGAAGCAGG-TCAAGAAGGAGGACACAGAGAAGGTGCGT...CCACAGGAG-CGGG 77805 T: ACCTNCGAGACGTTGGTNACTTGAAGAAGAACATCGAGGGAGAAGTNTGGCAT-GAAGGC 528 |||| |||||||| ||| ||| || |||||||||||| |||||||| |||||| || ||| G: ACCTGCGAGACGTGGGTGACTGGAGGAAGAACATCGA-GGAGAAGTCTGGCATGGAGGGC 77864 T: CNGAAGAAGATNTTTTANTTCGAGT-CTA-GCCA-TNGCTG-CCCTA-ANCTG-CCC-GT 581 | ||||||||| ||| | | ||||| ||| |||| | |||| ||||| ||| ||| || G: CGGAAGAAGATGTTTGAGTCCGAGTCCTAGGCCACTCGCTGCCCCTACGCCTGCCCCGGT 77924 T: -TCCGG-TTCC 590 |||| | || G: GCCCGGCTCCC 77935 EXIT 0 The result only displays the alignment around the splice site. The "ALIGNMENT" showed in result is to make sure user can distinguish between two alignment. **************************** *.Explanation for Output: * **************************** EXALIN [06-May-2005] <- version information ARGS: -a 1 -A 70000 -B 80000 ctg est #command line parameters used T-SEQUENCE: gi|1766096|gb|AA182927.1|AA182927 zp36c07.r1 Stratagene muscle 937209 Homo sapiens cDNA clone IMAGE:611532 5' similar to gb:J04760 TROPONIN I, SLOW SKELETAL MUSCLE (HUMAN);, mRNA sequence. # mRNA or EST sequence name T-LENGTH: 596 # mRNA or EST length G-SEQUENCE: H_KvLQT1 : renamed from Hs chr11.(669054-1948475) length=1279422 # Genomic sequence name G-LENGTH: 1279422 # Genomic sequence length SCORE: 516.3 nats (744.8 bits) # score for whole alignment. repersented by nats or bits SPLICE-DIR: + # GT..AG (+) or CT...AC (-) T-STRAND: + # The sequence strand when "-" is used, complementary est strand is used. Otherwise, both sequences used are positive strand. EXON: 1 - 21 (76027 - 76047) 26.2 N/A 4.64 100.00% EXON: 22 - 28 (76192 - 76198) 8.8 5.05 8.21 100.00% | | | | | | | | | | | | | | | percentage identity | | | | | | | | | | | | | score from acceptor site (nats) | | | | | | | | | | | score from donor site (nats) | | | | | | | | | score (nats) | | | | | | | end point in genomic sequence | | | | | start point in genomic sequenc | | | end point in EST | start point in EST # blast-guided local alignment won't provide score and identity information # since the information is not complete. # When use complementary strand EST sequence, start point and end point # in EST will be reversed. (start point is greater than end point).