EXALIN — Improved Spliced Alignment from an Information Theoretic Approach

See the open-access publication abstract, article and supplemental materials, by Miao Zhang and Warren Gish.

EXALIN is a program for aligning spliced mRNA to a genomic sequence template. While many such programs exist, spliced alignment is not a completely solved problem. Hence, the programs often provide different answers from one another using the very same input data. The use of different alignment heuristics contributes to this variability, as do the different alignment scoring schemes employed by the programs. Just obtaining an alignment from a program may therefore say little about its accuracy or completeness.

Rather than using ad hoc scoring systems, EXALIN makes explicit, rational use of information theory. Unlike other programs, EXALIN uses information-rich donor and acceptor splice site models that are capable of discriminating even between multiple instances of canonical GT..AG splice junctions due to differences present at other nucleotide positions. Another key to the the program's accuracy is its ability to use similarity scoring systems targeted to the specific problem of interest — e.g., error-prone ESTs aligned to an accurate genomic template from the same species or error-prone mouse ESTs aligned to an accurate human genome template. EXALIN combines and maximizes the information obtained from the splice site models with that obtained from sequence similarity. However, in order to be meaningfully combined, the information from these two sources must first be scaled equivalently (re: the λ parameter discussed in the manuscript), which is another issue that no previous program has addressed. When properly parameterized, EXALIN has been observed to produce results of superior accuracy and sensitivity, even in cross-species comparisons between human and mouse.

The default sequence similarity scoring system used by EXALIN is +5/-11 match/mismatch scoring with gap initiation and extension penalties of -11 and -11. This scoring system should be generally suitable for aligning transcript sequences of “EST” quality — e.g., with roughly a 5% combined sequencing error and polymorphism rate — to an accurate genomic template from the same species. For high quality mRNA sequences, a more stringent scoring system should be used, along with an appropriately adjusted value for λ. For cross-species comparisons where the expected similarity is less than 95%, a less stringent scoring system, with an appropriately adjusted λ, should be used instead. For best accuracy, EXALIN can optionally use fully-specified scoring matrices instead of the simple match/mismatch scoring.

The splice site models used by EXALIN are implemented as position-specific scoring matrices or PSSMs. The default splice site models were trained to recognize human donor and acceptor splice sites. Models for mouse, C. elegans and rice species are also included with the software, but only the human and mouse models were produced by the iterative training process described in the manuscript.

EXALIN employs full DP by default and reports the maximum log-likelihood score and alignment of the transcript to the genomic sequence. While full DP is slower to execute than heuristic methods, our philosophy is one of emphasizing accuracy over speed, as would a scientist who is interested in the structure of a specific gene. For reference quality analysis, a little extra computer time expended to obtain higher accuracy results could later save laboratory researchers an extraordinary amount of time and money. (One might even argue lives could be saved).

For roughly a 100-fold increase in speed, EXALIN can optionally utilize the results of a prior BLASTN search to restrict DP to relatively small regions spanning the ends of the HSPs reported by BLASTN. The topcomboN option of WU BLAST 2.0 is useful to include in this context. Reasons to avoid the fast, “EXALIN-BLAST” approach include: suboptimal alignments can result from the restricted DP; the method does not yield an overall alignment, although it does predict the locations of splice junctions almost as reliably as the default DP method; and the reported alignment scores are inaccurate approximations to the scores obtained from full DP, which can confound efforts to distinguish native from paralogous alignments.

When very long genomic contigs are involved, an intermediate approach to accuracy and speed is to request that EXALIN perform full DP within a defined subsequence of the input genomic sequence. The relevant subsequence might be determined from a prior BLASTN search and encompass the entire genomic region spanned by the HSPs, plus buffer zones of several kilobases at each end that hopefully include any 5'- or 3'-exons that BLASTN might have missed.

In spite of the caveats, if BLASTN will be used, keep in mind that, when a BLASTN job stream contains many transcripts to be compared against relatively few genomic contigs, the speed of execution can often be increased dramatically by reversing the problem and comparing the genomic contigs as queries against a database comprised of the shorter transcripts. The reason this accelerates searches it that typical BLASTN searches are “I/O-bound” and, by reversing the problem, I/O costs are virtually eliminated.

Getting the Software

See the README and copyright/license. The splice model data bundled with the binaries are in the public domain.

Pre-compiled binary archives:

Source code for the 2005-05-06 release: exalin.2005-05-06.tar.gz

Acknowledgements

In addition to the several people acknowledged in the manuscript for their contributions to the project, we wish to thank commercial licensees of WU BLAST 2.0, whose support made this project possible.


Return to the WU BLAST Archives home page

Last updated: 2006-04-17

Copyright © 2005,2006 Warren R. Gish, Saint Louis, Missouri 63110 USA. All rights reserved.