See the open-access publication abstract, article and supplemental materials, by Miao Zhang and Warren Gish.
EXALIN is a program for aligning spliced mRNA
to a genomic sequence template.
While many such programs exist,
spliced alignment is not a completely solved problem.
Hence, the programs often
provide different answers from one another using the very same input data.
The use of different alignment heuristics contributes
to this variability,
as do the different alignment scoring schemes employed by the programs.
Just obtaining an alignment from a program may therefore say little
about its accuracy or completeness.
Rather than using ad hoc scoring systems,
EXALIN makes explicit, rational use of information theory.
Unlike other programs,
EXALIN uses information-rich donor
and acceptor splice site models that are capable of discriminating
even between multiple instances of canonical GT..AG splice junctions
due to differences present at other nucleotide positions.
Another key to the the program's accuracy is its ability to use similarity
scoring systems targeted to the specific problem of interest —
e.g., error-prone ESTs aligned to an accurate genomic template
from the same species or error-prone mouse ESTs aligned to an accurate
human genome template.
EXALIN combines and maximizes the information obtained
from the splice site models
with that obtained from sequence similarity.
However, in order to be meaningfully combined,
the information from these two sources
must first be scaled equivalently
(re: the λ parameter discussed in the manuscript),
which is another issue that no previous program has addressed.
When properly parameterized,
EXALIN has been observed
to produce results of superior accuracy and sensitivity,
even in cross-species comparisons between human and mouse.
The default sequence similarity scoring system used by
+5/-11 match/mismatch scoring with gap initiation and extension penalties
of -11 and -11.
This scoring system should be generally suitable for aligning transcript
sequences of “EST” quality
— e.g., with roughly a 5% combined sequencing error and polymorphism rate
to an accurate genomic template from the same species.
For high quality mRNA sequences,
a more stringent scoring system should be used,
along with an appropriately adjusted value for λ.
For cross-species comparisons where the expected similarity
is less than 95%,
a less stringent scoring system,
with an appropriately adjusted λ,
should be used instead.
For best accuracy,
EXALIN can optionally use fully-specified scoring matrices
instead of the simple match/mismatch scoring.
The splice site models used by
implemented as position-specific scoring matrices or PSSMs.
The default splice site models
were trained to recognize human donor and acceptor splice sites.
Models for mouse, C. elegans and rice species
are also included with the software,
but only the human and mouse models were produced by the iterative
training process described in the manuscript.
EXALIN employs full DP by default
and reports the maximum log-likelihood score and alignment
of the transcript to the genomic sequence.
While full DP is slower to execute than heuristic methods,
our philosophy is one of emphasizing accuracy over speed,
as would a scientist who is interested in the structure
of a specific gene.
For reference quality analysis,
a little extra computer time expended to obtain higher accuracy results
could later save laboratory researchers
an extraordinary amount of time and money.
(One might even argue lives could be saved).
For roughly a 100-fold increase in speed,
EXALIN can optionally utilize the results
of a prior
to restrict DP to relatively small regions spanning the ends
of the HSPs reported by
WU BLAST 2.0
is useful to include in this context.
Reasons to avoid the fast, “
suboptimal alignments can result from the restricted DP;
the method does not yield an overall alignment,
although it does predict the locations of splice junctions almost as reliably
as the default DP method;
and the reported alignment scores are inaccurate approximations
to the scores obtained from full DP,
which can confound efforts to distinguish native from paralogous alignments.
When very long genomic contigs are involved,
an intermediate approach to accuracy and speed is to request that
EXALIN perform full DP within a defined subsequence of the input genomic
The relevant subsequence might be determined from a prior
and encompass the entire genomic region spanned by the HSPs,
plus buffer zones of several kilobases at each end
that hopefully include any 5'- or 3'-exons that
BLASTN might have missed.
In spite of the caveats,
BLASTN will be used,
keep in mind that,
BLASTN job stream contains many transcripts
to be compared against relatively few genomic contigs,
the speed of execution can often be increased dramatically
by reversing the problem and comparing the genomic contigs
as queries against a database comprised of the shorter transcripts.
The reason this accelerates searches it that typical
searches are “I/O-bound” and, by reversing the problem,
I/O costs are virtually eliminated.
See the README and copyright/license. The splice model data bundled with the binaries are in the public domain.
Pre-compiled binary archives:
Source code for the 2005-05-06 release: exalin.2005-05-06.tar.gz
In addition to the several people acknowledged in the manuscript for their contributions to the project, we wish to thank commercial licensees of WU BLAST 2.0, whose support made this project possible.
Last updated: 2006-04-17
Copyright © 2005,2006 Warren R. Gish, Saint Louis, Missouri 63110 USA. All rights reserved.