ALERT 2004-11-08: RepeatMasker has a new home (http://www.repeatmasker.org) and current versions now utilize WU BLAST natively, without the need for MaskerAid. Users are advised to update their copy of RepeatMasker and enjoy a MaskerAid-free life — powered by WU BLAST!
ALERT 2004-05-27: All versions of MaskerAid are incompatible with RepeatMasker 2004-03-06 and earlier, dating as far back as 2003-09-21, but not as far back as 2002-05-05. As of 2004-05-27, RepeatMasker was patched for this incompatibility.
What is
MaskerAid
?
RepeatMasker
(AFA Smit & P Green)
is a standard software tool used in computational genomics to identify repetitive elements and low-complexity
sequences.
Just as RepeatMasker is effective, however, when run natively it can also be slow.
At the program's most sensitive setting,
one of today's fastest computers would require about 2 years working around the clock
to analyze the entire human genome.
MaskerAid is a drop-in accelerator
that increases the speed of RepeatMasker about 30-fold while maintaining sensitivity.
The result? RepeatMasker with MaskerAid can mask the entire human genome:
The importance of this speed improvement is magnified, when one considers that the human genome sequence available today is largely unfinished and will need to be masked at least once more as it is finished. Furthermore, the highly repetitive 3 Mb mouse genome is expected to be available soon, in the form of millions of highly redundant shotgun sequencing reads.
How does MaskerAid work?
Execution profiling of native RepeatMasker showed that the vast majority of its time was spent
running the heuristic database search engine known as
Cross_Match
(P. Green, unpublished).
MaskerAid allows the fast and flexible
WU BLAST
(W Gish, 1994-2000)
search engine to substitute transparently for Cross_Match,
yielding the described speed improvement while effectively maintaining sensitivity.
MaskerAid is fundamentally a software “wrapper” around
WU BLAST
that makes it appear and function very much like
Cross_Match
-- hence a masquerade.
MaskerAid itself is a
PERL
5.0 script that runs on the same flavors of UNIX that
RepeatMasker (and WU BLAST)
does.
Note: No changes whatsoever must be made to RepeatMasker, in order to enjoy the combined speed and sensitivity of MaskerAid. In fact, newer versions of RepeatMasker support a -w option to have MaskerAid conveniently invoked instead of Cross_Match. Use of -w is recommended if your version of RepeatMasker supports it.
What are the specific improvements?
With MaskerAid installed,
RepeatMasker
runs about 30-fold faster -- sometimes more, sometimes less --
at its most sensitive (“slow”) setting,
while effectively maintaining sensitivity.
With its multithreaded
WU BLAST
underpinnings,
MaskerAid can even take advantage
of multi-processor computer architectures to obtain
an over 40-fold speed-up when allowed to use 2 processors.
We tested MaskerAid on a set of 20 randomly selected human genomic clones at the three speed settings of RepeatMasker ("slow", “standard” and “quick”), and compiled a graphical representation of the repeats identified relative to those found with native RepeatMasker. Even using RepeatMasker's “quick” setting, MaskerAid provided a 3-fold speed improvement while finding more repeats.
Results
MaskerAid is described in Bioinformatics
16:1040-1 (2000).
Additional supporting results are provided
here.
Limitations
MaskerAid was developed
to accelerate masking of repetitive elements
in high-throughput genomic sequence of human origin.
Some specialized RepeatMasker functions or auxiliary files
unrelated to this task are currently not supported.
Furthermore, MaskerAid is not an adequate substitute
for Cross_Match for all of RepeatMasker's activities.
The known limitations, or caveats, to using MaskerAid are:
Recommendations
In view of the above limitations,
an effective strategy for taking advantage of the distinct strengths
of native RepeatMasker and
RepeatMasker with MaskerAid
is to:
The SuperMasker shell script may be useful, as it automates the above steps and includes the -s option.
Even when RepeatMasker is used natively to identify low complexity regions, one should consider supplementing it with one or more additional low-complexity filters, such as dust or nseg. If the sequence is to be conceptually translated into protein, then a program geared for identifying low complexity regions in amino acid sequences should be used, such as seg or xnu. (Again, precompiled binaries for these applications are included in licensed WU BLAST 2.0 distributions.)
How to get it?
MaskerAid is
free
for anyone to use, modify and redistribute,
in accordance with Washington University's
open source license.
To download the latest version, click
here.
MaskerAid relies on WU BLAST 2.0. A freely available version 2.0a19 is posted here; however, MaskerAid has only been thoroughly tested with the full-featured, licensed WU BLAST version 2.0, which is expected to perform better (faster and more reliably) than version 2.0a19. The licensing procedure for WU BLAST 2.0 is described here [http://blast.wustl.edu/licensing]. Note: NCBI BLAST does not provide adequate flexibility for use with RepeatMasker.
The best configuration of RepeatMasker therefore requires the acquisition of four licensed components:
Technical questions about MaskerAid may be addressed to:
This page was last updated 2004-06-07.