MaskerAid: a 30-fold* speed enhancement to RepeatMasker


*Results obtained on human genome test data running on a single CPU, compared to using RepeatMasker natively in its slow, sensitive mode. Results may vary. Appropriateness of using MaskerAid to analyze data from other species is unknown. Be sure to read Recommendations below.

ALERT 2004-11-08:   RepeatMasker has a new home (http://www.repeatmasker.org) and current versions now utilize WU BLAST natively, without the need for MaskerAid. Users are advised to update their copy of RepeatMasker and enjoy a MaskerAid-free life — powered by WU BLAST!

ALERT 2004-05-27:   All versions of MaskerAid are incompatible with RepeatMasker 2004-03-06 and earlier, dating as far back as 2003-09-21, but not as far back as 2002-05-05. As of 2004-05-27, RepeatMasker was patched for this incompatibility.

Note: the previously posted “fixes” to MaskerAid dated 2004-05-25 and 2004-05-26 were in fact not fixes and should not be used.

What is MaskerAid ?
RepeatMasker (AFA Smit &  P Green) is a standard software tool used in computational genomics to identify repetitive elements and low-complexity sequences. Just as RepeatMasker is effective, however, when run natively it can also be slow. At the program's most sensitive setting, one of today's fastest computers would require about 2 years working around the clock to analyze the entire human genome. MaskerAid is a drop-in accelerator that increases the speed of RepeatMasker about 30-fold while maintaining sensitivity.

The result? RepeatMasker with MaskerAid can mask the entire human genome:

The importance of this speed improvement is magnified, when one considers that the human genome sequence available today is largely unfinished and will need to be masked at least once more as it is finished. Furthermore, the highly repetitive 3 Mb mouse genome is expected to be available soon, in the form of millions of highly redundant shotgun sequencing reads.

How does MaskerAid work?
Execution profiling of native RepeatMasker showed that the vast majority of its time was spent running the heuristic database search engine known as Cross_Match (P. Green, unpublished). MaskerAid allows the fast and flexible WU BLAST (W Gish, 1994-2000) search engine to substitute transparently for Cross_Match, yielding the described speed improvement while effectively maintaining sensitivity. MaskerAid is fundamentally a software “wrapper” around WU BLAST that makes it appear and function very much like Cross_Match -- hence a masquerade. MaskerAid itself is a PERL 5.0 script that runs on the same flavors of UNIX that RepeatMasker (and WU BLAST) does.

Note:  No changes whatsoever must be made to RepeatMasker, in order to enjoy the combined speed and sensitivity of MaskerAid. In fact, newer versions of RepeatMasker support a -w option to have MaskerAid conveniently invoked instead of Cross_Match. Use of -w is recommended if your version of RepeatMasker supports it.

What are the specific improvements?
With MaskerAid installed, RepeatMasker runs about 30-fold faster -- sometimes more, sometimes less -- at its most sensitive (“slow”) setting, while effectively maintaining sensitivity. With its multithreaded WU BLAST underpinnings, MaskerAid can even take advantage of multi-processor computer architectures to obtain an over 40-fold speed-up when allowed to use 2 processors.

We tested MaskerAid on a set of 20 randomly selected human genomic clones at the three speed settings of RepeatMasker ("slow", “standard” and “quick”), and compiled a graphical representation of the repeats identified relative to those found with native RepeatMasker. Even using RepeatMasker's “quick” setting, MaskerAid provided a 3-fold speed improvement while finding more repeats.

Results
MaskerAid is described in Bioinformatics 16:1040-1 (2000). Additional supporting results are provided here.

Limitations
MaskerAid was developed to accelerate masking of repetitive elements in high-throughput genomic sequence of human origin. Some specialized RepeatMasker functions or auxiliary files unrelated to this task are currently not supported. Furthermore, MaskerAid is not an adequate substitute for Cross_Match for all of RepeatMasker's activities. The known limitations, or caveats, to using MaskerAid are:

Recommendations
In view of the above limitations, an effective strategy for taking advantage of the distinct strengths of native RepeatMasker and RepeatMasker with MaskerAid is to:

  1. include the -s option on the RepeatMasker command line, whenever the available computational resources can provide adequate throughput with this most-sensitive setting, whether running natively with Cross_Match or with MaskerAid;
  2. run RepeatMasker natively with the -noint (no interspersed repeats) option;
  3. run RepeatMasker with its -w (MaskerAid) option, as well as the -nolow (no low-complexity or satellite sequences) and -norna options;
  4. combine the results from (1) and (2) together, by logically OR-ing the separately masked sequences. A program such as nmerge, or a simple PERL script, will suffice for this. (Note: pre-compiled nmerge binaries are included in licensed WU BLAST 2.0 distributions).

The SuperMasker shell script may be useful, as it automates the above steps and includes the -s option.

Even when RepeatMasker is used natively to identify low complexity regions, one should consider supplementing it with one or more additional low-complexity filters, such as dust or nseg. If the sequence is to be conceptually translated into protein, then a program geared for identifying low complexity regions in amino acid sequences should be used, such as seg or xnu. (Again, precompiled binaries for these applications are included in licensed WU BLAST 2.0 distributions.)

How to get it?
MaskerAid is free for anyone to use, modify and redistribute, in accordance with Washington University's open source license. To download the latest version, click here.

MaskerAid relies on WU BLAST 2.0. A freely available version 2.0a19 is posted here; however, MaskerAid has only been thoroughly tested with the full-featured, licensed WU BLAST version 2.0, which is expected to perform better (faster and more reliably) than version 2.0a19. The licensing procedure for WU BLAST 2.0 is described here [http://blast.wustl.edu/licensing]. Note: NCBI BLAST does not provide adequate flexibility for use with RepeatMasker.

The best configuration of RepeatMasker therefore requires the acquisition of four licensed components:

Technical questions about MaskerAid may be addressed to: 
 
  This page was last updated 2004-06-07.