Miropeats discovers regions of sequence similarity amongst any set of DNA sequences and then presents this similarity information graphically. Sequence similarity searching is a very general tool that forms the basis of many different biological sequence analyses but it is limited by the verbosity of traditional alignment presentation styles. Miropeats enhances the utility of conventional DNA sequence comparisons when looking at long lengths of sequence similarity by summarizing extensive large scale sequence similarities on a single page of graphics. The latest version of Miropeats can be used as a general pairwise alignment program or in its traditional role sorting out a big mess of overlapping or similar regions.
The descriptive abilities of Miropeats open research opportunities that would not be possible, or would be tedious, or difficult to do otherwise. Examples include comparing the repeat structures of entire chromosomes, visualising overlapping sequence fragments in a contig assembly project and comparing the products of different contig assembly programs. Miropeats was originally written to help contig assembly projects at the Genome Sequencing Center in St. Louis, Missouri, USA. where it was found to be useful for many diffferent roles. The intrinsic inscrutability of a string of 40,000 characters picked from an alphabet of only 4 letters (a typical cosmid assembly project) is made worse because the shotgun sequencing strategy starts with the original contiguous 40Kb DNA sequence split into an 800 piece puzzle. Miropeats helps shotgun assembly, not by solving the puzzle itself, but by helping the researcher gain an overall understanding of the task presented to them. Miropeats can do this because it draws a simple graphic that shows potential joins, cosmid overlaps, and also distinguishes tandem repeats, inverted repeats, oligo repeats and palindromes from each other.
Miropeats has options to look at all repeated DNA sequence segments (Default) or one can choose to see only those repeated sequences with either both copies on a single sequence, or both copies on different sequences. The program also has an adjustable threshold that lets the user choose what length of DNA sequence similarity should be considered significant and worth displaying. This facility allows Miropeats to be used for analysing different features in sequences varying from less than a Kbase to more than a Mbase. If the picture is too complex then the threshold should be raised but if the picture is not displaying repeats of interest then the threshold should be lowered.
Miropeats itself, is just a perl script. All the DNA comparisons are done by calls to another program called ICAass which is written in ANSI-C and which needs to be compiled. Miropeats has only to parse out the position and quality of any matching DNA segments and convert those above the threshold value into Postscript graphics. The output from Miropeats is always a Postcript graphic file unless there were no repeats found. Miropeats was written and tested on Solaris 2.3 and Ubuntu GNU/Linux so it may need altering slightly to run on any different UNIX versions. NB icaass needs to be in a directory listed in your $PATH environment variable.
If you need to make a new icaass executable then download the icatools package and type make miropeats or just make. If you type 'make' with no options, then you create the lightly optimised variants of the entire ICAtools package which could be useful anyway. Once 'icaass' and the Miropeats script are available on the user's path then the installation is complete. The program is very flexible about its DNA sequence format requirements and not demanding of memory or computer cycles. As currently configured, ICAass will work well with any number of sequences that are shorter than 4Mbases. Try changing the order of the sequences if one is very long.
Miropeats needs to be presented with DNA sequences in one of its recognized formats: EMBL, GenBank, FASTA, Staden, or plain format. The first four listed are complex formats and it is possible to have any number of sequences of the same format in any one file. All the sequence data in plain (unformatted) files is assumed to come from a single sequence. Files suitable for analysis include consensus files from databases (e.g. Xbap - use FASTA format), from local databases (e.g. ACEdb - use dump sequence) and from public databases (e.g. Genbank - use NCBI's Web server).
The threshold score is simply defined: the number of matching bases minus the number of mismatching bases. The default threshold score is 80. Changing the threshold score ("-s integer") can be very useful for miropeats when its default parameters produced too complicated a diagram to understand. A new assembly project with many potential joins is always going to be more complicated than a single finished cosmid so don't be alarmed if your new sequence assembly project looks difficult; its going to be simpler tomorrow.
If you are only interested in a certain portion of your sequences and want a graphic marking repeated sequence in just those regions, you have to create new subsequence files containing just those selected regions and then run the program again. Miropeats can cope with hundreds of files listed after the command and threshold options.
In the following examples, both the original postscript version of the Miropeats graphic and a rather coarse GIF equivalent are normally available. If you have a fast computer then use the PostScript versions because they are easier to understand.
To see a comparison between two instances of a homologous gene, one from Fugu rubripes and the other in the human, as performed by Miropeats then click onPostScript version or GIF version. The large number of repeats drawn above the human sequence are mostly Alu repeats which are abviously absent from the Fugu fish homologue. The 11 pairs of lines that join the two G6PD sequences correspond to the 11 exons that are found in both examples of this gene.
To see an example of what a relatively complex C. elegans sequence assembly project can look like when some initial "finishing" work has been done click on PostScript version or GIF version. Many features are represented including examples of each of the different types of DNA repeats distinguishable using Miropeats: tandem, inverted, tandem oligo, and palindrome. The graphic also shows that contig number "00241" is probably just a subsequence of "00328" that has not yet been integrated into a single contig. Please note that the programs have no divine knowledge and cannot distinguish between biological duplication events and assembly mistakes - that is for you to decide.
Most DNA analysis programs could not cope with either reading or sensibly displaying entire yeast chromosomes of over 600 Kbase length but Miropeats does it quite easily. It takes about 4 minutes on a 40MHz SPARCStation to examine yeast chromosome 8 (570 Kbases) looking for all its internal repeats. To see an example showing three separate chromosomes on a single diagram, click on PostScript version or GIF version. By switching off repeats between sequences it is possible to see the repeats within the three chromosomes more clearly. The telomeres are obvious as inverted repeats in all the chromosomes, the mating locus is the main feature in the centre of Chromosome3 and two duplications are shown next to each other in Chromosome 8. Also of interest is the striking overall similarity of each of the chromosome's repeat patterns.
Miropeats is not very sophisticated about the positioning of sequence fragments on the page and this can lead to interesting information being hidden behind a jumble of crossing lines. One day I would like to write an interactive version of Miropeats that would display graphics and allow the user to choose which fragments should be drawn and where they should be placed or use highlighting as the mouse moves over the page. Consed assembly view now has some of these features.
When working with Human DNA sequences or any DNA containing many repeats there is a chance that interesting matches will be lost amongst the many Alu's etc. that are not usually of much consequence. It would be nice if there was an option to screen out the interconnect lines from certain classes of well known repeats.
Sequences should not be longer than 4Mbases.
Jeremy Parsons makes no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty. Its academic software OK ! See the GNU copyright notice.
The latest version of the code should always be available from the My littlest website
Please send me an email to the address below to ensure that I keep you informed of bug fixes and to let me me know what improvements you would like. Please include the word Miropeats on the subject line somewhere.jparsons [at goes here] littlest.co.uk