“One code to find them all”: a perl tool to conveniently parse RepeatMasker output files
© Bailly-Bechet et al.; licensee BioMed Central Ltd. 2014
Received: 28 January 2014
Accepted: 7 April 2014
Published: 1 May 2014
Of the different bioinformatic methods used to recover transposable elements (TEs) in genome sequences, one of the most commonly used procedures is the homology-based method proposed by the RepeatMasker program. RepeatMasker generates several output files, including the .out file, which provides annotations for all detected repeats in a query sequence. However, a remaining challenge consists of identifying the different copies of TEs that correspond to the identified hits. This step is essential for any evolutionary/comparative analysis of the different copies within a family. Different possibilities can lead to multiple hits corresponding to a unique copy of an element, such as the presence of large deletions/insertions or undetermined bases, and distinct consensus corresponding to a single full-length sequence (like for long terminal repeat (LTR)-retrotransposons). These possibilities must be taken into account to determine the exact number of TE copies.
We have developed a perl tool that parses the RepeatMasker .out file to better determine the number and positions of TE copies in the query sequence, in addition to computing quantitative information for the different families. To determine the accuracy of the program, we tested it on several RepeatMasker .out files corresponding to two organisms (Drosophila melanogaster and Homo sapiens) for which the TE content has already been largely described and which present great differences in genome size, TE content, and TE families.
Our tool provides access to detailed information concerning the TE content in a genome at the family level from the .out file of RepeatMasker. This information includes the exact position and orientation of each copy, its proportion in the query sequence, and its quality compared to the reference element. In addition, our tool allows a user to directly retrieve the sequence of each copy and obtain the same detailed information at the family level when a local library with incomplete TE class/subclass information was used with RepeatMasker. We hope that this tool will be helpful for people working on the distribution and evolution of TEs within genomes.
KeywordsTransposable elements RepeatMasker Annotation
Large proportions of eukaryotic genomes are essentially composed of repeated sequences, including the human (approximately 45 to 78% [1, 2]), maize (approximately 80% ), and salamander (approximately 50% ) genomes. Among these repeated sequences, transposable elements (TEs) represent the most significant contributors in terms of sequence coverage and therefore have a major influence on genome evolution, especially on genome size . In contrast to other repeated sequences, TEs consist of a wide diversity of sequences; in addition to the separation in classes based on the transposition intermediate (RNA versus DNA), many subfamilies are described inside each class, corresponding to elements with particular sequence features, and many efforts were made to unify the classification system for all of these elements [6, 7].
With the ever-growing number of whole genome sequencing projects, the identification of TEs becomes necessary to fully characterize the evolutionary dynamics of genomes. Different methods of TE identification have been developed during the past 15 years, with the majority designed to determine TE content in assembled genome sequences produced by the classic Sanger sequencing method (for reviews, see Bergman and Quesneville , Saha et al., and Lerat ). These methods group three main types of approaches to recover TE sequences: homology-based approaches that search for a reference sequence in a query genome; structure-based approaches that search for particular structural features of certain TE classes, such as the presence of two long terminal repeats (LTRs) at the extremities of LTR-retrotransposons; and de novo approaches that principally use the repetitive nature of TEs to discover them.
More recently, with the emergence of next generation sequencing (NGS) technologies, new efforts were made to develop novel tools to detect TEs because previous methods are not directly applicable to reads produced by NGS data [11, 12]. However, one of the most commonly used procedures to find occurrences of known TEs remains the homology-based method proposed by the RepeatMasker program  because it is easy to use, rapid, and efficient [14, 15]. The main drawback of this program is its dependence on reference sequences and consequent inability to discover new TEs. This method however remains a must for identifying TE sequences in an assembly or after the identification of new consensus TE sequences using de novo methods. For example, this last approach (de novo TE libraries used with RepeatMasker) was applied for the identification of TEs in the 12 Drosophila genomes .
The principle of RepeatMasker is to search for the occurrence of any reference sequence contained in a library (currently Dfam  and RepBase , or user-built in) in a query sequence using a sequence comparison approach based on popular search engines including nhmmer, cross_match, ABBlast/WUBlast, RMBlast, and Decypher . RepeatMasker generates several output files, including the .out file, which provides a detailed annotation of all detected repeats in the query sequence, specifically including their position, orientation, and divergence from the reference sequence . This .out file is particularly useful because it identifies the part of the query sequence that matches a given TE family of a library (a ‘hit’) and provides its position in the query sequence for each one. However, a remaining challenge consists of identifying the different copies of elements corresponding to those ‘hits’ , which is a prerequisite for any evolutionary or comparative analysis of different copies of a family.
Some programs proposing the use of RepeatMasker output files were developed [21–23], but none allow access to the location of all of the TE copies or an accurate quantification of the family content at the genomic level. These programs usually have very specific aims. TSDFinder was developed to refine the coordinates of long interspersed nuclear element (LINE) L1 insertions by identifying flanking target site duplication (TSD) sequences and the poly(A) tails of 3′ intact L1 insertions in the human genome . The LTR-miner program was designed to specifically retrieve information concerning the age and distribution of LTR-retrotransposons . This program was then implemented in the Reannotate program for use on all categories of TEs to estimate the temporal order of insertions in the case of nested elements and to estimate the age of LTR-retrotransposon copies .
In this manuscript, we propose a perl tool (available at http://doua.prabi.fr/software/one-code-to-find-them-all) that parses the RepeatMasker .out files to accurately determine the number of TE copies found, obtain their positions, and retrieve their sequences. This tool should be helpful for any non-bioinformatics scientist interested in genome annotation and/or evolution. To our knowledge, this program is the first multi-purpose tool that correctly identifies TE copies using RepeatMasker and provides complementary quantitative information for individual families in a query sequence.
The proposed tool consists of two perl scripts that must be run successively to take into account the different characteristics of the consensus sequences.
Script 1: build_dictionary.pl
This script builds a list of all of the LTR-retrotransposons found in the query sequence at least once by RepeatMasker to associate hits corresponding to the internal portion and those corresponding to LTR sequences. This module uses the RepeatMasker .out file or a directory path containing several RepeatMasker .out files as input. RepeatMasker files are recognized based on their .out extension, allowing the program to run recursively on large file structures without prior file sorting (for example, working on one organism by running the program on an entire directory downloaded from a genome database). Then, the program matches together internal and LTR portions, based on name similarity. The main issue with this step relies on heterogeneity in the annotation of LTR-retrotransposons in the library. For example, the majority of LTR-retrotransposons in Drosophila melanogaster appear under the name ‘TE_LTR’ and ‘TE_I’ for the LTR sequence and the internal sequence, respectively. However, the members ‘LTR’ and ‘internal portion’ may sometimes have different names. This scenario is, for example, the case for the LTR-retrotransposon HMS-beagle, for which the corresponding LTR sequence is labelled DMLTR5, while the internal portion is labelled HMSBEAGLE_I. The same problem occurs more frequently for retrotransposons in Homo sapiens, making it difficult to derive a completely generalized algorithm to determine the concordance between the LTR and internal portions. These issues imply that the output file of build_dictionary.pl must be manually inspected to correct for any mis-association.
In the standard version (see --fuzzy option for the alternative version), the program only recognizes similar names in addition to the ‘LTR’ or ‘int’ suffix or prefix, taking into account small discrepancies such as a ‘-’ symbol replaced by a ‘_’ symbol, for example, recognizing the association between HERV-Fc2-int and HERV-Fc2_LTR.
--rm infile (corresponds to a RepeatMasker .out file or the name of a directory containing several RepeatMasker .out files).
This option prints a summary of the different usages of the script.
This option allows the script to associate more LTR names with internal counterparts to account for the possibility of LTR variants. In three successive passes, the program associates similar names differing by a single letter, a single number, or two characters. For example, in the human genome, the --fuzzy option allows for the association of MER66-int with its various counterparts MER66A, MER66B, MER66C, and MER66D or HERV1_I-int with HERV1_LTRa, HERV1_LTRb, HERV1_LTRc, HERV1_LTRd, and HERV1_LTRe.
To be used in particular cases where the RepeatMasker program was run using a local TE library without the class/subclass specification (see below).
Finally, the name and path of the output file should be specified using a redirection (>dictionary_output.txt). Examples of command lines are detailed in the tutorial available on the program website.
Script 2: one_code_to_find_them_all.pl
The second script uses the output file produced by build_dictionary.pl and a RepeatMasker .out file (or a directory containing several RepeatMasker .out files). The principle of this program is to compare the positions and orientation of each hit corresponding to the same TE family to determine whether the hits correspond to the same copy and can be merged or correspond to different copies. Two hits located on the same scaffold or chromosome are considered to be fragments of the same copy if they abide by the three following conditions: 1) they have the same orientation; 2) the extremities of the fragments respect a distance criterion: by default the furthest extremities should be separated by less than twice the length of the reference TE element (see the --insert option for non-default behavior); and 3) the second fragment starts and ends after the first one respectively starts and ends (that is, the two fragments can overlap but cannot be included in one another). These constraint filters were motivated by a conservative choice, meaning not to merge copies that do not belong to the same insertion. However, one shortcoming of this methodology is that it may be impossible to re-assemble old copies in which many insertions of other elements had taken place after this copy was first inserted in the genome. Moreover, we may over-estimate the copy number if a portion of a given copy is inverted, leading to several fragments in different orientations.
The identification of unique copies of LTR-retrotransposons depends on the different fragments and different portions of the element (LTR and internal portions), as follows. First, we identify different fragments of the same portion that could be later assembled as a copy. For that purpose, two LTR fragments must not be separated by a compatible internal fragment, and two internal fragments must not be separated by a compatible LTR fragment. These steps are necessary for the merging of fragments into a copy. Once all copies are reconstructed from the RepeatMasker hits, the program assembles full-length LTR-retrotransposons by associating LTR copies and their corresponding internal copy located closely to one another. Conditions for associating a LTR sequence with an internal sequence include the following: the LTR sequence must be in the same orientation as the internal sequence, and it must be separated from the internal sequence by less than half the LTR length. The reconstruction of full-length ‘LTR-I-LTR’ elements is performed as a priority, and with the remaining copies, incomplete ‘LTR-I’ or ‘I-LTR’ elements are then built. All copies, assembled or solo, are reported. As solo-LTRs are of special evolutionary interest, they are reported separately from the full-length and partial LTR-retrotransposon copies in the summary file .copynumber.csv (see below).
--rm infile (corresponds to a RepeatMasker .out file or the name of a directory containing several RepeatMasker .out files).
--ltr output file from build_dictionary.pl (Script 1).
This option prints the possible usages of the script.
This option makes the program use a rule based on the 80-80-80 rule  to select hits. In this case, the program provides copies with sizes greater than 80 bp long and which have greater than 80% identity to the reference element. By default, the program gives all hits found, regardless of the size or percentage of identity compared with the reference.
This option allows users to work with their own file for the length of the reference elements, which will be used to determine the ratio of the length of a given copy compared with its reference. If not provided, the code computes the length of all elements (LTR and internal portions separated for the LTR-retrotransposons) present in all .out files under study, by selecting for each element the most common consensus length (as in some cases multiple RepeatMasker consensus sequences can correspond to the same element). This option is valuable when working with elements whose annotation is ambiguous to ensure that the correct reference length is used. It can also be used with another purpose, when only a subset of TEs is considered, since only the elements mentioned in the .length file will be taken into account.
This option allows users to manually resolve ambiguous situations by choosing their favorite solution for merging hits. For example, Figure 2A shows a case in which two choices are possible, that is, two different hits can be assembled with the one under study (DM297_I at position 21,407,284 on the chromosome X). In this case, the first choice (solution 0) is the most parsimonious. Solution 0 is always the one corresponding to assembling closest hits together. However, this solution may come to a fault in the case of multiple nested or duplicated TEs corresponding to the same reference element. For example, in Figure 2B, solution 1 is the most parsimonious, that is, the one that minimizes the reorganization of the copy compared to the reference element structure. If this option is not specified, the default choice consists of choosing solution 0.
As many ambiguous cases can arise, the RepeatMasker block ID (column 14 of RepeatMasker .out file) is used when this option is activated. These IDs come from the ProcessRepeats script implemented in RepeatMasker, which makes educated guesses if any pair of fragments is derived from the same element or not. Therefore, if an ambiguous situation can be solved unequivocally using these Block IDs, no choice is left to the user, and the elements sharing the same Block ID are merged.
Another way of quickening the choice process is to only ask the user about ambiguous cases, and sometimes a single choice can disambiguate multiple situations. For example, consider the situation for which three fragments A, B, and C are considered for merging, and for which the choice is between A-B and A-C (choices are always pairwise). If the user considers the right choice to be A-B-C, he/she will choose A-B. Then, if adding C to the merged A-B is not ambiguous (if there is no D fragment of the same element nearby to get confused with, for example), the code will directly merge C with A-B, getting the right result A-B-C without asking the user about this final merging.
This option performs all operations, but reports no results except the log file with all operations performed. It is designed to be used in tests, particularly those determining the number of ambiguous situations to be resolved. Running the program with this option before the actual analysis allows estimation of the time required to complete an analysis with the --choice option because the number of ambiguous situations can be high, and manual choice is time-consuming if applied to all elements in a genome.
In the particular case in which the RepeatMasker program was run using a local library that did not use the naming system required to differentiate the class and the subclass (required format is described in RepeatMasker help file), the .out file is slightly different because column 11 (repeat class/family) is usually filled with ‘Unknown’ or ‘Unspecified’ , which means that the type of individual TE is not specified. To account for this possibility, the user can use the --unknown option, which will produce results for elements annotated as ‘Unknown’ or ‘Unspecified’ and deriving from the local, unannotated bank.
--fasta and --flanking ‘size_in_bp’
The --fasta option allows for the retrieving of sequences of copies reported by the program from the local fasta sequence files used in the RepeatMasker program. To study flanking sequences of the determined copies, the --flanking option can be specified to allow the program to report the flanking regions of the specified size surrounding each copy in addition to the TE sequence.
This option changes the code behavior for merging fragments into copies. By default, the furthest extremities of the considered fragment to be merged are compared, and merging takes place if they are less than twice the reference element length apart. Using --insert, the size of the genomic sequence between the two closest extremities of the considered fragments (that is, the size of the insertion between them) will be considered: if it is less or equal to the size given in the option, the fragments are merged. For example, using --insert 0 means only fragments detected right next to each other in the query sequence will be considered as parts of the same copy.
By default, five output files are generated, which are located in the same directory as the RepeatMasker .out file(s), plus one output file located in the working directory (.length file) that is produced only if the --length option was not specified.
The .log.txt file contains the screen output of the program. For each element, this file summarizes the number of hits and copies obtained after merging the hits. When the --dry-run option is chosen, it displays the possible choices that would be asked using the --choice option.
For each TE category (DNA transposons, LINEs, short interspersed nuclear elements (SINEs), and LTR-retrotransposons), the global information concerning the number of fragments, number of copies, number of base pairs, and percent coverage are given and correspond to lines beginning with ‘######Type:DNA’, ‘######Type:LINE’, ‘######Type:SINE’ , and ‘######Type:LTR’. The column ‘length’ in this case contains a NA. For example, in Figure 3A, the DNA/hAT transposon hobo (reference length of 3,016 bp) has 40 fragments on chromosome 2L corresponding to 21 copies. These copies span 20,529 bp on chromosome 2L, which represents 0.09% of this chromosome. The end of the file gives global information concerning all TEs (and thus the coverage of all TEs on the analyzed sequence), satellites, low complexity regions, simple repeats, and unknown repeat elements (see Figure 3B).
Individual copies of TEs correspond to lines beginning with the # character followed by the identification number of the merged hits with each one separated by a slash (/). For those that have been reconstructed using several hits, the fragments used to build the considered copies are shown below. For each copy, the ‘%_Div’ (percentage divergence to the reference), ‘%_Del’ (percentage of deletion compared to the reference), and ‘%_Ins’ (percentage of insertion compared to the reference) are the means of the values of each fragment normalized by size.
In the example in Figure 4, the first copy corresponds to a single fragment of a Doc element, which is nearly complete, whereas the copy below has been reconstructed using three fragments that also correspond to a Doc element. The third example corresponds to a complete copy of the copia LTR-retrotransposon, which has been reconstructed with respect to the separation of the ‘internal portion’ and ‘LTR’ in the consensus library. The last example corresponds to a full-length solo-LTR of copia2. The position of each copy is provided in columns 5, 6, and 7, which correspond to the name of the query sequence, position of the start of the copy in the query sequence, and position of the end of the copy in the query sequence, respectively. The orientation on the strand (+ or complementary) is specified in column 9. In the example in Figure 4, the reconstructed copy of the copia element is located in the long arm of chromosome 2 (chr2L), starts at position 3,073,087, ends at the position 3,078,231 (is 5,145 bp long), and is on the positive strand (+).
The last output file, .elem_sorted.csv, contains the same information as the two previous ones, but sorted per genomic position and not per element, in order to be easily used by people interested in the genomic context and distribution of TEs.
Results and discussion
To determine the accuracy of the program, we tested it with several RepeatMasker .out files corresponding to two organisms, D. melanogaster and H. sapiens, for which the TE content has already been largely described and which present great differences in genome size, TE content, and TE families.
Test of the D. melanogaster genome RepeatMasker output files
We retrieved the RepeatMasker .out files (thereafter mentioned as UCSC files) from the UCSC Genome Bioinformatics website (http://genome.ucsc.edu/), which were produced using version dm3 (April 2006) of the genome sequence with the 17 May 2007 (open-3.1.8) version of RepeatMasker and library release 20061006. Each file corresponds to a different chromosome (2L, 2LHet, 2R, 2RHet, 3L, 3LHet, 3R, 3RHet, 4, U, Uextra, X, XHet, and YHet). We did not retrieve the file corresponding to the mitochondrial genome. We also retrieved the unique RepeatMasker .out file (thereafter mentioned as RM file) provided for the same genome version on the RepeatMasker website (http://www.repeatmasker.org) using the library release 20080611 and open-3.2.5 version of RepeatMasker. This file contains the results for all chromosomes.
Determining the number of ambiguous cases that may require manual inspection (option --dry-run/--choice)
The option --dry-run was used with the UCSC files to determine the number of ambiguous cases that could be manually expertized. For all chromosomes, 862 cases appeared (see Additional file 1: Table S1 for individual chromosome detail). We investigated the cases corresponding to chromosome 3R for which eight ambiguous cases were identified. For all but two cases, the default solution 0 was the best choice from a biological point of view (minimizing the reorganization of the copy compared to the reference element structure). For the two remaining cases, the best choices were solution 1 and the last solution (not assemble the fragments). For chromosome X for which 14 ambiguous cases were indicated, solution 0 was the best choice in ten cases and the last solution (to let the first fragment alone) was the best choice for four cases. This result indicates that the default choice made by the program is the best choice (the most biologically sound) in the majority of cases.
Running the program with and without the --strict option
We did not initially specify use of the --strict option and successively ran the program with the UCSC and RM files. When the --strict option is not specified, the program considers every hit without filtering using our 80-80 rule. We observed the same amount of TEs globally (both in terms of copy number and chromosome coverage, see Additional file 2: Table S2 and Additional file 3: Table S3) for the two versions of the Repeat Library used with slightly more copies detected in the RM file (208 more copies, see Additional file 2: Table S2). This observation can be explained by the fact that the library used in this case was more recent and thus capable of containing new reference elements. In the results from the UCSC files we observed that the DNAREP1 element was associated with the repeat class family LINE/Penelope, as proposed when it was first described , whereas it is now known to correspond to the repeat class family DNA/Helitron. In the annotation from the RM file, the association is correct, indicating that the Repeat Library used by UCSC incorrectly assigned this element to the LINE category, which was later corrected in a new version. We therefore chose to consider only the output file from the RepeatMasker website (RM file) for the rest of the test. This underlines the importance of a correct TE classification to obtain an accurate amount of particular elements.
Copy number per chromosome for each category of TEs in D. melanogaster
Without --strict option
With --strict option
Total (without unplaced)
Using the output files *.transposons.csv and *.ltr.csv, which contain details for the copies for each heterochromatin chromosome, we retrieved all of the potentially full-length elements by selecting copies whose ratios compared with the reference were over 95% (%_of_Ref, column 17). We obtained 474 copies corresponding to this criterion, which is more than the 202 full-length elements previously described  but that includes 130 full-length solo-LTRs. We did the same to determine the number of potentially full-length elements in euchromatin regions and found a total of 655 elements (1,039 elements when counting the highly represented DNAREP1, which is no more active and full-length solo-LTRs (170 copies)). This number is higher than the 478 full-length elements described with an older version of the D. melanogaster genome, which annotated only 1,572 TE copies . This result demonstrates that our program can quickly identify potentially full-length elements.
In terms of proportion, the global TE content on chromosomes is congruent with what was previously shown [26, 27] with an average of 6.69% (6.04% with the --strict option) of TEs in euchromatin regions (without taking into account chromosome 4) and 61.63% (52.53% with the --strict option) of TEs for heterochromatin regions (see Additional file 4: Table S4).
Test of the tool using the H. sapiens genome RepeatMasker output files
We retrieved the RepeatMasker .out file from the RepeatMasker website (http://www.repeatmasker.org), which was produced using the hg19 version (February 2009) of the genome sequence with the open-3.3.8 version of RepeatMasker and Repeat Library 20120124. This file contains results for all chromosomes, that is, 22 autosomal chromosomes and the two sex chromosomes (X and Y) that we considered in the test. We did not take into account results corresponding to randomly placed sequences, unplaced sequences (chrUn), and particular regions of chromosome 6 (corresponding to different haplotypes of the major histocompatibility complex region), chromosome 4, and chromosome 17.
Determining the number of ambiguous cases that may require manual inspection (option --dry-run/--choice)
We determined the number of ambiguous cases that could be manually expertized for our file. For all of the considered chromosomes, a total of 12,133 possible choices appeared, which could potentially be investigated (see Additional file 8: Table S5 for the number by chromosomes). This large number indicates that complete manual annotation would be impossible to manage; however, by reducing the analysis to some TE families of interest, it would still be possible.
Running the program with and without the --strict option
Percent coverage of each chromosome and for each class of TE in H. sapiens with and without the -- strict option
Without --strict option
With --strict option
One original feature of our program is the ability to compute detailed quantitative information chromosome by chromosome, which differs from the output table produced by RepeatMasker. This feature allows us to show that the representation of each TE class differs according to the chromosome. For DNA transposons, chromosomes 3 and 20 displayed the highest proportion of these elements (4.05% and 4.17%, respectively), whereas the Y chromosome is particularly poor in elements of this class with only 0.79%. The X chromosome contains the highest proportion of LINEs and LTR-retrotransposons (33.71% and 11.38%, respectively) with chromosome 22 harboring the lowest proportion of the same elements (10.95% LINEs and 4.64% LTR-retrotransposons). Finally, SINEs are particularly abundant on chromosome 19 (26.98%) and rare on the Y chromosome (4.38%). Globally, the X chromosome has the highest proportion of TEs (58.77%), whereas the Y chromosome has the lowest proportion of TEs (23.84%). This observation is congruent with the discrepancy observed for particular families between the autosomal and sex chromosomes .
Percent coverage of each chromosome of the most represented TE families using the -- strict option
We have developed a tool to conveniently parse the classic RepeatMasker .out file to improve the original annotation provided, by including reconstruction of full-length copies. This information includes in particular a measure of the quality of the copies compared to a reference element, as well as the exact position and orientation of each copy and some quantification concerning their proportion in the genome/chromosome sequence, allowing for a fast and accurate assessment of the exact TE content. Additionally, the sequence of each copy with or without flanking sequences can be retrieved directly, allowing further analyses of the TEs. We hope that this tool will assist non-bioinformatics scientists in the more accurate identification of TE copies.
Availability and requirements
Project name: One code to find them all.
Project home: http://doua.prabi.fr/software/one-code-to-find-them-all.
Operating system(s): Linux/Unix, Mac OS X, Windows (with Perl installed).
Programming language: Perl.
License: GNU General Public License.
Long interspersed nuclear element
Long terminal repeat
Next generation sequencing
Short interspersed nuclear element
Target site duplication.
We are particularly thankful to the two reviewers for their very positive feedback and their great help to improve our program. The English of the manuscript has been edited by the American Journal Experts company.
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, et al.: Initial sequencing and analysis of the human genome. Nature 2001, 409: 860-921. 10.1038/35057062View ArticlePubMedGoogle Scholar
- de Koning APJ, Gu W, Castoe TA, Batzer MA, Pollock DD: Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet 2011, 7: e1002384. 10.1371/journal.pgen.1002384PubMed CentralView ArticlePubMedGoogle Scholar
- Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, Minx P, Reily AD, Courtney L, Kruchowski SS, Tomlinson C, Strong C, Delehaunty K, Fronick C, Courtney B, Rock SM, Belter E, Du F, Kim K, Abbott RM, Cotton M, Levy A, Marchetto P, Ochoa K, Jackson SM, Gillam B, et al.: The B73 maize genome: complexity, diversity, and dynamics. Science 2009, 326: 1112-1115. 10.1126/science.1178534View ArticlePubMedGoogle Scholar
- Sun C, Shepard DB, Chong RA, Arriaza JL, Hall K, Castoe TA, Feschotte C, Pollock DD, Mueller RL: LTR retrotransposons contribute to genomic gigantism in plethodontid salamanders. Genome Biol Evol 2012, 4: 168-183. 10.1093/gbe/evr139PubMed CentralView ArticlePubMedGoogle Scholar
- Biémont C, Vieira C: Genetics: junk DNA as an evolutionary force. Nature 2006, 443: 521-524. 10.1038/443521aView ArticlePubMedGoogle Scholar
- Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, Leroy P, Morgante M, Panaud O, Paux E, SanMiguel P, Schulman AH: A unified classification system for eukaryotic transposable elements. Nat Rev Genet 2007, 8: 973-982. 10.1038/nrg2165View ArticlePubMedGoogle Scholar
- Kapitonov VV, Jurka J: A universal classification of eukaryotic transposable elements implemented in Repbase. Nat Rev Genet 2008, 9: 411-412. author reply 414 10.1038/nrg2165-c1View ArticlePubMedGoogle Scholar
- Bergman CM, Quesneville H: Discovering and detecting transposable elements in genome sequences. Brief Bioinform 2007, 8: 382-392. 10.1093/bib/bbm048View ArticlePubMedGoogle Scholar
- Saha S, Bridges S, Magbanua Z, Peterson D: Computational approaches and tools used in identification of dispersed repetitive DNA sequences. Trop Plant Biol 2008, 1: 85-96. 10.1007/s12042-007-9007-5View ArticleGoogle Scholar
- Lerat E: Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity (Edinb) 2010, 104: 520-533. 10.1038/hdy.2009.165View ArticleGoogle Scholar
- Chaparro C, Sabot F: Methods and software in NGS for TE analysis. Methods Mol Biol 2012, 859: 105-114. 10.1007/978-1-61779-603-6_6View ArticlePubMedGoogle Scholar
- Modolo L, Lerat E: Identification and analysis of transposable elements in genomic sequences. In Genome analysis: Current Procedures and Applications. Edited by: Poptsova M. Norwich: Caister Academic Press; 2013:165-181.Google Scholar
- Smit AF, Hubley R, Green P: RepeatMasker Open-3.0. ( ) 1996–2004 http://www.repeatmasker.org () 1996–2004
- Aparicio S, Chapman J, Stupka E, Putnam N, Chia JM, Dehal P, Christoffels A, Rash S, Hoon S, Smit A, Gelpke MD, Roach J, Oh T, Ho IY, Wong M, Detter C, Verhoef F, Predki P, Tay A, Lucas S, Richardson P, Smith SF, Clark MS, Edwards YJ, Doggett N, Zharkikh A, Tavtigian SV, Pruss D, Barnstead M, Evans C, et al.: Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 2002, 297: 1301-1310. 10.1126/science.1072104View ArticlePubMedGoogle Scholar
- Juretic N, Bureau TE, Bruskiewich RM: Transposable element annotation of the rice genome. Bioinformatics 2004, 20: 155-160. 10.1093/bioinformatics/bth019View ArticlePubMedGoogle Scholar
- Clark AG, Eisen MB, Smith DR, Bergman CM, Oliver B, Markow TA, Kaufman TC, Kellis M, Gelbart W, Iyer VN, Pollard DA, Sackton TB, Larracuente AM, Singh ND, Abad JP, Abt DN, Adryan B, Aguade M, Akashi H, Anderson WW, Aquadro CF, Ardell DH, Arguello R, Artieri CG, Barbash DA, Barker D, Barsanti P, Batterham P, Batzoglou S, Drosophila 12 Genomes Consortium, et al.: Evolution of genes and genomes on the Drosophila phylogeny. Nature 2007, 450: 203-218. 10.1038/nature06341View ArticlePubMedGoogle Scholar
- Wheeler TJ, Clements J, Eddy SR, Hubley R, Jones TA, Jurka J, Smit AFA, Finn RD: Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res 2013, 41: D70-D82. 10.1093/nar/gks1265PubMed CentralView ArticlePubMedGoogle Scholar
- Jurka J: Repbase update: a database and an electronic journal of repetitive elements. Trends Genet 2000, 16: 418-420. 10.1016/S0168-9525(00)02093-XView ArticlePubMedGoogle Scholar
- Tempel S: Using and understanding RepeatMasker. Methods Mol Biol 2012, 859: 29-51. 10.1007/978-1-61779-603-6_2View ArticlePubMedGoogle Scholar
- Ma J, Devos KM, Bennetzen JL: Analyses of LTR-retrotransposon structures reveal recent and rapid genomic DNA loss in rice. Genome Res 2004, 14: 860-869. 10.1101/gr.1466204PubMed CentralView ArticlePubMedGoogle Scholar
- Szak ST, Pickeral OK, Makalowski W, Boguski MS, Landsman D, Boeke JD: Molecular archeology of L1 insertions in the human genome. Genome Biol 2002, 3: research0052.1-research0052.18.View ArticleGoogle Scholar
- Pereira V: Insertion bias and purifying selection of retrotransposons in the Arabidopsis thaliana genome. Genome Biol 2004, 5: R79. 10.1186/gb-2004-5-10-r79PubMed CentralView ArticlePubMedGoogle Scholar
- Pereira V: Automated paleontology of repetitive DNA with REANNOTATE. BMC Genomics 2008, 9: 614. 10.1186/1471-2164-9-614PubMed CentralView ArticlePubMedGoogle Scholar
- Kapitonov VV, Jurka J: Molecular paleontology of transposable elements in the Drosophila melanogaster genome. Proc Natl Acad Sci U S A 2003, 100: 6569-6574. 10.1073/pnas.0732024100PubMed CentralView ArticlePubMedGoogle Scholar
- Yang HP, Hung TL, You TL, Yang TH: Genomewide comparative analysis of the highly abundant transposable element DINE-1 suggests a recent transpositional burst in Drosophila yakuba. Genetics 2006, 173: 189-196. 10.1534/genetics.105.051714PubMed CentralView ArticlePubMedGoogle Scholar
- Quesneville H, Bergman CM, Andrieu O, Autard D, Nouaud D, Ashburner M, Anxolabehere D: Combined evidence annotation of transposable elements in genome sequences. PLoS Comput Biol 2005, 1: 166-175.PubMedGoogle Scholar
- Smith CD, Shu S, Mungall CJ, Karpen GH: The release 5.1 annotation of Drosophila melanogaster heterochromatin. Science 2007, 316: 1586-1591. 10.1126/science.1139815PubMed CentralView ArticlePubMedGoogle Scholar
- Kaminker JS, Bergman CM, Kronmiller B, Carlson J, Svirskas R, Patel S, Frise E, Wheeler DA, Lewis SE, Rubin GM, Ashburner M, Celniker SE: The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective. Genome Biol 2002, 3: research0084-0084.20.PubMed CentralView ArticlePubMedGoogle Scholar
- Kvikstad EM, Makova KD: The (r)evolution of SINE versus LINE distributions in primate genomes: sex chromosomes are important. Genome Res 2010, 20: 600-613. 10.1101/gr.099044.109PubMed CentralView ArticlePubMedGoogle Scholar
- Mills RE, Bennett EA, Iskow RC, Devine SE: Which transposable elements are active in the human genome? Trends Genet 2007, 23: 183-191. 10.1016/j.tig.2007.02.006View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.