“One code to find them all”: a perl tool to conveniently parse RepeatMasker output files
Mobile DNA volume 5, Article number: 13 (2014)
Of the different bioinformatic methods used to recover transposable elements (TEs) in genome sequences, one of the most commonly used procedures is the homology-based method proposed by the RepeatMasker program. RepeatMasker generates several output files, including the .out file, which provides annotations for all detected repeats in a query sequence. However, a remaining challenge consists of identifying the different copies of TEs that correspond to the identified hits. This step is essential for any evolutionary/comparative analysis of the different copies within a family. Different possibilities can lead to multiple hits corresponding to a unique copy of an element, such as the presence of large deletions/insertions or undetermined bases, and distinct consensus corresponding to a single full-length sequence (like for long terminal repeat (LTR)-retrotransposons). These possibilities must be taken into account to determine the exact number of TE copies.
We have developed a perl tool that parses the RepeatMasker .out file to better determine the number and positions of TE copies in the query sequence, in addition to computing quantitative information for the different families. To determine the accuracy of the program, we tested it on several RepeatMasker .out files corresponding to two organisms (Drosophila melanogaster and Homo sapiens) for which the TE content has already been largely described and which present great differences in genome size, TE content, and TE families.
Our tool provides access to detailed information concerning the TE content in a genome at the family level from the .out file of RepeatMasker. This information includes the exact position and orientation of each copy, its proportion in the query sequence, and its quality compared to the reference element. In addition, our tool allows a user to directly retrieve the sequence of each copy and obtain the same detailed information at the family level when a local library with incomplete TE class/subclass information was used with RepeatMasker. We hope that this tool will be helpful for people working on the distribution and evolution of TEs within genomes.
Large proportions of eukaryotic genomes are essentially composed of repeated sequences, including the human (approximately 45 to 78% [1, 2]), maize (approximately 80% ), and salamander (approximately 50% ) genomes. Among these repeated sequences, transposable elements (TEs) represent the most significant contributors in terms of sequence coverage and therefore have a major influence on genome evolution, especially on genome size . In contrast to other repeated sequences, TEs consist of a wide diversity of sequences; in addition to the separation in classes based on the transposition intermediate (RNA versus DNA), many subfamilies are described inside each class, corresponding to elements with particular sequence features, and many efforts were made to unify the classification system for all of these elements [6, 7].
With the ever-growing number of whole genome sequencing projects, the identification of TEs becomes necessary to fully characterize the evolutionary dynamics of genomes. Different methods of TE identification have been developed during the past 15 years, with the majority designed to determine TE content in assembled genome sequences produced by the classic Sanger sequencing method (for reviews, see Bergman and Quesneville , Saha et al., and Lerat ). These methods group three main types of approaches to recover TE sequences: homology-based approaches that search for a reference sequence in a query genome; structure-based approaches that search for particular structural features of certain TE classes, such as the presence of two long terminal repeats (LTRs) at the extremities of LTR-retrotransposons; and de novo approaches that principally use the repetitive nature of TEs to discover them.
More recently, with the emergence of next generation sequencing (NGS) technologies, new efforts were made to develop novel tools to detect TEs because previous methods are not directly applicable to reads produced by NGS data [11, 12]. However, one of the most commonly used procedures to find occurrences of known TEs remains the homology-based method proposed by the RepeatMasker program  because it is easy to use, rapid, and efficient [14, 15]. The main drawback of this program is its dependence on reference sequences and consequent inability to discover new TEs. This method however remains a must for identifying TE sequences in an assembly or after the identification of new consensus TE sequences using de novo methods. For example, this last approach (de novo TE libraries used with RepeatMasker) was applied for the identification of TEs in the 12 Drosophila genomes .
The principle of RepeatMasker is to search for the occurrence of any reference sequence contained in a library (currently Dfam  and RepBase , or user-built in) in a query sequence using a sequence comparison approach based on popular search engines including nhmmer, cross_match, ABBlast/WUBlast, RMBlast, and Decypher . RepeatMasker generates several output files, including the .out file, which provides a detailed annotation of all detected repeats in the query sequence, specifically including their position, orientation, and divergence from the reference sequence . This .out file is particularly useful because it identifies the part of the query sequence that matches a given TE family of a library (a ‘hit’) and provides its position in the query sequence for each one. However, a remaining challenge consists of identifying the different copies of elements corresponding to those ‘hits’ , which is a prerequisite for any evolutionary or comparative analysis of different copies of a family.
Some scenarios in particular can lead to multiple hits corresponding to a unique copy of an element. The first scenario, in the case of a LTR-retrotransposon, comes from the split of its consensus into a sequence corresponding to the LTR and a sequence corresponding to the internal portion of the element (Figure 1A). This separate annotation for LTR-retrotransposons is supported to facilitate the identification of solo-LTRs, which may be numerous in some genomes . Multiple hits corresponding to only one copy of a given element can also result from large deletions (Figure 1B) or insertions that occur in sequences and disrupt the entire copy, leading to nested TEs (Figure 1C). Moreover, the presence of undetermined bases, which may occur due to low sequencing quality, could also disrupt unique sequences corresponding to a copy and give multiple hits. Taken together, these characteristics induce multiple hits corresponding to a unique copy for a given TE in the RepeatMasker .out file. Finally, non-significant hits can be present in the output file, in addition to sequences that do not fit the 80-80-80 rule , that is, sequences that would align with the reference on less than 80 bp, on less than 80% of their respective length, and with less than 80% of identity.
Some programs proposing the use of RepeatMasker output files were developed [21–23], but none allow access to the location of all of the TE copies or an accurate quantification of the family content at the genomic level. These programs usually have very specific aims. TSDFinder was developed to refine the coordinates of long interspersed nuclear element (LINE) L1 insertions by identifying flanking target site duplication (TSD) sequences and the poly(A) tails of 3′ intact L1 insertions in the human genome . The LTR-miner program was designed to specifically retrieve information concerning the age and distribution of LTR-retrotransposons . This program was then implemented in the Reannotate program for use on all categories of TEs to estimate the temporal order of insertions in the case of nested elements and to estimate the age of LTR-retrotransposon copies .
In this manuscript, we propose a perl tool (available at http://doua.prabi.fr/software/one-code-to-find-them-all) that parses the RepeatMasker .out files to accurately determine the number of TE copies found, obtain their positions, and retrieve their sequences. This tool should be helpful for any non-bioinformatics scientist interested in genome annotation and/or evolution. To our knowledge, this program is the first multi-purpose tool that correctly identifies TE copies using RepeatMasker and provides complementary quantitative information for individual families in a query sequence.
The proposed tool consists of two perl scripts that must be run successively to take into account the different characteristics of the consensus sequences.
Script 1: build_dictionary.pl
This script builds a list of all of the LTR-retrotransposons found in the query sequence at least once by RepeatMasker to associate hits corresponding to the internal portion and those corresponding to LTR sequences. This module uses the RepeatMasker .out file or a directory path containing several RepeatMasker .out files as input. RepeatMasker files are recognized based on their .out extension, allowing the program to run recursively on large file structures without prior file sorting (for example, working on one organism by running the program on an entire directory downloaded from a genome database). Then, the program matches together internal and LTR portions, based on name similarity. The main issue with this step relies on heterogeneity in the annotation of LTR-retrotransposons in the library. For example, the majority of LTR-retrotransposons in Drosophila melanogaster appear under the name ‘TE_LTR’ and ‘TE_I’ for the LTR sequence and the internal sequence, respectively. However, the members ‘LTR’ and ‘internal portion’ may sometimes have different names. This scenario is, for example, the case for the LTR-retrotransposon HMS-beagle, for which the corresponding LTR sequence is labelled DMLTR5, while the internal portion is labelled HMSBEAGLE_I. The same problem occurs more frequently for retrotransposons in Homo sapiens, making it difficult to derive a completely generalized algorithm to determine the concordance between the LTR and internal portions. These issues imply that the output file of build_dictionary.pl must be manually inspected to correct for any mis-association.
In the standard version (see --fuzzy option for the alternative version), the program only recognizes similar names in addition to the ‘LTR’ or ‘int’ suffix or prefix, taking into account small discrepancies such as a ‘-’ symbol replaced by a ‘_’ symbol, for example, recognizing the association between HERV-Fc2-int and HERV-Fc2_LTR.
The following parameter must be provided in the program:
--rm infile (corresponds to a RepeatMasker .out file or the name of a directory containing several RepeatMasker .out files).
Three options can be specified by users:
This option prints a summary of the different usages of the script.
This option allows the script to associate more LTR names with internal counterparts to account for the possibility of LTR variants. In three successive passes, the program associates similar names differing by a single letter, a single number, or two characters. For example, in the human genome, the --fuzzy option allows for the association of MER66-int with its various counterparts MER66A, MER66B, MER66C, and MER66D or HERV1_I-int with HERV1_LTRa, HERV1_LTRb, HERV1_LTRc, HERV1_LTRd, and HERV1_LTRe.
To be used in particular cases where the RepeatMasker program was run using a local TE library without the class/subclass specification (see below).
Finally, the name and path of the output file should be specified using a redirection (>dictionary_output.txt). Examples of command lines are detailed in the tutorial available on the program website.
Script 2: one_code_to_find_them_all.pl
The second script uses the output file produced by build_dictionary.pl and a RepeatMasker .out file (or a directory containing several RepeatMasker .out files). The principle of this program is to compare the positions and orientation of each hit corresponding to the same TE family to determine whether the hits correspond to the same copy and can be merged or correspond to different copies. Two hits located on the same scaffold or chromosome are considered to be fragments of the same copy if they abide by the three following conditions: 1) they have the same orientation; 2) the extremities of the fragments respect a distance criterion: by default the furthest extremities should be separated by less than twice the length of the reference TE element (see the --insert option for non-default behavior); and 3) the second fragment starts and ends after the first one respectively starts and ends (that is, the two fragments can overlap but cannot be included in one another). These constraint filters were motivated by a conservative choice, meaning not to merge copies that do not belong to the same insertion. However, one shortcoming of this methodology is that it may be impossible to re-assemble old copies in which many insertions of other elements had taken place after this copy was first inserted in the genome. Moreover, we may over-estimate the copy number if a portion of a given copy is inverted, leading to several fragments in different orientations.
The identification of unique copies of LTR-retrotransposons depends on the different fragments and different portions of the element (LTR and internal portions), as follows. First, we identify different fragments of the same portion that could be later assembled as a copy. For that purpose, two LTR fragments must not be separated by a compatible internal fragment, and two internal fragments must not be separated by a compatible LTR fragment. These steps are necessary for the merging of fragments into a copy. Once all copies are reconstructed from the RepeatMasker hits, the program assembles full-length LTR-retrotransposons by associating LTR copies and their corresponding internal copy located closely to one another. Conditions for associating a LTR sequence with an internal sequence include the following: the LTR sequence must be in the same orientation as the internal sequence, and it must be separated from the internal sequence by less than half the LTR length. The reconstruction of full-length ‘LTR-I-LTR’ elements is performed as a priority, and with the remaining copies, incomplete ‘LTR-I’ or ‘I-LTR’ elements are then built. All copies, assembled or solo, are reported. As solo-LTRs are of special evolutionary interest, they are reported separately from the full-length and partial LTR-retrotransposon copies in the summary file .copynumber.csv (see below).
The parameters required by the program include the following:
--rm infile (corresponds to a RepeatMasker .out file or the name of a directory containing several RepeatMasker .out files).
--ltr output file from build_dictionary.pl (Script 1).
Several options can be specified by users:
This option prints the possible usages of the script.
This option makes the program use a rule based on the 80-80-80 rule  to select hits. In this case, the program provides copies with sizes greater than 80 bp long and which have greater than 80% identity to the reference element. By default, the program gives all hits found, regardless of the size or percentage of identity compared with the reference.
This option allows users to work with their own file for the length of the reference elements, which will be used to determine the ratio of the length of a given copy compared with its reference. If not provided, the code computes the length of all elements (LTR and internal portions separated for the LTR-retrotransposons) present in all .out files under study, by selecting for each element the most common consensus length (as in some cases multiple RepeatMasker consensus sequences can correspond to the same element). This option is valuable when working with elements whose annotation is ambiguous to ensure that the correct reference length is used. It can also be used with another purpose, when only a subset of TEs is considered, since only the elements mentioned in the .length file will be taken into account.
This option allows users to manually resolve ambiguous situations by choosing their favorite solution for merging hits. For example, Figure 2A shows a case in which two choices are possible, that is, two different hits can be assembled with the one under study (DM297_I at position 21,407,284 on the chromosome X). In this case, the first choice (solution 0) is the most parsimonious. Solution 0 is always the one corresponding to assembling closest hits together. However, this solution may come to a fault in the case of multiple nested or duplicated TEs corresponding to the same reference element. For example, in Figure 2B, solution 1 is the most parsimonious, that is, the one that minimizes the reorganization of the copy compared to the reference element structure. If this option is not specified, the default choice consists of choosing solution 0.
As many ambiguous cases can arise, the RepeatMasker block ID (column 14 of RepeatMasker .out file) is used when this option is activated. These IDs come from the ProcessRepeats script implemented in RepeatMasker, which makes educated guesses if any pair of fragments is derived from the same element or not. Therefore, if an ambiguous situation can be solved unequivocally using these Block IDs, no choice is left to the user, and the elements sharing the same Block ID are merged.
Another way of quickening the choice process is to only ask the user about ambiguous cases, and sometimes a single choice can disambiguate multiple situations. For example, consider the situation for which three fragments A, B, and C are considered for merging, and for which the choice is between A-B and A-C (choices are always pairwise). If the user considers the right choice to be A-B-C, he/she will choose A-B. Then, if adding C to the merged A-B is not ambiguous (if there is no D fragment of the same element nearby to get confused with, for example), the code will directly merge C with A-B, getting the right result A-B-C without asking the user about this final merging.
This option performs all operations, but reports no results except the log file with all operations performed. It is designed to be used in tests, particularly those determining the number of ambiguous situations to be resolved. Running the program with this option before the actual analysis allows estimation of the time required to complete an analysis with the --choice option because the number of ambiguous situations can be high, and manual choice is time-consuming if applied to all elements in a genome.
In the particular case in which the RepeatMasker program was run using a local library that did not use the naming system required to differentiate the class and the subclass (required format is described in RepeatMasker help file), the .out file is slightly different because column 11 (repeat class/family) is usually filled with ‘Unknown’ or ‘Unspecified’ , which means that the type of individual TE is not specified. To account for this possibility, the user can use the --unknown option, which will produce results for elements annotated as ‘Unknown’ or ‘Unspecified’ and deriving from the local, unannotated bank.
--fasta and --flanking ‘size_in_bp’
The --fasta option allows for the retrieving of sequences of copies reported by the program from the local fasta sequence files used in the RepeatMasker program. To study flanking sequences of the determined copies, the --flanking option can be specified to allow the program to report the flanking regions of the specified size surrounding each copy in addition to the TE sequence.
This option changes the code behavior for merging fragments into copies. By default, the furthest extremities of the considered fragment to be merged are compared, and merging takes place if they are less than twice the reference element length apart. Using --insert, the size of the genomic sequence between the two closest extremities of the considered fragments (that is, the size of the insertion between them) will be considered: if it is less or equal to the size given in the option, the fragments are merged. For example, using --insert 0 means only fragments detected right next to each other in the query sequence will be considered as parts of the same copy.
By default, five output files are generated, which are located in the same directory as the RepeatMasker .out file(s), plus one output file located in the working directory (.length file) that is produced only if the --length option was not specified.
The .log.txt file contains the screen output of the program. For each element, this file summarizes the number of hits and copies obtained after merging the hits. When the --dry-run option is chosen, it displays the possible choices that would be asked using the --choice option.
The .copynumber.csv file contains quantitative information about each of the identified TE families in the query sequence. This file displays eight columns (see Figure 3A as an example corresponding to some DNA transposons and LTR-retrotransposons detected on the long arm of the chromosome 2 (2L) of D. melanogaster): column 1, Family, category of the given TE (as specified in the column 11 ‘repeat class/family’ of the RepeatMasker output file); column 2, Element, name of the given TE (as specified in the column 10 ‘matching repeat’ of the RepeatMasker output file); column 3, Length, length of the reference TE in bp (information from the consensus sequences, as found in the .length file). In the absence of either the internal or LTR portion of a LTR-retrotransposon in the query files, the column will specify ‘No_ref_available’; column 4, Fragments, number of hits found by RepeatMasker corresponding to a given TE; column 5, Copies, total number of copies reconstructed from the hits (if the --strict option was selected, this number can be null, meaning that none of the fragments passed our 80-80 rule); column 6, Solo_LTR, number of solo-LTRs reconstructed from the hits. The column will specify ‘NA’ for non-LTR elements; column 7, Total_Bp, total number of base pairs corresponding to a given TE for the analyzed query sequence; and column 8, Cover, percent coverage of a given TE in the analyzed query sequence.
For each TE category (DNA transposons, LINEs, short interspersed nuclear elements (SINEs), and LTR-retrotransposons), the global information concerning the number of fragments, number of copies, number of base pairs, and percent coverage are given and correspond to lines beginning with ‘######Type:DNA’, ‘######Type:LINE’, ‘######Type:SINE’ , and ‘######Type:LTR’. The column ‘length’ in this case contains a NA. For example, in Figure 3A, the DNA/hAT transposon hobo (reference length of 3,016 bp) has 40 fragments on chromosome 2L corresponding to 21 copies. These copies span 20,529 bp on chromosome 2L, which represents 0.09% of this chromosome. The end of the file gives global information concerning all TEs (and thus the coverage of all TEs on the analyzed sequence), satellites, low complexity regions, simple repeats, and unknown repeat elements (see Figure 3B).
The *.ltr.csv and *.transposons.csv files (see Figure 4 as an example) contain the list of all occurrences of LTR-retrotransposons, and of non-LTRretrotransposons and DNA transposons, respectively, which were identified by the program. In these files, the columns globally correspond to those proposed in the RepeatMasker .out file, with the exception of the sixth and the last two columns. The (left) column of the RepeatMasker file, the sixth one, is replaced with the length of the reconstructed copy, from the consensus point of view (that is, it can be different from the span on the query sequence). The ‘Num_Assembled’ column corresponds to the number of hits assembled into the different copies. The ‘%_of_Ref’ column represents the proportion of length of the reconstructed copy compared to the reference element. This ratio is expected to be 1 if the reconstructed copy is the same length as the reference element. These numbers thus provide information about the integrity and quality of the copies inserted in the genome; that is, for a given family or superfamily, copies which are mostly full-length (ratio close to 1) and with low divergence from the reference, could result from recent insertion events. In the case of solo-LTRs, that is, copies that only correspond to the LTR section of a consensus, the ratio is computed in reference to the length of the LTR sequence. This implies that full-length solo-LTRs will have a ratio of 1.
Individual copies of TEs correspond to lines beginning with the # character followed by the identification number of the merged hits with each one separated by a slash (/). For those that have been reconstructed using several hits, the fragments used to build the considered copies are shown below. For each copy, the ‘%_Div’ (percentage divergence to the reference), ‘%_Del’ (percentage of deletion compared to the reference), and ‘%_Ins’ (percentage of insertion compared to the reference) are the means of the values of each fragment normalized by size.
In the example in Figure 4, the first copy corresponds to a single fragment of a Doc element, which is nearly complete, whereas the copy below has been reconstructed using three fragments that also correspond to a Doc element. The third example corresponds to a complete copy of the copia LTR-retrotransposon, which has been reconstructed with respect to the separation of the ‘internal portion’ and ‘LTR’ in the consensus library. The last example corresponds to a full-length solo-LTR of copia2. The position of each copy is provided in columns 5, 6, and 7, which correspond to the name of the query sequence, position of the start of the copy in the query sequence, and position of the end of the copy in the query sequence, respectively. The orientation on the strand (+ or complementary) is specified in column 9. In the example in Figure 4, the reconstructed copy of the copia element is located in the long arm of chromosome 2 (chr2L), starts at position 3,073,087, ends at the position 3,078,231 (is 5,145 bp long), and is on the positive strand (+).
The last output file, .elem_sorted.csv, contains the same information as the two previous ones, but sorted per genomic position and not per element, in order to be easily used by people interested in the genomic context and distribution of TEs.
Results and discussion
To determine the accuracy of the program, we tested it with several RepeatMasker .out files corresponding to two organisms, D. melanogaster and H. sapiens, for which the TE content has already been largely described and which present great differences in genome size, TE content, and TE families.
Test of the D. melanogaster genome RepeatMasker output files
We retrieved the RepeatMasker .out files (thereafter mentioned as UCSC files) from the UCSC Genome Bioinformatics website (http://genome.ucsc.edu/), which were produced using version dm3 (April 2006) of the genome sequence with the 17 May 2007 (open-3.1.8) version of RepeatMasker and library release 20061006. Each file corresponds to a different chromosome (2L, 2LHet, 2R, 2RHet, 3L, 3LHet, 3R, 3RHet, 4, U, Uextra, X, XHet, and YHet). We did not retrieve the file corresponding to the mitochondrial genome. We also retrieved the unique RepeatMasker .out file (thereafter mentioned as RM file) provided for the same genome version on the RepeatMasker website (http://www.repeatmasker.org) using the library release 20080611 and open-3.2.5 version of RepeatMasker. This file contains the results for all chromosomes.
Determining the number of ambiguous cases that may require manual inspection (option --dry-run/--choice)
The option --dry-run was used with the UCSC files to determine the number of ambiguous cases that could be manually expertized. For all chromosomes, 862 cases appeared (see Additional file 1: Table S1 for individual chromosome detail). We investigated the cases corresponding to chromosome 3R for which eight ambiguous cases were identified. For all but two cases, the default solution 0 was the best choice from a biological point of view (minimizing the reorganization of the copy compared to the reference element structure). For the two remaining cases, the best choices were solution 1 and the last solution (not assemble the fragments). For chromosome X for which 14 ambiguous cases were indicated, solution 0 was the best choice in ten cases and the last solution (to let the first fragment alone) was the best choice for four cases. This result indicates that the default choice made by the program is the best choice (the most biologically sound) in the majority of cases.
Running the program with and without the --strict option
We did not initially specify use of the --strict option and successively ran the program with the UCSC and RM files. When the --strict option is not specified, the program considers every hit without filtering using our 80-80 rule. We observed the same amount of TEs globally (both in terms of copy number and chromosome coverage, see Additional file 2: Table S2 and Additional file 3: Table S3) for the two versions of the Repeat Library used with slightly more copies detected in the RM file (208 more copies, see Additional file 2: Table S2). This observation can be explained by the fact that the library used in this case was more recent and thus capable of containing new reference elements. In the results from the UCSC files we observed that the DNAREP1 element was associated with the repeat class family LINE/Penelope, as proposed when it was first described , whereas it is now known to correspond to the repeat class family DNA/Helitron. In the annotation from the RM file, the association is correct, indicating that the Repeat Library used by UCSC incorrectly assigned this element to the LINE category, which was later corrected in a new version. We therefore chose to consider only the output file from the RepeatMasker website (RM file) for the rest of the test. This underlines the importance of a correct TE classification to obtain an accurate amount of particular elements.
Table 1 displays the number of copies per chromosome with and without the use of the --strict option. As expected, the global number of copies decreased from 9,134 to 5,656 copies in the euchromatin portion of the genome when the 80-80 rule was applied. This last number is congruent with the 5,409 annotated copies in the D. melanogaster euchromatin in the FlyBase annotation version r5.49 (http://flybase.org) . The results also showed that the copy number in unplaced chromosomes is particularly high, indicating that the euchromatin is far from a complete reflection of the entire genome in terms of TE content. While heterochromatin regions display less TE copies (5,066 copies without the --strict option and 3,451 copies with the --strict option), TEs represent a large coverage of these regions (approximately 60% on average, see Additional file 4: Table S4).
Using the output files *.transposons.csv and *.ltr.csv, which contain details for the copies for each heterochromatin chromosome, we retrieved all of the potentially full-length elements by selecting copies whose ratios compared with the reference were over 95% (%_of_Ref, column 17). We obtained 474 copies corresponding to this criterion, which is more than the 202 full-length elements previously described  but that includes 130 full-length solo-LTRs. We did the same to determine the number of potentially full-length elements in euchromatin regions and found a total of 655 elements (1,039 elements when counting the highly represented DNAREP1, which is no more active and full-length solo-LTRs (170 copies)). This number is higher than the 478 full-length elements described with an older version of the D. melanogaster genome, which annotated only 1,572 TE copies . This result demonstrates that our program can quickly identify potentially full-length elements.
In terms of proportion, the global TE content on chromosomes is congruent with what was previously shown [26, 27] with an average of 6.69% (6.04% with the --strict option) of TEs in euchromatin regions (without taking into account chromosome 4) and 61.63% (52.53% with the --strict option) of TEs for heterochromatin regions (see Additional file 4: Table S4).
Another example of what can be directly performed using the outfiles *.transposons.csv and *.ltr.csv is displayed in Figure 5. The divergence of sequences (%_Div, column 2) was plotted against the size ratio for each copy compared with the reference element (%_of_Ref, column 17) for each superfamily in the euchromatin portion of the genome (chromosomes 2L, 2R, 3L, 3R, 4, and X). This procedure can allow the quality of the copies inserted into the genome to be determined quickly; that is, for a given family or superfamily, if the copies are mostly full-length (ratio close to 1) and not divergent from the reference, this could indicate recent insertion events. For example, in Figure 5, the elements from the LTR/Copia superfamily (including the families copia, copia2, FROGGER, and 1731) mainly correspond to highly conserved copies (with a small divergence compared to their reference) with two populations of copies: one corresponding to almost full-length copies (potentially recent insertions) and the other corresponding to short copies. When looking in more detail, the populations of conserved copies of small sizes correspond mainly to copia2 copies but do not represent solo-LTRs (see Additional file 5: Figure S1 for individual representation of copia, copia2, FROGGER, and 1731 families). The same information can be produced for the other LTR-retrotransposon classes (Additional file 6: Figure S2 and Additional file 7: Figure S3 for individual family representations of Gypsy and BEL/Pao elements, respectively). Elements from the LINE/LOA superfamily, which in this case correspond to only one family (the Baggins family), had copies with low divergence compared to the reference but with different sizes, and a few of them were full-length, which could illustrate the same date of activity for the different copies and the transposition mechanism for LINE-like elements, which can be truncated at their 5′ end upon insertion. Thus, globally, we can easily obtain information concerning the population of copies of a given family and their positions in the genome.
Test of the tool using the H. sapiens genome RepeatMasker output files
We retrieved the RepeatMasker .out file from the RepeatMasker website (http://www.repeatmasker.org), which was produced using the hg19 version (February 2009) of the genome sequence with the open-3.3.8 version of RepeatMasker and Repeat Library 20120124. This file contains results for all chromosomes, that is, 22 autosomal chromosomes and the two sex chromosomes (X and Y) that we considered in the test. We did not take into account results corresponding to randomly placed sequences, unplaced sequences (chrUn), and particular regions of chromosome 6 (corresponding to different haplotypes of the major histocompatibility complex region), chromosome 4, and chromosome 17.
Determining the number of ambiguous cases that may require manual inspection (option --dry-run/--choice)
We determined the number of ambiguous cases that could be manually expertized for our file. For all of the considered chromosomes, a total of 12,133 possible choices appeared, which could potentially be investigated (see Additional file 8: Table S5 for the number by chromosomes). This large number indicates that complete manual annotation would be impossible to manage; however, by reducing the analysis to some TE families of interest, it would still be possible.
Running the program with and without the --strict option
We ran our program with and without the --strict option. Table 2 displays the percent coverage for each TE class in each chromosome and the two cases. The average coverage for each TE class without the --strict option was congruent with the admitted TE content in the human genome with 3.23% DNA transposons, 19.85% LINEs, 13.16% SINEs, and 8.73% LTR-retrotransposons, representing a total of 44.98% TEs in the genome .
One original feature of our program is the ability to compute detailed quantitative information chromosome by chromosome, which differs from the output table produced by RepeatMasker. This feature allows us to show that the representation of each TE class differs according to the chromosome. For DNA transposons, chromosomes 3 and 20 displayed the highest proportion of these elements (4.05% and 4.17%, respectively), whereas the Y chromosome is particularly poor in elements of this class with only 0.79%. The X chromosome contains the highest proportion of LINEs and LTR-retrotransposons (33.71% and 11.38%, respectively) with chromosome 22 harboring the lowest proportion of the same elements (10.95% LINEs and 4.64% LTR-retrotransposons). Finally, SINEs are particularly abundant on chromosome 19 (26.98%) and rare on the Y chromosome (4.38%). Globally, the X chromosome has the highest proportion of TEs (58.77%), whereas the Y chromosome has the lowest proportion of TEs (23.84%). This observation is congruent with the discrepancy observed for particular families between the autosomal and sex chromosomes .
We examined the base coverage proportion for the most represented TE families in each chromosome (Figure 6). For each chromosome, the most represented LINEs mainly correspond to L1 and then L2 (Figure 6A). The two most represented SINE families include Alu and MIR (Figure 6B). Among the LTR-retrotransposons, the most represented elements correspond to the MaLR families in all chromosomes except chromosomes 19 and Y in which they correspond to the ERV1 families. The ERVL families correspond to the third most represented LTR-retrotransposons in all chromosomes (Figure 6C). Among the DNA transposons, the TcMar_Tigger families are the most represented in all chromosomes with the exception of chromosomes 1 and 2 in which the hAT_Charlie families are the most abundant.
The same global distributions are observed when using the --strict option, which takes into account elements that follow our 80-80 rule. However, the global amount of each class decreases with an average of 25.48% of the genome (Table 2). The elements following this rule are expected to be well-conserved, suggesting that these elements were potentially active until recently. Indeed, the most represented families correspond to those known to have had a recent activity (Table 3) such as LINE L1 and SINE Alu. Among Alu elements, the most represented families correspond to AluJb, AluSz, AluY, AluSx1, and AluSx, which usually represent more than the half of the total Alu s. However, the most represented LTR-retrotransposons correspond to the ERV1 and MaLR families, and only ERVK elements are supposed to remain active .
We have developed a tool to conveniently parse the classic RepeatMasker .out file to improve the original annotation provided, by including reconstruction of full-length copies. This information includes in particular a measure of the quality of the copies compared to a reference element, as well as the exact position and orientation of each copy and some quantification concerning their proportion in the genome/chromosome sequence, allowing for a fast and accurate assessment of the exact TE content. Additionally, the sequence of each copy with or without flanking sequences can be retrieved directly, allowing further analyses of the TEs. We hope that this tool will assist non-bioinformatics scientists in the more accurate identification of TE copies.
Availability and requirements
Project name: One code to find them all.
Project home: http://doua.prabi.fr/software/one-code-to-find-them-all.
Operating system(s): Linux/Unix, Mac OS X, Windows (with Perl installed).
Programming language: Perl.
License: GNU General Public License.
Long interspersed nuclear element
Long terminal repeat
Next generation sequencing
Short interspersed nuclear element
Target site duplication.
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, et al.: Initial sequencing and analysis of the human genome. Nature 2001, 409: 860-921. 10.1038/35057062
de Koning APJ, Gu W, Castoe TA, Batzer MA, Pollock DD: Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet 2011, 7: e1002384. 10.1371/journal.pgen.1002384
Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, Minx P, Reily AD, Courtney L, Kruchowski SS, Tomlinson C, Strong C, Delehaunty K, Fronick C, Courtney B, Rock SM, Belter E, Du F, Kim K, Abbott RM, Cotton M, Levy A, Marchetto P, Ochoa K, Jackson SM, Gillam B, et al.: The B73 maize genome: complexity, diversity, and dynamics. Science 2009, 326: 1112-1115. 10.1126/science.1178534
Sun C, Shepard DB, Chong RA, Arriaza JL, Hall K, Castoe TA, Feschotte C, Pollock DD, Mueller RL: LTR retrotransposons contribute to genomic gigantism in plethodontid salamanders. Genome Biol Evol 2012, 4: 168-183. 10.1093/gbe/evr139
Biémont C, Vieira C: Genetics: junk DNA as an evolutionary force. Nature 2006, 443: 521-524. 10.1038/443521a
Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, Leroy P, Morgante M, Panaud O, Paux E, SanMiguel P, Schulman AH: A unified classification system for eukaryotic transposable elements. Nat Rev Genet 2007, 8: 973-982. 10.1038/nrg2165
Kapitonov VV, Jurka J: A universal classification of eukaryotic transposable elements implemented in Repbase. Nat Rev Genet 2008, 9: 411-412. author reply 414 10.1038/nrg2165-c1
Bergman CM, Quesneville H: Discovering and detecting transposable elements in genome sequences. Brief Bioinform 2007, 8: 382-392. 10.1093/bib/bbm048
Saha S, Bridges S, Magbanua Z, Peterson D: Computational approaches and tools used in identification of dispersed repetitive DNA sequences. Trop Plant Biol 2008, 1: 85-96. 10.1007/s12042-007-9007-5
Lerat E: Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity (Edinb) 2010, 104: 520-533. 10.1038/hdy.2009.165
Chaparro C, Sabot F: Methods and software in NGS for TE analysis. Methods Mol Biol 2012, 859: 105-114. 10.1007/978-1-61779-603-6_6
Modolo L, Lerat E: Identification and analysis of transposable elements in genomic sequences. In Genome analysis: Current Procedures and Applications. Edited by: Poptsova M. Norwich: Caister Academic Press; 2013:165-181.
Smit AF, Hubley R, Green P: RepeatMasker Open-3.0. ( ) 1996–2004 http://www.repeatmasker.org () 1996–2004
Aparicio S, Chapman J, Stupka E, Putnam N, Chia JM, Dehal P, Christoffels A, Rash S, Hoon S, Smit A, Gelpke MD, Roach J, Oh T, Ho IY, Wong M, Detter C, Verhoef F, Predki P, Tay A, Lucas S, Richardson P, Smith SF, Clark MS, Edwards YJ, Doggett N, Zharkikh A, Tavtigian SV, Pruss D, Barnstead M, Evans C, et al.: Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 2002, 297: 1301-1310. 10.1126/science.1072104
Juretic N, Bureau TE, Bruskiewich RM: Transposable element annotation of the rice genome. Bioinformatics 2004, 20: 155-160. 10.1093/bioinformatics/bth019
Clark AG, Eisen MB, Smith DR, Bergman CM, Oliver B, Markow TA, Kaufman TC, Kellis M, Gelbart W, Iyer VN, Pollard DA, Sackton TB, Larracuente AM, Singh ND, Abad JP, Abt DN, Adryan B, Aguade M, Akashi H, Anderson WW, Aquadro CF, Ardell DH, Arguello R, Artieri CG, Barbash DA, Barker D, Barsanti P, Batterham P, Batzoglou S, Drosophila 12 Genomes Consortium, et al.: Evolution of genes and genomes on the Drosophila phylogeny. Nature 2007, 450: 203-218. 10.1038/nature06341
Wheeler TJ, Clements J, Eddy SR, Hubley R, Jones TA, Jurka J, Smit AFA, Finn RD: Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res 2013, 41: D70-D82. 10.1093/nar/gks1265
Jurka J: Repbase update: a database and an electronic journal of repetitive elements. Trends Genet 2000, 16: 418-420. 10.1016/S0168-9525(00)02093-X
Tempel S: Using and understanding RepeatMasker. Methods Mol Biol 2012, 859: 29-51. 10.1007/978-1-61779-603-6_2
Ma J, Devos KM, Bennetzen JL: Analyses of LTR-retrotransposon structures reveal recent and rapid genomic DNA loss in rice. Genome Res 2004, 14: 860-869. 10.1101/gr.1466204
Szak ST, Pickeral OK, Makalowski W, Boguski MS, Landsman D, Boeke JD: Molecular archeology of L1 insertions in the human genome. Genome Biol 2002, 3: research0052.1-research0052.18.
Pereira V: Insertion bias and purifying selection of retrotransposons in the Arabidopsis thaliana genome. Genome Biol 2004, 5: R79. 10.1186/gb-2004-5-10-r79
Pereira V: Automated paleontology of repetitive DNA with REANNOTATE. BMC Genomics 2008, 9: 614. 10.1186/1471-2164-9-614
Kapitonov VV, Jurka J: Molecular paleontology of transposable elements in the Drosophila melanogaster genome. Proc Natl Acad Sci U S A 2003, 100: 6569-6574. 10.1073/pnas.0732024100
Yang HP, Hung TL, You TL, Yang TH: Genomewide comparative analysis of the highly abundant transposable element DINE-1 suggests a recent transpositional burst in Drosophila yakuba. Genetics 2006, 173: 189-196. 10.1534/genetics.105.051714
Quesneville H, Bergman CM, Andrieu O, Autard D, Nouaud D, Ashburner M, Anxolabehere D: Combined evidence annotation of transposable elements in genome sequences. PLoS Comput Biol 2005, 1: 166-175.
Smith CD, Shu S, Mungall CJ, Karpen GH: The release 5.1 annotation of Drosophila melanogaster heterochromatin. Science 2007, 316: 1586-1591. 10.1126/science.1139815
Kaminker JS, Bergman CM, Kronmiller B, Carlson J, Svirskas R, Patel S, Frise E, Wheeler DA, Lewis SE, Rubin GM, Ashburner M, Celniker SE: The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective. Genome Biol 2002, 3: research0084-0084.20.
Kvikstad EM, Makova KD: The (r)evolution of SINE versus LINE distributions in primate genomes: sex chromosomes are important. Genome Res 2010, 20: 600-613. 10.1101/gr.099044.109
Mills RE, Bennett EA, Iskow RC, Devine SE: Which transposable elements are active in the human genome? Trends Genet 2007, 23: 183-191. 10.1016/j.tig.2007.02.006
We are particularly thankful to the two reviewers for their very positive feedback and their great help to improve our program. The English of the manuscript has been edited by the American Journal Experts company.
The authors declare that they have no competing interests.
EL conceived the project; MBB implemented the programs; and EL, MBB, and AH tested the programs and wrote the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1: Table S1: Number of ambiguous cases by chromosome for D. melanogaster (UCSC file). Table containing the number of ambiguous cases obtained for each chromosome for D. melanogaster using the RepeatMasker output file provided by the UCSC website. (PDF 380 KB)
Additional file 2: Table S2: Copy number per chromosome for each category of TEs in D. melanogaster without the --strict option for D. melanogaster (UCSC file). Table containing the copy number per chromosome for each category of TEs in D. melanogaster without the --strict option for D. melanogaster using the RepeatMasker output file provided by the UCSC website. TE, transposable element. (PDF 297 KB)
Additional file 3: Table S3: Percent coverage for all TEs on each chromosome of D. melanogaster without the --strict option. Table containing the percent coverage for all TEs on each chromosome of D. melanogaster without the --strict option for the two RepeatMasker output files (from UCSC and RepeatMasker websites). TE, transposable element. (PDF 382 KB)
Additional file 4: Table S4: Percent coverage for all TEs on each chromosome of D. melanogaster (RepeatMasker version). Table containing the percent coverage for all TE classes on each chromosome of D. melanogaster using the RepeatMasker output file provided by the RepeatMasker website, with and without the --strict option. TE, transposable element. (PDF 384 KB)
Additional file 5: Figure S1: Plot of the divergences according to the size ratio of elements from the Copia subfamily in D. melanogaster. Figure representing the divergence (column %_Div in file *.ltr.csv) of sequences plotted against the size ratio of the copy compared to the reference element (column %_of_Ref in file *.ltr.csv). Each point corresponds to a copy. Copies with a divergence close to 0 and ratio close to 1 correspond to potentially active and full-length copies. As the divergence increases and ratio decreases, corresponding copies are more degraded. (PDF 310 KB)
Additional file 6: Figure S2: Plot of the divergences according to the size ratio of elements from the Gypsy subfamily in D. melanogaster. Figure representing the divergence (column %_Div in file *.ltr.csv) of sequences plotted against the size ratio of the copy compared to the reference element (column %_of_Ref in file *.ltr.csv). Each point corresponds to a copy. Copies with a divergence close to 0 and ratio close to 1 correspond to potentially active and full-length copies. As the divergence increases and ratio decreases, corresponding copies are more degraded. (PDF 376 KB)
Additional file 7: Figure S3: Plot of the divergences according to the size ratio of elements from the BEL/Pao subfamily in D. melanogaster. Figure representing the divergence (column %_Div in file *.ltr.csv) of sequences has been plotted against the size ratio of the copy compared to the reference element (column %_of_Ref in file *.ltr.csv). Each point corresponds to a copy. Copies with a divergence close to 0 and ratio close to 1 correspond to potentially active and full-length copies. As the divergence increases and ratio decreases, corresponding copies are more degraded. (PDF 335 KB)
Additional file 8: Table S5: Number of ambiguous cases by chromosome for H. sapiens. Table containing the number of ambiguous cases by chromosome for H. sapiens.(PDF 294 KB)
Authors’ original submitted files for images
About this article
Cite this article
Bailly-Bechet, M., Haudry, A. & Lerat, E. “One code to find them all”: a perl tool to conveniently parse RepeatMasker output files. Mobile DNA 5, 13 (2014). https://doi.org/10.1186/1759-8753-5-13