Conserved structure and inferred evolutionary history of long terminal repeats (LTRs)
© Benachenhou et al.; licensee BioMed Central Ltd. 2013
Received: 17 July 2012
Accepted: 14 December 2012
Published: 1 February 2013
Skip to main content
© Benachenhou et al.; licensee BioMed Central Ltd. 2013
Received: 17 July 2012
Accepted: 14 December 2012
Published: 1 February 2013
Long terminal repeats (LTRs, consisting of U3-R-U5 portions) are important elements of retroviruses and related retrotransposons. They are difficult to analyse due to their variability.
The aim was to obtain a more comprehensive view of structure, diversity and phylogeny of LTRs than hitherto possible.
Hidden Markov models (HMM) were created for 11 clades of LTRs belonging to Retroviridae (class III retroviruses), animal Metaviridae (Gypsy/Ty3) elements and plant Pseudoviridae (Copia/Ty1) elements, complementing our work with Orthoretrovirus HMMs. The great variation in LTR length of plant Metaviridae and the few divergent animal Pseudoviridae prevented building HMMs from both of these groups.
Animal Metaviridae LTRs had the same conserved motifs as retroviral LTRs, confirming that the two groups are closely related. The conserved motifs were the short inverted repeats (SIRs), integrase recognition signals (5´TGTTRNR…YNYAACA 3´); the polyadenylation signal or AATAAA motif; a GT-rich stretch downstream of the polyadenylation signal; and a less conserved AT-rich stretch corresponding to the core promoter element, the TATA box. Plant Pseudoviridae LTRs differed slightly in having a conserved TATA-box, TATATA, but no conserved polyadenylation signal, plus a much shorter R region.
The sensitivity of the HMMs for detection in genomic sequences was around 50% for most models, at a relatively high specificity, suitable for genome screening.
The HMMs yielded consensus sequences, which were aligned by creating an HMM model (a ‘Superviterbi’ alignment). This yielded a phylogenetic tree that was compared with a Pol-based tree. Both LTR and Pol trees supported monophyly of retroviruses. In both, Pseudoviridae was ancestral to all other LTR retrotransposons. However, the LTR trees showed the chromovirus portion of Metaviridae clustering together with Pseudoviridae, dividing Metaviridae into two portions with distinct phylogeny.
The HMMs clearly demonstrated a unitary conserved structure of LTRs, supporting that they arose once during evolution. We attempted to follow the evolution of LTRs by tracing their functional foundations, that is, acquisition of RNAse H, a combined promoter/ polyadenylation site, integrase, hairpin priming and the primer binding site (PBS). Available information did not support a simple evolutionary chain of events.
Retroviruses are positive strand RNA-viruses which infect vertebrates [1, 2]. After reverse transcription to a DNA form (a provirus) they can integrate in a host cell chromosome. If this cell belongs to the germ line integrated proviruses can thereafter be inherited in a Mendelian fashion and thereby become endogenous retroviruses (ERVs). Retroviruses contain at least four protein-coding genes: the gag, pro, pol and env genes. These genes are flanked by two identical direct repeats, the long terminal repeats (LTRs) that contain regulatory elements for proviral integration and transcription as well as retroviral mRNA processing. Retroviruses are here divided into three main groups: class I including Gammaretroviruses and Epsilonretroviruses, class II including Betaretroviruses and Lentiviruses and class III including Spumaretroviruses [3, 4]. This classification, originally based on human endogenous retrovirus (HERV) studies , can be extended to include all retroviruses (ERVs and exogenous retroviruses (XRVs)). As more genomes are sequenced, it becomes obvious that much of retroviral diversity is not yet covered by existing classifications. However, in the classification of the International Committee on the Taxonomy of Viruses (ICTV)  the retroviruses belong to the family Retroviridae with class I and II in the subfamily Orthoretrovirinae and class III mainly in Spumaretrovirinae. Here, we use the ICTV nomenclature together with the older retrotransposon nomenclature.
The genomes of non-vertebrate eukaryotic phyla also harbour retrovirus-like LTR-containing elements called LTR retrotransposons . They fall into three distinct groups: the Pseudoviridae (Copia/Ty1) group, present in plants, fungi and metazoans [8, 9], the Metaviridae (Gypsy/Ty3), found also in plants, fungi and metazoans ([10, 11] and the Semotivirus (Bel/Pao) group found exclusively in metazoans . The most diverse group is Metaviridae, which consists of around 10 subgroups . One of them, the chromoviruses, has a wider host range, being found in plants, fungi and vertebrates. Chromoviruses got their name because their pol gene encodes an integrase with a chromodomain (‘chromatin organization modifier domain’), a nucleosome-binding integrase portion which can mediate sequence specific integration ([10, 13–15]. Ty3 of yeast is part of the chromovirus clade even though some members of this clade, including Ty3, do not have a chromodomain in their integrase . Pseudoviridae can be divided into at least six main groups . According to the ICTV classification, Metaviridae contains three genera; the Semotivirus corresponding to Bel/Pao, the Metavirus (represented by Ty3) and Errantivirus (Gypsy). Pseudoviridae, is also divided into three genera; the Sireviru s, Hemivirus (Copia) and Pseudovirus (Ty1). The ICTV classification is in need of revision to account for the diversity of LTR retrotransposons . The LTR retrotransposons are important elements of plant genomes. In both maize (Zea mays) and broad bean (Vicia faba), for example, LTR retrotransposons account for more than 50% of the respective genomes .
The relationships of LTR retrotransposons have primarily been studied by constructing phylogenetic trees based on the reverse transcriptase (RT)-domain of Pol, the most conserved retroelement domain [16, 17]. According to the RT phylogeny, Pseudoviridae is the ancestral group, and Metaviridae and vertebrate retroviruses are sister groups. Semotivirus, Metaviridae and retroviruses may have arisen from the same ancestor because most of them share the same domain arrangement in Pol, with the integrase (IN) domain coming after RT and RNAse H. In Copia/Ty1 and the rGmr1 member of Metaviridae, IN comes before RT and RNAse H . In spite of Pseudoviridae being ancestral it has apparently diversified less than Metaviridae. In recent years, however, more Pseudoviridae have been discovered in basal organisms such as diatoms .
In addition, phylogenies of the RNAse H and IN domains of Pol were previously reported . No major disagreement was found among them, indicating that these domains were not exchanged between groups, even though the retroviral RNAse H seems to have been independently acquired .
The evolutionary relationships among different subgroups of Metaviridae remain to be resolved. Even for retroviruses, the relative tree positions of class I and class III retroviruses is uncertain but they seem to have branched off earlier during evolution than class II retroviruses. This is consistent with the wider distribution of gamma- and epsilonretroviruses which are highly represented in fish . Epsilon- and gammaretroviruses share several taxonomic traits, and are on the same major branch in a general retroviral tree .
The common structure of retroviral LTRs was recently investigated using Hidden Markov Models (HMMs) . LTRs can be divided into two unique portions (U3 and U5), and a repeated (R) region in between them. R and U5 are generally more conserved than U3. The higher variability of U3 may be due to adaptation to varying tissue environments. In the HMMs, the conservation was highest for the Short Inverted Repeat (SIR) motifs TG… and …CA at both ends of the LTR, plus one to three AT-rich regions providing the LTRs with one or two TATA-boxes and a polyadenylation signal (AATAAA motif). The precise delineation of U3/R/U5 borders depends on sequencing of retrotransposon RNA, critical information that is often missing. Moreover, none, one or several TATA boxes may exist. Initiator (INR) motifs (TCAKTY) may or may not be present. Alternative transcriptional start sites (TSSes) and antisense transcription are also common . Thus, LTR structure and function are complex and often cannot be encapsulated by simple schemes.
Three groups of retroviral LTRs were earlier modeled by means of HMMs in [21, 22]; alignments and phylogenetic trees were generated for the human betaretroviral mouse mammary tumor virus (MMTV)-like (HML), the lentiviral and the gammaretroviral genera. The aim of this study was to extend the analysis to groups of LTRs belonging to Pseudoviridae and Metaviridae making it possible to uncover the putative conserved structure of all major groups of LTRs and to study their phylogeny.
In Benachenhou et al.  and Blikstad et al. , HMMs were used to align and construct phylogenies of LTRs for the HML, the lentiviral and the gammaretroviral genera. The LTR phylogenies were largely congruent with the phylogenies of their RT domains. The HMMs were created by using a set of sequences, which was a representative sample of the family of interest, the so-called training set. A well-known problem in HMM-modelling is that the HMMs become too specialised to the training set. To alleviate this problem one has to regularise the HMMs, which amounts to adding or removing random noise from the data. It turned out that removing random noise produced worse HMMs. It is a common experience in pattern recognition algorithms that adding noise to the training set may diminish the tendency to over-learning and the tendency to lock on to local maxima.
A test set containing sequences not present in the training set was then used to evaluate the regularised HMMs. The method was subsequently improved to systematically search for the best phylogenetic tree, that is, the one with the highest mean bootstrap value .
Description of models
Number in training set
Class III endogenous retroviruses
These clades cover some of the diversity of animal Metaviridae. The alignments generated by the corresponding models were also visually inspected. The six models all had conserved SIRs (TG…CA), except for most LTRs in the Zam clade (which had 5′5'AGTTA .. 3′TAATT or .. the imperfect inverted repeat 3′TAACT) and an AATAAA motif.
In the same way, the internal coding sequences from Pseudoviridae fell into two main groups which could be subdivided into five clusters in total (Additional file 1: Table S1). Two clusters generated convergent HMMs: Sire (a Sirevirus) and Retrofit (a Pseudovirus), both in plants . Most of the Sire cluster was used for the Sire HMM whereas a subgroup comprising half of the sequences in the Retrofit cluster was used for the corresponding HMM. Both training sets contained many sequences from Sorghum bicolor (about 60%). The better known Copia sensu stricto, which is a Hemivirus of insects and Ty1, a Pseudovirus in yeast, did not yield convergent models because the sequence sets were highly diverse and/or contained too few LTRs. The two plant LTR models both displayed SIRs and a TATATA motif.
Finally, two retroviral LTR models (HML and gammaretroviruses) were taken from [21, 22] to which a class III retroviral model was added (Table 1). In comparison to Metaviridae it was relatively easy to build HMMs for those retroviral LTRs. Like for Metaviridae, the retroviral LTRs had an AATAAA motif in addition to SIRs.
Detection performance of HMMs
Detected ( n)
Missed ( n)
Additional positive ( n)
Previous studies [21, 23] have shown that the HMMs can be used to detect solo LTRs and even detect new groups if they are not too distantly related; for example an HMM trained on HML2-10 can detect 52% of HML1. However, the more general the HMM the less sensitive and specific it becomes. For efficient detection one needs sufficiently specialised HMMs which also implies more of them. The focus of this paper was however to show that it is possible to build HMMs for Metaviridae and Pseudoviridae LTRs. The detection aspect was considered mainly as a way of validating the HMMs. In particular many Metaviridae HMMs in Table 2 had quite poor detection capabilities.
A major challenge in determining the evolutionary trajectory of LTRs relates to the definition of the three segments U3, R and U5. This is a trivial matter for those elements for which the 5′ terminus and site(s) of polyadenylation of the RNA have been experimentally determined. Regrettably, although such data are available for most retroviruses for which RNA can readily be extracted in pure form from virions, equivalent data do not exist for the majority of retrotransposons. While it may be possible in some cases to extract such information from high throughput RNASeq datasets, preliminary studies indicate that the precision of mapping by this method ranges from moderately high (the highly expressed Ty1 in Saccharomyces cerevisiae) to non-existing (very poorly expressed Ty4 in S. cerevisiae) (Yizhi Cai and JD Boeke, unpublished data). Therefore, the ability to accurately predict such boundaries from primary sequence data combined with sophisticated alignment algorithms is potentially very valuable in understanding LTR structure and as an adjunct to RNASeq analyses.
The conserved elements common to most groups are the TATA box and in some clades TGTAA upstream of the TATA box, the AATAAA motif, the GT-rich area downstream of the polyadenylation site, and the SIRs at both ends of the LTR. The TATA motif is more conserved for the plant retrotransposons than for the metazoan retrotransposons whereas the opposite is true for the AATAAA motif. Although ‘TG’ and ‘CA’ are the most conserved portions of the SIRs, the conservation of the SIRs extends approximately seven bp into the LTR. The SIRs are somewhat longer in Pseudoviridae. The general consensus is TGTTRNR at the 5′end and YNYAACA at the 3′ end, in perfect complementarity. The SIRs bind to the integrase enzyme; therefore their conservation is presumed to reflect the specificity of the bound protein. From previous studies it is known that the integrase binding specificity resides in the terminal eight to fifteen bp , in agreement with the HMM models. The reason for the variation in SIR length is unknown.
The U3 region in the weblogos is proportionally smaller than the true length of U3; this is because its sequence is much less well conserved with few recognizable motifs (excepting the TATA box). The latter is also true for the R region whenever it is long such as in gammaretroviruses, class III endogenous retroviruses/spumaviruses and lentiviruses. This ‘residual’ conservation in the longer R-regions can be linked to stem-loop structures . Stem-loop structures favour conservation in both complementary parts of the stem. The HMMs have proven to be apt for finding conservation in LTRs despite their immense variability in length and conserved elements. As explained in Benachenhou et al. , the X axes in the HMMs are ‘match states’, a conserved subset of the nucleotides in the training LTRs. Less conserved nucleotides (‘insert states’) are not shown in the HMM, but are displayed in a Viterbi alignment of LTRs analysed with the HMMs. Depending on the training parameters, the HMM length is somewhat arbitrary but the conserved motifs in the shorter HMMs are always found in the longer ones. Beyond a certain length, the HMMs merely expand the length of the quasi-random regions in the LTR and thus provide limited additional information. If the HMMs are too short, some conserved motifs can be missed as was observed for class III retroviruses. In contrast, longer HMMs may display all conserved motifs but at the expense of unnecessarily long stretches of quasi-randomness, that is, variable nucleotides artificially elevated to the status of ‘match states’. This is an especially severe problem when modelling long LTRs (>1,000 bp). The subject of building LTR HMMs is further described in Benachenhou et al. . The match and insert states are shown for six HMMs in Additional file 2.
The approximate locations of U3, R and U5 of these Errantivirus elements, belonging to Metaviridae, in Figure 1A were determined using experimental results for the TED element  which is part of the training set. The AATAAA signal is not very clear but a relatively long AT-rich stretch is apparent in R (pos. 92–111).
The U5 region begins with a GT-rich stretch, a probable polyadenylation downstream element. Another conserved AT-rich stretch is found immediately upstream of the Transcriptional Start Site (TSS) and is therefore probably an analogue of a TATA box. The TSS may possibly be part of an INR at pos. 67–72. Its short sequence (TCAT(C or T)T) closely resembles the INR consensus of Drosophila (TCA(G or T)T(T or C)) . The INR element is a core promoter element overlapping the TSS and commonly found in LTRs, which can initiate transcription in the absence of a TATA box [26–28].
Integrase recognition motifs
5′ INT motif
3′ INT motif
Class III endogenous retroviruses
Integrase recognition motifs (also called att sites) at the 5′ and 3′ ends of LTRs are shown in Table 3. The IUPAC code for nucleic acids is used. The number of inserts is shown between parentheses.
Compared to the other weblogos below, Zam has a less clear AATAAA motif but is otherwise similar to the other weblogos.
This Metaviridae clade (belonging to genus Metavirus) has a clear AATAAA signal (Figure 1B) but no conserved TATA-box. Because of lack of experimental evidence, the division into U3, R and U5 cannot be clearly defined for this clade. The beginning of U5 was chosen to coincide with a G/T-rich stretch, a probable polyadenylation downstream element . The border between U3 and R cannot be located with precision but it should be upstream of the AATAAA signal.
The weblogo of this chromoviral clade (Figure 1C) has a clear AATAAA motif and a conserved AT-rich stretch at pos. 51–57 which could serve as a TATA-containing promoter. Two differences from other retroviruses and most Metaviridae LTR retrotransposons are noticeable. Firstly, the AATAAA motif is significantly closer to the 3′ end of the LTR and secondly, U3 is more T-rich. This last feature is shared by the non-chromoviral rGmr1 LTRs (not shown).
LTRs of Retrofit and Sire, two of the main groups (Pseudovirus and Sirevirus, respectively) of Pseudoviridae, have similar structures and are clearly different from retroviral and Metaviridae LTRs. Retrofit and Sire are shown in Figure 1D and E. The most striking feature is a highly conserved TATATA motif. This motif has previously been found in Bare-1 , Tnt1 , both related to Sire; and another clade of Sireviruses , phylogenetically distinct from the ones used in the present study. The TATATA motif is known to function as a TATA box .
The CAACAAA motif at pos. 120–126 in Sire (Figure 1E) is shared by Tnt1 where it serves as a polyadenylation site [33, 34]. Retrofit has a similar CAA motif at pos. 127–129 (Figure 1D). In Sire, the polyadenylation site is surrounded by T-rich stretches as is typical of plant genomes .
Retrofit (Figure 1D) and Tnt1  completely lack an AATAAA motif, suggesting that the TATATA motif has a dual role both as promoter and poly(A) signal as has been established previously for the particular case of HML retroviruses (but not for other retroviruses) . Plant genomes generally have fewer constraints on the polyadenylation signal than animal genomes ; any A-rich motif may do. The same applies to yeast genomes . Sire has however an additional A-rich motif immediately following the TATATA motif (Figure 1E). The endpoints of the R region in Sire in Figure 1E were estimated by comparing it with the related tnt1 [31, 36] whereas the beginning of R in Retrofit could not be located. It is however clear that R in both Sire and Retrofit is very short (for Sire 10 bp long) because of the proximity of the TATA box to the polyadenylation signal. This is in contrast to retroviruses where the size of R varies a lot: MMTV (mouse mammary tumour virus) 11 bp ; RSV (Rous sarcoma virus) 21 bp ; ERV gammaretroviruses 70 bp and lentiviruses 150 bp (calculated from the average length of the corresponding training sets in Benachenhou et al. ).
Retrofit has two well-conserved TGTAAC(C)A sequences upstream of the TATATA (Figure 1D). Tandem repeats of various sizes are often found in the U3 region of retroviruses [38, 39], where they can play a role in transcription regulation. Such tandem repeats were discovered almost 20 years ago in tobacco Tnt1 . A TGTAA motif is also found in a weblogo of Sire with more match states (see discussion of longer HMMs below under Class III retroviruses, and Additional file 2: Figure S1) and in gammaretroviruses (Additional file 2: Figure S2), it also lies upstream of the TATA box.
Most of the U3 region in Retrofit and Sire consists of a seemingly random region depleted of Cs (Figure 1D and E). This contrasts with the frequent occurrence of conserved cytosines in U3s of class III ERVs, spumaviruses and gammaretroviruses, especially close to the U3/R border (Figure 1F, and Benachenhou et al. ). Finally, the 5′ integrase recognition motifs are very similar in Retrofit, Sire and also in Ty1 from yeast: TGTTARAMNAT(1)AT, TGTTRRN(3)TAA and TGTTGGAATA, respectively, where (1) and (3) are the average lengths of non-conserved insertions (cf. Table 3).
As for animal Metaviridae and other retroviral elements the best conserved motif is the AATAAA motif (Figure 1F). Not apparent in Figure 1F but visible in HMMs with more match states (Additional file 2: Figure S3) is a less-conserved TATA box. The nucleotide composition of the 180 bp region between the probable TATA box and the AATAAA motif is depleted of As; this is also a feature of other retroviruses such as lentiviruses and gammaretroviruses (see Additional file 2: Figure S2 for gammaretroviruses). There are also strong similarities with the Metaviridae element Mag A downstream of the polyadenylation signal (compare Figure 1B and F).
The LTR tree can be compared to a neighbour-joining tree obtained from an alignment, which is a concatenation of the three Pol domains RT, RNAse H and INT (see Figure 2). The alignments are from  and are available at the EMBL online database (accession numbers DS36733, DS36732 and DS36734).
Four LTR groups were apparent: (1) The two Pseudoviridae LTRs Retrofit and Sire; (2) The retroviruses; (3) The Metaviridae LTRs, Zam, Mag C, Mag A and CsRN1; and (4) a more heterogeneous second group of Metaviridae, Sushi and rGmr1. Inspection of the Weblogos gives further support for these groups: Retrofit/Sire, and to a lesser degree Sushi and rGmr1, are different from the other LTRs with respect to conserved motifs and/or nucleotide composition. Note that the retroviruses cluster with the first Metaviridae group although at low support in the larger LTR tree. Most high bootstrap trees tended to give the same topology as the tree shown in Figure 2.
Our LTR structure analysis did not cover all LTR-retrotransposons, either because of LTR length, profound variation or scarcity of sequences in some clades. However, the commonality of structure of those from which we succeeded in building HMMs was striking. It was possible to construct models of LTRs from some groups of LTR retrotransposons and retroviruses, fathoming much of the LTR diversity. This allowed scrutiny of their phylogeny in a rather comprehensive way, and comparison with phylogenies of other retrotransposon genes. The HMMs should be useful for detection of both complete LTR retrotransposons and single LTRs. However, the focus of this study was not on detection per se but rather on assessing conservation. We assessed the possible conservation of structural features of LTRs of LTR retrotransposons from non-vertebrates and vertebrates (mainly retroviruses), in an effort to trace LTR evolution in a broad context of LTR retrotransposon evolution.
In a previous paper  we noted a common LTR structure among the orthoretroviruses. The present work shows a unity of LTR structure among a wide variety of LTR retrotransposons. LTRs are complex structures, and have a complex ontogeny. In spite of this they have a unitary structure. This indicates that the basic LTR structure was created once in a prototypic retrotransposon precursor, an argument for LTR monophyly, contrasting with the polyphyletic model of LTR retrotransposon evolution . When LTRs are SuperViterbi aligned, they tend to cluster similarly to the clustering of other retroviral sequences (RT, gag, PRO and IN) . There are however, notable exceptions, which will be discussed below.
LTR evolution must be seen in the context of evolution of host promoters. For example, the gradual development of epigenetic transcriptional regulation by cytosine methylation may have led to a selection for or against cytosines, involving negative or positive regulatory elements in the expression controlling U3 region. As shown here, class I and III retroviruses are especially rich in conserved cytosines in U3. The evolution of epigenetics will also have influenced the use of retrotransposon integrase chromodomains which bind to posttranslationally modified histones. In Ty3 it recognizes H3 methylated heterochromatin [10, 13–15]. Furthermore, evolution of CpG methylation to silence LTR-driven transcription may have influenced U3 sequence diversity.
A feature of Sire LTRs is that part of the 5′end of U3 contains inverted repeats, different from SIRs, which together with complementary repeats outside of the LTR, upstream of PPT, form a probable stem loop with PPT exposed in the loop . It was also found in HIV . A systematic search for such PPT-containing hairpins in other LTR retroelements is warranted. Such a 3´terminal stem-loop is analogous to the U5-IR loop in the 5′end of the retroviral genome . Stem loops involving base-pairing between LTR and LTR-adjacent sequences are of interest both from the aspect of LTR sequence conservation, but also of the origin of LTRs. It was shown that several chromoviruses use a 5′hairpin structure for priming, instead of a tRNA [44, 45]. Moreover, DIRS RNA was postulated to use stem-loop structures for the same purpose . It is uncertain if the terminal direct and indirect repeats found in Penelope elements, which seem to use target priming [47–49], may have been embryos of present-day LTRs. Both Penelope and DIRS elements do not have a DDE integrase. The presence of this integrase thus is not a prerequisite for their terminal repeats.
When only LTR retrotransposons are compared, LTR and Pol trees are in broad agreement (Figure 2) except that retroviruses cluster with a subset of Metaviridae in the LTR tree. If the LTR tree was an accurate representation of reality this would imply that Metaviridae is not a homogeneous clade. The occurrence of elements with inverted order of the RT and IN and reverse transcriptase priming support that Metaviridae has had a complex evolution. Another aspect is that the number of informative sites of the SuperViterbi alignment is limited, often less than 100. It is based on the match states of the constituent HMMs, of which some are almost invariable. Therefore, although the bootstrap support of the LTR-based trees indicated that they were robust, the fidelity of phylogenetic reconstruction from the HMMs must have limitations. Other arguments are:
First, according to the LTR tree, the rGmr1 clade is, together with the Sushi clade, basal to the other Metaviridae clades and retroviruses. The rGmr1 clade is unique among Metaviridae in having the same order between the RT and IN domains as Pseudoviridae. This is consistent with rGmr1 branching off after Pseudoviridae but before the other Metaviridae and retrovirus clades as in the LTR tree (except for Sushi). rGMr1 is most similar to Osvaldo and Ulysses in the Pol trees.
Second, Llorens and colleagues , noted a close similarity between class III retroviruses and Errantiviruses (which consist of Zam and Gypsy sensu stricto, see Figure 2) by comparing the gag and pro genes of both groups. Furthermore, Mag and other non-chromoviral clades such as Micropia and Mdg3 of insects, and class II retroviruses (which include HMLs and Lentiviruses) have features in common in their gag and pro genes . Altogether this is consistent with the sister relationship between retroviruses and some non-chromoviral Metaviridae clades.
Third, the weblogos of retroviral LTRs have more in common with some non-chromoviral Metaviridae clades than with Sushi and rGmr1, as noted above for class III retroviruses and Mag A. This is evident in the Gammaretroviral, the Zam and the Mdg1 weblogos with 300 match states (data not shown): They all contain long stretches based on CA or CAA in U3.
Why does the Pol tree of Figure 2 show a monophyletic Metaviridae? It could result from a summative effect of independently evolving RT, RH and IN modules. Alternatively, it could be the result of (artefactual) long-branch attraction between Pseudoviridae and retroviruses since both have long branches compared to Gypsy/Ty3 in Pol trees (see Figure 2). Long-branch attraction is well known to lead to inaccurate trees (see for example [51, 52]) in the context of bird phylogenetics); it occurs when the mutation rate varies extensively between different clades.
The Pol and RT trees (Figures 2 and 3, and Additional file 2: Figure S4) indicate different phylogenies of retrotranscribing elements and viruses. The non-LTR using DNA viruses hepadna and caulimo are interspersed among the retrotransposons. This, and the existence of an R-U5 like structure in hepatitis B virus , create difficulties for a simplistic LTR and retrovirus phylogeny. It is not possible to claim monophyly of all retrotranscribing viruses and elements
In Llorens et al. , the authors proposed ‘the three kings hypothesis’ according to which the three classes of retroviruses originated from three Metaviridae ancestors. Their conclusions were based on Gag phylogenies and sequence elements in other proteins such as the flap motif embedded in the Pro coding region. The divergent results shown in Figures 2, 3 and 4, and Additional file 2: Figure S4, illustrate that when a retroelement is reconstructed results can differ, indicating that polymerase evolution was complex, with instances of rather drastic cross-element and host-element modular transfers. In a similar vein, a network hypothesis of LTR retrotransposon evolution was proposed . However, all previously published Pol phylogenies , as well as phylogenies based on three independent trees of distinct Pol domains, support the monophyly of retroviruses. Our incomplete evidence from the LTR tree also indicates that retroviruses are monophyletic. On the other hand, the tree of Figure 3 indicates that the gamma, epsilon and spumaretroviruses are more related to Metaviridae than the other retroviruses are. More information is needed.
In the broader context of LTR retrotransposons, it is to be expected that different genes yield somewhat different tree topologies and as a consequence there is no single retroelement tree. Indications for a mosaic origin of LTR retroelements are the independent acquisitions of retroviral RNase H  and possibly also of the Pseudoviridae and rGmr1 IN, as suggested by their unique genomic position. The Pseudoviridae IN shares the HHCC and DDE motifs with retroviral and Metaviridae retroelements but also has a unique C terminal motif, the GKGY motif . On the other hand, gammaretroviral and some Metaviridae INs (including chromoviruses) have the GPY/F motif in the IN C terminus . The newly discovered Ginger 1 DNA transposon has a DDE integrase which seems more closely related to certain Metaviridae integrases  than to integrases from other Metaviridae, retroviruses or Pseudoviridae. It also has a GPY/F domain. This can be interpreted as supporting multiple origins for IN in LTR retrotransposons but it could also be due to an exchange in the other direction, that is, from Metaviridae to Ginger 1. It is interesting that Ginger 1 has terminal inverted repeats (TIRs), but not LTRs. Its TIRs begin with the sequence TGTNR which is close to the SIR TGTTRNR found in LTRs. Maybe LTRs arose from such TIRs. As mentioned above, the retroviral Gag is not monophyletic according to Llorens’ Gag phylogeny . Another sign of Gag ancestry is the presence of CCHC zinc fingers in both Errantivirus Gag and capsid proteins of caulimoviruses .
A third explanation for the limited discrepancy between the RT- and LTR-based trees is the occurrence of a recombination event between a retrovirus and a non-chromoviral Metaviridae retrotransposon so that the retroviral LTRs are derived from the latter but the retroviral RT is not.
Based on RT similarity and a gradual acquisition of functionally important structures, we suggest a complex series of events during the evolution of LTR retrotransposons (Figure 3), highlighting the intertwined relation between LTR and non-LTR retrotransposons. A similar tree was earlier presented by . A somewhat different branching order was seen in Additional file 2: Figure S4. These trees contain relatively few branches, and are not intended as ‘final’ phylogenetic reconstructions.
Although the exact sequence of events during retroviral evolution is difficult to unambiguously reconstruct at this stage, several lines of evidence can be drawn from sequence and structural similarities. The starting point of LTR retrotransposon evolution (Figure 4) may have been from non-LTR transposons related to LINE and Penelope elements. The latter have terminal repeats, which may have been precursors of LTRs. RH was acquired at least twice . Because of the varying position of integrase relative to reverse transcriptase, several horizontal transfers of integrase, maybe involving a DNA transposon, are postulated. A hypothetical LTR retrotransposon precursor may have been self-priming, via a 5′ hairpin . A similar mechanism has been proposed for DIRS retrotransposons . Some chromoviruses still use hairpin priming. tRNA priming via the PBS seems to be a rather late event. Judging from the RT-based trees, Pseudoviridae seems to be the oldest LTR retrotransposon group, but the relation between their reverse transcriptases and those of non-LTR retrotransposons like DIRS, and of hepadna and caulimoviruses is uncertain. Other events during LTR retrotransposon genesis were acquisition of a capsid and nucleic acid binding protein (‘Gag’), a pepsin-related aspartic protease and a membrane glycoprotein. It is likely that further search in the rapidly expanding base of host genomic sequences will reveal other retroelement intermediates, which will clarify the complex sequence of events.
The selective pressures acting on the host species set the stage for the evolutionary scenario of retrotransposons. Both Pseudoviridae and Metaviridae are widespread in eukaryotes, while retroviruses are confined to vertebrates. It is likely that retroviral evolution started from a Metaviridae precursor, in an early vertebrate [12, 45].
The existence of an RNAse H coding region in the element along with its site of action, the PPT. RNAse H was apparently acquired twice during evolution, and from distinct sources, first in LINE elements, and later in retroviruses .
A polymerase II (RNA Pol II) dependent promoter (which often involves a hairpin structure) in close proximity to a polyadenylation signal.
Presence of an integrase. Perhaps a selection for a new type of integration guidance favoured the acquisition of a DDE integrase, in at least three separate events. Alternatively, since IN has a similar folding as RH , it is conceivable that it originally arose as a gene duplication of RH. The DDE integrase of the Ginger DNA transposon is highly similar to that of some gypsy elements . The integrase was taken up in pol, just after the RT-RH sequence. However, a similar but separate acquisition must also have occurred in a precursor of copia and rGmr1 retroelements. In this case, the integrase may have been positioned before RT-RH. The order and direction of these sequence exchanges are uncertain.
The use of tRNA priming through a PBS probably is a relatively late evolutionary event. It is likely that the progenitors of LTR retrotransposons used hairpin priming instead.
LTRs may have arisen from a complex sequence of contributions from several types of retrotranscribing elements and viruses. In addition, specific regulatory motifs probably accumulated in the U3 region in response to adaptive selection to allow tissue-tropic transcription and in response to CpG methylation. The close relationship between packaged (viral) and unpackaged ‘selfish nucleic acid’ based on RNA and DNA during retrotransposon evolution is remarkable. Although difficult to trace, both could have co-existed and exchanged structures during evolution of multicellular organisms.
We have demonstrated that retroviruses and Metaviridae elements share the same conserved motifs but that Pseudoviridae elements differ slightly. Nearly all LTR retrotransposons, including plant Metaviridae and Semotivirus (Bel/Pao), which were not modelled in this study, have conserved SIRs. Some Metaviridae of Drosophila were however an exception. All investigated Metaviridae and retroviruses have a well-conserved AATAAA but a less conserved TATA box whereas the opposite is true for Pseudoviridae (Copia/Ty1) elements of plants, reflecting that the polyadenylation signal is less conserved in plants and demonstrating how well LTRs can mimic the promoters and regulatory elements of their hosts.
Surprisingly, conserved features other than promoter elements and the 5′ SIR are present in U3: Closely related LTRs such as Retrofit/Sire or Zam/Mdg1 have the same kind of low complexity regions in U3. The LTR alignments seem to favour paraphyly of Metaviridae and monophyly of retroviruses, agreeing partly with Llorens et al. .
As for retroviruses, the HMMs constructed here can also be used for detection of many groups of LTR retrotransposons if they are combined with detection of other motifs as is done by the RetroTector© program [57, 58]. Implementation of large-scale parallel execution of HMM detection is required, because of speed limitations of HMM algorithms.
Reference sequences from Metaviridae (Gypsy/Ty3) and Pseudoviridae (Copia/Ty1) were collected from Genbank, following Llorens et al. . In addition, all available Gypsy/Ty3 and Copia/Ty1 sequences were retrieved from RepBase . All class III retroviral sequences were obtained from RepBase.
The internal coding parts of all reference and all RepBase sequences were clustered by means of BLASTP and the CLANS software . E values <1E-200 were chosen in order to produce as many groups as possible. This resulted in 14 well-separated clusters for Gypsy/Ty3. The coding sequences of Copia/Ty1 fell into two main groups that could be further subdivided into a total of five groups. For each group the corresponding LTRs were selected. This assumes that LTRs and coding retrotransposon genes have co-evolved, which may often be the case as suggested by Benachenhou et al. .
HMMs were constructed for each LTR group, which was divided into a training set and a test set containing approximately 80/20% of the LTRs, respectively. The HMMs were selected based on score with the test set and/or presence of conserved motifs in the corresponding alignments. In some cases it was necessary to subdivide the coding sequence clusters to fulfil our HMM selection criteria. For example our Zam HMM describes only a subclade of Errantiviruses. The HMMs were used for detection in chromosomes from four different organisms: Drosophila melanogaster, Anopheles gambiae, Danio rerio and Oryza sativa. For comparison, RepeatMasker was run on each chromosome using the RepBase library version 090604.
The HMM algorithms were implemented in C by Panu Somervuo and FB. The software for detection was parallelised using Message Passing Interface (MPI), and run on a cluster of computers with 22 nodes. By parallelization the execution times could be reduced to a few hours for a genome size of 70 Mbp instead of 2 to 3 days. Other software used were ClustalW , Mega version 4.1  for phylogenetic trees, and Bioedit  and Weblogo  for visualisation of alignments. Phylogenetic trees were either neighbour-joining, maximum likelihood or minimum evolution, with bootstrap values from 1,000, 500 and 1,000 replications, respectively.
As described under ‘model building’ above, the profile HMM system cannot accommodate large variations in LTR length. It presupposes a certain number of match states. However, as described we systematically tested many different match states before settling for an optimal HMM, and therefore this source of bias was minimised.
Group antigen gene, encoding structural proteins
Group antigen protein
A portion of the integrase C-terminal domain
Human immunodeficiency virus
Human endogenous retrovirus
Human MMTV-like sequence
Hidden Markov model
International Commission for Taxonomy of Viruses
Initiator of transcription
Long terminal repeat
Mouse mammary tumour virus
Primer binding site
Repeat portion of LTR
Short inverted repeat
Terminal inverted repeat
Target site duplication
Transcriptional start site
Unique 3′ LTR portion
Unique 5′ LTR portion
We thank Oscar Eriksson for his invaluable help with computers and software, Hans-Henrik Fuxelius for his useful advice and Aris Katzourakis for initial valuable comments. This work was financially supported by funds given to JB and GA for ERV work from the Swedish Medical Research council, and to JB for bioinformatic development from the Uppsala Academic Hospital, and to EBR and GA from the Swedish University of Agricultural Sciences.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.