Human transposable elements in Repbase: genomic footprints from fish to humans
Mobile DNA volume 9, Article number: 2 (2018)
Repbase is a comprehensive database of eukaryotic transposable elements (TEs) and repeat sequences, containing over 1300 human repeat sequences. Recent analyses of these repeat sequences have accumulated evidences for their contribution to human evolution through becoming functional elements, such as protein-coding regions or binding sites of transcriptional regulators. However, resolving the origins of repeat sequences is a challenge, due to their age, divergence, and degradation. Ancient repeats have been continuously classified as TEs by finding similar TEs from other organisms. Here, the most comprehensive picture of human repeat sequences is presented. The human genome contains traces of 10 clades (L1, CR1, L2, Crack, RTE, RTEX, R4, Vingi, Tx1 and Penelope) of non-long terminal repeat (non-LTR) retrotransposons (long interspersed elements, LINEs), 3 types (SINE1/7SL, SINE2/tRNA, and SINE3/5S) of short interspersed elements (SINEs), 1 composite retrotransposon (SVA) family, 5 classes (ERV1, ERV2, ERV3, Gypsy and DIRS) of LTR retrotransposons, and 12 superfamilies (Crypton, Ginger1, Harbinger, hAT, Helitron, Kolobok, Mariner, Merlin, MuDR, P, piggyBac and Transib) of DNA transposons. These TE footprints demonstrate an evolutionary continuum of the human genome.
Repbase and conserved noncoding elements
Repbase is now one of the most comprehensive databases of eukaryotic transposable elements and repeats . Repbase started with a set of just 53 reference sequences of repeats found in the human genome . As of July 1, 2017, Repbase contains 1355 human repeat sequences. Excluding 68 microsatellite representatives and 83 representative sequences of multicopy genes (72 for RNA genes and 11 for protein genes), over 1200 human repeat sequences are available.
The long history of research on human repeat sequences resulted in a complicated nomenclature. Jurka  reported the first 6 “medium reiterated frequency repeats” (MER) families (MER1 to MER6). MER1, MER3 and MER5 are currently classified as the hAT superfamily of DNA transposons, and MER2 and MER6 are classified as the Mariner superfamily of DNA transposons. In contrast, MER4 was revealed to be comprised of LTRs of endogenous retroviruses (ERVs) . Right now, Repbase keeps MER1 to MER136, some of which are further divided into several subfamilies. Based on sequence and structural similarities to transposable elements (TEs) reported from other organisms, other MER families have also been classified as solo-LTRs of ERVs, non-autonomous DNA transposons, short interspersed elements (SINEs), and even fragments of long interspersed elements (LINEs). Problems in classification also appear with recently reported ancient repeat sequences designated as “Eutr” (eutherian transposon), “EUTREP” (eutherian repeat), “UCON” (ultraconserved element), and “Eulor” (euteleostomi conserved low frequency repeat) [4, 5]. In general, the older the repeat is, the harder it is to classify. One reason for this pattern is the inevitable uncertainty of some ancient, highly fragmented repeats at the time of discovery and characterization.
Recent analyses of repeat sequences have accumulated evidence that repeat sequences contributed to human evolution by becoming functional elements, such as protein-coding regions and binding sites for transcriptional regulators [6, 7]. Due to the rapid amplification of nearly identical copies with the potential to be bound by transcriptional regulators, TEs are proposed to rewire regulatory networks [8,9,10].
Another line of evidence for the contribution of TEs comes from conserved noncoding elements (CNEs), which were characterized via the comparison of orthologous loci from diverse vertebrate genomes. CNEs at different loci sometimes show substantial similarity to one another and to some TEs , indicating that at least some of these CNE “families” correspond to ancient families of TEs. Xie et al.  reported 96 such CNE families, including those related to MER121, LF-SINE, and AmnSINE1. It was revealed that ancient repeats have been concentrated in regions whose sequences are well conserved . However, resolving the origins of these repeat sequences is a challenge because of their age, divergence and degradation.
This article summarizes our current knowledge about the human repeat sequences that are available in Repbase. The map, showing the positions of repeats in the reference genome, the human genome sequence masked with the human repeat sequences in Repbase, and the copy number and the coverage length of each repeat family are available at http://www.girinst.org/downloads/repeatmaskedgenomes/. It is noteworthy that despite our continuous efforts, most ancient repeat sequences remain unclassified into any group of TEs (Table 1).
Repbase and RepeatMasker
RepeatMasker (http://www.repeatmasker.org/) and Censor  are the two most widely used tools for detecting repeat sequences in genomes of interest. These tools use sequence similarity to identify repeat sequences with the use of a prepared repeat library. The repeat library used by RepeatMasker is basically a repacked Repbase that is available at the Genetic Information Research Institute (GIRI) website (http://www.girinst.org/repbase). Censor is provided by GIRI itself and can use the original Repbase. The RepeatMasker edition of Repbase is released irregularly (once a year in the last 5 years), while the original Repbase is updated monthly. However, there are some minor discrepancies between Repbase and the RepeatMasker edition. These differences are caused by independent updates of repeat sequences and their annotations in both databases. These updates are seen especially for human repeats. These discrepancies include different names for the same repeats. For example, MER97B in Repbase is listed as MER97b in the RepeatMasker edition, MER45 in Repbase is found as MER45A in the RepeatMasker edition, and MER61I in Repbase is found as MER61-int in the RepeatMasker edition. In some cases, the corresponding sequences may have less than 90% sequence identity due to independent sequence updates. The MER96B sequences in the two databases are only 89% identical. The consensus sequences of the L1 subfamilies are divided into several pieces (“_5end,” which includes the 5’ UTR and ORF1, “_orf2,” which corresponds to ORF2, and “_3end,” which corresponds to the 3’ UTR) in the RepeatMasker edition to improve the sensitivity of detection.
This article does not aim to eliminate such discrepancies. Instead, some consensus sequences that were found only in the RepeatMasker edition previously were added to Repbase. In this article, all sequence entries are based on Repbase, but if those entries have different names in the RepeatMasker edition, these names are also shown in parentheses in the included Tables.
TE classification in Repbase
Eukaryotic transposable elements are classified into two classes: Class I and Class II. Class I is comprised of retrotransposons, which transpose through an RNA intermediate. Class II is comprised of DNA transposons, which do not use RNA as a transposition intermediate. In other words, Class I includes all transposons that encode reverse transcriptase and their non-autonomous derivatives, while Class II includes all other autonomous transposons that lack reverse transcriptase and their non-autonomous derivatives. Another important piece of information is that the genomes of prokaryotes (bacteria and archaea) do not contain any retrotransposons.
Repbase currently classifies eukaryotic TEs into three groups: Non-LTR retrotransposons, LTR retrotransposons and DNA transposons  (Table 2). Non-LTR retrotransposons and LTR retrotransposons are the members of Class I TEs. To simplify the classification, some newly described groups are placed in these three groups. The “Non-LTR retrotransposons” include canonical non-LTR retrotransposons that encode apurinic-like endonuclease (APE) or/and restriction-like endonuclease (RLE), as well as Penelope-like elements (PLE) that encode or do not encode the GIY-YIG nuclease. These non-LTR retrotransposons share a transposition mechanism called “target-primed reverse transcription (TPRT),” in which the 3’ DNA end cleaved by the nuclease is used as a primer for reverse transcription catalyzed by the retrotransposon-encoding reverse transcriptase (RT) . Non-LTR retrotransposons are classified into 32 clades. Short interspersed elements (SINEs) are classified as a group of non-LTR retrotransposons in Repbase. SINEs are composite non-autonomous retrotransposons that depend on autonomous non-LTR retrotransposons for mobilization [15, 16]. SINEs are classified into four groups based on the origins of their 5′ regions .
LTR retrotransposons are classified into five superfamilies (Copia, Gypsy, BEL, DIRS and endogenous retrovirus (ERV)), and the ERV superfamily is further subdivided into five groups (ERV1, ERV2, ERV3, ERV4 and endogenous lentivirus). Except for the DIRS retrotransposons, these LTR retrotransposons encode DDE-transposase/integrase for the integration of cDNA, which is synthesized in the cytoplasm by the retrotransposon-encoding RT. The RT encoded by LTR retrotransposons uses tRNA as a primer for reverse transcription. The DDE-transposase/integrase of LTR retrotransposons resembles the DDE-transposase seen in DNA transposons, especially IS3, IS481, Ginger1, Ginger2, and Polinton . DIRS retrotransposons, on the other hand, encode a tyrosine recombinase (YR), which is related to the YRs encoded by Crypton DNA transposons .
DNA transposons include very diverse groups of TEs. Repbase currently uses 23 superfamilies for the classification of DNA transposons. Most TE superfamilies encode DDE transposase/integrase , but Crypton and Helitron encode the YR and HUH nucleases, respectively [21, 22]. Polinton encodes a DDE transposase that is very closely related to the LTR retrotransposons, Ginger1, and Ginger2, but Polinton is an extremely long TE encoding DNA polymerase B and some structural proteins [18, 23]. Polinton was recently reported as an integrated virus designated Polintovirus, based on the identification of the coding regions for the minor and the major capsid proteins .
Only three groups of non-LTR retrotransposons are active in the human genome: L1 (long interspersed element-1 (LINE-1)), Alu and SVA (SINE-R/VNTR/Alu). Thanks to their recent activity, these retrotransposons can be classified into many subfamilies based on sequence differences (Table 3). The classification and evolution of these groups is well described in several articles [25,26,27,28]; thus, these three groups are introduced briefly here.
L1 is the only active autonomous non-LTR retrotransposon in the human genome. L1 encodes two proteins called ORF1p and ORF2p. ORF1p is the structural protein, corresponding to Gag proteins in LTR retrotransposons and retroviruses. ORF2p includes domains for endonuclease and reverse transcriptase, as well as a DNA-binding CCHC zinc-finger motif. L1 mobilizes not only its own RNA but also other RNAs that contain 3′ polyA tails. Thus, the presence of L1 corresponds to an abundance of processed pseudogenes, which are also called retrocopies or retropseudogenes . Alu and SVA transpose in a manner dependent on the L1 transposition machinery [15, 30, 31]. L1 is present in most mammals, but some mammals, such as megabats, have lost L1 activity .
Based on their age and distribution, L1 lineages are classified as L1P (primate-specific) and L1M (mammalian-wide). These groups are further sub-classified into various subfamilies (Table 3). L1PA1 (L1 and L1HS in Repbase correspond to this subfamily) is the only active L1 subfamily in the human genome. During the evolution of L1, the 5′ and 3′ untranslated regions (UTRs) were replaced by unrelated sequences . These replacements sometimes saved L1 from restriction by KRAB-zinc finger proteins .
The majority of Alu is composed of a dimer of 7SL RNA-derived sequences. Dimeric Alu copies in the human genome are classified into three lineages: AluJ, AluS and AluY, among which AluY is the youngest lineage . Older than AluJ are monomeric Alu families, which can be classified into 4 subfamilies: FAM, FLAM-A, FLAM-C and FRAM . FLAM-A is very similar to PB1 from rodents; thus, Repbase does not include FLAM-A. FLAM in Repbase corresponds to FLAM-C. 7SL RNA-derived SINEs are called SINE1. SINE1 has been found only in euarchontoglires (also called supraprimates), which is a mammalian clade that includes primates, tree shrews, flying lemurs, rodents, and lagomorphs . The close similarity between FLAM-A and PB1 indicates their activity in the common ancestor of euarchontoglires, and the lack of SINE1 outside of euarchontoglires indicates that SINE1 evolved in the common ancestor of euarchontoglires after their divergence from laurasiatherians. In rodents, no dimeric Alu has evolved. Instead, B1, which is another type of derivative of PB1, has accumulated. The genomes of tree shrews contain composite SINEs that originated from the fusion of tRNA and 7SL RNA-derived sequences .
Several Alu subfamilies are transposition-competent. The two dominant Alu subfamilies that show polymorphic distributions in the human population are AluYa5 and AluYb8. AluYa5 and AluYb8 correspond to approximately one-half and one-quarter of human Alu polymorphic insertions, respectively . AluYa5 and AluYb8 have accumulated 5 and 8 nucleotide substitutions, respectively, from their ancestral AluY, which remains active and occupies ~15% of the polymorphic insertions. Until recently, all active Alu elements were believed to be AluY or its descendants . However, a recent study revealed that some AluS insertions are polymorphic in the human population, indicating that some AluS copies are or were transposition-competent . Monomeric Alu families are older than dimeric Alu families, but monomeric Alu families also show species-specific distributions in the great apes . Monomeric Alu insertions have been generated via two mechanisms. One mechanism is recombination between two polyA tracts to remove the right monomer of dimeric Alu, and the other mechanism is the transposition of a monomeric Alu copy. BC200, which is a domesticated Alu copy , is the main contributor to the latter mechanism, but at least one other monomeric Alu copy also contributed to the generation of new monomeric Alu insertions .
SVA is a composite retrotransposon family, whose mobilization depends on L1 protein activity [30, 31]. Two parts of SVA originated from Alu and HERVK10, which is consistent with the younger age of SVA than Alu and HERVK10 . The other parts of SVA are tandem repeat sequences: (CCCTCT) hexamer repeats at the 5′ terminus and a variable number of tandem repeats (VNTR) composed of copies of a 35–50 bp sequence between the Alu-derived region and the HERVK10-derived region. SVA is found only in humans and apes. Gibbons have three sister lineages of SVA, which are called LAVA (L1-Alu-VNTR-Alu), PVA (PTGR2-VNTR-Alu) and FVA (FRAM-VNTR-Alu) [44, 45]. These three families share the VNTR region and the Alu-derived region but exhibit different compositions.
SVA in hominids (humans and great apes) is classified into 6 lineages (SVA_A to SVA_F), and SVA_F is the youngest lineage . The three youngest subfamilies, SVA_F, SVA_E and SVA_D, contribute to all known polymorphic SVA insertions in the human genome. Recently, another human-specific SVA subfamily was found, and this subfamily has recruited the first exon of the microtubule-associated serine/threonine kinase 2 (MAST2) gene [46,47,48]. The master copy of this human-specific subfamily is presumed to be inserted in an intron of the MAST2 gene and is transcribed in a manner dependent on MAST2 expression in some human individuals, although it is not present in the human reference genome. An SVA_A-related subfamily was recently found in the Northern white-cheeked gibbon (Nomascus leucogenys) and was designated as SVA NLE .
In addition to the sequences described above, the human genome contains many signs of the ancient activity of non-LTR retrotransposons belonging to L2, CR1, Crack, RTE, RTEX, R4, Vingi, Tx1 and Penelope (Table 3). With the rapid increase of information about repeats in other vertebrate genomes, TEs from other vertebrates occasionally provide clues about the origin of human repeat sequences. One recently classified example is UCON82, which exhibits similarity to the 3′ tails of vertebrate RTE elements from coelacanth (RTE-2_LCh), crocodilians (RTE-2_Croc) and turtle (RTE-30_CPB) (Fig. 1a). The characterization of L2-3_AMi from the American alligator Alligator mississippiensis revealed the L2 non-LTR retrotransposon-like sequence signatures in UCON49 and UCON86.
These groups of non-LTR retrotransposons are also found in several mammals or amniotes, supporting their past activity. L2 is the dominant family of non-LTR retrotransposons in the platypus genome . The diversification of CR1 is a trademark of bird genomes . Active RTE was found in various mammals and reptiles and is represented by Bov-B from bovines [51, 52]. L4 and L5 were originally classified as RTE, but the reanalysis revealed that these sequences are more closely related to RTEX. Non-LTR retrotransposons belonging to the R4 clade were reported in the anolis lizard . Vingi was reported in hedgehogs and reptiles . Some sequence-specific non-LTR retrotransposons belonging to Tx1 are reported in crocodilians . Crack and Penelope have not been reported in any amniotes. On the other hand, R2, which is a non-LTR retrotransposon lineage that is distributed widely among animals , is not found in any mammalian genomes.
The human genome also contains many ancient SINE insertions, such as MIRs or DeuSINEs [56,57,58]. It is known that MIRs exhibit sequence similarity to L2 in their 3′ regions, indicating that MIRs were transposed in a manner dependent on the transposition machinery of L2 . MER131 is considered to be a SINE because it ends with a polyA tail. As shown in many reports [6, 59], some of these insertions have been exapted to function as promoters, enhancers or other non-coding functional DNA elements.
The group of LTR retrotransposons in the human genome is primarily endogenous retroviruses (ERVs) (Table 4). ERV1, ERV2 and ERV3 are all found in the human genome, but the recently recognized ERV4 has not been detected . Neither the endogenous lentivirus nor the endogenous foamy virus (Spumavirus) was found. Some traces of Gypsy LTR retrotransposons have also been found, and this finding is consistent with the domesticated Gypsy (Sushi) sequences in peg10 and related genes . There are no traces of the Copia, BEL or DIRS retrotransposons in the human genome , except for the two genes encoding DIRS-derived protein domains: Lamin-associated protein 2 alpha isoform (LAP2alpha) and Zinc finger protein 451 (ZNF451) . BEL and DIRS are found in the anolis lizard genome but have not been detected in bird genomes . Mammalian genomes contain only a small fraction of Gypsy LTR retrotransposons, and it is speculated that during the early stage of mammalian evolution, LTR retrotransposons lost their competition with retroviruses.
Historically, human ERVs have been designated with “HERV” plus one capital letter, such as K, L or S. Difficulty in classifying ERV sequences is caused by (1) the loss of internal sequences via the recombination of two LTRs and (2) the high level of recombination between different families. Different levels of sequence conservation between LTRs and the internal portions between LTRs increases this complexity. Recently, Vargiu et al.  systematically analyzed and classified HERVs into 39 groups. Here, the relationship between the classification reported by Vargiu et al. and the consensus sequences in Repbase is shown (Table 4). Unfortunately, it is impossible to determine all LTRs or internal sequences in Repbase using the classification system reported by Vargiu et al. . Thus, in this review, 22 higher classification ranks in Vargiu et al.  are used, and many solo-LTRs are classified as the ERV1, ERV2, ERV3 and Gypsy superfamilies. The numbers of copies for each ERV family in the human genome are available elsewhere, such as dbHERV-REs (http://herv-tfbs.com/), and thus, the abundance or the phylogenetic distribution of each family is not discussed in this review.
ERV1 corresponds to Gammaretroviruses and Epsilonretroviruses. In the classification scheme outlined by Vargiu et al. , only HEPSI belongs to Espilonretrovirus. In addition, one subgroup of HEPSI, HEPSI2, may represent an independent branch from other HEPSIs and may be related to the retrovirus-derived bird gene Ovex1 . Endogenous retroviruses related to Ovex1 were found in crocodilians . Several MER families and LTR families (MER31A, MER31B, MER49, MER65, MER66 (MER66A, MER66B, MER66C, MER66D and MER66_I linked with MER66C), MER87, MER87B, HERV23, LTR23, LTR37A, LTR37B, and LTR39) are reported to be related to MER4 (MER4 group).
ERV2 was classified into 10 subgroups by Vargiu et al. . All of these subgroups belong to the lineage Betaretrovirus. No ERV2 elements closely related to Alpharetrovirus were detected. HERVK is the only lineage of ERVs that has continued to replicate within humans in the past few million years , and this lineage exhibits polymorphic insertions in the human population .
ERV3 was historically considered to be the endogenous version of Spumavirus (foamy virus); however, the recent identification of true endogenous foamy viruses (SloEFV from sloth, CoeEFV from coelacanth and ERV1-2_DR from zebrafish) revealed that ERV3 and Spumavirus are independent lineages [1, 68, 69]. The ERVL lineage of the ERV3 families encodes a dUTPase domain, while the ERVS lineage lacks dUTPase. The distribution of ERVL- and ERVS-like ERVs in amniotes indicates that at least two lineages of ERV3 have evolved in mammalian genomes .
There are many recombinants between different ERV families. HARLEQUIN is a complex recombinant whose structure can be expressed as LTR2-HERVE-MER57I-LTR8-MER4I-HERVI-HERVE-LTR2. HERVE, HERVIP10F, and HERV9 are the closest in sequence to HARLEQUIN, indicating that these three ERV1 families are the components that construct HARLEQUIN-type recombinant ERVs. HERVE, HERVIP10 and HERV9 are classified as HERVERI, HERVIPADP and HERVW9, respectively, in Vargiu et al. . Recombinants between different families or lineages makes the classification very difficult. The extremes of recombination are the recombinants between two ERVs belonging to ERV1 and ERV3. Such recombination generates ERV1-like envelope protein-encoding ERV3 families, although most mammalian ERV3 families lack envelope protein genes. HERV18 (HERVS) and the related HERVL32 and HERVL66 are such recombinants.
As shown by Pace and Feschotte , no families of DNA transposons are currently active in the human genome. During the history of human evolution, two superfamilies of DNA transposons, hAT and Mariner, have constituted a large fraction of the human genome (Table 5). Autonomous hAT families are designated as Blackjack, Charlie, Cheshire, MER69C (Arthur) and Zaphod. Many MER families are now classified as non-autonomous hAT transposons. The Mariner DNA transposons that contain at least a portion of a protein coding region are Golem (Tigger3), HsMar, HSTC2, Kanga, Tigger, and Zombi (Tigger4). Some recently characterized repeat sequence families designated with UCON or X_DNA have also been revealed to be non-autonomous members of hAT or Mariner. For example, the alignment with Mariner-N12_Crp from the crocodile Crocodylus porosus revealed that UCON39 is a non-autonomous Mariner family and the first two nucleotides (TA) in the original consensus of UCON39 are actually a TSD (Fig. 1b). The characterization of hAT-15_CPB from the western painted turtle Chrysemys picta bellii led to the classification of Eutr7 and Eutr8 as hAT DNA transposons because those sequences exhibit similarity in the termini of hAT-15_CPB. Based on sequence similarity and age distribution , it is revealed that autonomous DNA transposon families have a counterpart: non-autonomous derivative families. MER30, MER30B and MER107 are the derivatives of Charlie12. MER1A and MER1B originated from CHARLIE3. TIGGER7 is responsible for the mobilization of its non-autonomous derivatives, MER44A, MER44B, MER44C and MER44D.
In addition to these two dominant superfamilies, small fractions of human repeats are classified into other DNA transposon superfamilies (Table 5). These repeats are Crypton (Eulor5A, Eulor5B, Eulor6A, Eulor6B, Eulor6C, Eulor6D and Eulor6E), Helitron (Helitron1Nb_Mam and Helitron3Na_Mam), Kolobok (UCON29), Merlin (Merlin1-HS), MuDR (Ricksha), and piggyBac (Looper, MER75 and MER85). A striking sequence similarity was found between Crypton elements from salmon (Crypton-N1_SSa and CryptonA-N2_SSa) and Eulor5A/B and Eulor6A/B/C/D/E, especially at the termini (Fig. 1c). They are the first Eulor families classified into a specific family of TEs and also the first finding of traces of Cryptons in the human genome, except for the 6 genes derived from Cryptons .
Like Crypton-derived genes, some human genes exhibit sequence similarity to DNA transposons, which have not been characterized in the human genome. The identification of these “domesticated” genes reveals that some DNA transposons inhabited the human genome in the past. Ancient Transib was likely the origin of the rag1 and rag2 genes that are responsible for V(D)J recombination [72,73,74]. THAP9 has a transposase signature from a P element and retains transposase activity . harbi1 is a domesticated Harbinger gene . rag1, rag2 and harbi1 are conserved in all jawed vertebrates. Gin-1 and gin-2 show similarity to Gypsy LTR retrotransposons, as well as Ginger2 DNA transposons, but are the most similar to some Ginger1 DNA transposons from Hydra magnipapillata . Therefore, although the traces of 4 superfamilies of DNA transposons (Transib, P, Harbinger, and Ginger1) have not found as repetitive sequences in the human genome, they have contributed to human genome evolution by serving protein-coding sequences.
Genomic traces of human evolution
Several families of TEs are still active in the human population. L1PA1, SVA and several AluY subfamilies show polymorphism in the human population, indicating their recent activity [40, 77]. Another type of evidence for the current activity of these TEs are the somatic insertions seen in brains and cancer cells [78, 79]. HERVK is the only lineage of ERVs exhibiting polymorphic insertions in the human population .
On the other hand, human repeats have accumulated during the whole history of human evolution. These repeats are certainly not restricted to the human genome but are shared with the genomes of many other mammals, amniotes, and vertebrates. Almost all TE families are shared between humans and chimpanzees. An exception is the endogenous retrovirus family PtERV1, which is present in the genomes of chimpanzees and gorillas but not humans . The human TRIM5alpha can prevent infection by PtERV1, and this can be the reason why PtERV1 is absent in the human genome . Sometimes, TE families that ceased transposition long ago in the human lineage have been active to mobilize in another lineage. The Crypton superfamily of DNA transposons were active in the common ancestor of jawed vertebrates, judging from the distribution of orthologous Crypton-derived genes . Eulor5A/B and Eulor6A/B/C/D/E are shared among euteleostomi including mammals to teleost fishes and show similarity to two non-autonomous Crypton DNA transposons from salmon (Fig. 1c). Copies of Crypton-N1_SSa are over 94% identical to their consensus sequence, and copies of CryptonA-N2_SSa are around 90% identical to their consensus sequence. The autonomous counterpart of these two salmon Crypton DNA transposons may be the direct descendants of the ancient Crypton DNA transposon that gave birth to Eulor5A/B and Eulor6A/B/C/D/E. UCON39 is conserved among mammals and shows similarity to the crocodilian DNA transposon family Mariner-N12_Crp (Fig. 1b). The distribution of these two families indicates that they are the sister lineages sharing the common ancestor. Copies of Mariner-N12_Crp are only around 82% identical to their consensus. Considering the low substitution rate in the crocodilian lineage, Mariner-N12_Crp also ceased to transpose a very long ago. These examples clarify the contribution of TEs to the human genome components. They also highlight the importance of characterizing TE sequences from non-human animals in understanding the human genome evolution.
As represented by names such as EUTREP (eutherian repeat) or Eulor (euteleostomi conserved low frequency repeat), different repeat families are shared at different levels of vertebrate groups. Jurka et al.  reported 136 human repeat families that are not present in the chicken genome and 130 human repeat sequences that are also present in the chicken genome. These two sets of families likely represent ancient TE families that expanded in the common ancestor of mammals and ancient TE families that expanded in the common ancestor of amniotes, respectively. Based on the carrier subpopulation (CASP) hypothesis we proposed, these TE insertions were fixed by genetic drift after population subdivision . These insertions may have resulted in reduced fitness of the host organism, but it can allow the organism to escape from evolutionary stasis . Once TE insertions were fixed, mutations should have accumulated to increase fitness. Increasing fitness is usually through the elimination of TE activity and the removal of TE insertions. However, some TE insertions have acquired function beneficial to the host. Indeed, ancient repeats have been concentrated in regions whose sequences are well conserved . They are expected to have been exapted to have biological functions as enhancers, promoters, or insulators.
More direct evidence for the ancient transposition of TEs is seen in domesticated genes. rag1, rag2, harbi1, and pgbd5 (piggyBac-derived gene 5) are conserved in jawed vertebrates. The most ancient gene that originated from a certain TE superfamily is a Crypton seen in the woc/zmym genes . Four genes, zmym2, zmym3, zmym4 and qrich1, were duplicated by two rounds of whole genome duplication in the common ancestor of vertebrates and represent the orthologs of woc distributed in bilaterian animals. Unfortunately, this level of conservation is unlikely to be present in non-coding sequences derived from TEs; however, over 6500 sequences are reported to be conserved among chordates, hemichordates and echinoderms . Researchers are more likely to find traces of ancient TEs when analyzing slowly evolving genomes, such as crocodilians .
Nearly all repeat sequences in the human genome have likely been detected. The current challenge is the characterization of these repeat sequences and their evolutionary history. This characterization is one objective of the continuous expansion of Repbase. Repbase will continue to collect repeat sequences from various eukaryotic genomes, which will help to uncover the evolutionary history of the human genome.
Conserved noncoding element
Euteleostomi conserved low frequency repeat
Long interspersed element
Long terminal repeat
Microtubule-associated serine/threonine kinase 2.
Medium reiterated frequency repeats
Open reading frame
Short interspersed element
Target-primed reverse transcription
Variable number of tandem repeats
Bao W, Kojima KK, Kohany O. Repbase update, a database of repetitive elements in eukaryotic genomes. Mob DNA. 2015;6:11.
Jurka J, Walichiewicz J, Milosavljevic A. Prototypic sequences for human repetitive DNA. J Mol Evol. 1992;35(4):286–91.
Jurka J. Novel families of interspersed repetitive elements from the human genome. Nucleic Acids Res. 1990;18(1):137–41.
Gentles AJ, Wakefield MJ, Kohany O, Gu W, Batzer MA, Pollock DD, Jurka J. Evolutionary dynamics of transposable elements in the short-tailed opossum Monodelphis Domestica. Genome Res. 2007;17(7):992–1004.
Jurka J, Bao W, Kojima KK, Kohany O, Yurka MG. Distinct groups of repetitive families preserved in mammals correspond to different periods of regulatory innovations in vertebrates. Biol Direct. 2012;7:36.
Bejerano G, Lowe CB, Ahituv N, King B, Siepel A, Salama SR, Rubin EM, Kent WJ, Haussler D. A distal enhancer and an ultraconserved exon are derived from a novel retroposon. Nature. 2006;441(7089):87–90.
Sorek R, Ast G, Graur D. Alu-containing exons are alternatively spliced. Genome Res. 2002;12(7):1060–7.
Chuong EB, Elde NC, Feschotte C. Regulatory evolution of innate immunity through co-option of endogenous retroviruses. Science. 2016;351(6277):1083–7.
Lynch VJ, Leclerc RD, May G, Wagner GP. Transposon-mediated rewiring of gene regulatory networks contributed to the evolution of pregnancy in mammals. Nat Genet. 2011;43(11):1154–9.
Kunarso G, Chia NY, Jeyakani J, Hwang C, Lu X, Chan YS, Ng HH, Bourque G. Transposable elements have rewired the core regulatory network of human embryonic stem cells. Nat Genet. 2010;42(7):631–4.
Xie X, Kamal M, Lander ES. A family of conserved noncoding elements derived from an ancient transposable element. Proc Natl Acad Sci U S A. 2006;103(31):11659–64.
Kohany O, Gentles AJ, Hankus L, Jurka J. Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and censor. BMC Bioinformatics. 2006;7:474.
Kapitonov VV, Jurka J. A universal classification of eukaryotic transposable elements implemented in Repbase. Nat Rev Genet. 2008;9(5):411–2. author reply 414.
Luan DD, Korman MH, Jakubczak JL, Eickbush TH. Reverse transcription of R2Bm RNA is primed by a nick at the chromosomal target site: a mechanism for non-LTR retrotransposition. Cell. 1993;72(4):595–605.
Dewannieux M, Esnault C, Heidmann T. LINE-mediated retrotransposition of marked Alu sequences. Nat Genet. 2003;35(1):41–8.
Kajikawa M, Okada N. LINEs mobilize SINEs in the eel through a shared 3′ sequence. Cell. 2002;111(3):433–44.
Kojima KK. A new class of SINEs with snRNA gene-derived heads. Genome Biol Evol. 2015;7(6):1702–12.
Bao W, Kapitonov VV, Jurka J. Ginger DNA transposons in eukaryotes and their evolutionary relationships with long terminal repeat retrotransposons. Mob DNA. 2010;1(1):3.
Poulter RT, Goodwin TJ. DIRS-1 and the other tyrosine recombinase retrotransposons. Cytogenet Genome Res. 2005;110(1–4):575–88.
Yuan YW, Wessler SR. The catalytic domain of all eukaryotic cut-and-paste transposase superfamilies. Proc Natl Acad Sci U S A. 2011;108(19):7884–9.
Goodwin TJ, Butler MI, Poulter RT. Cryptons: a group of tyrosine-recombinase-encoding DNA transposons from pathogenic fungi. Microbiology. 2003;149(Pt 11):3099–109.
Kapitonov VV, Jurka J. Rolling-circle transposons in eukaryotes. Proc Natl Acad Sci U S A. 2001;98(15):8714–9.
Kapitonov VV, Jurka J. Self-synthesizing DNA transposons in eukaryotes. Proc Natl Acad Sci U S A. 2006;103(12):4540–5.
Krupovic M, Bamford DH, Koonin EV. Conservation of major and minor jelly-roll capsid proteins in Polinton (maverick) transposons suggests that they are bona fide viruses. Biol Direct. 2014;9:6.
Kapitonov V, Jurka J. The age of Alu subfamilies. J Mol Evol. 1996;42(1):59–65.
Price AL, Eskin E, Pevzner PA. Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res. 2004;14(11):2245–52.
Khan H, Smit A, Boissinot S. Molecular evolution and tempo of amplification of human LINE-1 retrotransposons since the origin of primates. Genome Res. 2006;16(1):78–87.
Giordano J, Ge Y, Gelfand Y, Abrusan G, Benson G, Warburton PE. Evolutionary history of mammalian transposons determined by genome-wide defragmentation. PLoS Comput Biol. 2007;3(7):e137.
Ohshima K, Hattori M, Yada T, Gojobori T, Sakaki Y, Okada N. Whole-genome screening indicates a possible burst of formation of processed pseudogenes and Alu repeats by particular L1 subfamilies in ancestral primates. Genome Biol. 2003;4(11):R74.
Hancks DC, Goodier JL, Mandal PK, Cheung LE, Kazazian HH Jr. Retrotransposition of marked SVA elements by human L1s in cultured cells. Hum Mol Genet. 2011;20(17):3386–400.
Raiz J, Damert A, Chira S, Held U, Klawitter S, Hamdorf M. Lower J, Stratling WH, lower R, Schumann GG: The non-autonomous retrotransposon SVA is trans-mobilized by the human LINE-1 protein machinery. Nucleic Acids Res. 2012;40(4):1666–83.
Cantrell MA, Scott L, Brown CJ, Martinez AR, Wichman HA. Loss of LINE-1 activity in the megabats. Genetics. 2008;178(1):393–404.
Jacobs FM, Greenberg D, Nguyen N, Haeussler M, Ewing AD, Katzman S, Paten B, Salama SR, Haussler D. An evolutionary arms race between KRAB zinc-finger genes ZNF91/93 and SVA/L1 retrotransposons. Nature. 2014;516(7530):242–5.
Smit AF. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev. 1999;9(6):657–63.
Bao W, Jurka J. Origin and evolution of LINE-1 derived “half-L1” retrotransposons (HAL1). Gene. 2010;465(1–2):9–16.
Batzer MA, Deininger PL, Hellmann-Blumberg U, Jurka J, Labuda D, Rubin CM, Schmid CW, Zietkiewicz E, Zuckerkandl E. Standardized nomenclature for Alu repeats. J Mol Evol. 1996;42(1):3–6.
Kojima KK. Alu monomer revisited: recent generation of Alu monomers. Mol Biol Evol. 2011;28(1):13–5.
Kriegs JO, Churakov G, Jurka J, Brosius J, Schmitz J. Evolutionary history of 7SL RNA-derived SINEs in Supraprimates. Trends Genet. 2007;23(4):158–61.
Nishihara H, Terai Y, Okada N. Characterization of novel Alu- and tRNA-related SINEs from the tree shrew and evolutionary implications of their origins. Mol Biol Evol. 2002;19(11):1964–72.
Konkel MK, Walker JA, Hotard AB, Ranck MC, Fontenot CC, Storer J, Stewart C, Marth GT, Genomes C, Batzer MA. Sequence analysis and characterization of active human Alu subfamilies based on the 1000 genomes pilot project. Genome Biol Evol. 2015;7(9):2608–22.
Kryatova MS, Steranka JP, Burns KH, Payer LM. Insertion and deletion polymorphisms of the ancient AluS family in the human genome. Mob DNA. 2017;8:6.
Kuryshev VY, Skryabin BV, Kremerskothen J, Jurka J, Brosius J. Birth of a gene: locus of neuronal BC200 snmRNA in three prosimians and human BC200 pseudogenes as archives of change in the Anthropoidea lineage. J Mol Biol. 2001;309(5):1049–66.
Wang H, Xing J, Grover D, Hedges DJ, Han K, Walker JA, Batzer MA. SVA elements: a hominid-specific retroposon family. J Mol Biol. 2005;354(4):994–1007.
Carbone L, Harris RA, Gnerre S, Veeramah KR, Lorente-Galdos B, Huddleston J, Meyer TJ, Herrero J, Roos C, Aken B, et al. Gibbon genome and the fast karyotype evolution of small apes. Nature. 2014;513(7517):195–201.
Ianc B, Ochis C, Persch R, Popescu O, Damert A. Hominoid composite non-LTR retrotransposons-variety, assembly, evolution, and structural determinants of mobilization. Mol Biol Evol. 2014;31(11):2847–64.
Bantysh OB, Buzdin AA. Novel family of human transposable elements formed due to fusion of the first exon of gene MAST2 with retrotransposon SVA. Biochemistry (Mosc). 2009;74(12):1393–9.
Hancks DC, Ewing AD, Chen JE, Tokunaga K, Kazazian HH Jr. Exon-trapping mediated by the human retrotransposon SVA. Genome Res. 2009;19(11):1983–91.
Damert A, Raiz J, Horn AV, Lower J, Wang H, Xing J, Batzer MA, Lower R, Schumann GG. 5′-Transducing SVA retrotransposon groups spread efficiently throughout the human genome. Genome Res. 2009;19(11):1992–2008.
Warren WC, Hillier LW, Marshall Graves JA, Birney E, Ponting CP, Grutzner F, Belov K, Miller W, Clarke L, Chinwalla AT, et al. Genome analysis of the platypus reveals unique signatures of evolution. Nature. 2008;453(7192):175–83.
Suh A, Churakov G, Ramakodi MP, Platt RN 2nd, Jurka J, Kojima KK, Caballero J, Smit AF, Vliet KA, Hoffmann FG, et al. Multiple lineages of ancient CR1 retroposons shaped the early genome evolution of amniotes. Genome Biol Evol. 2014;7(1):205–17.
Kordis D, Gubensek F. Horizontal transfer of non-LTR retrotransposons in vertebrates. Genetica. 1999;107(1–3):121–8.
Walsh AM, Kortschak RD, Gardner MG, Bertozzi T, Adelson DL. Widespread horizontal transfer of retrotransposons. Proc Natl Acad Sci U S A. 2013;110(3):1012–6.
Novick PA, Basta H, Floumanhaft M, McClure MA, Boissinot S. The evolutionary dynamics of autonomous non-LTR retrotransposons in the lizard Anolis Carolinensis shows more similarity to fish than mammals. Mol Biol Evol. 2009;26(8):1811–22.
Kojima KK, Kapitonov VV, Jurka J. Recent expansion of a new Ingi-related clade of Vingi non-LTR retrotransposons in hedgehogs. Mol Biol Evol. 2011;28(1):17–20.
Kojima KK, Seto Y, Fujiwara H. The wide distribution and change of target specificity of R2 non-LTR Retrotransposons in animals. PLoS One. 2016;11(9):e0163496.
Smit AF, Riggs AD. MIRs are classic, tRNA-derived SINEs that amplified before the mammalian radiation. Nucleic Acids Res. 1995;23(1):98–102.
Jurka J, Zietkiewicz E, Labuda D. Ubiquitous mammalian-wide interspersed repeats (MIRs) are molecular fossils from the mesozoic era. Nucleic Acids Res. 1995;23(1):170–5.
Nishihara H, Smit AF, Okada N. Functional noncoding sequences derived from SINEs in the mammalian genome. Genome Res. 2006;16(7):864–74.
Sasaki T, Nishihara H, Hirakawa M, Fujimura K, Tanaka M, Kokubo N, Kimura-Yoshida C, Matsuo I, Sumiyama K, Saitou N, et al. Possible involvement of SINEs in mammalian-specific brain formation. Proc Natl Acad Sci U S A. 2008;105(11):4220–5.
Chong AY, Kojima KK, Jurka J, Ray DA, Smit AF, Isberg SR, Gongora J. Evolution and gene capture in ancient endogenous retroviruses - insights from the crocodilian genomes. Retrovirology. 2014;11:71.
Ono R, Kobayashi S, Wagatsuma H, Aisaka K, Kohda T, Kaneko-Ishino T, Ishino F. A retrotransposon-derived gene, PEG10, is a novel imprinted gene located on human chromosome 7q21. Genomics. 2001;73(2):232–7.
Chalopin D, Naville M, Plard F, Galiana D, Volff JN. Comparative analysis of transposable elements highlights mobilome diversity and evolution in vertebrates. Genome Biol Evol. 2015;7(2):567–80.
Abascal F, Tress ML, Valencia A. Alternative splicing and co-option of transposable elements: the case of TMPO/LAP2alpha and ZNF451 in mammals. Bioinformatics. 2015;31(14):2257–61.
Vargiu L, Rodriguez-Tome P, Sperber GO, Cadeddu M, Grandi N, Blikstad V, Tramontano E, Blomberg J. Classification and characterization of human endogenous retroviruses; mosaic forms are common. Retrovirology. 2016;13:7.
Carre-Eusebe D, Coudouel N, Magre S. OVEX1, a novel chicken endogenous retrovirus with sex-specific and left-right asymmetrical expression in gonads. Retrovirology. 2009;6:59.
Subramanian RP, Wildschutte JH, Russo C, Coffin JM. Identification, characterization, and comparative genomic distribution of the HERV-K (HML-2) group of human endogenous retroviruses. Retrovirology. 2011;8:90.
Wildschutte JH, Williams ZH, Montesion M, Subramanian RP, Kidd JM, Coffin JM. Discovery of unfixed endogenous retrovirus insertions in diverse human populations. Proc Natl Acad Sci U S A. 2016;113(16):E2326–34.
Katzourakis A, Gifford RJ, Tristem M, Gilbert MT, Pybus OG. Macroevolution of complex retroviruses. Science. 2009;325(5947):1512.
Han GZ, Worobey M. An endogenous foamy-like viral element in the coelacanth genome. PLoS Pathog. 2012;8(6):e1002790.
Pace JK 2nd, Feschotte C. The evolutionary history of human DNA transposons: evidence for intense activity in the primate lineage. Genome Res. 2007;17(4):422–32.
Kojima KK, Jurka J. Crypton transposons: identification of new diverse families and ancient domestication events. Mob DNA. 2011;2(1):12.
Kapitonov VV, Jurka J. RAG1 core and V(D)J recombination signal sequences were derived from Transib transposons. PLoS Biol. 2005;3(6):e181.
Kapitonov VV, Koonin EV. Evolution of the RAG1-RAG2 locus: both proteins came from the same transposon. Biol Direct. 2015;10:20.
Huang S, Tao X, Yuan S, Zhang Y, Li P, Beilinson HA, Zhang Y, Yu W, Pontarotti P, Escriva H, et al. Discovery of an active RAG Transposon illuminates the origins of V(D)J recombination. Cell. 2016;166(1):102–14.
Majumdar S, Singh A, Rio DC. The human THAP9 gene encodes an active P-element DNA transposase. Science. 2013;339(6118):446–8.
Kapitonov VV, Jurka J. Harbinger transposons and an ancient HARBI1 gene derived from a transposase. DNA Cell Biol. 2004;23(5):311–24.
Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Fritz MH, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526(7571):75–81.
Muotri AR, Chu VT, Marchetto MC, Deng W, Moran JV, Gage FH. Somatic mosaicism in neuronal precursor cells mediated by L1 retrotransposition. Nature. 2005;435(7044):903–10.
Goodier JL. Retrotransposition in tumors and brains. Mob DNA. 2014;5:11.
Yohn CT, Jiang Z, McGrath SD, Hayden KE, Khaitovich P, Johnson ME, Eichler MY, McPherson JD, Zhao S, Paabo S, et al. Lineage-specific expansions of retroviral insertions within the genomes of African great apes but not humans and orangutans. PLoS Biol. 2005;3(4):e110.
Kaiser SM, Malik HS, Emerman M. Restriction of an extinct retrovirus by the human TRIM5alpha antiviral protein. Science. 2007;316(5832):1756–8.
Jurka J, Bao W, Kojima KK. Families of transposable elements, population structure and the origin of species. Biol Direct. 2011;6:44.
McFadden J, Knowles G. Escape from evolutionary stasis by transposon-mediated deleterious mutations. J Theor Biol. 1997;186(4):441–7.
Simakov O, Kawashima T, Marletaz F, Jenkins J, Koyanagi R, Mitros T, Hisata K, Bredeson J, Shoguchi E, Gyoja F, et al. Hemichordate genomes and deuterostome origins. Nature. 2015;527(7579):459–65.
Green RE, Braun EL, Armstrong J, Earl D, Nguyen N, Hickey G, Vandewege MW, St John JA, Capella-Gutierrez S, Castoe TA, et al. Three crocodilian genomes reveal ancestral patterns of evolution among archosaurs. Science. 2014;346(6215):1254449.
The author thanks Weidong Bao for critical reading of the manuscript.
This work was supported by the Ministry of Science and Technology, Taiwan. The funding agency had no involvement in the design of the study, the collection, analysis, and interpretation of data, or writing the manuscript.
Availability of data and materials
The datasets generated and/or analysed during the current study are available in Repbase (http://www.girinst.org/repbase/).
Ethics approval and consent to participate
Consent for publication
The author declares that he has no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Kojima, K.K. Human transposable elements in Repbase: genomic footprints from fish to humans. Mobile DNA 9, 2 (2018). https://doi.org/10.1186/s13100-017-0107-y