Domestication and co-domestication events of both PIF transposase and MADF protein in insect genomes
To retrieve sequences related to PIF transposases, we performed reiterative tBLASTn and BLASTp searches of insect NCBI proteins and translated nucleotide databases using the seven previously identified PIF domesticated transposases from Drosophila, i.e., DPLG proteins, as initial queries and all hits to PIF TE related sequences collected in the searched insect genomes. This approach added up to more than 250 hits to PIF transposases being used as queries (see Methods).
We screened all species with sequenced and annotated genomes in NCBI within (in alphabetic order): Archaeognatha, Blattodea, Coleoptera, Collembola, Dermaptera, Diplura, Diptera, Ephemeroptera, Hymenoptera, Lepidoptera, Megaloptera, Odonata, Orthoptera, Phasmatodea, Plecoptera, Psocodea, Siphonaptera, and Thysanoptera. We couldn’t screen (in alphabetic order): Embioptera, Grylloblattodea, Mantodea, Mantophasmatodea, Mecoptera, Neuroptera, Protura, Raphidioptera, Zoraptera and Zygentoma (Thermobia) because of lack of sequenced, annotated genomes. See Methods for more details and summary Fig. 1 and Supplementary Table 1.
We confirmed the domestications previously found in Drosophila (DPLG1–7) and dated their time of domestication with higher accuracy (See Methods). We obtained good support for domestications events within Anopheles (Diptera), Lepidoptera and Blattodea. We present all the evidence below of six additional PIF TE proteins domestication events in insects, in separate sections. First, two co-domestications of both transposase and MADF proteins in Anopheles (Diptera), then one transposase-only domestication event and one co-domestication in butterflies and moths (Lepidoptera), and lastly two transposase-only domestication events in cockroaches (Blattodea). Our inferences are based on the genes being in syntenic regions in the genomes in far related species, under purifying selection and having lost the hallmarks of being active TEs, i.e., absence of related copies in the genomes and TIRs, following our previous work approach [13]. See details below and in Methods.
Two co-domestication events of both PIF transposase and MADF protein in Anopheles
The searches and the exploration of the genomic regions containing PIF transposases BLAST hits in Anopheles resulted in the identification of two unrelated potentially domesticated transposases (Fig. 2 and Supplementary Table 1). Upon further examination both cases appear to be analogous to the PIF co-domestication previously observed [13], i.e., DPLG7 and DPM7, where both open reading frames of the same PIF TE insertion in Drosophila were domesticated. We named the first gene of the two Anopheles co-domestications as Anopheles PIF-Like Gene 1 (APLG1) and APLG2, respectively, and the second gene as Anopheles PIF MADF-like protein-encoding gene 1 (APM1) and APM2 (Fig. 2). APLG1 and APM1 are 97 bp apart,
whereas the distance between APLG2 and APM2 is 179 bp in the Anopheles gambiae genome and the structures are well conserved in other species although the genes are much further apart in some lineages (Fig. 2E).
Multiple pieces of evidence support these examples as independent co-domestications in mosquito. We found that in both cases the transposase and the MADF proteins possess intact coding regions in all species of Anopheles we found hits in (Supplementary Table 1 and Supplementary file 1). They all appear as a single-copy gene in all species examined, supporting the hypothesis that they represent conserved orthologs. Additionally, APMs are in a region immediately adjacent to APLGs in all species analyzed (Fig. 2), suggesting that both proteins are derived from the domestication of the genes of the same transposon insertion. However, we did not find TIRs or target site duplications (TSDs) further supporting that these are not PIF TEs (see Methods). After manually annotating all retrieved sequences, we performed DNA multiple alignments and found that the exon-intron structure of each gene is also well conserved in all Anopheles species where we find the genes. Analyses of approximately 10 kb upstream and downstream of each genes revealed that the microsynteny of each region respectively is well-conserved (Supplementary Fig. 1). To examine the evolutionary relationship of both transposases and their respective MADF proteins, we performed phylogenetic analyses using the maximum likelihood (ML) approach (Fig. 2A-D), and found that in both cases the evolutionary history of domesticated proteins follows the evolutionary history of the Anopheles genus (Fig. 2 [31,32,33], providing further support that those proteins have been domesticated and are an integral part of the functionality of their host genomes. In addition, we estimated dN/dS ratios of all genes and maximum likelihood ratio tests indicated that purifying selection is the major evolutionary force acting on the transposase genes, i.e., dN/dS is statistically significantly smaller than 1 (APLG1 dN/dS = 0.282; APLG2 dN/dS = 0.330; LRT P-value < 0.05 for both genes), while MADF sequences can evolve sometimes much faster albeit under purifying selection with dN/dS ratios significantly smaller than 1 (APM1 dN/dS = 0.517; APM2 dN/dS = 0.074; LRT P-value < 0.05 for both genes; See Supplementary Table 2 and Fig. 1) consistent with the mode of evolution of these genes described in other species [28, 29]. dN/dS values are similar between APLG1 and APLG2, suggesting that different PIF transposases might experience similar degrees of selective constraint in the same species.
Interestingly, APLG1 and APM1 in A. gambiae are annotated as a single transcript and the five annotated introns have good RNA-Seq support (Supplementary Fig. 2). APLG1 and APM1 tissue expression supports their functionality and co-expression in multiple tissues (Supplementary Table 3). APLG2 and APM2 transcription was also confirmed in A. gambiae for multiple tissues (Supplementary Table 3). These data support the functionality of these domesticated PIF genes. The dicistronic transcription of APLG1 and APM1 provides additional/potential justification for why co-domestication of transposase and MADF protein from the same insertion occurred.
Based on the previously established phylogeny of Anopheles [31] and the presence-absence of the domestications in different genomes (Fig. 2), we estimated that APLG1 and APM1 co-domestication is a relatively young co-domestication that occurred ~ 28 Mya. On the other hand, APLG2 and APM2 co-domestication is present in all Anopheles species supporting that these domestication took place approximately 100 Mya (http://www.timetree.org).
In the closely related species A. darlingi, A. albimanus and A. aquasalis, a further pair of genes encoding for a transposase and for a protein containing a MADF domain was found adjacent to APLG2. Given the genomic location, it is likely that these two genes derive from a tandem duplication of the pair APLG2-APM2; therefore, we named them APLG2b and APM2b. However, both pairs APLG2-APLG2b and APM2-APM2b show only 57% of protein sequence identity, suggesting that APLG2b and APM2b may have evolved rapidly after arising via duplication. The alternative scenario of an independent domestication event originating APLG2b and APM2b seems unplausible because of the aforementioned proximity of the two pairs of genes.
Domestication and co-domestications of PIF TE genes in Lepidoptera
We found two additional cases of PIF transposase domestications in several lineages of Lepidoptera (butterflies and moths). First, we identified a transposase domestication event in 31 Lepidoptera species, representing 21 genera, whose genomes are sequenced and annotated and publicly available, which we named LPLG1 (Lepidoptera PIF-like Gene 1). Second, we found an additional case of an independently domesticated transposase, named hereafter LPLG2 (Lepidoptera PIF-like Gene 2), in species representing 18 Lepidoptera genera (only data for representatives of the different genera is provided for this case). Both cases of domestications seem to have occurred at least 140 Mya (Akito et al. 2019; Figs. 1 and 3). BLASTp searches showed highly conserved orthologous sequences for each gene across genera (E-value = 0) and the lack of the structural hallmarks of active TEs (TIRs and TSDs). Both LPLG1 and LPLG2 are present as different single-copy genes within each genome examined, confirming the independent domestication origin of those transposases from a different PIF TE family. LPLG1 and LPLG2 both possess intact coding regions, composed of 385–428 aa and 356–402 aa, respectively (Supplementary file 1) and are under strong purifying selection (LPLG1 dN/dS = 0.037; LPLG2 dN/dS = 0.038; LRT P values < 0.05; Supplementary Table 2). Detailed examination of the genomic regions upstream and downstream of LPLG1 and LPLG2 showed a high degree of conservation across species and genera (Supplementary Fig. 1). Immediately adjacent to LPLG2, we found a gene encoding a MADF protein, named hereafter LPM2 (Lepidoptera MADF-like Gene 2; Fig. 3 and Supplementary file 1), indicating a likely co-domestication of both PIF genes from the same PIF TE insertion in this case. LPM2 orthologs have evolved under purifying selection (dN/dS = 0.1758; LRT P-value < 0.05 (Supplementary Table 2). LPM2 is distant from LPLG2 in a few lineages (Fig. 3D). However, we did not find TIRs or target site duplications (TSDs) further supporting that these are not PIF transposons (see Methods). To observe the evolutionary relationship of both transposases and the MADF protein, we performed ML phylogenetic analyses (Fig. 3A-C) and observe that the relationships follow quite closely the known phylogeny (Fig. 3 and [34, 35] but not completely. That gene trees do not always follow completely the species phylogeny is expected from incomplete lineage sorting [36, 37].
Transcription data in Bombyx mori reveals transcription of LPLG2 and LPM2 in all adult tissues analyzed, as well as expression of LPLG1 in most adult tissues analyzed. While LPLG2 and LPM2 are transcribed in the same tissues, LPLG2 shows lower levels compared to LPM2 (Supplementary Table 3). Transcription of these genes support their functionality and in the co-domestication where we observe transcription in the same tissues of both genes albeit at different levels provides supports for at least partial coregulation of the two genes likely derived from the co-domestication of transposase and MADF protein from the same PIF TE insertion.
Domestication of PIF TE genes in cockroaches
In the superorder Paraneoptera, order Blattodea (cockroaches), we discovered two cases of PIF transposase-only domestication events. We named these genes BPLG1 and BPLG2, for Blattodea PIF-like Gene 1 and 2. BPLG1 and BPLG2 were found in five and four species, respectively (Supplementary Table 1). BPLG1 shows an intact coding region encoding a transposase-like protein of 378–401 aa (Supplementary Table 1 and Supplementary file 1) and maintains a conserved two-exons structure across all species although the single intron has change quite a bit in length (Fig. 4C). The regions flaking BPLG1 show no evidence of the TIRs or TSD associated to PIF TE insertions. Analysis of the coding region suggests that this transposase-derived gene has been evolving under strong purifying selection (BPLG1 dN/dS = 0.095; LRT P-value = 0, see Supplementary Table 2). Furthermore, examination of the genomic regions upstream and downstream of BPLG1 shows a high degree of conservation across species (Supplementary Fig. 1). The gene phylogenies (Fig. 4A-B) are consistent with the known species phylogeny [38,39,40]. Overall, these findings support the scenario of a domestication event from a PIF transposase that is conserved across several species of cockroaches. Given its distribution we estimate that this domestication event occurred ~ 228 Mya (http://www.timetree.org), the oldest event among the five cases we identified.
BPLG2 orthologs encode for a highly conserved 394–395 aa long transposase-like protein sharing 77–90% sequence identity. The gene contains two exons separated by an intron spanning ~ 900–1500 bp in all species except for the ~ 5400 long intron in Cryptotermes secundus (Fig. 4). No TIRs or TSD flanking BPLG2 were identified and microsynteny data validated the orthology across species (Supplementary Fig. 1). BPLG2 is under purifying selection (BPLG2 dN/dS = 0.2486; LRT P-value < 0.001, see Supplementary Table 2). Given the species distribution the time of this domestication was 132 Mya (http://www.timetree.org). Although we did not find available genome wide expression data for these species, the two domestications in cockroaches are well supported at the sequence level.
Interestingly, multiple PIF TEs and MADF proteins are present in the annotated cockroach genomes (Fig. 1), but we did not observe the domestication of genes encoding MADF proteins next to the domesticated transposases.
Evolutionary relationships of transposase sequences from PIF-like genes and transposons
As we retrieved the domestication cases above, we collected PIF transposases sequences from those genomes (Supplementary file 2). The phylogenetic reconstruction of transposase evolution from 90 insect PIF-like sequence revealed that domesticated elements are scattered throughout the PIF transposon transposases tree (Fig. 5; Supplementary Fig. 3; Supplementary file 3). This phylogeny suggests that the majority of the fourteen domesticated PIF genes found in insects evolved independently from distantly related lineages of transposons. Nodes ancestral to each PIF-like gene (PLG) are statistically well supported with the sole exception of APLG2. This phylogeny of PIF-like transposases shows that APLG2 and APLG2b form a clade with low statistical support together with DPLG3 genes (Fig. 5; Supplementary Table 4), but APLG2 and APLG2b are not monophyletic within this clade. No synteny conservation was found for APLG2 and DPLG3 in Anopheles and Drosophila, further supporting that these two genes represent independent domestication events. So, while some PLGs appear to form clades, these are not well supported in agreement with the scenario of multiple independent domestication events. This is also the case for the pair of tandemly arranged genes APLG2 and APLG2b. Although APLG2b is likely to have evolved from duplication of its neighbor gene, it might have experienced a high level of substitution, obfuscating the evolutionary relations between the two. This is supported by the length of the branch leading to APLG2b proteins (Fig. 5).
The genes APLG1, LPLG1, BPLG1, DPLG1 and LPLG2 form monophyletic clades with one or multiple PIF transposon transposase lineages, further reinforcing the hypothesis that insect PLGs largely evolved independently. APLG1 is sister to a group of TEs found in dipterans, beetles and hymenopterans (Fig. 5). LPLG1 forms a clade with TEs from ants and dragonflies, whereas BPLG1 is sister with a stick insect element. Both DPLG1 and LPLG2 grouped with different TE lineages from lepidopterans.
Overall, no PLGs can be traced to PIF elements currently present in the same genomes, as already suggested for the seven Drosophila PLGs [13]. This lends further support to the view that insect PLGs represent ancient domestication events from TE lineages that have gone extinct in the species harboring those genes.
A step-by-step model of PIF TE genes co-domestication
Building upon our previous work in Drosophila [13], we have performed exhaustive analyses of PIF-like sequences in the most diverse animal taxon to determine the frequency of transposase and MADF genes co-domestication (Fig. 1). We established that co-domestication events from the same TE insertion are relatively common in insects, with a minimum of four cases found collectively: Drosophila DPLG7-DPM7 (65 Mya), Anopheles (APLG1-APM1 ~ 28 Mya, and APLG2-APM2 ~ 100 Mya), and Lepidoptera (LPLG2-LPM2 ~ 140 Mya). Although six cases of TPase gene-only domestications were documented in Drosophila in our previous work (Casola, et al. 2007), we found only three additional such events in other insects (Fig. 1). While the possibility exists that some domesticated events could not be traced due to incomplete gene annotation outside Drosophila, these findings indicate that co-domestication occurs at least as often as the recruitment of TPase genes-only in non-Drosophila lineages. It is also possible that genes encoding MADF-like proteins tend to be co-domesticated in most cases but are domesticated from independent insertions or subsequently relocated to other genomic regions in some lineages (See MADF genomic distribution in Fig. 1). Indeed, we observed that in the mosquito APLG2-APM2 and the lepidopteran LPLG2-LPM2 pairs the two genes are proximal to each other in many species while they are farther apart in the genome of other species (Figs. 2 and 3). The growth of the intergenic region between the PLGs and the MADF encoding gene is the most parsimonious explanation to this pattern. Since the two genes do not need to be linked chromosomally for the co-domestication after a TE invasion, our observations speak to the specificity of the interaction and co-domestication of transposase gene and MADF/Myb-like gene from the same PIF TE insertion.
An observation further supporting the frequent co-domestication scenario is that in most PIF TE co-domestications only one of the two proteins show a predicted nuclear localization signal (NLS; Supplementary Table 5), suggesting that the two PIF proteins need to interact to relocate to the nucleus, as shown in rice (Hancock et al., 2010) and zebrafish (Sinzelle et al., 2008). Indeed, among co-domestication cases, we found that often only one protein, either the transposase or the MADF protein but not both, contained a predicted NLS in 33/42 species (Supplementary Table 5).
Such interaction can occur regardless of the proximity of the transposase and MADF genes. However, additional support for the requirement of co-domestication from the same TE insertion comes from the fact that APLG1 and APM1 are annotated as dicistronic in A. gambiae. Thus, we propose a step-by-step model wherein co-domestication of both PIF genes from the same insertion is common and is occasionally followed by the separation or movement of one of the two genes to different genomic locations, from which they might continue to interact (Fig. 6).