Migrators within migrators: exploring transposable element dynamics in the monarch butterfly, Danaus plexippus

Background Lepidoptera (butterflies and moths) are an important model system in ecology and evolution. A high-quality chromosomal genome assembly is available for the monarch butterfly (Danaus plexippus), but it lacks an in-depth transposable element (TE) annotation, presenting an opportunity to explore monarch TE dynamics and the impact of TEs on shaping the monarch genome. Results We find 6.21% of the monarch genome is comprised of TEs, a reduction of 6.85% compared to the original TE annotation performed on the draft genome assembly. Monarch TE content is low compared to two closely related species with available genomes, Danaus chrysippus (33.97% TE) and Danaus melanippus (11.87% TE). The biggest TE contributions to genome size in the monarch are LINEs and Penelope-like elements, and three newly identified families, r2-hero_dPle (LINE), penelope-1_dPle (Penelope-like), and hase2-1_dPle (SINE), collectively contribute 34.92% of total TE content. We find evidence of recent TE activity, with two novel Tc1 families rapidly expanding over recent timescales (tc1-1_dPle, tc1-2_dPle). LINE fragments show signatures of genomic deletions indicating a high rate of TE turnover. We investigate associations between TEs and wing colouration and immune genes and identify a three-fold increase in TE content around immune genes compared to other host genes. Conclusions We provide a detailed TE annotation and analysis for the monarch genome, revealing a considerably smaller TE contribution to genome content compared to two closely related Danaus species with available genome assemblies. We identify highly successful novel DNA TE families rapidly expanding over recent timescales, and ongoing signatures of both TE expansion and removal highlight the dynamic nature of repeat content in the monarch genome. Our findings also suggest that insect immune genes are promising candidates for future interrogation of TE-mediated host adaptation. Supplementary Information The online version contains supplementary material available at 10.1186/s13100-022-00263-5.

diversity and abundance of TEs within a genome, as well as their integration histories, proliferation frequencies, and rate of removal, we can begin to unravel the factors that govern TE evolutionary dynamics. Curating TEs in a wide range of species is also an important step more generally, to expand understanding of variation in the interplay between TEs and host genomes, and the consequences of this for host evolution [28]. Meanwhile, from a practical perspective, accurate and reliable repeat annotation is essential to provide high-quality genome annotations, and to help avoid repeat sequences being incorrectly annotated as host genes and vice versa, especially given the vast numbers of genomes currently being sequenced across the tree of life.
Lepidoptera (butterflies and moths) are an important model system in ecology and evolution. Lepidopterans have captivated researchers for decades, proving a fascinating and practical system for research into topics such as development, physiology, host-plant interactions, coevolution, phylogenetics, mimicry, and speciation (e.g. [29][30][31][32][33]). Over recent decades, Lepidoptera has become an important model system in genomics especially. For example, Lepidoptera genomes have been pivotal in investigations into the genomic bases of migration [34][35][36], warning colouration [35], adaptive radiations [33], and evolutionary studies into TEs, including TE horizontal transfer [37][38][39], the influence of TEs on genome size [40], and the role of a TE in the evolution of industrial melanism in the peppered moth [41].
The monarch butterfly, Danaus plexippus, is famous for its incredible North American migration, global dispersal, and striking orange warning colouration [35], and has been the focus of much research and broad public interest for decades. A draft monarch genome was released in 2011, and work during the last decade has focussed on revealing the genetic bases of several key traits [35,36]. Most recently, a high-quality chromosomal assembly of the monarch genome was generated [42]. Additionally, genome assemblies for the black veined tiger (D. melanippus) and African Queen (D. chrysippus) provide a comparative context to examine monarch genomics. Yet, despite being an important research focus, we lack a detailed TE annotation for the monarch, and an accompanying understanding of monarch TE dynamics and host-TE interactions.
Here, we provide a thorough annotation of TEs within the monarch genome, alongside analyses of TE diversity and abundance, interactions with host genes, genomic distribution including hot and cold spots, and TE evolution. We find evidence of considerable recent TE activity in the monarch, with two novel Tc1 element families rapidly expanding over recent timescales (tc1-1_dPle and tc1-2_dPle). In contrast, we find that historically a pattern of high proliferation rates of LINE and Penelope-like elements (PLEs) was dominant, with these elements still dominating the monarch TE landscape. We also reveal the considerable role of genomic deletions in removing LINE TEs from the genome. Overall, we provide an overview of the broadscale dynamics that have shaped the TE landscape of the monarch genome, and provide a high-quality, curated, and species-specific TE annotation library for use by the scientific community.

Transposable Element Landscape
Repeats are a major determinant of genome size [40,[43][44][45][46][47]. However, in the monarch, we find that repeats comprise just 6.21% of the genome (total assembly size = 248. 68 Mb, assembly scaffold N50 = 9210 kb, 97.5% Complete Busco orthologs) (Fig. 1, Table 1). This differs considerably in comparison to the TE annotation performed on the original monarch genome assembly, where repeats were estimated to cover 13.06% of the genome (Table S1; Additional File 1) [36]. When examining the source of this variation, the major difference in TE content can be attributed to unclassified elements, for which we find coverage is reduced by 16.31 Mb (6.6% of total genome size) (Table S1; Additional File 1). This difference is likely due to two issues: (i) improvements in TE annotation tools over the last decade, where improved knowledge of TE structure has led to exclusion of erroneous sequences previously annotated as putatively TE, particularly given that the original annotation used similar tools and databases to those applied in this study [36]; (ii) We apply a conservative approach to TE annotation that excludes very short fragments which cannot be confidently identified as TE sequence, and we remove overlapping TE annotations. Improvements in genome assembly quality will also have improved repeat detection. Consistent with this, a standard 'no frills' RepeatMasker run using the Arthropoda RepBase database [48] and the Dfam database of repetitive DNA families [49] identified a repeat content of 5.60%, which is much closer to our finding of 6.21% repeat content using the earlGrey pipeline (https:// github. com/ TobyB aril/ EarlG rey) [50], and much lower than that identified in the original TE annotation (Table S2; Additional File 1). Thus, many of the sequences annotated as unclassified in the original annotation were most likely not TE sequences and are no longer annotated as such by current annotation tools. We are, however, unable to confirm this for certain since the original de novo repeat library generated for the draft genome was not provided alongside the genome. We also acknowledge the considerable difference in TE content annotated in the monarch genome in this study compared to a recent study focussing on D. chrysippus which also annotated TEs in the monarch genome [51]. When examining the source of this variation, we find this can be attributed to the more conservative and detailed approach taken here, where for example, 95,956 extremely short annotations identified in [51] did not reach the length and score thresholds for annotation in this study, whilst we also used a monarch-specific repeat library for annotation (See Additional File 2 for full discussion).
We annotated the two additional genomes available in the genus Danaus using the same automated TE annotation method that we applied for the monarch. In comparison, the monarch exhibits a much lower TE content than the other two species, with repetitive elements covering 33 Table 1). Relatively recent TE activity is indicated in all three Danaus species, with several TE classifications exhibiting low genetic distance to their respective consensus sequences, but there is evidence of a much greater burst of recent TE activity in D. chrysippus (Fig. 1) Table 1). The lower TE content observed in the monarch genome presumably arose by either: (i) loss of repeat content in the monarch; (ii) independent gains of repeat content in both D. chrysippus and D. melanippus (or in their ancestral lineages); or (iii) a gain of repeat content in the ancestral lineage leading to D. chrysippus and D. melanippus ( Fig. 2A). The ancestor of the monarch and D. chrysippus is estimated to have diverged ~ 8.3Mya [52]. Thus, using this estimate we were able to calculate DNA loss rates for the two species, to examine if the smaller genome of the monarch is a result of an increased rate of DNA loss (  Fig. 2B & C). Therefore, we suggest that hypothesis (ii) is the most likely explanation for observed differences in TE load among Danaus species, whereby D. chrysippus and D. melanippus have both experienced independent expansions in TE content (Fig. 2). Factors that may have been responsible for these gains, particularly the large gain in TE content in the genome of D. chrysippus, are unclear.

Transposable Element Landscape
TEs are unevenly distributed across the monarch genome, as indicated by a high standard deviation in TE base pairs per 100 kb genomic window (mean = 6.1 kb/100 kb, SD = 4.1 kb/100 kb), whereby a standard deviation of 0 indicates TEs are evenly distributed across the genome (Fig. 3, Fig. 4, Table S4; Additional File 1). TEs were also unevenly distributed  TEs are known to be unevenly distributed between different genomic compartments, with TE densities varying between genic and intergenic space [1]. In the monarch genome, we find that the greatest amount of TE sequence is found in gene flanking regions (6.59 Mb), followed by introns (5.56 MB), intergenic regions (2.82 Mb) and, finally, exons (0.47 Mb) (Fig. 3A). While when considered as a proportion of genomic compartment size, intergenic regions have the greatest proportion of TE content (9.67%), followed by gene flanking regions (8.22%), introns (5.00%), and exons (1.64%) (Fig. 3B). The relative lack of TEs in exonic regions is expected due to the potential for deleterious effects arising from transposition into coding sequence [1,53]. LINEs and PLEs are the most populous TEs present in the monarch genome. LINEs comprise 1.94% of the monarch genome, accounting for 3.36% of intergenic sequences, 2.67% of gene flanking regions, 1.41% of intronic sequences, and 0.76% of exonic sequences (Fig. 3B). Meanwhile, PLEs make up 1.03% of the monarch genome, Overall, we observed a weak negative correlation between gene density and TE density in the monarch genome (Spearman's rank, rho = − 0.373, p < 0.01), suggesting that regions that contain a high density of host genes are less likely to contain TEs. We identified 25 hotspots of TE accumulation in the monarch genome, defined as 100 kb windows with a TE content above a 99% area under the curve (AUC) cut-off (see Methods). This equates to windows where TE density is at least 3.03 times higher than expected (i.e., expected TE density assuming an even spread of TEs across 100 kb windows = 6.15 kb, lowest hotspot density = 18.64 kb, highest hotspot density = 28.02 kb) (Fig. 5). Hotspot windows represent regions of the genome able to tolerate high densities of TE insertions. Conversely, the presence of 36 cold spots, defined as 100 kb windows with TE content below a 1% AUC cut-off, where TE density is at least 12.42 times lower than expected (lowest cold spot density = 0.11 kb, highest cold spot density = 0.5 kb), highlights regions of the genome where very few TE insertions are tolerated. Whilst numerous TE coldspots are found on the Z chromosome, no TE hotspots are found there (Fig. 5). Of note, we detected 6 separate 100 kb windows labelled as TE coldspots and gene hotspots (Fig. 5). Gene ontology analyses suggest that genes present in gene hotspots that are TE coldspots are involved in RNA binding, splicing, polyadenylation, mRNA stability, and translation initiation [54]; cell polarisation [55]; neuronal development; antenna and eye development [56][57][58][59][60]; and olfactory processes, which are hypothesised to be involved in social behaviour [36] (Table S5; Additional File 1). Given the importance of these functions, it is likely that TE insertions in these regions would be severely detrimental to host fitness, thus individuals where TEs insert into these regions may be non-viable and purged early in development.

Transposable Element Activity
The genome of D. plexippus is dominated by three highly successful novel TE families, which collectively comprise 34.9% of total TE content ( Table 2, Table S6; Additional File 1). Of these families, the consensus sequence of the PLE, penelope-1_dple, is intact and putatively transposition competent, containing both a GIY-YIG endonuclease domain and a reverse transcriptase-like (RT-like) domain, whilst the LINE element, r2-hero-1_dple, is putatively transposition-competent, with an intact RT-like domain (Table S7, Additional File 1). SINEs are nonautonomous elements containing an internal polymerase III promoter to facilitate their transposition [61]. As such, there are no internal coding regions to determine the transposition competency of this element, however hase2-1_dple is putatively full-length at 306 bp.
Only three TE families are estimated to have inserted <1Mya in the monarch genome (Table S8, Fig. S1; Additional Files 1 and 3). The low estimated age of these insertions is a sign of recent host colonisation, potentially via horizontal transposon transfer (HTT) from another host genome [1,[62][63][64][65]. Peaks of low genetic distance to TE consensus sequence indicate an abundance of closely related elements and thus recent TE proliferation (Fig. 1, Table S9 and Fig. S2; Additional Files 1 and 4). There has been a recent burst of activity in two Tc1 elements: tc1-1_dple (418 copies, from ~ 0.43Mya), and tc1-2_dple (216 copies, from ~ 0.35Mya) (Table S8; Additional File 1). The rapid expansion of these Tc1 elements over such a short time period highlights their great success within the monarch genome when compared to other relatively young TEs, which are found at much lower copy numbers ( Fig. 6 and Fig. 7, Table S8 and Table S9; Additional File 1). This recent expansion of DNA TEs is interesting given the historic dominance of LINEs in the monarch genome (1.94% of the genome, 31% of total TE content). Whilst there are more DNA TE families than LINE families in the monarch genome (319 vs 213), LINEs have been more successful in proliferating and colonising the genome than their DNA TE counterparts to date, although the reasons for this are unknown, it appears that these new DNA TE families may challenge the historic trend ( Fig. 6 and Fig. 7).
The oldest TE insertions present in the monarch genome were estimated to have occurred ~ 43 Mya, predating the emergence of the monarch, which diverged ~ 32 Mya [66], with the first invaders being DNA and LTR elements (Fig. 6, Fig. 7, Table S9, Fig. S1; Additional Files 1 and 3). Thus, these TEs appear to have proliferated in an ancestral lineage after the Nymphalidae family emerged, ~91Mya (CI: 71-112Mya) [66]. However, most TE families (79%) are estimated to have inserted after the emergence of the monarch (mean TE family age = 25.41Mya) (Fig. 6, Fig. 7).

Turnover of LINEs and Penelope-like Elements
LINEs and PLEs are reverse transcribed from their 3′ end during insertion of a new copy and are often incomplete at the 5′ end due to the premature dissociation of reverse transcriptase (RT), or the activity of cellular RNases [67,68]. In contrast, the absence of a complete 3′ region is unlikely to arise due to either process and is evidence of a genomic deletion, presumably as a consequence of ectopic recombination between homologous LINE or PLE sequences [67,68].
Consistent with observations in the postman butterfly Heliconius melpomene [68] also from the family Nymphalidae, we find LINE and PLE fragments across all LINE and PLE families exhibit patterns suggestive of genomic deletions acting to remove these elements from the genome ( Table 3, Fig. S3; Additional File 5). In total, just under half of all LINE/PLE insertions (49.37%) are truncated at the 3′ end. For the CR1, I-Jockey, RTE-BovB, and Penelope families, most fragments display patterns suggestive of removal via genomic deletions (i.e. truncated at the 3′ end), whilst the majority of fragments of the L2, R1, R2-Hero, and RTE-RTE families display patterns indicative of processes associated with degradation of LINEs during mobilisation, such as premature disassociation of RT or breakdown via cellular RNAse activity (i.e. truncated at the 5′ end) ( Table 3). The range of insertion times between LINEs exhibiting an abundance of 3′ fragments and those exhibiting an abundance of 5′ fragments overlap, suggesting that insertion time does not explain the inter-family differences in fragmentation (Table S9, Additional File 1). Swathes of small fragments across all LINE and PLE families (mean length = 356 bp, n = 20,706) suggest a high rate of turnover of these non-LTR elements within the monarch genome, in contrast  [66] to patterns observed in mammals whereby a low rate of turnover leads to an accumulation of ageing TEs [68,69]. Despite evidence of loss, LINE and PLE sequences still dominate the TE landscape of the monarch. To be autonomous, an intact LINE must contain start and stop codons and an RT domain [70], whilst an intact PLE must contain an RT domain and a GIY-YIG endonuclease domain [71]. In total, we identified five intact LINEs and two intact PLEs in the monarch genome (Table S7, Additional File 1), including one copy of r2-Hero-1_dPle and 2 copies of penelope-1_dPle, which are the two most populous TE families in the monarch genome, suggesting continued activity (Table 2). We identified 4 intact LINEs in D. melanippus and 9 in D. chrysippus, however in both species intact PLEs were absent. We also observed accumulations of certain LINE families in the monarch, however only 11.38% of LINEs and 3.51% of PLEs have maintained a length above 1 kb, falling short of the expected length given an intact LINE ORF encoding RT is ~ 3 kb [72] (Table S10; Additional File 1). Short elements are more likely to persist in the genome than their longer counterparts, as they are less prone to recombination [19]. This, combined with the high activity level of LINEs and PLEs before and throughout the evolution of the monarch (Fig. 3), explains the dominance of LINE and PLE fragments across the genomic landscape, despite their high rates of turnover [19,68].

Association of TEs with Monarch Colouration and Immune Genes
Given a previous role for TEs in wing colouration in the peppered moth (Biston betularia) [41] and the clouded yellow butterfly (Colias croceus) [73], we investigated the possibility of TE associations with wing colouration in the monarch. The myosin gene is associated with wing colouration in the monarch [35], and we identified a single unclassified element 7 kb upstream of this gene ( Fig. S4; Additional File 6). This location contrasts to that of TEs implicated in colouration in the peppered moth and clouded yellow, where the TE insertions were found within an intron, and 6 kb downstream of the gene body, respectively [41,73]. When compared to all monarch genes, the association of a single TE with the myosin gene is within the expected range to find within 20 kb of a genic locus (See Methods), and therefore is not evidence of TE enrichment (Fig. S5, Additional File 7).
Immune genes are among the fastest evolving metazoan genes given strong selection arising from exposure to harmful pathogens and parasites [74][75][76][77], and so are key candidates for exploring the role of TEs in host evolution. Across the monarch genome, TEs of all major types were found within immune gene flanks (20 kb either side of gene bodies). Regions surrounding immune genes contain a three-fold increase in TE content when compared to genes involved in other functions, although this level of enrichment is just above a 5% level of significance (Welch's T-Test, T 9.0915 = 2.1351, p = 0.061). However, a significant difference in TE abundance between different types of immune genes was apparent (Kruskal-Wallis χ 2 38 = 104.88, p < 0.01), suggesting an enrichment for TEs around Class II-associated invariant chain peptide (CLIP)-like modulation genes when compared to immune genes involved in IKK gamma, Tab2, SOCS, and IMD signalling. (Fig. S6; Additional File 8).
When comparing associations between major TE types around immune genes, we find that LINEs and PLEs are significantly more abundant than rolling circles (Kruskal-Wallis, χ 2 7 = 29.88, p < 0.01), but there were no significant pairwise differences between the remaining major TE types. However, LINEs and PLEs are not enriched around immune genes compared to genome-wide levels (LINEs: genome-wide = 1.94% of genome size, immune gene regions = 2.04% of immune gene compartment size; PLEs: genome-wide = 1.03% of genome size, immune gene regions = 0.96% of immune gene compartment size).

Discussion
We provide a detailed annotation of TEs in the monarch genome, an analysis of their diversity and evolution, an investigation of their influence in shaping the monarch genome, and a comparative overview of TE load for the monarch and two congeneric species. We find a large reduction in TE sequence compared to the original annotation performed on the draft monarch genome. We suggest that this is due to many sequences in the original annotation no longer being recognised as TEs when using updated TE databases and current annotation tools. However, we are unable to confirm this, due to the original de novo TE library and annotation results being unavailable. This raises an important issue in genomics currently. Given TE annotation is a key step in genome assembly and annotation, we suggest that it should be required to include TE libraries alongside the publication of genome projects, ideally in a specific repository, to: (i) reduce duplication of efforts required for reannotation; (ii) improve reproducibility in annotation; and (iii) improve TE consensus libraries to reduce cases of a single TE family being given different names in multiple species. As the number of sequenced genomes increases, the wider availability of TE libraries will benefit a wide range of studies, where gene annotations can be improved by accurate TE annotation. The availability of large numbers of well-curated TE libraries will also facilitate large-scale studies of TE evolutionary dynamics, and the impacts of TEs on specific traits of interest, and host evolution more generally.
We identify an uneven distribution of TEs across the genome. TE insertions in non-coding regions are less likely to have a detrimental impact on host genome function than TE insertions into coding regions [1,21,53,[78][79][80], and so are less stringently removed by selection, enabling TE accumulation in non-coding regions. Indeed, some TEs actively avoid coding regions by targeting regions of the genome unlikely to impact host fitness [53,81], with this process potentially occurring in the monarch, where we find overlapping gene hotspots and TE cold spots. TE abundance is highest in intergenic regions. Of note, the highest TE coverage is found in host gene flanking regions rather than intergenic regions distal from host genes, where the risk of detrimental impacts of TE insertion are thought to be reduced [53,81]. The finding is interesting given the potential involvement of TEs in the evolution of gene regulatory networks [10][11][12][13]82] and the uneven distribution of TEs across the monarch genome. Although this could also be due to the relatively compact nature of the monarch genome, where genic regions account for 56% of the genome and host gene flanks account for a further 32%, leaving only 12% of the genome distal from host genes (i.e intergenic). TEs are implicated in rapid increases in intron size in eukaryotic genomes, where short non-autonomous DNA TEs independently generate hundreds to thousands of introns [1,53,83]. It is also hypothesised that TE-derived introns are responsible for the prevalence of nucleosome-sized exons (DNA sequences ~ 150 bp that can wrap around a single histone core) observed in eukaryotes [83]. Thus, the presence of TEs in intronic sequences in the monarch genome could reflect the status of introns as refugia where TE insertions are not as likely to be purged, or alternatively it could indicate ongoing intron generation, with TE insertions leading to the generation of new, or longer, introns over evolutionary time.
The presence of several young high copy number TE families in the monarch genome demonstrates the success of TEs in naïve host genomes [81,[84][85][86][87]. The putative recent arrival of two novel Tc1 families (tc1-1_dple and tc1-2_dple), has led to a wave of TE proliferation in the monarch genome. These elements are closely related and located in Clade 130 of Tc1/mariner phylogeny, which contains elements from diverse insect species, including butterflies, aphids and ants, in a recent phylogenetic study of DDE TE diversity [88] (Fig. S7; Additional File 9). Ants and aphids are known to share environments with monarch larvae on milkweed plants [89], raising the potential for HTT events between these groups [64,90]. Elucidating the origin of recent Tc1 insertions in the monarch genome will require a fuller comparison across related butterfly species and other host plants and insects that are known to share environments with the monarch, as more genomes become available. Over time, it is hypothesised that the activity of the novel families identified will decrease as the host develops mechanisms to suppress activity, either through the capture of TE sequences in piRNA regions, or through other defensive mechanisms such as methylation [87,91,92]. A reduction in transposition of DDE transposasecontaining DNA transposons can also be attributed to increases in non-autonomous copies [88]. In the meantime, however, these TEs are proliferating and dynamically altering the genomic landscape of D. plexippus, with potential consequences for ongoing genome structure and functional evolution [1][2][3]19].
Factors leading to the relatively low TE content observed in the monarch compared to other Danaus species are currently unknown. The maintenance of a small genome in the monarch may be partly attributed to the high turnover of TEs, evidenced by a lack of ancient peaks in repeat landscapes and the presence of few intact LINEs and PLEs, many of which show hallmarks of genomic deletions. We did not observe large differences in the number of intact LINEs between the three Danaus species considered here, however both D. chrysippus and D. melanippus lack transposition competent PLEs. In D. melanippus, there are only small numbers of intact LINEs and a relative lack of recent LINE activity, despite historical LINE activity having made the largest TE contribution to genome size ( Fig. 1 and Fig. 2). In contrast, there is evidence of very recent LINE activity in D. chrysippus, although much very recent TE activity is attributed to DNA and rolling circle elements. (Fig. 1 and Fig. 2).
Ectopic recombination between similar elements is likely to play a significant role in the removal of LINEs and PLEs. The only LINE or PLE families with an abundance of 3′ fragments are: L2, R1, R2-Hero, and RTE-RTE. While it is tempting to suggest that L2, R1, R2-Hero, and RTE-RTE insertions may have a lower impact on host fitness than other LINE insertions, and so are "allowed" to degrade over time, rather than being removed en masse through genomic mechanisms, it should be noted that ectopic recombination is a passive process. However, removal via ectopic recombination does not account for inter-family differences in the ratio of fragments undergoing degradation through ectopic recombination versus degradation via the activity of RT or cellular RNAses. In the case of the most abundant family, R2-Hero, these elements may have so far avoided ectopic recombination, given these elements are relatively young compared to others in the same classification (insertion ~ 8.58Mya, mean LINE insertion time 28.51Mya) ( Fig. 6 and Fig. 7, Table S9; Additional File 1)). Our results are consistent with those of previous findings in Heliconius species [93], in which a high TE turnover and a relative lack of more ancient TEs were observed, suggesting these characteristics might be widespread among nymphalids or lepidopterans more generally, although further comparative studies will be required to confirm this. In addition to TE turnover, TE-driven increases in genome size are likely also limited through inhibition of TE activity via epigenetic mechanisms and piRNAs [1,91,94,95]. Further research to characterise piRNA clusters in the monarch, and interrogation of its transcriptome, will further improve our understanding of how these processes have shaped its TE landscape.
The lack of TE hotspots on the Z chromosome is perhaps unexpected given sex chromosomes are thought to accumulate TEs faster than autosomes due to suppressed homologous recombination [96][97][98]. If this was the case in the monarch, one might expect to observe TEs accumulating due to the inability of the host genome to purge insertions. The Z chromosome of the monarch is the product of a fusion between an ancestral-Z and neo-Z genome segment, with distinct modes of dosage compensation acting on either section [42]. The unique chromatin remodelling dynamics on the Z chromosome in the monarch might moderate TE access to regions of the Z chromosome, although further research is required to understand the interplay between host processes and TE dynamics on this unique sex chromosome. Meanwhile, the three-fold increase in TE abundance around immune genes is intriguing and warrants future research to further understand the impact of TEs in these regions. The association with CLIP modulation genes is particularly interesting, as CLIP peptides block class II MHC binding grooves to modulate the immune response by influencing antigen presentation to CD4 T cells by class II MHC molecules [99]. Class II MHC genes are diverse and evolve rapidly, maintained by strong positive selection and frequent gene conversion [100,101]. Given the strong selection these loci are under, these are ideal candidates for further study to investigate a potential role for TEs in the adaptive immune system. The use of TE knockout studies in combination with transcriptomic analyses may elucidate the potential impacts TEs have on immune gene evolution in the monarch.

Conclusions
In-depth studies into TE landscapes and dynamics within host species of broad scientific interest remain relatively few, even with widespread acknowledgement of the important contributions that TEs have made to eukaryotic evolution [1,43,102]. We provide a detailed annotation of TEs in the monarch genome and analysis of TE diversity and evolution, including the identification of highly successful young DNA TE families, which mirror the activity profile of the most successful LINEs present in the genome. We provide a comparative context with two other Danaus species, demonstrating the considerable differences in TE content that can occur even among congeneric species. We present evidence of ongoing TE expansion and removal, highlighting the dynamic nature of repeat content within genomes over time. We also identify an increase in TE abundance around genes of immune function, providing potential candidates for future studies to elucidate the impact of TEs on immune function in the monarch. Further largescale comparative studies will be greatly beneficial to further our understanding of the evolutionary dynamics of TEs, and the drivers leading to diverse TE landscapes across lineages.

Transposable Element Annotation
The monarch chromosomal assembly, Dplex_v4, was downloaded from NCBI (Refseq GCF_009731565.1), along with corresponding genome annotations in GFF format [42,103]. The monarch genome assembly is 248.676 Mb, with a GC content of 32.1%, and comprises 4115 scaffolds with a scaffold N50 of 9.21 Mb.
The D. plexippus genome was annotated using Earl Grey (github. com/ TobyB aril/ EarlG rey/) [104]. Known repeats were first identified and masked using Repeat-Masker (version 4.1.0) [105] using the Arthropoda library from RepBase (version 23.08) and Dfam (release 3.3) [48,49]. Low complexity repeats and RNA were not masked (−nolow and -norna) and a sensitive search was performed (−s). Following this, a de novo repeat library was constructed using RepeatModeler (version 1.0.11), with RECON (version 1.08) and RepeatScout (version 1.0.5) [106][107][108]. To generate maximum-length consensus sequences for the de novo repeat library, novel repeats identified by RepeatModeler were curated using an automated version of the 'BLAST, Extract, Extend' process [28]. A BLASTn search was performed on the D. plexippus genome to obtain up to the top 40 hits for each TE family identified by RepeatModeler [109]. Sequences were extracted with 1000 base pairs of flanking sequences at the 5′ and 3′ ends. Each set of family sequences were aligned using MAFFT (version 7.453) [110]. Alignments were trimmed with trimAl (version 1.4.rev22) to retain high-quality positions in the alignment (−gt 0.6 -cons 60) [111]. Updated consensus sequences were computed with EMBOSS cons (−plurality 3) to generate a new TE library featuring consensus sequences with extended flanks [112]. This process was repeated through 5 iterations to obtain maximum-length consensus sequences.
Following automated curation, de novo family sequences were subject to a manual curation following protocols described by Goubert et al. (2022) [113]. Here, the final alignments generated by Earl Grey were visualised using AliView [114], and poorly-aligned positions and ends were trimmed where single-copy DNA was present. Consensus sequences were generated using EMBOSS cons [112] to generate a library of manually curated elements, which was clustered using cd-hit-est [115] with the options recommended by Goubert et al. (2022) [113] to reduce redundancy. To classify TEs, the TE library was analysed with the diagnostic tool TE-Aid (https:// github. com/ clemg oub/ TE-Aid) to identify TEassociated protein domains in sequence ORFs and to look for the presence of TIRs and LTRs [105,109,112]. Following this, blastn and nhmmer searches were performed against the Dfam database (version 3.3) to identify homology to previously described elements [49,109,116].
The resulting de novo repeat library was combined with the RepBase Arthropoda library and utilised to annotate repetitive elements using RepeatMasker with a score threshold at the relatively conservative level of 400 (−cutoff 400), to exclude poor matches unlikely to be true TE sequences. Following this, the D. plexippus genome was analysed with LTR_Finder (version 1.07), using the LTR_Finder parallel wrapper [117,118] to identify full-length LTR elements. Final TE annotations were defragmented and refined using the loose merge (−loose) command in RepeatCraft [119]. Overlapping annotations were removed using MGKit (version 0.4.1) filter-gff, with the longest element of overlapping annotations retained (−c length -a length) [120]. Finally, all repeats less than 100 bp in length were removed before the final TE quantification to decrease spurious hits. Final repeat annotation is provided in datasets S1 and S2 (Additional Files 10 and 11), and de novo TE library is provided in dataset S3 (Additional File 12).

Quantifying DNA Loss Rates Between the Monarch and D. chrysippus
To estimate DNA loss rates between the monarch and D. chrysippus, the DelGet.pl script (https:// github. com/ 4urel iek/ DelGet) was used to estimate DNA loss from small (< 30 nt) and mid-size (30 < nt < 10,000) deletions, as described previously [45]. Briefly, the script identifies orthologous regions of 10,000 bp between three input species (an outgroup and two species with known branch length in Mya) using BLAT (https:// genome. ucsc. edu/ FAQ/ FAQbl at. html). These regions are extracted and aligned with MUSCLE [121] before deletions are quantified. Species-specific gaps are identified and quantified. Deletion rates are then calculated between the two species of interest by dividing the normalised deleted nucleotide count (del_nt/10 kb) by the corresponding Mya since divergence.

Identifying Transposable Element Association with Genomic Compartments
The D. plexippus gene annotation file was processed to generate a GFF file containing coordinates of exons, introns, and 5′ and 3′ flanking regions. Gene flanking regions are defined here as 20 kb directly upstream or downstream of the gene body. These flanking regions are determined to identify TEs at a distance that could be described as the proximate promoter region, rather than just accounting for the core promoter region, to include both promoter and more distal enhancer regions for genes [122]. Bedtools intersect (version 2.27.1) was used to identify the number of bases of overlap (−wao) between TEs and different genomic compartments [123]. Following this, quantification was performed in R.

TE Hotspots and Coldspots
To identify TE and gene hotspots and coldspots, chromosomal scaffolds were split into 100 kb windows, and the TE density in each window was calculated and used to produce the frequency distribution of observations. A polynomial model was fitted to the frequency distribution, to generate a smooth curve through the data points using the lm function in R. Using these models, the total area under the curve (AUC) was calculated using the area_under_curve function of the bayestestR package [124]. From this, the 1 and 99% AUC cutoff values were determined. Starting with the first data point and iteratively adding subsequent datapoints with each round, the AUC for each collection of points was determined. The maximum AUC value not exceeding 99% of the total AUC was taken as the cutoff where windows containing higher TE or gene coverage were considered to be hotspots. The minimum AUC value not exceeding 1% of the total AUC was taken as the cutoff where windows containing lower TE or gene coverage were considered to be coldspots ( Fig. S8; Additional File 13). Karyoplots of the D. plexippus genome, illustrating gene and TE locations, highlighting TE hotspots and coldspots were generated using the Karyop-loteR package in R [125].

Estimating TE Age and Activity
TE activity dates were estimated using Kimura 2-parameter genetic distances (including CpG sites) calculated between individual insertions and the consensus sequence of each TE family [126]. A neutral mutation rate of 0.0116 substitutions per site per million years was used for D. plexippus, as previously described for the monarch [127].

Turnover of LINEs and PLEs
To interrogate LINE and PLE decay patterns, each LINE and PLE sequence was extracted from the genome using bedtools getfasta [123]. Extracted sequences were used to query the repeat library using BLASTn with tabular output specified (−outfmt 6). In the event of multiple matches, only the top match was retained (defined by percent identity). Coordinates of each match were scaled to normalise the length of each consensus sequence. Mapped fragments were plotted using ggplot2 of the tidyverse package in R [128,129].

Identification of Intact LINEs
To identify full-length LINEs, all elements longer than 2700 bp were extracted from the monarch genome using bedtools getFasta [123]. Open reading frames (ORFs) were detected using EMBOSS getORF [112]. ORF sequences were filtered to retain all sequences >600aa. RPSBLAST+ (v2.9.0+) [109] was used to query the ORF sequences against the CDD database [130] to identify conserved protein domains. Sequences were classified as intact LINEs based on the presence of relevant protein domains as previously described [72] (Table S7, Additional File 1).

Identification of Intact PLEs
To identify intact PLEs, all elements longer than 2400 bp were extracted from the monarch genome using bedtools getfasta [123] as full-length PLEs are ~ 2500 bp long [131]. Open reading frames (ORFs) were detected using EMBOSS getORF [112]. ORF sequences were filtered to retain all sequences >200aa. RPSBLAST+ (v2.9.0+) [109] was used to query the ORF sequences against the CDD database [130] to identify conserved protein domains. Sequences were classified as intact PLEs based on the presence of reverse transcriptase (RT) and a GIY-YIG endonuclease.

Classification of Novel DNA TEs
To further classify novel DNA elements in the monarch beyond homology to known elements, as used by RepeatModeler, DDE transposases were compared with curated alignments for all DDE TE superfamilies [88]. Specifically, dashes in curated alignments of all known DDE TE sequences for each of the 19 families were removed, and the resultant fasta file was used as the subject in a blastx search against the de novo monarch TE library, with an e-value threshold of 1 × 10 − 20 and tabular output specified (−evalue 1e-20 -outfmt 6) [109]. The best match for each de novo query sequence was retained, with a minimum 50% identity over at least 50% of the total length of the curated DDE subject sequence. Following this, novel DNA elements were only detected for the Tc1-Mariner family ( Fig. S7; Additional File 9). Transposase sequences for the de novo monarch sequences were extracted and aligned to the original DDE alignments for their respective families using the profile alignment command in MAFFT [110]. Sequences from related transposon superfamilies were used as outgroups, as described [88]. To estimate phylogenies for the two DDE superfamilies, we performed 1000 ultra-fast bootstrap replicates in IQ-TREE (v1.6.12) [132] specifying the best-fit amino acid model for each family as indicated by ModelFinder [133]. Phylogenetic trees were visualised in figtree (v1.4.4) (http:// tree. bio. ed. ac. uk/ softw are/ figtr ee/).

Association of TEs with Monarch Colouration and Immune Genes
We obtained the transcript sequence in fasta format for the myosin gene responsible for wing colouration from MonarchBase, under accession DPOGS206617 [34]. To identify the matching region in the v4 monarch assembly, a blastn search was performed to query the monarch assembly [109]. The coordinates of the myosin gene were obtained and used in a bedtools window search to identify TEs within 20 kb (−w 20,000) [123]. TE copy number within 20 kb of gene bodies was quantified for each monarch gene and a polynomial model was fitted to the frequency distribution to generate a smooth curve through the data points using the lm function in R. Using these models, the total area under the curve (AUC) was calculated using the area_under_ curve function of the bayestestR package [124]. From this, the 2.5 and 97.5% AUC cutoff values were determined, equivalent to selecting the central 95% of the data distribution. Starting with the first data point and iteratively adding subsequent datapoints with each round, the AUC for each collection of points was determined. The maximum AUC value not exceeding 97.5% of the total AUC was taken as the cutoff where genes containing higher TE copy number were considered significantly enriched for TEs. The minimum AUC value not exceeding 2.5% of the total AUC was taken as the cutoff where genes containing lower TE copy number were considered significantly depleted for TEs ( Fig.  S5; Additional File 7). A karyoplot of the myosin gene region including 20 kb flanks was generated in R using the karyoploteR package [125].
To investigate associations between monarch immunity genes and TEs, nucleotide and protein sequences of monarch immune genes were obtained in fasta format from MonarchBase [34]. To identify corresponding regions in the v4 monarch assembly, blastn and tblastn searches were used to query the monarch genome assembly with the monarch immune gene nucleotide and protein sequences [109]. Matches were filtered by percentage identity and query match coordinates to identify the locus of each immune gene. A bed file was generated, containing the name and location of each immune gene, and a bedtools window search was performed to identify TEs within 20 kb of each immune gene(−w 20,000) [123]. The same method was applied to identify TEs within 20 kb of all other genes. TE bases surrounding each immune gene were quantified in R [134]. As TE count was not normally distributed, a Kruskal-Wallis test was used to compare TE abundance between immune genes and other genes, between different TE classifications around immune genes, and between different immune gene types. Unless otherwise stated, all plots were generated using the ggplot2 package of the tidyverse suite in R [128,129].