Differential retention of transposable element-derived sequences in outcrossing Arabidopsis genomes

Transposable elements (TEs) have initially been viewed as pure genomic parasites but are now recognized as central genome architects and powerful agents of rapid adaptation. A proper evaluation of their evolutionary significance has been hampered by the paucity of short scale phylogenetic comparisons between closely related species. Here, we characterized the dynamics of TE accumulation at the micro-evolutionary scale by comparing two closely related plant species, Arabidopsis lyrata and A. halleri. Joint genome annotation in these two outcrossing species confirmed that both contain two distinct populations of TEs with either ‘recent’ or ‘old’ insertion histories. Identification of rare segregating insertions suggests that diverse TE families contribute to the ongoing dynamics of TE accumulation in the two species. TE fragments that have been maintained in both species, i.e. those that are orthologous, tend to be located on average closer to genes than those that are retained in one species only. Moreover, compared to non-orthologous TE insertions, those that are orthologous tend to produce fewer short interfering RNAs, are less heavily methylated when found within or adjacent to genes and these tend to have lower expression levels. These findings suggest that long-term retention of TE insertions reflects their frequent acquisition of adaptive roles and/or the deleterious effects of removing TE insertions when they are close to genes. Overall, our results indicate a rapid evolutionary dynamics of the TE landscape in these two outcrossing species, with an important input of a diverse set of new insertions with variable propensity to resist deletion.


INTRODUCTION
1 2 Transposable elements (TEs) are repeated elements found almost universally in eukaryotic 3 genomes that can proliferate by high-jacking a variety of cellular processes. They are believed to 4 be the substrate over which the non-coding fraction of the genome is formed in the long term 5 (Britten and Kohne 1968) and contribute a large fraction of genome size variation across taxa, 6 representing as much as 85% of the maize and barley genome and around 20% in A. thaliana

13
In spite of their quantitative importance, the evolutionary significance of TEs has been the 14 subject of constant debate in the field. Their discovery was immediately followed by the 15 interpretation that they must represent important "controlling elements" (McClintock 1950) 16 that confer selective advantages to the organism and are a major "fuel" for evolution (Slotkin pattern that contributed to maize domestication (Studer et al. 2011). Thanks to the regulatory 26 elements they carry, TEs have also the capacity to confer environmental responsiveness to 27 neighboring genes (e.g. Makarevitch et al. 2015;Horváth et al. 2017;Dubin et al. 2018). Hence, 28 the duality of TEs, seen either as purely deleterious or as powerful drivers of rapid adaptive 29 evolution has not been resolved today and the way natural selection is acting on TEs and how 30 they accumulate in host genomes remain important questions in evolutionary genomics (Hua-31 Van et al. 2011;Lisch 2013;Casacuberta and González 2013).

33
To achieve a more balanced view of TE evolution, one must therefore consider their 34 accumulation as resulting from a complex balance between the rate and genomic locations at 35 which they insert, the variety of their deleterious or beneficial effects and the rate at which they take place, but striking differences have been observed even within species, with e.g. as much as 23 22% of genome size variation between two lines of maize mainly caused by TE differences 24 (Vielle-Calzada et al. 2009) or a 30% increase in genome size in the Australian rice Oryza 25 australiensis being caused by the recent activity of just three TE families (Piegu et al. 2006).

26
However, because of their repetitive nature, it is generally challenging to follow the evolutionary 27 fate of individual TE copies as soon as divergence increases. Hence, the limitation of this "global" 28 approach is that it has limited power to pinpoint factors that prevent or promote the invasion of 29 TEs within a given genome (Ungerer et al. 2006 A. thaliana has been interpreted as an effect of the shift to selfing, which occurred roughly within 6 the same timeframe (de la Chaux et al. 2012). While theoretical models do indeed predict a 7 major effect of the mating system on the dynamics of TEs (Charlesworth and Charlesworth 8 1995, Wright et al. 2008, Boutin et al. 2012, the net effect depends on the balance between 9 sharply different processes. Therefore, it is also possible that the contrast between A. lyrata and 10 A. thaliana is due to factors independent from the mating system. Specifically, the presence of 11 TEs is associated with reduced levels of gene expression for TEs up to 2.5kb away in A. thaliana, To obtain a more general picture of how TEs evolve in the model Arabidopsis genus, it is 21 essential to compare species without the confounding effect of the mating system. To follow the evolutionary fate of individual TE copies, we studied the divergence of the TE repertoires of two 23 closely related species, A. lyrata and A. halleri, which diverged less than 1 million years ago 24 (Roux et al. 2011) and thus remain phylogenetically close enough that TE insertions can be 25 tracked individually. We find that both genomes host an abundant population of recently 26 inserted TEs with almost identical insertion ages, although only a very small fraction are found 27 at orthologous positions, indicating a very rapid turnover of these sequences. The small fraction 28 of TE-derived sequences that is retained over the long run displays distinctive features, with 29 gene proximity an important factor favoring TE retention. We argue that while TE accumulation 30 in genomes has typically been studied in light of the dynamics of new insertions, their 31 propensity for long-term retention by resisting deletion is also an important factor. 32 33 34 35

2
Comparing and improving genome assemblies in outcrossing Arabidopsis species. To compare TE 3 repertoires in outcrossing Arabidopsis species, we used the high quality Sanger-based A. lyrata 4 genome assembly (Hu et al. 2011),and the recently published genome assembly of the Asian 5 subspecies A. halleri gemmifera (Briskine et al. 2017). To improve contiguity of the recent 6 genome assembly of A. halleri halleri (Karam et al. submitted), we produced additional Illumina 7 paired-end and mate pairs as well as PacBio sequencing reads (Table S1). We sequenced a total 8 of (i) 12,560,731,806 base pairs using Illumina sequencing (~48x coverage of the genome) and 9 (ii) 4,713,108,471 base pairs (~18x coverage) using PacBio sequencing with an average subread 10 size of 3,332bp. A new Illumina-based assembly was produced combining the new reads and the 11 reads of Karam et al. (submitted), and the PacBio long reads were used for scaffolding leading to 12 a substantial 3-fold decrease of the number of scaffolds (9,891 to 3,152) and a 5-fold increase of 13 the N50 (52kb to 279kb, Table1). Although the resulting assembly remains more fragmented 14 than the A. halleri gemmifera, A. lyrata and A. thaliana assemblies we used in this study both in 15 terms of the number of scaffolds and a lower N50 (Table 1), the fraction of coding sequences 16 was roughly comparable, with only slight variations in the proportion of genic non-CDS 17 sequences and shorter genic non-CDS sequences in A. thaliana (Fig. 1A), as noted previously (Hu 18 et al. 2011). Furthermore, the higher fragmentation of the assembly affected only slightly the 19 representation of the coding genes, since quantitative measures for the assessment of the 20 different assemblies using Busco (Simão et al. 2015) showed similar numbers with only 46 of 21 the 1,440 universally conserved plant genes missing from the A. halleri halleri assembly (3.2%) 22 vs. 17 in A. lyrata (1.2%, Table1).

24
Orthology map of genes in the assemblies. The orthology relationships between genes were 25 defined using inParanoid (Remm et al. 2001

25
Within the TE fraction, the relative proportion of the major superfamilies were roughly 26 comparable, with Gypsy, Copia, LINE, MuDR and Helitron as the five most abundant 27 superfamilies in all genomes, although their relative ranking varies (Fig. 1B). Hence, the higher 28 abundance of TEs in A. lyrata, A. halleri halleri and A. halleri gemmifera as compared to A. 29 thaliana is not due to just one TE family having expanded but rather to a more general process lyrata (Maumus and Quesneville 2014) are also observed in A. halleri halleri and A. halleri 1 gemmifera. Specifically, the distribution of percentage of identify was clearly bimodal with as 2 many as 21,160 (30.9% of the total number of TEs) and 35,010 (40.8%) TEs with over 90% 3 identity in A. halleri halleri and A. halleri gemmifera, respectively (Fig. 2). In the following, we 4 define "recent" vs "old" TEs in relation to this 90% threshold. The distribution profile in A. 5 halleri gemmifera is very similar to the one observed in A. lyrata (n=32,318 i.e. 36.9%) but the 6 peak of very similar TEs is less pronounced in A. halleri halleri, possibly due to differences in the 7 quality of the assembly for the most recent copies. In contrast, the number of TEs with over 90% 8 identity above was lower in A. thaliana (n=7,938 i.e. only 20.2% of the total number of TEs, Fig.   9 2), confirming the sharply different age distribution of TEs in this species (Maumus and 10 Quesneville 2014). Hence, the peak of putatively recent TEs observed in the A. lyrata genome is 11 also observed in A. halleri halleri and A. halleri gemmifera.

12
The different TE superfamilies differed in their contribution to the peaks of recent and ancient 13 TEs. The LINE superfamily, for instance, had very low contribution to the recent peak, while the 14 other four (Gypsy, Copia, MuDR and Helitron) had sometimes very sharp peaks of recent TEs 15 ( Fig. S1). Moreover, the peak of recent TEs is not caused by any single TE superfamily, but rather 16 corresponds to the recent activity of several TE superfamilies, and hence corresponds to a 17 general TE mobilization phenomenon. In order to evaluate the current dynamics of mobilization, very similar to that of the peak of recent TEs present in the assembly ( Fig. 3 and S1). Hence, TE 24 mobilization appears to be ongoing and the relative contribution of the different superfamilies 25 seems to have remained relatively stable in the recent past.

27
Low proportion of orthologous TEs in spite of recent divergence. Next, we sought to follow the fate 28 of individual TE copies between pairs of lineages in order to distinguish TEs that have been 29 either specifically inserted or deleted in one of the two lineages from TEs that have been 30 maintained at orthologous positions since the divergence between pairs. To define TE orthology 31 in a context where the compared assemblies exhibit different levels of contiguity, we used 32 stringent positional information determined by the identity of the pair of flanking genes with a 33 strict one-to-one orthology relationship between the two genomes compared. The presence of a 34 TE in the orthologous intergenic interval was then determined based on a relaxed Blast search 35 procedure. For TEs within genic sequences, we searched for the presence of a TE in the unambiguous ortholog, when it existed. In turn, to avoid spurious results due to multiple hits 1 that may arise because of the relaxed Blast criteria, we restricted the analysis to orthologous 2 intergenic segments shorter than 70kb and discarded TEs that are either on contigs with no 3 orthologous gene or on the extremity of contigs. Using this set of conditions, the intergenic 4 segments considered contained an average of 2.7 distinct TE sequences, thus enabling us to 5 cross-check their presence (orthology) and absence (non-orthology) in the two genomes with 6 substantial accuracy. In spite of the use of relaxed Blast parameters, our analysis identified only 7 8,394 orthologous TEs between A. halleri halleri and A. lyrata, representing a minority of the 8 TEs. Specifically, this number of orthologous TEs represents 28.8% of the 29,111 TEs in A. 9 halleri and 21.6% of the 38,919 interrogated TEs from A. lyrata (Fig. 4A). As expected, a higher 10 proportion of orthologous TEs was detected when comparing the very closely related A. halleri 11 halleri and A. halleri gemmifera (i.e. at the sub-species level, 51.6% and 38.3%; Fig. 4B).

12
The age distribution of orthologous TEs (as defined by the divergence to their consensus) was 13 strikingly different from that of non-orthologous TEs. As expected, orthologous TEs were almost 14 exclusively old with a very small fraction belonging to the population of recently inserted TEs, 15 whereas non-orthologous TEs were either anciently or recently inserted, with a relative 16 proportion of these two categories closely matching that of the overall genome (Fig. 4, Table 2).

17
Similar results were observed in the comparison between A. halleri gemmifera and A. lyrata 18 (Table S2). Given the time scales considered, recently inserted TEs may either have inserted 19 after the species became isolated, or have been present in the ancestor some time before the 20 split and have been removed in one of the two species. It is therefore impossible to 21 unambiguously reconstruct their evolutionary history.

23
Factors associated with long-term maintenance of ancient TEs. In contrast, reconstructing the 24 evolutionary history of ancient TEs (<90% sequence identity to their respective consensus, Fig   25   4) is relatively straightforward. We note that, by definition, this population of TE-derived 26 sequences has accumulated mutations since their insertion, and so corresponds to largely 27 degraded and likely inactive copies rather than full-length elements. Assuming identical rate of 28 divergence, "old" TEs had to be present in the ancestor species, so that their absence in one 29 genome can be readily interpreted as resulting from a deletion process. Based on this 30 assumption, we sought to identify the factors associated with long-term maintenance of 31 individual TEs. First, we compared the different superfamilies and found that old members from 32 the Helitron superfamily were preferentially maintained in the long-term relative to others, 33 since this class of TEs was more represented among orthologous (25.6%) than among non-34 orthologous TEs (11.6%, Fig. 5A). Conversely, old members of the LINE and Copia superfamilies 35 were enriched in the non-orthologous fraction (26.4% and 18.9% for LINE and Copia, 1 therefore more rapidly deleted. Second, we found that TEs that had been maintained at 2 orthologous positions tend to be on average 19.8% shorter than those that had been deleted 3 from one of the two genomes (322.8 vs. 401.6 bp, p<2.2e -16 ), but comparing the medians of the 4 two distribution showed the opposite pattern, suggesting that the difference is largely driven by 5 a limited set of large TEs that are only found in the non-orthologous fraction (Fig. 5B). Third, we 6 compared the location of orthologous vs. non-orthologous old TEs and observed that 7 orthologous TEs tended to be found more often within genes than non-orthologous TEs ( TEs within CDS and non-CDS sequences. We observed that old TEs located within CDS are more 14 likely to be retained at the orthologous state than TEs located in non-CDS sequences (Fig. 5D). In 15 fact, around 35.5% of orthologous TEs were found within CDS sequences, while they were only 16 11.8% for non-orthologous TEs (p<2.2e -16 ). In line with this observation, we also observed that 17 TEs outside genic regions tended to be retained more readily when located close to genes ( 19 halleri halleri and A. lyrata were located on average 1,829 bp away from their closest gene, while 20 those that have been retained either in A. halleri halleri or in A. lyrata only (and thus have been 21 deleted from the other lineage) were located on average at a distance of 2,303 bp, i.e. they were 22 located 26% farther (p<2.2e -16 ). Overall, these results suggest that TEs in gene-rich regions tend 23 to be protected from deletion, possibly because of the deleterious effects associatted with the 24 imprecise nature of the deletion process, which tend to remove flanking sequences as well.

25
We then sequenced small RNAs from the A. lyrata MN47 accession and compared the proportion 26 of old TEs with substantial siRNA production (>5 uniquely mapped reads per million reads and 27 covering at least 10% of the length of the TE) as a proxy for efficient targeting by the RdDM 28 pathway. As explained above, recent TEs were discarded from this analysis because of their 29 ambiguous evolutionary history. Old TEs located within genes were less often targeted by the 30 RdDM pathway than those outside of genes (p<2.2e -16 ) (Fig. 5F). For TEs located within genes, 31 we found that old TEs that have remained orthologous were less likely to be RdDM targets than 32 those that have been deleted since divergence, with 10.3% of non-orthologous TEs showing 33 active siRNA production vs. only 2.3% for orthologous TEs (p<2.2e -16 ). However, for TEs located 34 outside of genes we found no difference in siRNA production by orthologous vs. non-35 orthologous TEs (17.6% and 17.5% respectively, p=0.9315). Since the mechanism of TE silencing operates through DNA methylation, we further compared the level of methylation of 1 orthologous and non-orthologous TEs in A. lyrata, taking advantage of the bisulfite DNA 2 methylation data available for A. lyrata (Seymour et al. 2014). In accordance with the siRNA 3 mapping analysis, we found that orthologous TEs within genes were less methylated than any 4 other populations of TEs (p<2.2e -16 , Fig. 5G). Indeed, they presented a mean percentage of 5 methylation of only 4.0% compared to 22.8, 20.3, 32.4% for non-orthologous TEs within genes, 6 orthologous TEs outside of genes, and non-orthologous TEs outside of genes, respectively. These 7 results suggest that a low siRNA production and low DNA methylation levels are associated with 8 the long-term maintenance of old TEs within genes. In contrast, these two factors may not be 9 related to the long-term maintenance of TEs outside of genes. 10 Finally, we used RNA-seq data from the same A. halleri halleri accession to compare the 11 expression of genes containing old TE sequences that have been either retained or removed 12 since the separation of the two species. We reasoned that if TEs are deleterious on average, 13 removing them should be advantageous even in the face of the deleterious effect of local 14 deletions. If so, TE removal should occur more readily close to or within genes with high 15 expression than genes with low expression. Accordingly, we found that on average genes with 16 an orthologously-maintained TE in their DNA sequence were expressed at slightly lower levels 17 than the genes with a TE that has been removed from A. lyrata (p=0.01085, Fig. 5H). The same Overall, the population of TEs in the two A. halleri assemblies that we studied is very similar to 26 the one described in A. lyrata, both in terms of TE families present and in their age distribution.

27
As noted previously, these profiles are sharply distinct from that seen in A. thaliana (Hu et al.  (Richardson et al. 2015). In spite of the recent divergence between the two species 8 we find very few orthologous TEs between A. halleri and A. lyrata, even for the population of 9 "older" TE-related sequences that must have been present before speciation. Overall, even 10 though the dynamics of TE accumulation seems to be shared, the resulting TE fractions of the 11 two genomes are very different, indicating a rapid turnover of TE-related sequences. It will now 12 be essential to compare quantitatively the rate at which TEs transpose and get removed 13 between different species and how these rates are affected by various biological features. TEs

28
First, our analyses suggest that Helitron elements are more likely to maintained over the long-29 term in A. halleri and A. lyrata. As noted by Maumus and Quesneville (2014), helitrons tend to 30 have lower GC content than the other TE superfamilies, which may be associated with reduced 31 targeting by RDdM because less cytosines are available for methylation, hence leading to less 32 disruption of neighbouring gene expression. It is tempting to speculate that the lower GC 33 content of helitrons may make them less deleterious, allowing for their preferential long-term 34 maintenance. In contrast, LINE and Copia families are those that have been the most strongly 35 eliminated since the divergence of the two species. Mao and Wang (2017) recently observed that in grass, SINE families were retained over the long term. Like in grass, SINEs have low 1 abundance in the A. halleri and A. lyrata genomes, since they cover less than 3% of the repeat 2 sequences. However, in these species they do not seem to be associated with a long-term 3 maintenance, as they are equally represented amongst old TEs that have been maintained at 4 orthologous positions and amongst those that have been deleted from one of the two genomes 5 (Fig. 4). Hence, the long-term maintenance of particular TE families seems to be lineage-specific 6 and cannot easily be generalized.

8
Sheltering of TEs by proximity to genes.

9
A striking observation is the long-term retention of TE sequences in gene-rich regions. As we 10 focused on the population of "old" TEs that had to be present in the most recent common 11 ancestor, this pattern is unlikely to be caused by an insertion bias of recent specific insertions 12 towards genic regions and rather reflects a process of differential retention. Alternatively, this 13 pattern may also be caused by gene-rich regions being better assembled, resulting in TEs in 14 those regions being more readily found in the different assemblies. This effect is likely minor 15 because our analysis focuses on old TEs, which should be relatively less problematic in terms of 16 assembly because they tend to be less identical across copies. Also, in this case the most poorly 17 assembled genomes should show less non-orthologous TEs, while here the reciprocal analyses

21
The relative enrichment of orthologous TEs in genic sequences as compared to non-orthologous 22 TEs is consistent with the interpretation that TEs within gene-rich regions benefit from a 23 "sheltering" effect, whereby a deletion of the TE sequence involves the risk of also deleting part 24 of the gene sequence, which would be highly deleterious in particular when they have become 25 integrated within coding sequences. Hence, the effective rate of deletion might be higher for 26 non-genic TEs than for genic TEs, resulting in a long-term enrichment of the sheltered genic TEs.

27
This process of differential retention was less pronounced when comparing the more closely 28 related A. halleri halleri vs. A. halleri gemmifera species, indicating that such differential 29 enrichment is a relatively slow process. In grass, Mao and Wang (2017) showed that members of 30 the SINE TE family are often shared across species and are also enriched in and near protein 31 coding genes, possibly as a result of differential removal of SINE copies in gene-poor regions. 32 33 TE deletion: a cure worse than the disease?

34
Our results suggest that several factors can affect the long-term retention of transposable 35 element sequences, and in particular the proximity to highly expressed genes. We propose that the process of differential TE retention results from the balance between the deleterious effects 1 of the TE itself and that of the deletion removing it. While the presence of TE sequences was 2 shown to equally frequently increase or decrease gene expression (Stuart et al. 2016)  Earlier studies have shown that the rate of DNA loss can be highly heterogeneous across genomes (Petrov et al. 2000). It is possible that the level of sequence identity among repeated 11 sequences may contribute to this variation, as more identical sequences are more likely to be 12 involved in the heterologous recombination that is believed to be responsible for DNA deletions.

13
If so, the most recently inserted TEs would be expected to show an even faster elimination, as 14 proposed by Maumus and Quesneville (2016). This might also contribute to decrease the 15 proportion of young orthologous TEs. Beside the fact that they might have been inserted after 16 the species divergence (but as we explained, precisely dating these events is challenging), they 17 might be eliminated even more rapidly than old TEs that recombine less easily. Hu et al.   TEs, in genomic sequences (Fig. S3) we selected only TEs located between two genes of this orthology map (called "framed" TEs, or 32 TEs located within a genic sequence ("inserted" TEs). For each "framed" TE in one species, a 33 blast search was performed between the TE sequence and the genomic sequence between the 34 same pair of orthologous genes in the other species. We restricted this analysis to chromosomal 35 segments of at most 70 kb (from either the subject or the query genome). Similarly, for "inserted" TEs, the TE sequence was compared with the orthologous gene sequence. Both 1 "framed" and "inserted" TEs presenting a blast hit with an E-value ≤1E -10 , an identity ≥ 80% and 2 at least some overlap with a TE annotation were defined as orthologous. The other TEs, which 3 did not fulfill these three criteria, were defined as non-orthologous. These criteria correspond to 4 a relatively relaxed search and should result in a strong power to detect orthologous TEs, 5 resulting in a conservative analysis. 6 7 TE analyses 8 TEs were separated into "young" and "old" classes according to whether they reached the cut-off 9 of 90% identities with their cluster consensus sequence, or not. Differences in the size and 10 distance to nearest gene were tested using a non-parametric Mann-Withney test. Differences in 11 the proportion of genic vs. non-genic TEs and in the proportions of CDS vs non-CDS TEs were 12 tested using a χ 2 test with 1 and 2 degrees of freedom, respectively.

14
Identification of segregating non-reference (neo)  The first step consists in defining the orthology of genes (blue squares) between genomes X and 22 Y using Inparanoid. In our example, A/A', C/C' and E/E' are considered as orthologous pair of 23 genes (represented by blue arrows). The orthology of TEs is defined sequentially for genome X 24 and Y but the process are similar: only TEs between two orthologous genes spaced for at most 25 70kb (black squares named a and b in our example) (named "Framed") and TEs located within 26 genes (black square c) (named "Inserted") are analysed. TEs which are located at an extremity of 27 a scaffold (d and d') and TEs located on scaffold without orthologous genes are discarded. The 28 sequence of the TE a and b, which are located between A and C genes of the orthology map are 29 compared using Blastn (thresholds: Evalue ≤1E -10 , an identity ≥ 80%) to the sequence between 30 the orthologous genes of A and C, i.e. A' and C'. The TE a presents a blast hit, and a TE annotation 31 overlaps the Blast hit in genome Y. Hence a and a' are defined as orthologous. No-significant 32 blast hit is retrieved for b, which is defined as non-orthologous. The sequence of the TE c located 33 within the E gene is compared to the sequence of the E' gene. In our example, we considered that 34 the Blast hit is significant and overlaps a TE annotation in Genome Y. The TE c is defined as 35 orthologous.       Ahallerigemmifera