The structure, organization and radiation of Sadhu non-long terminal repeat retroelements in Arabidopsis species

Background Sadhu elements are non-autonomous retroposons first recognized in Arabidopsis thaliana. There is a wide degree of divergence among different elements, suggesting that these sequences are ancient in origin. Here we report the results of several lines of investigation into the genomic organization and evolutionary history of this element family. Results We present a classification scheme for Sadhu elements in A. thaliana, describing derivative elements related to the full-length elements we reported previously. We characterized Sadhu5 elements in a set of A. thaliana strains in order to trace the history of radiation in this subfamily. Sequences surrounding the target sites of different Sadhu insertions are consistent with mobilization by LINE retroelements. Finally, we identified Sadhu elements grouping into distinct subfamilies in two related species, Arabidopsis arenosa and Arabidopsis lyrata. Conclusions Our analyses suggest that the Sadhu retroelement family has undergone target primed reverse transcription-driven retrotransposition during the divergence of different A. thaliana strains. In addition, Sadhu elements can be found at moderate copy number in three distinct Arabidopsis species, indicating that the evolutionary history of these sequences can be traced back at least several millions of years.


Background
We previously reported a novel family of Arabidopsis retroposons, Sadhu [1]. The typical Sadhu element contains a poly(A) tract and is flanked by a direct 7 to 16 base pair (bp) target site duplication (TSD). Similar to small interspersed nuclear elements (SINEs), Sadhu elements are non-protein coding and do not contain long terminal repeats (LTRs); they are therefore expected to be non-autonomous. Although plant SINEs are thought to be mobilized by autonomous long interspersed nuclear elements (LINEs), the source of the transposase for Sadhu is not clear.
Structurally, Sadhu elements resemble SINEs (noncoding, poly(A) tract), but unlike known SINEs, they do not contain sequence similarity to known non-coding RNAs (for example, 5SrRNA, tRNA) [2]. Nor do Sadhu elements carry conserved sequences similar to RNA polymerase II TATA boxes or RNA polymerase III promoter motifs (for example, A and B boxes). However, Sadhu elements share a motif near the 5' end (consensus 5' CAATCGTTSC 3') and an approximately 20 bp polypyrimidine region that we hypothesize might attract GAGA-repeat binding transcription factors [3][4][5]. Sadhu elements in different Arabidopsis thaliana accessions are expressed, often at high levels. Sense transcription begins at or near the start of the element [6], consistent with the hypothesis that these elements carry their own internal promoter sequences. Expression can also occur in the antisense direction, presumably from promoters in the flanking DNA sequence. Whether sense or antisense, transcription of Sadhu elements is epigenetically regulated; silenced elements are associated with cytosine methylation and packaged in chromatin containing the dimethylated isoform of lysine 9 of histone H3 [1,6]. There is variation in the modes of silencing of various Sadhu family members highlighted by differential susceptibility to epigenetic modifier mutations and distinct cytosine methylation profiles. These findings suggest that Sadhu elements are silenced independently and individually, not coordinately [6]. For these diverse reasons, Sadhu represents a unique family of non-LTR retroelements.
Related families of the same transposable element class can often be detected by sequence similarity in widely divergent species (see for example, [7,8]). Sadhu elements within A. thaliana are highly divergent in terms of nucleotide sequence, with an average pairwise identity of less than 75%, suggestive of an ancient origin. However, these sequences cannot be identified in any of the current public genome databases outside of the Brassicaceae. There are only 39 Sadhu-related sequences in the A. thaliana genome, showing a dispersed distribution pattern across all five chromosomes. This moderate copy number is typical of Arabidopsis non-LTR retroelements: there are approximately 130 SINE elements in the A. thaliana reference genome and less than 1,500 LINEs [9]. The relatively low copy number of non-LTR retroelements in A. thaliana suggests that the transposition rate of these elements is low and/or that new insertions have been effectively removed during the evolutionary history of the species.
Here, we describe a classification scheme for this retroelement family. In addition, we investigate the organization and radiation of Sadhu sequences both in different A. thaliana accessions and related Arabidopsis species.

Classification of Sadhu elements
We designed a classification scheme for Sadhu elements reflecting the phylogenetic grouping of these elements into 10 distinct subfamilies in the A. thaliana genome (Table 1, Figure 1, Additional file 1) [1]. Table 1 lists the new nomenclature side by side with locus ID numbers (for full-length elements) or locus position (for partial elements). Sadhu elements that extend from the 5' conserved motif 5' CAATCGTTSC 3' to a 3' poly(A) tract approximately 900 bp downstream have been designated 'full length'. Full-length elements on the same branch of the phylogeny share a family name (Sadhu#), but have different element names (SadhuX-#). Elements that closely align (>75% identity) to a unique full-length element are designated 'd' indicating derived; for example, Sadhu5-1d1 is likely to be derived from Sadhu5-1. Sadhu-related sequences that are not similar to a unique full-length element are assigned to the nearest full-length element on a pairwise BLAST search with the designation 'L' for 'like' (for example, Sadhu3L). See Additional file 1 for divergence matrices among elements within different subfamilies and among subfamilies.

Partial Sadhu elements
The Sadhu2, Sadhu3, Sadhu4, Sadhu5, and Sadhu6 subfamilies feature derivative sequences that are greater than 80% identical to a particular full-length element ( Figure 2, Table 1, Additional file 1). Many of the partial elements sequences are 5' truncated: that is, the region of similarity shared with the most closely related fulllength element does not extend to the 5' end, but contains remnants of 3' poly(A) tracts (recognizably A-rich regions) and, in some cases, flanking direct repeats that represent TSDs. This pattern is consistent with abortive retrotransposition. Other partial sequences align to internal sections of full-length elements. In the case of Sadhu2-1d, a 3' poly(A) tract is detectable, but is preceded by a stretch of DNA sequence (19 bp) that does not align to the prospective progenitor Sadhu element (Figure 2c; Sadhu7L1 and Sadhu10L3 also have this structure). This type of chimeric retrotransposon structure can result from template switching during retrotransposition [10,11]. In contrast, the Sadhu8L3 derivative terminates in a poly(A) tract at a position earlier than its closest full-length element (Figure 2e). This structure might arise from abortive transcription and early polyadenylation of the precursor sequence or through subsequent internal deletion of the element. If partial elements arose by segmental duplication, we would expect to see DNA sequence similarity extending beyond the Sadhu-related sequence. However, none of the Sadhu elements in the Columbia (Col) reference genome shares significant sequence similarity in flanking genomic regions with their derivative elements. Therefore, it is more likely that the partial elements are remnants of ancestral retrotransposition followed by template switching, deletion and/or divergence.

Radiation of the Sadhu5 subfamily in A. thaliana
A comparison of the genome sequences of two Arabidopsis strains, Col and Ler, revealed over 150 indels caused by differential activity of transposable elements between the strains [12]. We previously reported that several Sadhu elements from different subfamilies are also polymorphic in terms of presence/absence among different Arabidopsis strains [1,6]. Below, we examine closely related elements from a single subfamily in a set of 24 A. thaliana strains in order to trace the retrotranspositional history of these elements. The Sadhu5 subfamily contains four elements that are all greater than 80% identical to one another in the Col reference genome and close to full-length or full-length (>600 bp) ( Figure 2a). Sadhu5-1 and Sadhu5-2 are 83% identical to one another, while the two derivative elements, Sadhu5-1d1 and Sadhu5-1d2, are greater than 95% identical to Sadhu5-1. This family therefore represents a closely related group of sequences that might have expanded during the recent evolutionary history of the species. We began by examining the Sadhu5-2 element. A polymerase chain reaction (PCR) product corresponding to an internal region of this element was present in every strain examined ( Table 2). We investigated whether Sadhu5-2 elements in different strains were present in the same genomic location: using an outward facing forward primer in the element and reverse primers designed based on the Col reference genome 5' and 3' adjacent sequence, we attempted to amplify PCR products spanning the flanks of the elements. In every case, we were successful in amplifying products of the expected size (Table 2). Therefore, it is likely that Sadhu5-2 represents a single insertion event in the ancestor of the A. thaliana lineage.
In contrast to our finding for Sadhu5-2, we were unable to amplify PCR products from several strains using primers specific to the Sadhu5-1, Sadhu5-1d1 or Sadhu5-1d2 insertion sites in the Col strain (Table 2). To investigate the structure of putative deletions or 'empty' sites for these elements, we amplified PCR products from these strains using primers located 5' and 3' of the element in the Col reference genome. We identified 2 strains for Sadhu5-1 and 17 strains for Sadhu5-1d1 that amplified a specific, shorter PCR product than  Table 1 for gene ID numbers corresponding to Sadhu numbers. Bootstrap values (percentages) were calculated from 500 bootstrap replicates. would be predicted from the reference genome. We obtained DNA sequence for these PCR products: in every case, there was a clean retrotransposition 'empty site', with a single, identical copy of the target site duplication of the element in strain Col ( Figure 3). The structure of the 'empty' versus the 'filled' sites are typical of retroelements that undergo target primed reverse transcription (TPRT) [13]. The Col strain carries the most common haplotype for the region surrounding the Sadhu5-1d1 insertion ( Figure 3). Therefore, the most parsimonious explanation is that the element inserted relatively recently in the history of these strains, after the divergence of different haplotypes in this region.
The identification of clean presence/absence polymorphisms among Arabidopsis strains also lends support to the model that Sadhu5-1 and Sadhu5-1d1 are relatively recent retrotransposition events. In contrast, we could not find polymorphic insertion sites for Sadhu5-1d2 and Sadhu5-2, suggesting that these elements represent older, ancestral insertion events.
Sadhu5-2 appears to be a truncated retrotransposition product relative to Sadhu5-1, as it is missing sequence that would align with the 5' portion of Sadhu5-1 ( Figure 2a). Therefore, while the Sadhu5-2 sequence itself appears more prevalent than Sadhu5-1, the latter element could not be derived by retrotransposition or gene duplication from the former without invoking a subsequent deletion of the 5' region of the element, which is unlikely given that the same structure appears to exist in all strains based on PCR of the flanking regions (Table 2). An alternate hypothesis is that the full-length ancestor to this subfamily has been deleted or lost from the A. thaliana Col reference strain.

Target site consensus
TSDs are typical of most transposable elements. Non-LTR retroelements mobilized by the LINE enzymatic machinery feature TSDs of 7 to 20 bp in length. These TSDs result from the target primed reverse transcription mechanism, where two staggered cuts are made on the target strand [13]. In mammals, the consensus for the LINE 5' endonuclease cleavage site contains two thymines, whereas the duplicated target site often starts with a string of four adenines [14][15][16]. This string of Table 2 Distribution of Sadhu5 subfamily members in natural strains.

Accession number
Stock number Empty cells signify no PCR product amplified with the corresponding primers. 3' = PCR product with one primer located in the 3' flank and the other in the element; 5' = PCR product with one primer located in the 5' flank and the other in the element; ES = negative for int PCR, but empty site amplified with 5' and 3' flanking primers; int = internal PCR product, both primers located within the element; PCR = polymerase chain reaction; X* = PCR product with more distal but not with more proximal primers.
adenines (thymines on the opposite strand) within the target site are hypothesized to act in priming reverse transcription from the poly(A) tail of the LINE transcript. SINEs, which are mobilized by hijacking of the LINE machinery [17], have a similar target site preference as LINEs. While plant LINEs are predicted to move in a similar manner to mammalian LINEs, the consensus site has not yet been studied in a comprehensive manner. However, a study of Arabidopsis SINEs indicated a similar consensus sequence as mammalian LINEs; a string of adenines within the target site duplication, as well as a thymine at the 3' nicking site [18]. A total of 14 Sadhu sequences containing target site duplications of between 7 and 16 bp were identified in the A. thaliana genome (Table 3). We examined the region around these target sites to determine whether 5' and 3' nicking site consensus patterns could be identified and, if so, whether they resembled patterns previously reported for LINEs and SINEs. As shown in Figure 4, the 5' nicking site does appear to favor a thymine (preceded by adenines), while the target site duplication also began with a stretch of adenines. There is no strong consensus at the 3' nicking site. These data are consistent with a model in which Sadhu elements, similar to SINEs, are mobilized by the LINE-encoded target primed reverse transcription machinery.
An examination of the A. thaliana Col reference genome [9] reveals less than 1,500 LINE superfamily-related elements spanning 12 different lineages, including both LINE1, LINE2, TA11 and TA12 families [19][20][21]. However, less than 50 LINEs in the A. thaliana reference genome are greater than 5,000 bp in length, and almost none contain intact open reading frames. Therefore, while it is evident that Sadhu elements have been mobile during   the divergence of different Arabidopsis strains, their low copy number might be a consequence of the sheer rarity of active autonomous LINE driver elements.

Sadhu elements can be identified in taxa outside of A. thaliana
In order to explore the evolutionary distribution of the Sadhu sequence family, we sought to identify Sadhu homologs in two related species of the Brassicaceae family, A. arenosa and A. lyrata. These species are estimated to have diverged from A. thaliana approximately 5 million years ago. The genomes of the three species have changed significantly in that interval: Arabidopsis arenosa and Arabidopsis lyrata maintain the ancestral complement of eight chromosomes, while A. thaliana has condensed its chromosome number to five [22,23]. Molecular evolutionary studies have determined that the average sequence divergence at silent sites between A. thaliana and A. arenosa or A. lyrata is 12% to 15% [22]. We attempted to isolate Sadhu elements from A. arenosa. DNA sequence was obtained from specific PCR products that were generated using A. arenosa genomic templates and primers corresponding to the A. thaliana elements Sadhu5-1, Sadhu1-3, Sadhu3-1, and Sadhu8-1 (Table 4; Additional file 2). In a phylogenetic analysis, the A. arenosa Sadhu sequences that we obtained cluster within the previously defined subfamilies (Figure 5a).
We conducted TAIL PCR using A. arenosa genomic templates to identify more complete sequences for the Sadhu elements identified by PCR. Three 5' and four 3' flanking sequences homologous to Sadhu1 were amplified and cloned from A. arenosa genomic DNA template (Table 4, Additional file 3). Several of the 3' Sadhu1 portions were >95% identical to one another, indicative of recent retrotransposition in this subfamily. Two 5' flanking clones (AaSadhu1FP3 and AlSadhu1FP1) shared a stretch of 150 bp of sequence that does not correspond to known Sadhu1 sequence in A. thaliana. This extra sequence may have been transduced by the Sadhu element resulting in a chimeric retroposon. Both 3' and 5' flanking sequences were obtained by TAIL PCR corresponding to A. arenosa Sadhu3 (Table  4 and Additional file 3). Because these sequences could not be joined by PCR, there are likely to be at least two members of this subfamily in A. arenosa. Sadhu5 TAIL PCR sequences isolated from A. arenosa were 85% to 88% identical to A. thaliana Sadhu5 subfamily members (5' and 3' portions) ( Table 4 and Additional file 3). 5' and 3' sequences were also obtained corresponding to Sadhu8 subfamily members from A. arenosa (Table 4 and Additional file 3). These sequences were greater than 90% identical to one another and 75% to 79% identical to A. thaliana Sadhu8-1, indicating that retrotransposition occurred more recently than the divergence of the two species. In summary, A. arenosa contains several members of at least four Sadhu subfamilies. Examination of sequences flanking the Sadhu elements suggests that these elements are located in non-orthologous positions in A. arenosa relative to A. thaliana (Additional file 3).
A. lyrata Sadhu elements were identified from iterative BLAST searches of the recent A. lyrata genome sequence assembly (JGI V. 1.0; Joint Genome Institute, Walnut Creek, CA, USA). We used A. thaliana fulllength Sadhu sequences as queries in a primary search to identify a set of A. lyrata sequences, which were subsequently used as queries in secondary searches. This method is expected to identify all full-length or near full-length sequences, although shorter Sadhurelated partial elements might have been overlooked. In total, we found 21 full-length and 4 partial Sadhu elements greater than 350 bp in length (Table 5, Additional file 4). The number of full-length elements (21) is similar to that in A. thaliana (16), indicating that the element family is relatively small in both species. Full-length A. lyrata elements are structurally similar to Sadhu elements in A. thaliana: they begin with a conserved motif (5' CAATCGTTSC 3' followed by a polypyrimidine patch) and terminate approximately 900 bp downstream in a poly(A) tract. Of the 21 fulllength elements, 15 feature direct target site duplications of between 8 and 18 bp in length, suggesting that they originated via retrotransposition. There are no discernable conserved open reading frames. None of the elements appear in orthologous locations to A. thaliana elements, indicating that Sadhu elements have mobilized considerably since the divergence of the two species, and that related elements are similar through retrotransposition and not through direct inheritance of the genomic region. A. lyrata elements are between 71% and 86% identical to the most similar A. thaliana element (Table 5). Figure 5b shows a phylogenetic tree showing the relationships among the 25 A. lyrata and 16 full-length A. thaliana elements. All A. lyrata elements clustered within previously defined subfamilies, indicating that the divergence of the different subfamilies predated the split of these two species. Most of the Sadhu subfamilies previously identified in A. thaliana have representatives in A. lyrata; however, there is a dramatic expansion of elements within certain subfamilies relative to others (Figure 5b, Table 5). For instance, the Sadhu1 subfamily contains three members in A. thaliana but has expanded to seven full-length members in A. lyrata. The Sadhu8 and Sadhu6 subfamilies are represented by only a single member in A. thaliana, but feature six and three full-length elements, respectively, in A. lyrata. These genome comparisons suggest that, while multiple distinct Sadhu subfamilies have been active since the divergence of these two taxa, different subfamilies have proliferated more in certain species than in others. Alternatively, certain subfamilies may have been pared down by deletion and elimination in one species relative to the other.

Perspective
We have identified Sadhu sequences corresponding to multiple subfamilies in the related species A. lyrata and A. arenosa. The presence of target site duplications and poly(A) tracts, along with the absence of orthologous sites, strongly suggests that Sadhu elements in these other taxa arose via retrotransposition. In a few cases, elements within a given species are greater than 95% identical to one another, indicating that these sequences have mobilized more recently than the divergence of the different species. The partial sequence available for the Brassica genome [24] does not contain Sadhu-related sequences. While these sequences may have been lost from some taxa, the high degree of divergence amongst elements in the Arabidopsis genus strongly suggests an ancient origin for these elements. Therefore, we predict that some sequences related to Sadhu elements might be present in other plants, perhaps even those quite distantly related to Arabidopsis. These presumably more divergent Sadhu relatives might share little overall primary nucleotide sequence with the A. thaliana elements, but might have maintained other recognizable diagnostic features, such as length, conserved 5' motif(s), a 3' poly (A) tract, and target site duplications. Low copy number and high divergence among element subfamilies is not a phenomenon unique to Sadhu elements. Indeed, because only 10% of the Arabidopsis genome is composed of transposable elements [25], lower than other sequenced plant genomes, there may be a general tendency for genome size reduction in this species through progressive loss of repetitive DNA. A comparison of the A. thaliana genome with the five times larger Brassica oleracea genome revealed that while most element families were present in both species, some (for example, CACTA elements) had contributed more than others to the relative expansion of the Brassica genome [21]. As with the different Sadhu subfamilies, different SINE non-LTR subfamilies appear to be more active in each of the two species [26]. The lack of orthologous Sadhu insertion sites among different Arabidopsis species is also reminiscent of the case with SINEs, which similarly featured no shared sites in B. oleracea [26]. Both types of non-LTR elements are therefore subject to frequent loss over evolutionary time. This susceptibility may be a consequence of the dispersed pattern of localization of Sadhus and SINEs: elements that target heterochromatic regions, such as Athila LTR elements, appear to be relatively protected from this winnowing process [27].
Although retroelement superfamilies can typically be found in widely differing plant taxa [8], certain families show longer phylogenetic branch lengths and low copy numbers more similar to the case with Sadhu. In particular, copia/Ty1 families in Arabidopsis are highly divergent from one another [19,[28][29][30]. Non-LTR TA elements are also present in few copies per genome from distinct, evolutionarily ancient lineages [20]. This high divergence among element subfamilies and lack of orthologous sites in related species stands in stark contrast to primate non-LTR elements: L1s and Alus crowd mammalian genomes, with both currently active lineages as well as many defunct ancestral sites shared among humans and their most recent relatives (for example, [31][32][33]). Therefore, while the evolutionary trajectory of Sadhu elements is not dramatically different from that exhibited by some plant retroelements, it is unlike many more well-studied elements.

Conclusions
Sadhu elements represent a previously little characterized retrotransposon family. We have generated a comprehensive classification scheme for these sequences based on phylogenetic analysis. Partial elements often contain 3' poly(A) tracts and target site duplications, consistent with an origin by target primed reverse transcription-driven retrotransposition. An examination of the Sadhu5 subfamily among different A. thaliana strains indicates that subfamily members arose through retrotransposition; the presence of polymorphic insertion sites provides evidence for retrotransposition in the recent history of the species. In addition, sequences at the target site are similar to the Arabidopsis SINE consensus, consistent with the hypothesis that the LINE machinery is responsible for the mobilization of both of these types of elements. Sadhu-related sequences identified in A. lyrata and A. arenosa cluster within specific A. thaliana subfamilies, indicating that the radiation of this element family preceded the divergence of the Arabidopsis genus. These A. lyrata and A. arenosa elements often contain poly(A) tracts and target site duplications, consistent with the model that these sequences also arose via retrotransposition. Taken together, these studies indicate that Sadhu elements have been active since the divergence of different Arabidopsis species, and through the differentiation of different A. thaliana strains. Further research is warranted to resolve the molecular origin and potential impact of this unique class of DNA sequence on genome structure and organization.

Plant materials
A. thaliana strains were obtained from the Arabidopsis Biological Resource Center (ABRC, Columbus, OH, USA). Stock numbers are listed in Table 2. A. arenosa seeds were obtained from Craig Pikaard (Department of Biology, Indiana University, Bloomington, IN, USA). Plants were grown on soil or on 1 × MS media with 1% sucrose. DNA was isolated using previously described methods [34].

Molecular biology
PCR was performed using standard conditions with Taq DNA polymerase (QIAGEN, Valencia, CA, USA) or KT1 polymerase (Clontech, Mountain View, CA, USA). Two rounds of TAIL PCR were performed on A. arenosa template using protocols and degenerate AD primers described previously [35]. Products from the second round of TAIL PCR were isolated from agarose gel and TA cloned into pGEM-T Easy (Promega, Madison, WI, USA) before sequencing. All other PCR products were directly sequenced without an additional cloning step following purification through Performa DTR gel filtration cartridges (Edge BioSystems, Gaithersburg, MD, USA). DNA sequencing was performed using Big Dye Terminator Cycle Sequencing (PerkinElmer, Waltham, MA,, USA) protocols/reagents; sequences were processed at the Washington University Department of Biology sequencing facility. PCR primers used to generate the data in Tables 2 and 4 are described in Additional file 2. 'Internal' PCR primers were used to amplify sequence from different A. thaliana strains and to amplify homologs from A. arenosa. All sequences in this study have been deposited in the National Center for Biotechnology Information (NCBI) database. Genbank accession numbers are listed in Table 3 (for A. arenosa sequences) and in the legend to Figure 3 (for A. thaliana strain specific sequences).

Computational analysis
Full-length and partial Sadhu elements were identified based on sequence similarity to At2 g01410 as previously described [1]. The maximum parsimony and neighbor joining trees in Figures 1 and 5 were generated using the software PAUP* V. 4.0 (Sinauer Associates, Sunderland, MA, USA) based on a ClustalX alignment [36]. Divergence matrices in Additional file 1 were generated based on a ClustalX alignment using the European Molecular Biology Open Software Suite (EMBOSS) program 'distmat' [37] run without corrections. Consensus sequences of different subfamilies were generated from full-length and derivative sequences using the EMBOSS program 'cons' [37]. Alignments in Figure 3 were visualized by ClustalX [36]. WebLogo [38] was used to create the logo images in Figure 4 that describe the retrotransposition target consensus sites. Annotations of features within TAIL PCR products in Additional file 3 were aided by the repeat masker feature on the Censor server [39] and the TAIR WU-BLAST server [40]. A. lyrata sequence information was obtained using the database, browser, and BLAST tools at the Joint Genome Institute (JGI) [41]. A. lyrata Sadhu elements were identified by iterative BLAST searches of the JGI assembly using, initially, A. thaliana and then A. lyrata Sadhu sequences as queries until a self-referencing set of sequences was identified. The classification scheme in Table 1 and locus ID and nucleotide positions for full-length elements have been submitted to both The Arabidopsis Information Resource (TAIR) [9] as well as the repeat database at the Genetic Information Research Institute (GIRI) [42]. The scale is indicated. Internal polymerase chain reaction (PCR) sequences used specific primers based on the Arabidopsis thaliana sequence, while 5' and 3' sequences were obtained by thermal asymmetric interlaced (TAIL) PCR (see Table 4 for details).