The dynamic intein landscape of eukaryotes

Background Inteins are mobile, self-splicing sequences that interrupt proteins and occur across all three domains of life. Scrutiny of the intein landscape in prokaryotes led to the hypothesis that some inteins are functionally important. Our focus shifts to eukaryotic inteins to assess their diversity, distribution, and dissemination, with the aim to comprehensively evaluate the eukaryotic intein landscape, understand intein maintenance, and dissect evolutionary relationships. Results This bioinformatics study reveals that eukaryotic inteins are scarce, but present in nuclear genomes of fungi, chloroplast genomes of algae, and within some eukaryotic viruses. There is a preponderance of inteins in several fungal pathogens of humans and plants. Inteins are pervasive in certain proteins, including the nuclear RNA splicing factor, Prp8, and the chloroplast DNA helicase, DnaB. We find that eukaryotic inteins frequently localize to unstructured loops of the host protein, often at highly conserved sites. More broadly, a sequence similarity network analysis of all eukaryotic inteins uncovered several routes of intein mobility. Some eukaryotic inteins appear to have been acquired through horizontal transfer with dsDNA viruses, yet other inteins are spread through intragenomic transfer. Remarkably, endosymbiosis can explain patterns of DnaB intein inheritance across several algal phyla, a novel mechanism for intein acquisition and distribution. Conclusions Overall, an intriguing picture emerges for how the eukaryotic intein landscape arose, with many evolutionary forces having contributed to its current state. Our collective results provide a framework for exploring inteins as novel regulatory elements and innovative drug targets. Electronic supplementary material The online version of this article (10.1186/s13100-018-0111-x) contains supplementary material, which is available to authorized users.


Background
Intervening protein elements, called inteins, are capable of self-excision from precursor polypeptides and ligation of the flanking sequences, termed exteins, to produce the functional host protein [1,2]. As mobile elements, some inteins also contain a homing endonuclease domain (HEN), allowing movement at the DNA level [3,4]. Inteins remove themselves from host polypeptides through a process known as protein splicing (Fig. 1a, Intein). This excision occurs in four steps using conserved splice junction residues, often highly reactive nucleophiles, like cysteine, serine, or threonine (Additional file 1: Figure S1).
In addition to inteins, eukaryotes encode other selfsplicing proteins that utilize similar chemistry. Hedgehog proteins are structural and functional analogs of inteins, but are involved in cell signaling in higher eukaryotes exclusively [5] (Fig. 1a, Hedgehog; Additional file 1: Figure S2). These essential proteins are synthesized as inactive precursors and undergo nucleophilic attack similar to protein splicing in order to form a functional molecule with cholesterol [6]. Eukaryotes also encode proteins with Hedgehog-intein (Hint) domains (Fig. 1a, Hint; Additional file 1: Figure S2). Hints employ similar autoproteolytic reactions, either mirroring activity at the N-terminus or the C-terminus of an intein, depending on where peptide cleavage occurs [7]. All three of these eukaryotic self-splicing elements share a canonical Hint fold to promote peptide cleavage reactions [8].
Inteins were originally discovered while studying the vacuolar ATPase (VMA1) of Saccharomyces cerevisiae [9,10]. A central region of the ATPase that had no homology to other known proton pumps was observed. This 'spacer' portion, present in the mRNA but not the final protein, turned out to be a 50-kDa self-splicing intein. Since this discovery, thousands of inteins have been documented mainly through sequence-based approaches. A striking outcome has been uncovering inteins across all three domains of life.
A large-scale survey was recently performed to analyze inteins across bacteria and archaea [31]. The results revealed that about half of sequenced archaeal genomes contain at least one intein, whereas a quarter of bacteria are intein-positive among those sequenced. Within this broad distribution, over 60% of proteins containing inteins are involved in replication, recombination, and repair, with 70% localizing to ATP-binding proteins. A similar study investigated where inteins are distributed across bacteria and their phages, providing the first evidence that mycobacteriophages function as facilitators of intein dissemination across mycobacteria [27].
With the recent surge in available sequenced genomes, we next focused on inteins in eukaryotic genes to learn how these inteins might have evolved and spread. Eukaryotic inteins are not only present in nuclear genomes, but in genomes of the chloroplast, and in viral genomes. These inteins frequently interrupt proteins involved in replication, recombination, and repair, as well as RNA processing and energy metabolism, differing somewhat from the tendencies of bacterial and archaeal inteins. Curiously, inteins seem to be enriched in specific lineages of pathogenic fungi, such as Cryptococcus spp. More generally, the patterns of eukaryotic inteins suggest important roles in the regulation of the host protein to the potential  Fig. 1 Types of self-splicing protein sequences and their distribution in eukaryotes. a Inteins, Hedgehog, and Hint proteins. Inteins are mobile, self-splicing protein elements present across eukarya, bacteria, and archaea. Conserved residues coordinate self-splicing, indicated by red arrows, to ligate the N-extein (blue) and C-extein (green). Hedgehog proteins are found in higher eukaryotes only and are involved in complex developmental processes. They are composed of two domains, HhN and HhC. The HhC domain is analogous to inteins, utilizing a similar mechanism to link cholesterol to HhN (red arrows). Hedgehog-intein (Hint) domains have cleavage properties similar to either the N-terminus (HintN-like) or C-terminus (HintC-like) of an intein, and are found in both metazoans and lower eukaryotes. b A modified phylogenetic tree of eukaryotes was constructed. Scaled circles indicate intein-containing phyla. In fungi, inteins are found in nuclear DNA (nDNA; red circles), in algae in chloroplast DNA (cpDNA; green circles), and in eukaryotic viruses (vDNA; blue circles). Specific intein-containing species are mentioned in the text. Total inteins in each tree are listed benefit of the organism in which it resides. Eukaryotic inteins can be classified into three types based on size and presence of a HEN. Apparently, some of these HENs were or are active, based on connections between diverse inteins on a sequence similarity network that is indicative of movement. We discovered that a specific chloroplast intein is a relic of endosymbiosis several billion years ago, while other chloroplast inteins appear to be horizontally transferred by viral vectors. This work suggests a complex, evolving picture of inteins across the eukaryotic mobilome.

Eukaryotic inteins are scarce
To assess and characterize the diversity of eukaryotic inteins, we surveyed genomic sequences using BLASTp and previously developed pipelines [31,32]. At the time of analysis (February 20, 2017), there were 11,001 eukaryotic genomes available through the National Center for Biotechnology Information (NCBI), including 2377 entries for nuclear genomic sequences and 8624 entries for organelle-and plasmid-only sequences. We also analyzed~1500 viral reference proteomes available at ViralZone [33]. The full list of intein-containing genomes and other relevant information is provided in Additional file 2: Table S1 and Additional file 3: Table S2. As an additional point of reference for understanding intein evolution in eukaryotes, we mined genomic sequences for other self-splicing protein elements, including Hedgehog, Hedgehog-like proteins, and Hedgehog-intein (Hint) domains ( Fig. 1a; Additional file 1: Figure S2 and Table S3). In total, 257 inteins were identified in 231 eukaryotic species, either in their nuclear (nDNA) or chloroplast (cpDNA) genomes, and 136 inteins were found in proteomes from 98 viruses (vDNA) (Fig. 1b).
Remarkably, despite intein rarity among eukaryotes, there are species with more than one intein per genome, which might indicate intragenomic spread (discussed below). For example, among Fungi, two inteins are identified in nDNA of an arbuscular mycorrhizal fungus Rhizophagus irregularis (Mucoromycota), and a phytopathogenic fungus Fusarium fujikuroi (Ascomycota) carries three inteins in its genome.

Intein enrichment in pathogens is genus-specific
While compiling data on intein distribution, we noticed that inteins are seemingly prevalent in fungal pathogens (Additional file 1: Table S4). To investigate further, we analyzed a distribution of pathogen and non-pathogen representatives among sequenced Ascomycota and Basidiomycota genomes to reveal a potential bias introduced by sequencing pathogenic species ( Fig. 2a; Additional file 1: Table S5). It is estimated that roughly 33% of all fungi are pathogens, and a majority of them are ascomycetes, with some basidiomycetes [38]. Out of 689 sequenced genomes from Ascomytoca and Basidiomycota, only 30.2% (208 genomes) are from pathogenic fungi, suggesting that there is no bias in the dataset (Fig. 2a, left, gray versus black circles; Additional file 1: Table S5). There also appears to be no bias towards sequencing genomes from pathogenic species in the separated Ascomycota and Basidiomycota subsets (Fig. 2a, right, gray versus black circles; Additional file 1: Table S5). The overall percentage of inteincontaining pathogens from a combined analysis of Ascomycota and Basidiomycota (37.0%) is higher than the percentage of intein-containing non-pathogens (16.4%) (Fig. 2a, left, red circles). The trend remains when looking at the two phyla separately. Ascomycota have 41.8% pathogenic species with inteins versus 27.4% non-pathogenic species with inteins ( Fig. 2a, right, red circle). Dramatically, Basidiomycota alone have 6-fold more intein-containing pathogens than non-pathogens, with 18.6% compared to 3.2% (Fig. 2a, right, red circle).
To further investigate this phenomenon, we focused on two intein-containing fungal groups: Aspergillus/ Neosartorya (Ascomycota) and tremellomycetous yeasts (Basidiomycota), including C. neoformans, C. gattii, and their close relatives (Fig. 2b) [39][40][41][42]. We found that within a subset of selected Aspergillus, there was no correlation between having an intein and a pathogenic lifestyle, despite several important pathogens having inteins (Fig. 2b, left; Additional file 1: Table S6). On the other hand, within the tremellomycetous yeasts, there was a positive correlation between pathogenic species and intein-containing species. Indeed, C. neoformans and C. gattii are the only two pathogenic species within the tremellomycetous yeast group, and are the only two intein-containing species in the analysis (Fig. 2b, right; Additional file 1: Table S7). It seems that pathogenic fungi do have a propensity for inteins, but the pattern is variable among specific genera.
Inteins are found in genes specific to genome type (nDNA, cpDNA or vDNA) Next, we analyzed the distribution of inteins relative to the protein into which they are inserted (extein) (Fig. 3a-3c; Table 1). Interestingly, there is little or no overlap in exteins between different types of genomes that harbor inteins ( Fig. 3a-3c; Table 1). While in nDNA, a majority of inteins are found in Prp8 and VMA1 ( Fig. 3a; Additional file 1: Table S8), the replicative helicase DnaB and DNAdependent RNA polymerase (DdRP) are the most common intein-containing proteins in cpDNA ( Fig. 3b; Additional file 1: Table S9). Only red and brown algae contain DnaB inteins, as green algae lack a dnaB gene, and the reverse is observed for the ATP-dependent Clp protease, ClpP, and its inteins (Additional file 1: Figure S3). In viral genomes, DNA-dependent DNA polymerases (DdDPs) often harbor  Table S10). The only overlapping intein-containing proteins among all three groups of genomes are DdRPs. However, it should be noted that DdRPs are diverse and genome-specific. Also, the Prp8 intein exists in both nDNA and cpDNA genomes, although its presence in algae might be the result of fungal contamination.

Inteins are located in conserved structure boundaries
Inteins are known to occupy sites with specific characteristics that allow splicing besides preceding a nucleophile [46]. Ideally, intein insertion sites must allow  Fig. 2 Intein preponderance in pathogenic fungi. a Analysis of inteins in pathogens. Two phyla (Ascomycota + Basidiomycota) were analyzed for a propensity of inteins in pathogens (left). Sequenced genomes of non-pathogenic fungi (gray circles) and pathogenic fungi (black circles) were separated and overlaid with the number of intein-positive genomes from each group (red circles). The overall percentage of intein-containing pathogens is 37.0%, higher than the 16.4% of intein-containing non-pathogens. Ascomycota and Basidiomycota were analyzed separately (right), and also show higher number of intein-containing pathogens (41.8% and 18.6% compared to 27.4% and 3.2%, respectively). Out of available sequenced Ascomycota and Basidiomycota, there are more non-pathogenic genomes sequenced than pathogenic, indicating no sequencing bias. Total genomes analyzed are listed. b Certain fungal lineages have intein-pathogen correlation. Species within an individual phylum (Aspergillus/ascomycete and Cryptococcocus/basidiomycete) were analyzed for a correlation of inteins in pathogens. A condensed phylogenetic tree for Aspergillus species was constructed and annotated by lifestyle (colored circles). Presence of an intein is indicated by bold and red text. While Aspergillus contains many inteins, these do not have a preference for pathogenic species, with a negative correlation coefficient (r = − 0.2). The phylogenetic tree for Cryptococcus shows an absolute correlation (r = 1.0), with the only two known pathogens both having inteins proper folding and splicing, and permit the host protein to correctly fold post-splicing [47]. Additionally, inteins are often found in protein active sites and in other conserved, functionally important domains within host proteins [31,48,49]. Thus, we examined native insertion sites of the eukaryotic inteins for the properties such as conservation and secondary structure ( Fig. 4; Additional file 1: Figure S4). We note that it is a common practice to distinguish inteins based on their extein identity and insertion site [50]. For example, inteins in the Prp8 protein are identified as Prp8i. There are six intein insertion points found in the Prp8 protein, which were designated as a-f based on the conventional intein insertion site classification scheme [50,51]. Thus, Prp8i from insertion point a is indicated as Prp8i-a (Fig. 4).
To determine the local secondary structure of the intein insertion site, we used homology modeling (Additional file 1: Table S13). As demonstrated previously, intein insertion sites are more likely to occur in loopstructure boundary (54% of all sites) than in the middle of a β-sheet or α-helix (31% of sites) [46]. Results of our secondary structure modeling are in agreement with these observations. As demonstrated in Fig. 4 above the 3D structures, the most common eukaryotic inteins Prp8i-a and VMA1i-a, as well as less widely distributed inteins in threonyl-tRNA synthetase (ThrRSi-a) and glutamate synthase (GLTi-a), are inserted either in flexible loops, or in close proximity to the loop/α-helix boundary (Fig. 4). Intein insertion in a flexible loop likely allows the intein to fold and splice properly, creating less strain on the host protein.
In order to assess insertion site conservation, the PDB structures of these exteins were uploaded to ConSurf [52], a bioinformatics tool for estimating the evolutionary conservation of amino acid positions in a protein based on the phylogenetic relations between homologous sequences. The ConSurf alignments of nDNA exteins Prp8, VMA1, GLT, and ThrRS reveal they are all in highly conserved sites, as previously noted by several groups [14,34,49] (Fig. 4, yellow highlighting; Additional file 1: Figure S4). The structure models also demonstrate the inteins insert at flexible boundaries, often between structured regions. VMA1i-a is a clear example, being inserted at a loop between a β-turn and an α-helix.
The presence of homing endonucleases (HEN) (see Fig. 5) likely facilitates mobility to conserved regions. Inteins likely home to such positions to decrease the chance of elimination and delay removal, as deletion would require absolute precision to maintain host protein function. This conservation assessment is in line with what has been reported for bacteria and archaea [27,31,49]. Two possibilities exist: limitations on mutating Mauve indicates highly conserved, whereas cyan is more variable as shown in the key. The first residue of the C-extein, shown as spheres, is highlighted in yellow and indicates intein insertion site. Prp8i-a, VMA1i-a, and GLTi-a are in structurally flexible, yet highly conserved sites. The ThrRS intein is inserted in a structured α-helix. Linear cartoons were also generated using Pro-Origami and are shown above the ConSurf maps. Residue numbering indicates the region of protein used in the Pro-Origami model. The black arrow shows intein insertion site and the number corresponds to the highlighted residue in the ConSurf structure. Structure representations are as follows: α-helixgray rectangle, β-strandgray arrow, flexible boundary -black line the homing site make it difficult for hosts to evolve immunity to the HEN, allowing inteins to propagate as selfish elements, or inteins are functionally important at these conserved sites and are retained to perform yet unknown regulatory roles.

Eukaryotic inteins vary in size
Many inteins carry a site-specific HEN in addition to the protein splicing domain [4] (Fig. 5a). A HEN renders an intein mobile by introducing a double-strand break into an inteinless allele and initiating a gene conversion    Fig. 3 legend event, which results in the copying of the intein coding sequence [4]. This process, known as homing, is thought to be responsible for the horizontal transfer and spread of inteins [53]. Thus, to elucidate the evolutionary dynamics of eukaryotic inteins, it is important to trace the presence/absence of intein-associated HENs and their key features (Fig. 5).
The length of eukaryotic inteins, a simple indicator of the presence of a HEN domain, varies greatly from 123 amino acid residues (DnaBi from red algae Dichotomaria marginata) up to 1065 amino acid residues (Prp8i-b from grass-infecting smut fungus Anthracocystis flocculosa) ( Fig. 5a and b). When all identified inteins are plotted by residue length, the resulting plots show a step-wise pattern with varying intein lengths ( Fig. 5b; Additional file 1: Figure S5). The lower step, up to~200 amino acid residues, must correspond to inteins that lack HEN domains, referred to as mini-inteins or HEN(−). The middle step between~200 to 700 amino acids, is likely represented by inteins carrying a HEN or inteins with splicing domains interrupted by a 'linker' sequence. These inteins were analyzed for endonuclease motifs and were indeed found to contain LAGLIDADG HENs. The most puzzling aspect of the plot is a group of inteins that are more than 800 amino acid residues in length ( Fig. 5b  and c). Although there are only a few extra-large inteins and all of them are in Prp8 (Fig. 5c, HEN(+) extra ), they exist in diverse fungal species, including both ascomycetes (e.g. F. fujikuroi and A. fumigatus) and basidiomycetes (An. flocculosa). In addition to protein splicing domain and HEN, these extra-large inteins carry linker sequences of unknown origin and function, with some of these linkers having repeated motifs (Fig. 5a). Repetitive sequences are common in proteins for structural reasons or may participate in ligand binding [54][55][56]. The HEN(+) extra inteins were also analyzed for endonuclease type, and many were found to carry LAGLIDADG repetitive motifs (Additional file 2: Table S1 and Additional file 3: Table S2). In general, when inteins in specific exteins are plotted as a function of length, they tend to cluster together in the same HEN category (Fig. 5c).

Sequence similarity network and phylonetwork of eukaryotic inteins
Typically, phylogenetic reconstruction, which focuses on relationships resulting from vertical descent, is used for analysis of homologous genes (or proteins). However, inteins do not follow the rule of strict vertical descent [4,14,27,30,31], and classic phylogenetic analysis would pose significant limitations to the analysis of intein evolutionary dynamics. Thus, to further elucidate evolutionary dynamics of eukaryal inteins, we built a sequence similarity network (SSN), which does not assume input sequences are homologous (Fig. 6a) [57][58][59][60].
The resulting SSN consists of 410 nodes and 5782 edges (links) representing significant relationships between analyzed sequences (Fig. 6a). A majority of the SSN components consists of the sequences originating from the same genomic pool (nDNA, cpDNA or vDNA), and inteins cluster according to their exteins, with a few notable exceptions.
There are several SSN components that are formed by inteins located in different proteins. For example, cpDNA inteins, ClpPi and DdRPi from green algae (Chlorophyta) fell within a single connected SSN component (Fig. 6a, cluster 6, green), and also, surprisingly, clustered together with viral DdRPi (Fig. 6a, cluster 6, blue), indicating potential intragenomic intein mobility as well as putative horizontal transfer between cpDNA and vDNA.
Finally, we were intrigued by the large number of DnaBi in cpDNA from taxonomically diverse and evolutionary distant species of eukaryotes (Fig. 1b, Table 1). In bacteria, DnaB is an intein hot-spot [30,31]. Therefore, we hypothesized that eukaryal DnaBi was likely inherited from the cyanobacterial progenitor of chloroplasts by endosymbiosis [62][63][64]. Although cpDNA DnaBi show a relatively high degree of protein sequence similarity in pairwise comparisons (data not shown), an overwhelming  (2), and many viral inteins also cluster by exteins (7,8,9,10). Cases where this pattern is broken represent possible horizontal transfer events (1b, 1c, 3, 6). Some inteins cluster phylogenetically, such as DnaBi from Rhodophyta (5a) or Heterokonta (5b). Hedgehog proteins (black; HhC) do not cluster with any eukaryotic inteins (11), indicating no phylogenetic relationship between Hedgehog and inteins based on sequence, although they are structurally and functionally similar. Hint-containing mating type switching proteins (yellow, HO Hop) cluster with VMA1i (2). Some inteins do not form connections to anything at all (12). b Nuclear intragenomic intein transfer. Selected intein pairs were further examined by calculating pairwise similarity percentages and are shown in the box plot. The GLTi and CHSi pair shows an average similarity above 50%, indicative of intragenomic transfer. c Endosymbiotic intein transfer. A phylonetwork tree was built in SplitsTree after alignment of cpDNA DnaBi (green) and bacterial DnaBi (pink). A branch of clustering of cpDNA DnaBi and bacterial DnaBi (shaded) suggests that DnaBi in chloroplasts might have been inherited from a cyanobacterial progenitor. Since bacteria also have inteins in ClpP, cpDNA ClpPi were included as a control and they cluster separately (6) majority is represented as singletons in the SSN (Fig. 6a,  cluster 12, green), which is likely due to difficulties with obtaining meaningful local alignments during the allagainst-all BLASTp stage of SSN reconstruction. Thus, to identify possible evolutionary relationships among DnaBi from cpDNA and bacteria, we utilized an alignment-based approach for phylogenetic network inference ( Fig. 6c; Additional file 4) [65]. As evident from the resulting phylonetwork ( Fig. 6c; Additional file 1: Figure S6), some bacterial and cpDNA DnaBi are distantly related as they cluster together on the same branches, suggesting intein transfer by endosymbiosis.

Discussion
Inteins can execute an autocatalytic protein splicing reaction and perform post-translational modification of a precursor (Fig. 1a). These features make intein-based tools indispensable in modern protein chemistry and bioengineering [66][67][68]. However, the nature of inteins, their origin, and possible biological roles are still poorly understood, especially in their native context. This work expands our understanding, by providing a comprehensive analysis of intein diversity, distribution, and dissemination in the eukaryotic world. Inteins are found in three genome types: nuclear (nDNA), chloroplast (cpDNA), and viral (vDNA) (Fig. 1b, Fig. 3a-c). Not surprisingly, eukaryotic inteins are inserted in several important, often functionally critical host proteins, matching a global intein trend [14,27,31,49]. The occurrence of inteins in notable human and plant pathogens is a highlight, and will pave the way for further research into intein drug discovery (Fig. 2). Lastly, by applying sequence similarity network (SSN) and phylonetwork analyses, we find evidence for intragenomic mobility, endosymbiotic acquisition, and horizontal transfer of eukaryotic inteins, thus providing an updated picture for how the eukaryotic intein landscape might have evolved (Fig. 7).

Eukaryotic intein landscape and potential domestication
To gain greater insight into inteins, we previously utilized data mining to investigate the distribution of inteins in prokaryotes, and demonstrated that both bacteria and archaea are relatively intein-rich, with 25% and 50% intein-positive genomes, respectively [31]. In contrast, this study found that fewer than 2% of eukaryotes harbor inteins (Fig. 1b, Fig. 3a-c). The paucity of inteins in eukaryotes suggests that their more complex genomes might have expunged these otherwise invasive elements.
Another possibility for the rarity is that inteins became 'domesticated' in eukaryotic genomes and lost some of their features. Indeed, Hedgehog-intein (Hint) domains are widely distributed among eukaryal phyla ( Fig. 1a and Additional file 1: Figure S2) [6,7]. The function of a majority of these Hint-containing proteins remains unknown, and the essentiality of protein splicing to the host proteins has yet to be established [7]. An evolutionary link of inteins to Hedgehog proteins is also a tantalizing possibility [5,8]. Although there is only limited sequence Examples of intragenomic transfer of inteins were also found in both nuclei and chloroplasts (2). The DnaBi within chloroplasts appear to be from reticulate evolution via endosymbiosis (3). Other inteins in fungal cell nuclei and within algae are spread by horizontal gene transfer through eukaryotic viruses that replicate in the cytoplasm (4) similarity between the two self-splicing sequences [5], Hedgehog proteins are structurally analogous to inteins, sharing a characteristic β-strand core known as the Hint module [8]. This common architecture allows both Hedgehog and inteins to undergo post-translational cleavage and ligation by similar trans-esterification reactions to form active molecules. These obvious structural and functional similarities argue in favor of their ancestral relatedness.

Intein distribution suggests possible functional roles
It was observed that prokaryotic inteins localize to highly conserved sites of vital proteins [14,31,34]. In general, this trend holds true for inteins in eukaryotes and their viruses (Fig. 3a-c; Fig. 4). cpDNA and vDNA inteins parallel patterns of extein preference for prokaryotic inteins, with insertions overwhelmingly in proteins involved in replication, recombination, and repair. nDNA inteins are often located in proteins associated with RNA processing, polymerases, and energy production, again critical proteins ( Fig. 3d; Table 1). The presence of inteins in essential genes, coupled with their occurrence at different sites within the same protein, argues that inteins may be important to the host protein. On the other hand, these data may also suggest that inteins are successful molecular parasites, by inserting at essential sites where they are less likely to be deleted. Indeed, recent experimental work has shed light on inteins modulating the host protein through conditional splicing. Inteins in bacteria and archaea have been shown to respond to salt, redox state, temperature, and ssDNA [69][70][71][72][73][74]. Additionally, evidence from data mining led to the observation that functionally related, but evolutionarily distinct proteins carry inteins, indicating a bias that suggests maintenance due to a regulatory role [31]. There are also often many independent intein insertions into different sites within the same protein or domain. For example, there are 104 varied inteins at different sites in the critical spliceosomal protein Prp8, suggesting several invasion events, and that the inteins are retained because they provide some function. A fascinating hypothesis is that protein splicing acts to regulate RNA splicing, to the selective advantage of the organism.
The finding that inteins are prevalent in human and plant pathogens is also noteworthy (Fig. 2). Given the absence of inteins from metazoan genomes, a pursuit for intein inhibitors as novel antimicrobials is underway [75,76]. The discovery here of inteins in several more fungal human pathogens strengthens the rationale behind such drug screening efforts. Furthermore, the presence of inteins in agricultural pathogens makes them ripe for exploitation in drug discovery, at a time when fungal diseases are of increasing concern [77].

Multiple pathways for eukaryotic intein dissemination
To further elucidate eukaryotic intein dynamics, we performed comparative analyses of all known eukaryotic inteins. Phylogenetic analysis of large sets of inteins is often hindered by an inability to produce quality multiple sequence alignments due to a high level of intein sequence diversity. Moreover, there is mounting evidence that horizontal transfer plays a significant role in intein evolution [14,27,[29][30][31]58], but classic phylogenetic analyses focus primarily on vertical descent and traditional phylogenetic trees become inefficient. Here we chose to utilize a SSNbased approach (Fig. 6a), which is less constrained and represents simultaneous interrelationships of all sequences based on their pairwise alignments [46].
Based on the results of our SSN and additional phylonetwork analyses (Fig. 6), we propose a model for intein dissemination in eukaryotes involving vertical inheritance, intragenomic mobility, endosymbiotic and horizontal (extragenomic) acquisitions (Fig. 7). The wide distribution of nDNA inteins and Hint domains among eukaryotes suggests vertical descent from an ancestral sequence as far back as the last universal common ancestor (LUCA) (Fig. 7, pathway 1) [14,48]. Once in the chromosome, the intein may become fixed in a population and vertically transmit from generation to generation, as is the case for VMA1i (Fig. 6a). Some nDNA inteins may also be mobilized intragenomically (Fig. 7, pathway 2), as is the case for CHSi and GLTi, an event that has been postulated before due to their high sequence similarity ( Fig. 6a and b) [48]. Surprisingly, a putative case of intragenomic transfer is also found within the chloroplast genome, suggested by strong relatedness of cpDNA DdRPi and ClpPi (Fig. 6a). To our knowledge, this is the first example of intragenomic intein transfer within an organellar genome.
The largest group of cpDNA inteins, DnaBi, was likely inherited via endosymbiosis from a cyanobacterial progenitor (Fig. 7, pathway 3). Previously, a cyanobacterial DnaBi was shown to have weak sequence similarity with cpDNA DnaBi from a red alga [62]. In the present study, a phylonetwork was reconstructed based on a joint dataset of both cpDNA DnaBi and bacterial DnaBi, and indicates that some of these inteins are distantly related (Fig. 6c). Endosymbiotic acquisition of inteins from bacteria seems reasonable and represents an unprecedented example of transfer of inteins in eukaryotes.
Finally, extragenomic horizontal gene transfer seems to be pervasive and one of the key forces driving eukaryotic intein distribution and diversity (Fig. 7, pathway 4). This is not surprising, given that viruses have been shown to transfer inteins among prokaryotes [23,24,27], and research has shown that a family of giant viruses has an ongoing process of exchanging inteins among each other [78]. In algal-like cells (Fig. 7), various instances of intein transmission seem to play a role in the evolution of the current landscape. Here, we again see vDNA inteins clustering with cpDNA inteins (Fig. 6a), suggesting that spread involved viral transfer at some point. Algae prey on bacteria and archaea, and are infected by viruses, providing a place for inteins to transfer directly from bacteria to eukaryotic viruses [25]. The fact that vDNA inteins are found in a class of viruses that replicate in the cytoplasm may allow the opportunity for spread.

Conclusions
Eukaryotic inteins are scarce compared to their prokaryotic counterparts, although they mimic patterns of distribution that suggest functional importance. These eukaryotic inteins are present in three genome types, nuclear, chloroplast and viral, and these inteins appear to have distinct routes of acquisition. Vertical transmission was primarily observed for fungal nuclear inteins, indicating maintenance since the last common ancestor. HEN-based intragenomic transfer allows these inteins to move to new sites within nuclei, as well as within chloroplasts. Ancient acquisition is also suggested by intein inheritance through endosymbiosis in chloroplast genomes. Finally, horizontal gene transfer spreads inteins across all three genome types, likely mediated by unique eukaryotic viruses that replicate in the cytoplasm. This expansive analysis garnered insight into intein dissemination, and will guide future experiments to investigate intein function and their use as novel drug targets.

Data mining
The first set of full-length precursors and intein protein sequences for eukaryotes was collected from the intein database, InBase [28], and the National Center for Biotechnology Information (NCBI) Protein Database (www.ncbi.nlm.nih.gov/protein/) using previously described pipeline [27,31]. Next, NCBI BLASTp [32] (blast.ncbi.nlm.nih.gov/Blast.cgi) was used to source additional, mostly unannotated inteins: known intein sequences were used as queries in a series of BLASTp searches against a non-redundant protein database limited to Eukaryota (taxid:2759). The results of the intein search in eukaryotic genomes and other metadata are provided in Additional file 2: Table S1. A similar approach was used to search for viral inteins. However, in addition to mining from NCBI Protein Database, viral inteins were investigated among entries in viral reference proteomes available at ViralZone [33] (https://viralzone.expasy.org/). The results of the intein search in genomes of viruses of eukaryotes are available in Additional file 3: Table S2.
To assess if genomes of pathogenic fungal species were overrepresented in the analysis, fungal species listed on the MycoCosm web portal (http://jgi.doe.gov/data-andtools/mycocosm/) were examined (February 20, 2017). For each species, a 'pathogen' was defined as any species causing plant or human disease. Anything listed as a 'parasite' , whether of nematodes, plants, or other fungi, were also considered pathogenic for the purpose of this analysis. Any organisms listed as 'rarely causing disease' were classified as pathogens. Fungi with unknown medical relevance were not ranked as pathogens.

Sequence analysis
The primary sequence data obtained for precursor as a whole and intein(s) separately were analyzed further using the following resources. The presence of conserved protein domains and motifs was verified using the Conserved Domain Database (CDD; www.ncbi.nlm. nih.gov/cdd) and Conserved Domain Search Service (CD Search; www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) as well as InterPro protein analysis tool (www.ebi.ac.uk/ interpro/). KOG annotation for nuclear (nDNA) and viral (vDNA) inteins was performed using protein function annotation by the KOG database tool available at WebMGA server [44,45] (weizhong-lab.ucsd.edu/metagenomic-analysis/); COG annotation for inteins from chloroplast genomes (cpDNA) was performed using the COG database [43] (www.ncbi.nlm.nih.gov/COG/). The endonuclease size sorting and plotting was done using customized Python scripts, which are available upon request. The CD search tool was utilized to identify which type of endonuclease the HEN(+) and HEN(+) extra inteins encoded. Splicing blocks A, B, F, and G were annotated based on classical classification based on conserved residues [50,79].
Pairwise and multiple sequence alignments of inteins were performed using Clustal Omega (www.ebi.ac.uk/ Tools/msa/clustalo/) unless indicated otherwise. The sequence similarity network (SSN) was generated based on full-length protein sequences of all identified inteins using EFI-EST tool [60] (Additional file 5) (http://efi.igb.illinois.edu/efi-est/) and visualized using Cytoscape [80]. The phylonetwork was built using an alignment of cpDNA DnaBi, ClpPi, and bacterial DnaBi and uploaded into Split-sTree using default settings.

Secondary structure modeling and insertion site conservation
Intein-containing exteins with known structures were accessed from the Protein Data Bank (www.rcsb.org/pdb/ home/home.do). They had the following PDB numbers: Prp8 -5GMK, VMA1 -3J9T, ThrRS -3UGQ, and GLT -1EA0. The structures have a conserved intein insertion site, despite being from inteinless hosts. Each structure file was truncated to include the domain, or portion of the protein, with intein insertion. All shortened structure PDB files were uploaded to Pro-Origami software (http://munk.csse.unimelb.edu.au/pro-origami/) [81]. For conservation, a single full-length extein sequence was chosen as a query and uploaded to ConSurf [52].