Characteristics of ERV-ORFs in 19 mammalian genomes
To characterize ERVs that are expressed as proteins in mammalian species, we first compared 176,401 possible protein-coding ERV-ORFs from 19 mammalian species genomes obtained from the gEVE database [26]. ERV-ORFs include ORFs of ≥80 amino acid (aa) residues that encode domains in retroviral genes such as gag, pro, pol, and env genes. Distribution of ERV-ORF length among mammals are shown in Fig. S1. The number of ERV-ORFs vary among mammalian genomes (Fig. 1a), as we have reported previously [26]. In each genome, the proportion of possible protein-coding ERVs relative to the total ERVs, were found to range from 0.05–0.15% (Fig. 1a). Specifically, within mice and cows, the proportion of detectable ERV-ORFs was higher compared to other species. Alternatively, we identified very few (43) ERV-ORFs in the platypus genome (Fig. 1a); however, this was to be expected as the number of ERVs is very small in this species [33]. We then calculated the proportion of ERV-ORF lengths to those of the total ERVs in the genome of each species, and found that the proportion was correlated (r = 0.71, p -value < 5.51 × 10− 4), which means that the amount of ERV-ORFs in each genome reflected that of the total ERVs in the genome.
We further compared mammalian ERV-ORFs at the level of gene domains. To reduce the influence of genome qualities on the number of domains detected, we calculated the relative fraction change (∆F) from the mammalian average for each domain fraction (Fig. 1b). The species-specific characteristics observed in the ∆F values were found in the gag and env gene, which exhibited a ∆F of more than ±0.25 in seven and nine species, respectively (Fig. 1b). The largest difference from the mammalian average in the env gene was observed in opossum (∆F = 0.3). By contrast, the ∆F values for pro and pol were similar among mammalian species, many of which had ∆F within ±0.25. Exceptions were found in mice in which the pro gene showed a dramatic increase (∆F = 1.8). Platypuses showed quite unique compositions in their gag and pro genes with very small ∆F values (0.14 and 0.12, respectively).
We next examined which ERV groups from the Repbase database [34] were enriched in human and mice ERV-ORFs (herein we employed the term “group” rather than the Repbase classification of “families/sub-families” for ERVs to avoid confusion with species classification). Top 20 enriched ERV groups were shown in Fig. 1c. A large proportion of the detected ERVs in humans were found to be HERV-H and HERV-K, and IAPE (Intracisternal A-type Particles elements with an Envelope) in mice (Fig. 1c). HERV-H represents both young and ancient HERV groups, which was inserted into the common ancestral genome of simians and prosimians [35], while HERV-K is described as containing younger HERVs [36, 37]. In addition, the Intracisternal A-type Particles elements (IAP) in mice is one of the ERV groups that continue to be active in mice [38]. This high proportion of ERV-ORFs identified in the mouse genome is likely due to the presence of many active ERVs.
We identified divergent fragments of ERV-ORFs using the Repbase consensus sequences and compared them with ERVs excluding ERV-ORF regions (hereafter referred to as non-ERV-ORFs). We hypothesized that newer elements would be closer to the consensus, and older ones would be further. Thus, we were able to estimate the age of ERV-ORFs by comparing the divergence spectra to their consensus sequences. We first determined the medians of Kimura 2-parameter divergence values [39] between ERV-ORFs and Repbase consensus sequences, and found that the medians of divergence were quite different from those of non-ERV-ORFs, which were approximately 22–28% in all eutherians (Fig. 2 and Fig. S2). The ERV-ORFs showed clear bimodal distributions (supported by dip test, p-value < 0.01, false discovery rate (FDR) corrected, see Methods) in many mammalian species save for dogs, sheep, goats, and opossum, with two modes of approximately 2–7% (younger) and 39–45% (older). It should be noted that in specific species the distribution patterns revealed that bimodality was not statistically supported in dogs, sheep, goats and opossum (dip test, p-value > 0.01, FDR corrected, Fig. 2). Although these distribution patterns were relatively similar among closely related species, such as non-human primates and rodents, overall, the distributions varied significantly between species (Fig. 2).
We next determined how many ERV-ORFs serve as raw materials for the assembly of multi-exon genes. Comparison of our observed ERV-ORFs with those in Ensembl gene and protein annotations revealed that 15 and 22 ERV-ORFs were constituents of ordinary multi-exon genes in humans and mice, respectively (Table S1, non-pink shaded cells for human and mouse genes). It is also intriguing that large numbers of these multi-exon genes were found in sheep (302 out of 324 genes) and opossum (389 out of 396 genes) (Fig. S3). The number of genes containing ERV-ORFs, as predicted by HMM not RepeatMasker [3], were quite large in sheep (18 out of 324 genes) and horse (25 out of 32 genes).
Thus far we observed differences in the presence, type, and divergence of ERV-ORFs among mammals. The high proportion of ERV-ORFs identified in the mouse genome (Fig. 1a) is likely due to the presence of many active ERVs, such as IAP (Figs. 1c and 2). Although some IAP groups contain env genes, many do not [40, 41], which may support the observation in mice of relatively higher ∆F in gag and pro, and lower ∆F in env. These results indicate that many ERV-ORFs in mice may be derived from the young ERVs that possess ORFs by chance. Similar phenomena may have occurred for other species, such as primates, as the frequencies of the divergence with modes of approximately 2–7% were larger than those with 39–45%, which is likely to be boosted by the younger ERVs, such as ERVK (Figs. 1c and 2). However, we also confirmed that the divergence of most known ERV-derived genes in the Ensembl database were also separated into two groups, young and old (Fig. S4), suggesting that not all young ERV-ORFs remained by chance. To identify candidates for ERV-derived genes, it is important to reduce the number of such ERV-ORFs that possess ORFs by chance, which may not function as genes but rather remain ORFs simply because they are too young to have undergone sequence decay. We thus investigated selection pressure as well as transcriptional potential of the ERV-ORFs to better understand the characteristics of ERV-ORFs which are likely to be candidates for ERV-derived genes in mammals.
Transcriptional potential and selective pressure of ERV-ORF
To investigate whether ERV-ORFs are expressed as mRNAs, we examined the presence of transcription start sites (TSS) of ERV-ORFs using the CAGE data in the FANTOM5 database [25]. CAGE data are derived from 1816 human samples from various tissues, primary cells, and cell lines. Using this data, we can detect ERV-ORFs with transcriptional potential. We assessed the presence of TSS near all of the ERV data sets, and further compared ERVs with and without ORFs that were located downstream of the TSS. We calculated the number of ERV-ORFs and non-ERV-ORFs located within each bin of 1000 bp from the TSS, and found that significantly fewer ERV-ORFs were located within 20,000 bp of TSS compared to non-ERV-ORFs in humans and mice (Fig. 3a and Fig. S5). We next examined which cell types contain active TSSs located upstream of ERV-ORFs in human expression profiles within CAGE data in the FANTOM5 database. We performed principal component analysis (PCA) analyses using profiles from 38 different cell lines or tissues types, all of which had a minimum of ten available samples for analysis and found that these active TSSs located near ERV-ORFs were segregated into clearly defined groups for different cell types. Principally, in the expression profile of ERV-ORFs, embryonic stem (ES) cells, H9 embryoid body (H9EB) cells, and induced pluripotent stem (iPS) cells were separated into three groups (Fig. 3b, top). Specifically, osteoclasts established a noticeably clear group. Moreover, human body tissue samples, including those from ovaries, muscles, and retinal pigment epithelium (RPE) formed individual groups, part of which overlapped with each other, and merged to form one large group in the profile of ERV-ORFs (Fig. 3b, top). Interestingly, the distinct groups of ERV-ORF differed from those of non-ERV-ORF profiles (Fig. 3b, bottom). Monocytes and macrophages were also clearly separated from iPS, ES, and H9EB/ES cells, suggesting that ERV-ORFs exhibited unique characteristics of transcription that differed from that of non-ERV-ORFs.
However, the ERV-ORFs with transcriptional potential may have been present by chance after recently becoming inserted into the host genome as was observed with an endogenous bornavirus-like nucleoprotein element (EBLNs) expressed in simians [42]. To determine which ERV-ORFs were not simply present by chance, we investigated the type and strength of selective pressure on ERV-ORFs by estimating ratios of non-synonymous to synonymous substitution (dN/dS) of ERV-ORFs. To this end, we first identified syntenic sequences to human ERV-ORFs in non-human mammalian genomes, and extracted the sequences when the length of non-human ERV-ORFs was > 90% of that of humans (as shown in Fig. 3c). In this step, we obtained sequence alignments for approximately 58.4% (7524/12,879) of human ERV-ORFs and calculated pairwise dN/dS ratios using a maximum likelihood-based approach [43]. The pairwise dN/dS ratios of each ERV-ORF were plotted in Fig. 3d and those between human and non-primate mammalian species were under 1, suggesting that, in such diverse species pairs, the ERV-ORFs that conserved the synteny may be under purifying selection (Fig. 3d). Between human and non-human primates, 5414 ERV-ORF pairs showed dN/dS < 1; however, the dN/dS ratios varied greatly depending on ERV-ORF pairs.
Given that LTR, in particular a 5′ LTR, could potentially function as a promoter for ERV-derived genes, the distance between the gene and their respective TSS should be similar to that of the original ERV structure. Therefore, the distance depends on which ERV domain the ERV-ORFs has originated from. We thus assessed the relationship between TSS and domain types of ERV-ORFs. We first used ERV-ORFs located within ten known ERV-derived single-exon genes (Table S1, pink shaded cells for human and mouse genes) and the closest TSSs in humans and mice; these ERV-ORFs were located approximately 7000 bp from the TSS (Table S1, Dist_TSS column). Four ERV-ORFs in ERV-derived genes were located within 1000 bp of the TSS; however, the remainder, specifically those within the pol and env domains, were found 1821 bp to 7086 bp downstream of the TSS (Table S1). Within mice, the ERV-ORFs within known ERV-derived genes demonstrated a similar pattern (0 bp to 5402 bp) as that observed in humans (Table S1). The distance range to TSS was relatively long compared to that of ordinary genes, which may have caused by ERV-derived genes using their own 5′ LTR as a promoter. We next applied this result to all our ERV-ORF data sets with dN/dS ratios less than 1 in primate pairs, located within 10,000 bp downstream of TSS. We plotted the range of distance between TSS and each domain category of ERV-ORFs with transcriptional potential and found that the median distance increased along the ERV structure of the domain in the order: gag-pro-pol-env (Fig. S6A). These results demonstrated that the tendency for TSSs to be located near ERV-ORFs were derived from their own 5′ LTR. However, the distance range was relatively large in all domain categories so that we set a cut-off of 10,000 bp for the distance between TSS and ERV-ORFs to detect ERV-ORFs with transcriptional potential, which can be candidates of ERV-derived genes in this study.
To further understand the relationship between selection pressure and Kimura 2-parameter divergence of ERV-ORFs, we used two groups of ERV-ORFs in the human-chimpanzee pair: the first was located within 10,000 bp downstream of TSS with dN/dS ratios less than 1 (ERV-ORFs with transcriptional potential), and the second had no limitation for dN/dS ratio or distance to TSS applied (all ERV-ORFs). We extracted 705 ERV-ORFs with transcriptional potential from human-chimpanzee pairs and compared their divergence distributions with those of all ERV-ORFs (Fig. 3e) and found an increase in the median value of the Kimura 2-parameter divergence of ERV-ORFs with transcriptional potential from that of all ERV-ORFs (from 10 to 12%, Wilcoxon rank sum test, p-value < 3.1 × 10− 6), demonstrating that the divergence of ERV-ORFs with transcriptional potential was significantly higher than that of all ERV-ORFs. Furthermore, the statistical differences in divergence between ERV-ORFs with transcriptional potential and all ERV-ORFs were also observed between human-gorilla and human-orangutan pairs (549 and 428 pairs, Wilcoxon rank sum test, p-value < 1.3 × 10− 6 and < 8.6 × 10− 4, FDR corrected, respectively). This confirmed that ERV-ORFs with relatively small divergence (≤ 7%), which were likely recently inserted into the genomes, may include ERV-ORFs that remain by chance alone. Of note, ERV groups showing a significant decrease in the proportion compared to those expected from all ERV-ORFs in humans were HERV-H and HERV-K groups (Fig. S6B).
Omics analyses of ERV-ORFs
To predict which ERV-ORFs are expressed in a variety of tissues, we performed comparative analyses using a wide range of transcriptome and histone mark data generated by next-generation sequencing. We first compared observed ERV-ORFs from our study with quantified transcriptome data derived from 31 human tissues generated in the GTEx study [23] from the CHESS database [24]. We found that a total of 279 ERV-ORFs overlapped with transcripts in the CHESS database. Of these, 7 ERV-ORFs were found to correspond to genes containing exons derived from ERV sequences predicted only by HMM, not by RepeatMasker (Table S1). The proportion of ERV-ORFs identified in CHESS transcripts was 3.5%. For each domain, the fraction was 3.0, 2.9, 3.5, and 4.7% in gag, pro, pol, and env, respectively. We performed the chi-squared test to examine whether a preference exists for each domain identified as CHESS transcripts. Comparisons of the observed numbers of ERV-ORF domains that were identified as CHESS transcripts with the expected numbers calculated from the fraction of all ERV-ORF domains within the human genome showed that the env segment was relatively large compared to the other domains (observed 89, expected 66); however, this result was not statistically significant (chi-square test, p-value < 0.16).
Since myoblast cells were reported to express syncytins during myogenesis [44, 45], we also analyzed RNA-Seq data from human and mouse differentiating myoblasts. The relevant expression data from human primary myoblasts (12 runs) and mouse C2C12 cells (16 runs) were collected from the sequence read archive (SRA) database (see Supplementary Table 3 for details). Following gene mapping and quantification, the expression data of ERV-ORFs having a minimum of ten normalized read counts in the human and mouse myoblasts were extracted. The RNA-Seq data were analyzed from four (Days 0–3) and three (Days 0, 3, 6) time points after differentiation had begun in humans and mice, respectively. PCA plots successfully captured the difference in ERV-ORF expression at each time point (Fig. 4a and Fig. S7). From these plots, 51 and 586 ERV-ORFs were identified as having a minimum of ten normalized read counts in all samples at the same differentiation stage in humans and mice, respectively. In both species, a number of ERV-ORFs exhibited high expression levels throughout the entire differentiation process, including ERV3–1 in humans, and AH000833 and AI506816 in mice (Fig. 4b and Fig. S8). Other known genes/transcripts such as Peg10, AC016577 and PPHLN1 in humans, and syncytin-A and Asprv in mice were also detected, however, their expressions were low. Many ERV-ORFs detected in the myoblast differentiation were unreported transcripts, derived from all four ERV domains of ERV1, ERVK, and ERVL, some of which exhibited stage-specific expression patterns (Fig. 4b and Fig. S8). It is noteworthy to mention that in ERV-ORFs expressed during human muscle cell differentiation, only eight were also identified in GTEx transcripts.
One of the features associated with transcribed gene body is the chromatin mark of trimethylated histone H3 at lysine 36 (H3K36me3) [46, 47]. We, therefore, applied this feature to identify ERV-ORFs with transcriptional potential as proteins. We obtained the histone mark data of H3K36me3 (9 and 6 cell/tissue types in humans and mice, respectively) from ChIP-Atlas [48], and identified the ERV-ORFs overlapping with this mark in a minimum of four samples from each cell/tissue type. From these results we identified 51 and 250 ERV-ORFs overlapping with the mark, in humans and mice, respectively. Further, a higher proportion of samples containing ERV-ORFs with H3K36me3 in each cell/tissue type showed cell type specific patterning (Fig. 4c and Fig. S9). We also identified several ERV-ORFs marked by multiple cell types; some of which were the fourth exon of CNBP (cellular nucleic acid-binding) genes in humans and mice, yet others were unreported ERV-ORFs. All ERV-ORF domains were determined contain the H3K36me3 mark in humans and mice. However, a portion of the domain expression exhibited tissue specificity; for example, in humans, the pro domain was found exclusively in liver cells. We also analyzed the relationship between TSSs and ERV-ORFs with transcriptional potential, which were detected by comparing RNA-Seq and histone mark data (Fig. 3a). A significantly larger proportion of ERV-ORFs with transcriptional potential were located downstream of TSS compared to non-ERV-ORFs, especially within 1000 bp in humans [FDR corrected p-value < 0.05]. When comparing numbers of the two types of ERV-ORFs within each bin of 2000 bp, we found a significantly larger proportion of ERV-ORFs with transcriptional potential within 4000 bp from TSS compared to non-ERV-ORFs in humans and mice (Fig. 3a and Fig. S5, p-value < 0.01, FDR corrected).
We then examined enriched ERV groups based on Repbase classification in ERV-ORFs with transcriptional potential, which were identified using three functional datasets (CHESS, myoblast RNA-Seq, and H3K36me), and compared them with those of all ERV-ORFs detected in the human genome. The top 10 enriched ERV groups in ERV-ORFs with transcriptional potential are shown in Fig. 4d (p -value < 0.05, chi-squared test, FDR corrected). Some enriched ERV groups were more prominent in ERV-ORFs with transcriptional potential compared to all ERV-ORFs detected in humans. For example, when comparing top 10 enriched ERV groups, the proportion of the HERVK3 group significantly increased in all three functional datasets compared to that in all ERV-ORFs in humans (Fig. 1c and Fig. S10). By contrast, the proportion of the HERV-H group in ERV-ORFs with transcriptional potential was significantly decreased from that in all ERV-ORFs in humans (Figs. 1c and 4d, and Fig. S10). Similarly, differences in the enriched ERV groups were observed among the three datasets of ERV-ORFs with transcriptional potential. Specifically, high proportions of the HERV-E and HERVS71 groups were observed in ERV-ORFs from the CHESS database, yet were not observed in ERV-ORFs identified using the other functional datasets. Conversely, the fractions of HERVK9 and HERVH48 groups were found to be highly enriched in ERV-ORF from H3K36me3 histone dataset compared to those detected in CHESS datasets. The histone datasets contain a larger variety of samples, including iPS, ES, and cancer cell lines, while CHESS datasets are obtained exclusively from human tissues. In ERV-ORFs from myoblast RNA-Seq, the proportion of the HERV-K group was significantly increased compared to that of all ERV-ORFs detected in humans (Figs. 1c, 4d, chi-squared test, p -value < 0.001, FDR corrected), which accounts for nearly 60% of ERV groups in the ERV-ORFs from myoblast RNA-Seq. However, enrichment of the HERV-K group was observed only in ERV-ORFs from myoblast RNA-Seq. Hence, the differences in the ERV-ORF groups may result from differences in sample types used in each project. Indeed, the ERV-ORFs detected via three different functional datasets were quite different (Fig. 4e). Although we identified 408 ERV-ORFs with transcriptional potential, only 15 were identified in both the CHESS and histone datasets. Moreover, since myoblast cells were not included in two of the datasets, only 11 ERV-ORFs were found to be overlapped with those in CHESS and histone datasets. These observations confirmed that ERV-ORFs were tissue-specific, and approximately 3.2% (408/12,879) and 2.3% (752/32,062) of ERV-ORFs may be transcribed as proteins in humans and mice, respectively.
We also compared dN/dS ratios between all ERV-ORFs and ERV-ORFs with transcriptional potential in primates, and found that the dN/dS ratios with transcriptional potential only for baboons, macaques and marmosets were significantly smaller than those of all ERV-ORFs (p-value < 0.05, Wilcoxon rank sum test, FDR corrected). Moreover, the dN/dS ratios of ERV-ORFs between apes were significantly small when they were shared by at least four non-human primates (Fig. 4f, Wilcoxon rank sum test, p-value < 0.05, FDR corrected). In the sets for chimpanzee, gorilla, and orangutan, approximately 81, 74, and 72% of ERV-ORFs were below dN/dS 1, respectively. This showed that ERV-ORFs detected by functional data still contain sequences that are not under purifying selection. Out of 408 ERV-ORFs, which were detected using three different functional datasets, 93 had dN/dS < 1, demonstrating that approximately 22.3% of ERV-ORFs considered to exhibit high transcriptional potential, were under purifying selection.
Genome-wide comparison of ERV-ORFs
The current ERV groups classified by Repbase are based on nucleotide identity; we, therefore, postulated that several unique ERV-derived genes exist that originated from different ERV groups. However, we do not know whether there is certain ERV-ORF domains or amino acid sequences are preferentially retained in different lineages to generate novel ERV-derived genes, including ERV-ORFs, even if they belong to different ERV groups in mammals. We also do not know whether amino acid sequences of ERV-derived genes, including ERV-ORFs, tend to be unique as they exhibit high transcriptional potential. To better understand such relationships between sequence types of ERV-derived genes in mammals, we conducted a genome-wide cluster analysis on ERV-ORFs using CD-HIT [49]. Cross-species ERV clustering provided important predictions for the number of sequence types (different amino acid sequences) in the 19 examined mammalian species. After clustering ERV-ORF sequences, we obtained 96,938 to 8749 clusters at the levels of ≥90% to ≥50% sequence identity, respectively (Fig. 5a). Furthermore, among ERV genes, the number of clusters identified for pol genes (Fig. 5a) were more variable, while that for pro genes was less variable. We assigned cluster IDs for each cluster by corresponding the number of sequences in the cluster with the length of the reference sequences.
Cluster analysis identified numerous clusters shared between multiple species, which had not been detected by clustering of nucleotide ERV sequences. Generally, genes shared among mammals have ≥60% of shared amino acid identity; however, many ERV-ORFs were shown to be lineage-specific, and made lineage-specific clusters at this identity level; many clusters were found to be shared between primates, rodents, or bovids, whereas less to no, shared clustering occurred in opossum and platypuses (Fig. 5b). This might because opossum and platypus are genetically diverged compared to other mammalian lineages. Using clustering patterns and human RNA-Seq data, we determined the degree of shared ERV-ORFs with transcriptional potential by examining whether the ERV-ORFs were derived from human, primate, or multi-species clusters. Many of the human ERV-ORFs identified in transcripts via RNA-Seq that were detected in transcripts in the CHESS database, and in myoblast differentiation transcriptome data, were also found in clusters of primates (apes and old-world monkeys) at the level of 80% (primate homologs share roughly ≥80% identity). Furthermore, more than half of the clusters contain only small numbers (< 10) of primate ERVs, indicating they are unique sequences (Fig. S11). These human ERV-ORFs with transcriptional potential were also in primate clusters even at the level of ≥60% identity (Fig. 5c). Overall, the clustering of gag and env domains were shared between similar species; many of which were shared among closely related species such as primates, while other gag clusters were shared among distantly related species. Additionally, the pro domain cluster, containing transcripts found in the CHESS database, was relatively small, and appeared to be more specific to apes compared to other species gene categories. The clusters containing human ERV-ORFs that overlapped with RNA-Seq transcripts with known ERV gene annotations, are highlighted in Fig. 5c. Many of these genes, save for syncytin-1, were found in clusters comprised of a smaller number of sequences (≤10). Further, of the total 12,879 human ERV-ORFs examined in this study, only 223 sequences were human-specific (un-clustered unique sequences or human-specific clusters). Moreover, from the ERV-ORFs that overlapped with RNA-Seq data transcripts, only two sequences were found to be un-clustered. Mouse ERV-ORFs showed a similar tendency, although the species specificity was stronger than humans. For example, 18,730 sequences out of the total 32,063 ERV-ORFs were mouse-specific, and 349 out of 586 ERV-ORFs detected in the RNA-Seq data were determined to be in mouse-specific clusters. Only 15 ERV-ORFs were unique sequences in the mouse genome.
Many human ERV-ORFs identified via RNA-Seq analysis were found to be shared among primates, while some ERV-ORFs were found in multiple shared clusters. Indeed, of the clusters at the level of 60% identity, we identified 60 clusters that were shared among ≥8 mammalian species (Fig. 5d). The largest shared ERV-ORF domain category was for the pol gene, which had 27 clusters shared between species. The remaining ERV gene domains contained a similar number of cross-species clusters (12, 11, and 10 for gag, pro, and env, respectively). Although nine clusters contained known annotated genes or reported transcripts (Fig. 5d and Table S1), many of the ERV-ORFs identified in the clusters were previously unreported; seven clusters contain human ERV-ORFs with transcriptional potential but not known ERV-derived genes. Although no clusters containing mouse ERV-ORFs with transcriptional potential were found, this might be explained by the different number of species for primates and rodents used in the analysis. Interestingly, 15 of these multi-species clusters did not contain human ERV-ORFs.
Nine of these clusters remained following the extraction of ≥8 shared species clusters at the level of ≥80% identity, and 21 clusters remained when extracting clusters shared among ≥10 mammalian species (Fig. S12). Further, there were fewer multi-species ERV-ORF clusters identified in rodents and opossum. Instead, these species exhibited larger numbers of species-specific clusters and unique ERV sequences (Fig. 5d, right panels). The species-specific clusters in these species were abundant in pol genes, while there were no multi-species shared clusters observed in analysis of platypuses.