SCIFER workflow
SCIFER (Single Cell Implementation to Find Expressed Retrotransposons) is a method to quantify L1 mRNA expression at the locus-specific level in scRNA-Seq datasets. SCIFER aligns data from a standard 10X Chromium Single Cell 3′ RNA-Seq dataset to the reference genome using procedures optimized to deal with the very high genomic copy number of L1 elements (Fig. 1A, see Methods). 10X Chromium Single Cell 3′ RNA-Seq read alignments are enriched at the 3′ end of L1 loci and genes due to the selection of polyadenylated transcripts during library preparation. Therefore, scRNA-Seq data lack equal distribution of read alignments across the expressed L1 locus, preventing the analysis of 5′ aligned reads that allow discernment between authentically and passively transcribed L1 loci [12, 20, 30]. For this reason, SCIFER requires that bulk RNA-Seq analysis of L1 expression using a list of full-length L1 loci has been performed on a matching sample, allowing for validation of L1 loci that were exclusively expressed from the L1 promoter. Additionally, SCIFER performs an alignment to the reference genome, instead of the transcriptome, to allow detection of L1 elements in introns and intergenic regions (Fig. 1, step 2).
To validate SCIFER efficacy we performed scRNA-Seq of MCF7 and HEK293-FRT-LacZeo cells in a combined sample using the 10X Genomics Chromium™ Single Cell 3′ Library & Gel Bead Kit v3’. To assess bias in the representation of scRNA-Seq reads, bulk RNA-Seq reads and genomic DNA-Seq reads from MCF7 cells and scRNA-Seq reads from our sample were separately aligned to the L1 consensus sequence (Additional file 1 A). DNA-seq and bulk RNA-Seq reads were distributed throughout the L1 sequence with some enrichment at the 3′ end due to the abundance of 5′ truncated L1 loci in the human genome. Alignment of scRNA-Seq reads to the 3′ end of the L1 consensus were 2X more abundant than alignments to other regions of the L1 sequence (Additional file 1 A), reflective of the 3′ targeted sequencing procedure. Similar analysis of actin (ACTB) demonstrated the expected enrichment of reads at the 3′ end of the gene locus in scRNA-Seq, with bulk RNA-Seq reads evenly distributed throughout gene exons (Additional file 1 B, top). Bulk RNA-Seq reads are equally distributed throughout the length of an L1 locus previously identified and authenticated as expressed (Additional file 1 B, bottom) [29]. Alignment of scRNA-Seq reads to the same L1 locus are enriched at the 3′ end (Additional file 1 B, bottom). Based on these comparisons, the use of the traditional 3′ targeted scRNA-Seq alone is not an adequate method to detect L1 expression (Additional file 1 B). Therefore, we used expression data from bulk RNA-seq to accurately detect expressed L1 loci to guide SCIFER analysis.
Analysis of L1 mRNA expression in MCF7 cells
To validate the utility of SCIFER for detection of L1 expression in scRNA-Seq datasets and to identify technical strengths and limitations of this approach, we analyzed High and Low coverage MCF7 scRNA-Seq datasets (Fig. 1B). We performed High coverage scRNA-Seq on a pool of MCF7 and HEK293-FRT-LacZeo cells in a 9:1 ratio, respectively, using a total of 853 cells in this experiment (Fig. 1B). Clustering by 10X Genomics cellranger count identified that MCF7 cells formed 5 clusters (Additional file 2 A) with Cluster 1 cells having on average significantly higher numbers of mapped reads per cell compared to Clusters 3 and 4 (P < 0.0001, Fig. 2A). Clusters 2 and 5 contained low numbers of RNA molecules and Cluster 2 contained a high percentage of mitochondrial read;, therefore, both were discarded from further analysis (Fig. 2A, Additional file 2 B). HEK293-FRT-LacZeo cells were used as an internal control within our cell mixture for accurate cell clustering. After sequencing and clustering cells using the cellranger count tool, the HEK293-FRT-LacZeo cells clustered separately from the MCF7 cells (Additional file 2 A, Clusters 6 and 7). HEK293-FRT-LacZeo cell identity was confirmed by aligning reads to the FRT-LacZeo transgene sequence.
Following alignment, clustering of cells, and de-duplication of barcode-UMIs, the next step in SCIFER analysis is to parse expressed L1 loci from passively transcribed L1 sequences (also referred to as background) by cross referencing the list of L1 loci that were assigned same-sense RNA-Seq alignments in the scRNA-Seq dataset with a list of full-length L1 loci validated to be expressed in MCF7 cells from a previous study (Additional file 17, Fig. 1, step 2) [29]. By summing the RPM of all L1 loci identified as expressed by this approach in each MCF7 cell, we determined the RPM levels per cell for each cluster (Fig. 2B). The three MCF7 clusters with considerable L1 expression (Clusters 1, 3, and 4) had similar average L1 expression levels ranging from 0.48 RPM in Cluster 1 to 0.80 RPM in Cluster 4 (Fig. 2B). Differences in L1 expression levels between clusters 1, 3, and 4 were also compared using Seurat v4.0.5 (Additional file 2 E). MCF7 clusters 1, 3, and 4 also expressed similar numbers of L1 loci with the average number of expressed L1 loci per cell ranging from 7.3 in Cluster 3 to 8.6 in Cluster 4 (Fig. 2C). Comparison of the L1 loci expressed in Clusters 1, 3, and 4, determined that the clusters share 82% (123 out of 150) of expressed L1 loci (Fig. 2D). The L1 RPM and number of expressed L1 loci for Clusters 2 and 5 were excluded from these figures due to not meeting technical standards, but comparisons of L1 RPM and the number of expressed L1 loci that include these clusters are shown in Additional file 2 C and D.
To find how closely SCIFER analysis of scRNA-Seq mirrors bulk RNA-Seq detection of L1 expression using our validation method [20, 30], we compared the list of L1 loci manually validated to be expressed in bulk RNA-Seq with the list of L1 loci that received ≥1 sequence alignment in scRNA-Seq. 150 of the 161 L1 loci (93%) identified as expressed in bulk RNA-Seq were detected to be expressed in scRNA-Seq (Fig. 2E, Venn diagram). We observed that the RPM per L1 locus detected using scRNA-Seq positively correlates with L1 FPKM determined using bulk RNA-Seq (r = 0.51, P < 0.0001, Fig. 2E). Comparing the bulk RNA-Seq FPKM level of the 11 L1 loci that were not detected to be expressed by scRNA-Seq revealed that the L1 loci unique to the bulk RNA-Seq dataset were expressed at a significantly lower FPKM compared to the L1 loci detected by both bulk and Single Cell RNA-Seq (P < 0.0001, Fig. 2E, Violin plot). The positive correlation of L1 FPKM detected in bulk RNA-seq with L1 RPM observed in scRNA-Seq dataset demonstrates that SCIFER performs accurate detection of L1 expression at the locus-specific level in single cells. The ability of SCIFER to detect 93% of the L1 loci determined to be expressed in the bulk RNA-Seq analysis also demonstrates that this approach is sensitive enough to identify almost all expressed L1 loci, missing L1 loci that have, on average, significantly lower expression levels than the loci detected to be expressed (Fig. 2E, Violin plot).
Several published studies employ analysis of scRNA-Seq reads to draw conclusions regarding TE expression in single cells without considering limitations of this technology such as the presence of passive L1 expression that requires careful analysis of potentially expressed L1 loci to address [32, 33]. To determine the outcome of unsupervised analysis of L1 expression using 10X Chromium Single Cell 3′ scRNA-Seq methodology, all L1 loci with unique alignments identified in scRNA-Seq were considered as potentially expressed. Comparison of unsupervised with bulk-validated L1 loci determined that the number of expressed L1 loci per cluster was inflated 8-9X and the total L1 RPM per cluster was inflated 5-6X per cluster in the non-validated dataset (Fig. 2F and G). This discrepancy highlights the importance of including a list of expressed, full-length L1 loci validated in bulk RNA-Seq analysis from a matching sample to guide findings made using scRNA-Seq datasets. Without such guidance, scRNA-Seq analysis of TE expression produces results that lack scientific rigor and biological meaning by disproportionally exaggerating both the levels of L1 expression and the number of expressed L1 loci.
Shallow sequencing reduces sensitivity of L1 mRNA expression detection in single cells
Sequencing depth is an important consideration when preparing samples for analysis of mobile element expression due to their low expression level in normal tissues, especially at the locus-specific level [10, 20, 21, 45,46,47]. Because most publicly available scRNA-Seq datasets are not sequenced to as high of a depth as our High coverage MCF7 dataset (Fig. 1B), we next tested the effect of reduced sequencing coverage on SCIFER’s ability to detect expressed L1 loci and their expression levels in single cells. To accomplish this, we used two complementary approaches: downsampling of high coverage cells from the MCF7 High coverage dataset and analysis of an independent MCF7 Low coverage scRNA-Seq dataset.
Three MCF7 cells with similar sequencing depth (0.17–0.18 million mapped reads per cell) were down sampled in 10% intervals and the average number of expressed L1 loci detected in each resulting interval sample was compared (Fig. 3A). This approach demonstrated that the average number of L1 loci expressed in these three cells drops off significantly at 70% of the original sample size (ANOVA, P = 0.026, Fig. 3A). This established that optimum detection of expressed L1 loci by SCIFER occurs at ≥80% of our starting reads (≥0.14 million mapped reads per cell) and at least half of expressed L1 loci remain detectable in files containing 50% of the original number of reads (Fig. 3A).
To determine limitations imposed by lower sequencing coverage, we performed SCIFER analysis on a low coverage MCF7 scRNA-Seq dataset with 10-fold less sequencing coverage and 5-fold more sequenced cells compared to the High coverage MCF7 scRNA-Seq dataset (Fig. 1B, Additional file 2 B). The cluster with the highest average number of mapped reads in this dataset (Cluster 7, 0.0032 million mapped reads per cell) had on average ~ 30X fewer mapped reads than Cluster 1 from the High coverage dataset, the cluster with the highest average number (0.095) of million mapped reads in the High coverage dataset (P < 0.0001, comparing Fig. 2A Cluster 1 and Additional file 3 A Cluster 7). The Low coverage dataset also had ~29X fewer expressed L1 loci per cell compared to the High coverage dataset (0.3 vs. Cluster 4 High: 8.6 expressed L1 loci) and the L1 RPM was ~ 1.9X lower in the Low than the High coverage dataset (0.35 vs. 0.66 L1 RPM, Fig. 2B and C and Additional file 3 B and C). Bulk RNA-Seq of MCF7 cells followed by manual validation identified three L1Hs loci as expressed [12, 20, 29, 30]. Because L1Hs loci are the evolutionarily youngest L1s in the human genome and, therefore the hardest to map uniquely by all approaches, including ours, but the most capable of contributing to L1-related genome instability, we considered whether their detection would change with sequencing depth. We quantified the number of cells expressing each of the three L1Hs loci in a High and Low coverage scRNA-Seq dataset of MCF7 cells (Fig. 1B, Fig. 3B). We found that while all three L1Hs loci were detected in both High and Low coverage scRNA-Seq, the L1Hs loci expression was detected in 6.2X fewer cells in the Low coverage dataset compared to the High-coverage, on average (Fig. 3B). These findings demonstrate that accurate detection of L1Hs expressing cells is reduced in lower coverage scRNA-Seq datasets leading to an underestimation of L1 expression and its potential biological impact.
We next compared the total L1 expression levels detected by SCIFER in the Low coverage MCF7 dataset with L1 expression levels detected in the bulk RNA-seq dataset. Despite reduced sensitivity of detection of L1 expression per cell in the Low coverage dataset, RPM per expressed L1 locus in the Low coverage dataset was positively correlated with bulk RNA-Seq L1 FPKM (r = 0.55, P < 0.0001 Additional file 3 E). Additionally, 88% (142 of 161) of L1 loci shown to be expressed in bulk are detected as expressed in the Low coverage MCF7 dataset (Additional file 3 E, Venn diagram). The L1 loci unique to bulk RNA-Seq had significantly lower FPKM levels compared to the FPKM levels of the L1 loci shared between bulk and scRNA-Seq (Additional file 3 E, Violin Plot, P < 0.0001). The positive L1 FPKM-RPM correlation and high percent of shared expressed loci detected between Low coverage scRNA-Seq and bulk RNA-Seq establishes that L1 expression detection, averaged across a population of cells, is not dramatically impacted by reduced sequencing depth. However, the number of expressed L1 loci per cell and their levels of expression are underestimated when a low coverage dataset is used. Additionally, similar to the High coverage scRNA-Seq dataset, L1 loci expressed in bulk RNA-Seq but missed by SCIFER in lower sequencing coverage scRNA-Seq datasets are those L1 loci that have low expression levels.
Guided by these findings we analyzed L1 loci expression detected by SCIFER in 10 single MCF7 cells with similar and highest sequencing depth (0.17–0.28 million mapped reads per cell) to determine the extent of variation in L1 expression between individual cells (Fig. 3C). This analysis demonstrated that some L1 loci were expressed by 6–7 cells, such as loci L1–3027, L1–4594 and L1–4591 (loci 3, 8,and 12, respectively), while others were expressed only in one cell, such as L1-1644, L1-3511, L1-5118, L1-0862, L1-1741, L1-4311, L1-0320, L1-2843, L1-5594, L1-2879, L1-4279, L1-0424, L1-2543, L1-2693, L1-3746, L1-4326, L1-3864, L1-4757, L1-1889, L1-4296, L1-4093, L1-4644, L1-4405, L1-4024, L1-5096, L1-4935, L1-3602 (loci 14, 21, 23, 38, 49-53, 56-58, 60-74, respectively) (Fig. 3C). The number of expressed L1 loci per cell varied from 10 in cell 7 to 33 in cell 10 (Fig. 3C). This cell-to-cell variation can be due to the drop-out of the detected L1 locus expression, an inherent feature of scRNA-Seq experiments.
We also detect expression of housekeeping genes (HKGs) in cells with read numbers downsampled in 10% increments, the same three cells used in Fig. 3A (Additional file 2 F and G). We observed that while some HKGs were more robustly expressed than others (Additional file 2 F), the average number of detected HKGs did not significantly decrease until 40% of the starting number of reads (Additional file 2 G). The difference in the dropout threshold between HKG and L1 (40% vs. 70%, respectively, Fig. 3A and Additional file 2 G) is likely due to individual L1 loci having lower levels of expression than genes tested in this experiment. We also generated an expression matrix of HKGs in the 10 high-sequencing depth cells from Fig. 3C. We observe that HKG expression detection is more uniform across single cells compared to expression of L1 loci (Additional file 2 H and Fig. 3C). Notably, HKGs with lower expression levels (Additional file 2 F) demonstrate drop-out amongst the 10 high-sequencing depth cells (Additional file 2 H). These findings show that the depth of sequencing is an important consideration when investigating L1 expression patterns or expression patterns of genes with low expression levels. This analysis also supports that biologically informative observations about single L1 locus expression or genes with low expression levels should be made at the cell cluster or cell type level, rather than between individual cells. With this in mind, given the high sequence depth of the 10 cells analyzed in this experiment, our findings of cell-to-cell variation in locus-specific L1 expression could partially reflect biologically relevant patterns of L1 expression in individual cells, which would be consistent with previously observed differences in L1 retrotransposition events between single cells [48].
L1 mRNA expression in single cells of mouse testes
Previously, our comprehensive analysis of locus-specific L1 mRNA expression in mouse organs determined that testes express the highest levels of L1 mRNA compared to other mouse organs, including male and female brains, male and female lungs, ovaries, and uteri [10]. To determine the cellular source(s) of L1 expression in mouse testes, we used SCIFER to analyze scRNA-Seq data from testes collected from two 2 mo mice. First, expression of spermatogenesis-specific cell markers was used to confirm correct cell clustering in the mouse testis samples. TBPL1 is a marker of Spermatocyte and early Spermatid stages and was found to have significantly higher expression in Spermatocyte and Round Spermatid clusters compared to Elongating and Condensing Spermatids (P < 0.0001, Additional file 17, Additional file 4 A1, B1, and C) [49]. PRM1, a protamine that is exchanged for histones during the haploid phase of spermatogenesis, was expressed significantly higher in Round, Elongating, and Condensing Spermatids compared to Spermatocytes and Sertoli cells (P < 0.0001, Additional file 17, Additional file 4 A2, B2, and C) [49]. TNP1, a protein involved in the histone-protamine exchange, also had significantly higher expression in Round, Elongating, and Condensing Spermatids compared to Spermatocytes and Sertoli cells (P < 0.0001, Additional file 17, Additional file 4 A3, B3, and C) [49]. These gene expression profiles are consistent with accurate clustering of mouse testis cells prior to SCIFER analysis.
SCIFER analysis was performed on the scRNA-Seq mouse testis samples to discover the levels and patterns of L1 mRNA expression in different cell types using a list of L1 loci validated to be expressed in mouse testes from a previous publication [10]. SCIFER analysis of the scRNA-Seq mouse testis datasets found that Round Spermatids express on average the highest levels of L1 per cell compared to other cell types (Fig. 4B and C). They have higher average L1 RPM per cell, compared to clusters of Spermatocytes, Elongating Spermatids, and Condensing Spermatids (ANOVA, P < 0.0001, P < 0.0001, P < 0.0001, respectively, Fig. 4B). Analysis of L1 RPM with Seurat v4.0.5 analysis confirmed that Round Spermatids had on average significantly higher L1 expression levels compared to Spermatocytes, Elongating Spermatids, and Condensing Spermatids (Wilcoxon rank sum, P = 4.39E-99, P = 1.20E-28, P = 3.89E-94, respectively, Additional file 4 D). Mouse Round Spermatids also on average express more L1 loci per cell (2.9 loci) than other cell types with some cells expressing 8–12 L1 loci (compared to Spermatocytes, Elongating Spermatids, and Condensing Spermatids which, on average, express less than one locus per cell (ANOVA, P < 0.0001, P < 0.0001, P < 0.0001, respectively, Fig. 4C). Too few Spermatogonia, Leydig, and Sertoli cells were identified in these datasets to make meaningful comparisons (Additional file 5 C and E). SCIFER was used to analyze a second mouse dataset of a lower sequencing coverage (Fig. 1B, Additional file 5 B) and confirmed that on average Round Spermatids in Mouse 2 also expressed L1 at a significantly higher RPM (Additional file 5 D) and on average expressed a significantly higher number of L1 loci per cell compared to the other cell types (Additional file 5 F). These findings establish that Round Spermatids reproducibly express the highest levels of L1 mRNA compared to other cell types in mouse testes.
To understand potential changes in L1 expression during spermatogenesis, we compared the identity of L1 loci expressed in different sperm cell types identified in our single cell pool. Comparison of L1 loci expressed in different sperm cell types determined that in the Mouse 1 dataset, Spermatocytes, Round Spermatids, Elongating Spermatids, and Condensing Spermatids share only 2% (5 of 212) of L1 loci expressed in testes, with 157 L1 loci being detected to be expressed only in Round Spermatids alone (Fig. 4D). In the Mouse 2 dataset, 0% (0 of 217) of expressed L1 loci are shared by Spermatocytes, Round Spermatids, Elongating Spermatids, and Condensing Spermatids with Round Spermatids expressing 160 unique L1 loci (Additional file 5 G). The high number of expressed L1 loci unique to Round Spermatids in both mouse replicates (157 and 160) demonstrates that Round Spermatids reproducibly support the majority of L1 loci detected to be expressed in bulk RNA-Seq analysis even though their genome is haploid. We also observe a high level of similarity in the expressed L1 loci shared between Mouse 1 and Mouse 2 Round Spermatids (65%, 162 of 251, Additional file 5 H2). In comparison, the number of shared loci between Mouse 1 and 2 is 31% in Spermatocytes, 44% in Elongating Spermatocytes, and 0% in Condensing Spermatids (Additional file 5 H1, H3, and H4). Additionally, Round Spermatids are the most abundant cell type in both Mouse 1 and Mouse 2 samples representing 40% (499 of 1237, including Leydig and Sertoli cells) and 49% (641 of 1315, including Leydig and Sertoli cells) of the cell populations, respectively, which may lead to their L1 expression levels dominating the SCIFER analysis. Of note, Spermatocytes are the second most abundant cell type in both Mouse 1 and Mouse 2 representing 30% (363 of 1237) and 25% (330 of 1315) of the cell populations, respectively, yet Spermatocytes have significantly lower L1 RPM levels and fewer expressed L1 loci (Fig. 4B and C, Additional file 5 D and F). This suggests that the high levels of L1 expression observed in Round Spermatids are biologically relevant and not a reflection of oversampling of the cell type.
Previously we found that in MCF7 cells SCIFER detects 88–93% of expressed L1 loci detected in bulk RNA-Seq and the L1 FPKM-RPM levels of expressed L1s are positively correlated between SCIFER-detected L1 loci expression in single cells and bulk RNA-Seq (r = 0.51, P < 0.0001, Fig. 2E and Additional file 3 E). To determine whether SCIFER is similarly consistent with bulk RNA-Seq L1 mRNA detection in an organ-derived sample with multiple cell types, we assessed the expression levels of L1 loci and compared the identity of expressed L1 loci in SCIFER analyzed scRNA-Seq and bulk RNA-Seq generated using 2mo mouse testes. With this analysis we found a strong positive correlation between L1 loci expression in scRNA-Seq and bulk RNA-Seq of mouse testes (r = 0.72, P < 0.0001, Fig. 4E). We also found that 212 of the 305 L1 loci identified as expressed using bulk RNA-Seq of mouse testes are identified by SCIFER analysis of scRNA-Seq of mouse testes (Fig. 4E, Venn Diagram). The FPKM levels of the 93 L1 loci detected as expressed in bulk RNA-Seq but not in scRNA-Seq were significantly lower than the 212 expressed L1 loci detected in both bulk RNA-Seq and scRNA-Seq with SCIFER (Welch’s t-test, P < 0.0001, Fig. 4F). These loci could also be expressed in other cell types such as Spermatogonia, Leydig, and Sertoli cell types that are underrepresented in these scRNA-Seq datasets and most likely in the bulk RNA-Seq datasets as well. Furthermore, based on our findings in MCF7 cells, an increase in sequencing depth could have resulted in a greater number of L1 loci shared between bulk and scRNA-Seq datasets.
Variable L1 mRNA expression detected by SCIFER in different cell types represented in scRNA-Seq of mouse testes led us to consider the expression of genes previously identified to contribute to restriction of L1 expression and translation during Spermatogenesis [50]. We quantified expression levels of Pld6 and Hsp90aa1, two piRNA pathway genes involved in the restriction of L1 activity in mouse testis [50], in mouse testis cell types. PLD6 is a piRNA pathway protein involved in the processing of piRNAs during Spermatogenesis [50, 51]. We observed that Pld6 is expressed at the highest level in Spermatocytes (Additional file 6 A1, B1, and C, Mouse 1: 0.91, Mouse 2: 1.04 Normalized Expression) and second highest level in Round Spermatids (Additional file 6 C, Mouse 1: 0.25, Mouse 2: 0.22 Normalized Expression), compared to the other sperm cell types (Additional file 6 A1, B1, and C). Hsp90aa1 is similarly expressed at the highest level in Spermatocytes (Additional file 6 A2, B2, C, Mouse 1: 3.01, Mouse 2: 3.12 Normalized Expression) and Round Spermatids (Additional file 6 A2, B2, C, Mouse 1: 2.24, Mouse 2: 2.21 Normalized Expression). We also considered the expression level of UHRF1, a protein that recruits DNMT1 and promotes DNA methylation at hemimethylated CpGs [50, 52, 53]. We observe decreasing levels of Uhrf1 in Round Spermatids (Additional file 6 A3, B3, C, Mouse 1: 0.29, Mouse 2:0.30) compared to Spermatocytes (Additional file 6 A3, B3, C, Mouse 1: 0.66, Mouse 2:0.67). The overall patterns of Pld6, Hsp90aa1, and Uhrf1 expression, genes related to inhibiting L1 expression and translation [50, 51, 54,55,56,57], in mouse testes are consistent with the observed peak in L1 expression that SCIFER detects in Round Spermatids (Additional file 6). This initial analysis demonstrates that the increase in L1 expression observed in Round Spermatids coincides with a peak and subsequent downregulation of Uhrf1 expression, a component of the DNA methylation pathway, as well as a peak and subsequent downregulation of Pld6 and Hsp90aa1, components of the piRNA pathway.
Individual mouse cells support expression of multiple types of transposable elements
Our results show that individual MCF7 cells support expression of multiple L1 loci (Fig. 2C and 3C). Mouse genomes contain multiple currently active L1 subfamilies. Thus, we considered whether individual mouse cells support expression of multiple L1 subfamilies. Analysis of mouse L1 A, F, Gf, and Tf subfamilies determined that all sperm cell types support expression of these subfamilies with Round Spermatids having the highest number of expressed L1 loci from the Gf and Tf subfamilies, the youngest and most active of the mouse L1 subfamilies (Additional file 7 A). Additionally, we identified that in Round Spermatids, 155 cells (46%) expressed at least two different L1 subfamilies (Additional file 7 B) and 7 (2.1%) Round Spermatids express at least one L1 locus from each active L1 subfamily (Additional file 7 B). These data show that similar to individual human breast cancer cells, cells in mouse testis also support expression of multiple L1 loci from the same or different subfamilies.
Mouse genomes contain different families of transposable elements that are active. To determine whether individual mouse testis cells express multiple families of mobile elements, we measured LTR expression in scRNA-Seq data of mouse testes. Bulk RNA-Seq analysis was performed on two 2 mo mouse datasets and expression from 143 LTRs was manually validated. LTR elements were included in our analysis if they were greater than 2 kb in length and received at least 10 aligned reads. Expression from 11 LTR elements (5 MMERVK, 3 IAP, and 3 MURVY) that were manually validated to be expressed in bulk RNA-Seq were analyzed in scRNA-Seq datasets (Additional file 17 and Additional file 7). Bulk RNA-Seq and scRNA-Seq of Mouse 1 and 2 testes shared the expression of 4 out of the 11 LTRs (Additional file 7 C). L1 and LTR element co-expression was detected in 1 Spermatogonia cell, 2 Spermatocytes, 14 Round Spermatids, and 2 Elongating Spermatids in Mouse 1 (Additional file 7 D1). L1 and LTR element co-expression was also detected in 17 Spermatocytes, 1 Round Spermatid, and 1 Elongating Spermatid in Mouse 2 (Additional file 7 D2). This analysis shows L1 and LTR elements are co-expressed in a subset of mouse Spermatocytes and Round Spermatids, the cell types with the highest L1 expression levels (Fig. 4 and Additional file 5).
Mouse and human testes support similar L1 expression patterns
The increased L1 mRNA expression in mouse testes and round spermatids as well as the high levels of similarity in L1 loci expressed between testes taken from different mice led us to investigate whether similar patterns of L1 cell-type and locus specificity are conserved in human testes [10]. To determine whether there is a similar agreement as to which L1 loci are expressed in human testes from unrelated individuals, we performed bulk RNA-Seq using RNA extracted from testes samples obtained from two 20 yo donors followed by our previously reported L1 RNA-Seq analysis [12, 20, 30]. This approach identified 114 L1 loci that were expressed in testes samples collected from the two donors (Additional file 9 A). Of the 114 expressed L1 loci, 83% (95 of 114 L1 loci) were shared between the two unrelated 20 yo donors, demonstrating that human testes exhibit reproducible L1 expression patterns between biological replicates, similar to our previous study that showed testes collected from different mice shared 85% of expressed L1 loci (Additional file 5 A) [10]. L1 loci identified to be expressed in bulk RNA-Seq were then used to guide SCIFER analysis of scRNA-Seq datasets generated using testis samples from 24 yo and 25 yo donors, which share 77% (79 of 102) of expressed L1 loci (Additional file 9 B).
First, we confirmed proper cell-type clustering in the testis datasets from 24 yo and 25 yo donors by quantifying expression of testis cell-type-specific markers Prm1, Spag6, Tnp1, and TNP2. Human Prm1 expression, like mouse Prm1 expression, was detected at progressively increasing levels in Round, Elongating, and Condensing Spermatids (Additional file 17 and Additional file 8 A1 and C1). Spermatocytes, Round Spermatids, and Elongating Spermatids were observed to have high expression levels of Spag6 which contributes to sperm motility and maintenance of sperm structure in mature sperm (Additional file 17 and Additional file 8 A2 and C2) [58]. Elongating and Condensing Spermatids exhibited high expression levels of Tnp1 and Tnp2, genes encoding proteins involved in the exchange of histones for protamines during spermatid maturation (Additional file 17 and Additional file 8 A3, A4, C3, C4) [59].
Cell type-specific analysis of two technical replicates of scRNA-Seq from the testis of a 24 yo donor found that the number of mapped reads per cell was the highest in Round Spermatids (0.035) and Spermatocytes (0.029) compared to Spermatogonia (0.025), Elongated Spermatids (0.28), and Condensing Spermatids (0.007) (Fig. 5A). SCIFER analysis determined that on average Spermatogonia, Spermatocytes, and Round Spermatids supported the highest levels of L1 expression per cell (average L1 RPM per cell = 0.62, 0.43, and 0.43, respectively Fig. 5B) and Round Spermatids express, on average 1.2 L1 loci per cell, the highest number compared to the other cell types (Fig. 5C). Graphs that include Macrophages, Endothelial, Myoid, Sertoli, and Leydig cells from the 24 yo donor are presented in Additional file 9 C-E. Round Spermatids were confirmed to have significantly higher L1 expression levels compared to Condensing Spermatids using Seurat v4.0.5 (Wilcoxon rank sum, P = 3.68E-16, Additional file 8 D). To further confirm our results, we also SCIFERed scRNA-Seq from the testis of a 25 yo human donor. In this dataset, Round Spermatids had on average the highest number of reads per cell (0.022 million mapped reads per cell, Additional file 9 F), Spermatogonia had the highest levels of L1 expression per cell out of the sperm cell types (0.88 L1 RPM, Additional file 9 G), and Round Spermatids had the highest number of expressed L1 loci per cell (0.59 L1 loci expressed per cell, Additional file 9 H). These findings show that L1 expression patterns are somewhat conserved in mice and humans with Round Spermatids and Spermatogonia supporting high levels of L1 expression compared to the other cell types in Spermatogenesis. Similarly, while the number of L1 loci expressed is significantly greater in mouse Round Spermatids compared to human Round Spermatids, the average number of L1 loci per cell is 2.4 and 1.2, respectively (Welch’s t-test, P < 0.0001, Fig. 4C and 5C). Although this comparison is likely reflective of the biological differences and similarities between the two species, the accurate quantification of the trends detected in our studies can only be determined using datasets of the same sequencing depth. Based on our findings with MCF7 cells (Fig. 3), datasets from mouse and human testes with higher sequencing depth than the datasets analyzed here would be even more informative (Fig. 1B).
To determine the extent of heterogeneity of L1 expression among different sperm cell types, we compared the levels of individual L1 locus expression (Fig. 5D, graph) and the distribution of expressed L1 (Fig. 5D, Venn diagram) in different human testis sperm cell types. Different testis cell types in the 24 yo donor shared 14% (12 out of 88) of expressed L1 loci (Fig. 5D, Venn diagram). 27% (22 out of 81) of expressed L1 loci were shared between different cell types in the testis of the 25 yo donor (Additional file 9 I). By comparing the expressed L1 loci detected in bulk RNA-Seq of the testes from 20 yo donors to the expressed L1 loci detected in scRNA-Seq of the testis from a 24 yo donor, 85% (97 of 114) L1 loci expressed in bulk were expressed in the scRNA-Seq dataset (Fig. 5E). The 17 L1 loci unique to the bulk RNA-Seq dataset had a significantly lower FPKM compared to the L1 loci shared between the two datasets (P = 0.0032, Fig. 5F). We also performed a comparison between the bulk RNA-Seq from two 20 yo donors and scRNA-Seq from the 25 yo donor and found that 74% (84 of 114) of expressed L1 loci were shared between the bulk and single cell RNA-Seq datasets. The 30 L1 loci unique to bulk RNA-Seq had a significantly lower FPKM level than the L1 loci shared between bulk and single cell RNA-Seq (Welch’s t-test, P = 0.0035, Additional file 9 K). The number of shared L1 loci between different cell types in human testes is higher than in mouse testes (27–14% vs. 2–0%) potentially due to the higher number of Round Spermatids (499 cells) in mouse vs. Round Spermatids in human (214 cells). This is likely a reflection of the biological differences in the relative cell composition between mouse and human testes. Despite this difference, SCIFER detected expression from 85% of the L1 loci expressed in bulk RNA-Seq of samples from human testes in scRNA-Seq datasets.
Expression patterns of genes restricting L1 expression and translation in human testis
To understand expression patterns of genes relevant to the L1 replication cycle, Seurat analysis [60] was used to determine the expression levels of genes previously identified to be involved in transcriptional and post-transcriptional regulation of L1 [50]. We examined expression of Dnmt1, Mecp2, Kdm1a, Trim28, and Ercc4, nuclear factors involved in the epigenetic regulation of L1 expression as well as Rnaseh2b, a gene involved in post-transcriptional regulation of L1 [50, 61]. In general, the same patterns of expression were observed for these genes in testes samples from the 24 yo and 25 yo donors. Spermatocytes expressed significantly higher levels of Dnmt1, Mecp2, and Kdm1a compared to Round, Elongating, and Condensing Spermatids in testes samples collected from 24 yo and 25 yo donors (Additional file 17, Additional file 10 A1–3 and B1–3). Spermatocytes and Round Spermatids express the highest levels of Trim28 in 24 yo and 25 yo testes compared to the other cell types (Additional file 17, Additional file 10 A5 and B5). Ercc4 expression was higher in Spermatocytes compared to Spermatids in both donors (P < 0.0001 (P = 0.017) (P < 0.0001) (Additional file 17, Additional file 10 A6 and B6). Rnaseh2b expression was significantly higher in Spermatocytes compared to Round, Elongating and Condensing Spermatids (Additional file 17, Additional file 10 A8 and B8). These gene expression patterns provide preliminary evidence that, in general, L1 inhibitory genes analyzed in this study peak in expression in Spermatocytes and their expression declines in Round, Elongating, and Condensing Spermatids. This peak in expression of genes that inhibit L1 expression corresponds to the peak in L1 expression observe in human Spermatocytes and Round Spermatids.