Ambiguous mappings can lead to biases in the enrichment estimation at young TE families/subfamilies
To investigate the TE families/subfamilies associated with ambiguously mapping reads, we analysed the genomic distribution of the read mappings from 37 ChIP-seq experiments (Methods, Fig. 1A). Specifically, we considered 1159 different TE families/subfamilies covering 45.82% of the human genome (Fig. 1B) and 1133 distinct TE families/subfamilies covering 40.94% of the mouse genome (Supplementary Fig. S1). In general, libraries with longer read lengths had smaller proportions of multimappers, i.e., reads mapping to multiple genomic locations. Nevertheless, within the analysed range (26 bp–100 bp), we did not observe any substantial differences (Supplementary Fig. S2). We found that 16.34–32.41% of the reads were multimappers. Interestingly, 11.36–39.44% of multimappers mapped to regions of the genome that are not annotated as TEs. Hence, although most of the multimappers mapped to annotated TEs, a considerable fraction of them mapped to other repetitive sequences or non-annotated TEs. Additionally, 47.95–68.44% of reads mapping to TEs were unimappers, i.e., they only mapped once in the genome. This finding indicates that despite the repetitive nature of TEs, individual copies of the same TE family/subfamily are often not identical.
To examine the similarity among the TE families/subfamilies individual copies in more detail, we compared the TE families/subfamilies associated with multimappers and unimappers (Methods). Approximately 9% (107 out of 1159) of the TE families/subfamilies were predominantly covered by multimappers (i.e., more than 90% of the reads mapping to these families were multimappers). Remarkably, all these TE families/subfamilies were exclusive to primates and their derived clades (Fig. 1C), and 57% (61 out of 107) of them belonged to some of the youngest TE families/subfamilies, including LINE-1, SVA and many members of the Alu family. On the other hand, 2.5% (29 out of 1159) TE families/subfamilies were predominantly covered by unimappers (i.e., more than 90% of the reads mapping to these families were unimappers), and all of these TE families/subfamilies expanded in the vertebrate genome before the evolution of mammals (Fig. 1C). Notably, these TE families/subfamilies were present especially in Amniota (75.86%), with UCONs (ultra-conserved element) representing almost 70% of these TE families/subfamilies. These observations suggest that distinct strategies dealing with ambiguously mapped reads may lead to different conclusions for the functional analysis of TEs, depending on the age of the TE families/subfamilies. More specifically, using the strategy of discarding multimappers is likely to miss functions linked to young TE families/subfamilies.
Uniform background distribution may lead to false positive and false negative enrichments at TEs
Next, we setup to study enrichment biases arising from the approach that calculates the background by randomly permuting the locations of the read mappings in the genome. Basically, this strategy assumes a uniform distribution of the read mappings.
Consistent with our expectations, we found that although the relationship between the number of read mappings simulated assuming a uniform distribution and the number of read mappings observed for the input control was linear (Fig. 2A), there were clear deviations for specific TE families/subfamilies. Notably, compared to the input control, assuming a uniform distribution led to 30% less read mappings for LTR13, LTR18B, LTR18C and LTR43-int subfamilies, and to 50% more read mappings for LTR12B, MamRep1151, SVA E and SVA F subfamilies in at least four of the seven ENCODE ChIP-seq input controls (Dataset 1; see Methods and Supplementary Table S1). In general, we observed that the TE families/subfamilies for which the number of read mappings was underestimated by assuming a uniform distribution comprise shorter TEs than those of families/subfamilies for which the number of read mappings was overestimated; also, the former are smaller (i.e., have fewer individual copies in the genome) than the latter. Notwithstanding, the TE families/subfamilies that are underestimated/overestimated do not differ in any obvious way from others in terms of their size and length of their individual TE copies (Fig. 2B). Thus, while the TEs in the underestimated long-terminal repeat (LTR) subfamilies vary from 301 to 871 bp per copy, TEs in the overestimated LTR12B subfamily are, on average, 1967 bp long. Additionally, while there are only 76 and 92 copies of TEs of the LTR18C and LTR43-int subfamilies, respectively, the LTR subfamily MamRep1151 has with 1251 copies more than twice the median number of copies of all TE families/subfamilies in the human genome. Therefore, using the uniform distribution as background for enrichment analysis may result in both false positive and negative enrichments that cannot be easily predicted.
T3E is a novel approach to estimate enrichment at TEs from ChIP-seq data
To address the issues presented above, we developed T3E (Methods). Acknowledging technical limitations when it comes to characterising specific TE instances in the human genome, T3E computes enrichments for entire TE families/subfamilies, rather than for their individual TE copies. The algorithm uses the genomic distribution of the read mappings of the ChIP-seq input control experiment (Fig. 3A) to estimate a background probability distribution (“input-based background probability distribution”, Fig. 3B). Specifically, the probability of observing a mapping at a given genomic position is assumed to reflect the coverage of the input control. T3E then samples from this probability distribution to construct an appropriate background for enrichment calculations (“simulated input library”, Fig. 3C). Consequently, the expected number of read mappings that T3E estimates for each TE family/subfamily is proportional to that observed in the input control (Supplementary Fig. S3). To account for the ambiguity of read mappings, T3E weights each read mapping to a TE copy by the number of genomic loci to which the read maps, so that the contribution of a particular read mapping to the total number of read mappings associated with that TE copy is inversely proportional to the overall number of loci to which the corresponding read maps (Fig. 3D). Moreover, the calculation is done at the single nucleotide level, i.e., only the number of nucleotides from a read mapping to a TE copy are considered. The process is repeated multiple times in order to obtain an empirical P-value. Finally, a Fold-Change is calculated as the ratio between the number of read mappings associated with the TE family/subfamily of interest in the ChIP-seq sample library and the average number of read mappings associated with the same TE family/subfamily across all simulated input libraries (Fig. 3E). T3E is a novel, open-source framework for the epigenetic analysis of TE families/subfamilies from which the entire TE community could benefit.
The run-time and memory complexities of T3E depend on the library size, number of ambiguously mapping reads, and number of genomic loci to which ambiguously mapping reads map. In the worst-case scenario, the library would contain only multimappers, each mapping to every single genomic locus. In this case, the theoretical complexity is proportional to genome size × input library size × sample library size. However, on an average ENCODE ChIP-seq sample (HepG2, see Methods), the algorithm showed approximately linear run-time and memory complexities with respect to the input library size (Supplementary Fig. S4).
T3E detects biologically-relevant TE family/subfamily enrichments
To test how T3E can be used to characterise the epigenetic profile of TE families/subfamilies we analysed three ChIP-seq libraries and their seven input control libraries generated by the ENCODE consortium (Dataset 1). The experiments involved ChIP-seq against H3K4me3 –a histone mark associated with euchromatin– in B-cells –responsible for humoral responses of the immune system–, H3K79me1 –also associated with euchromatin– in a cell line isolated from normal lung fibroblast tissue (IMR-90), and the transcription factor FOXP1 in a cell line isolated from hepatocellular carcinoma (HepG2).
Few TE families/subfamilies show enrichment for all three types of samples, perhaps indicating some sort of basic level of enrichment. Specifically, we found that the LTR12E (FCs between 1.14 and 9.19), LTR13 (FCs between 1.32 and 2.01) and MER57E3 (FCs between 1.26 and 7.58) subfamilies of the long terminal repeats (LTRs) are enriched in all three cell types, even though at different levels. This observation is consistent with reports about the LTRs of endogenous retroviruses (ERVs) frequently having been exapted as gene promoters [24], which are often not cell-specific.
Supporting a functional association, we observed specific enrichment profiles depending on the cell type and ChIP target (Fig. 4A). For instance, six subfamilies of the same TE family, the endogenous retrovirus-K (ERVK), featured enrichment for FOXP1 in HepG2 cells, with Fold-Changes varying from 2.98 (LTR22C2) to 10.24 (LTR22) (Methods). These six subfamilies did not present enrichment in any of the other ChIP-seq samples. Similarly, most Alu retrotransposon subfamilies were only enriched for H3K79me1 in IMR-90 cells (FCs between 1.24 and 2.12). Moreover, the LTR21B and LTR10B subfamilies (FCs between 2.05 and 3.59) only showed enrichment in two out of three different cell types. Whereas the LTR21B subfamily, enriched both for H3K4me3 in B-cells (FC = 3.59) and for FOXP1 in the HepG2 cancer cell line (FC = 2.70) but not for H3K79me1 in the IMR-90 normal cell line (FC < 1, P-value = 0.85), has been associated with immune checkpoint activity and CD8+ T-cell expression in tumours [25], the LTR10B subfamily, also enriched for H3K4me3 in B-cells (FC = 2.05) and FOXP1 in HepG2 (FC = 2.53) cells but not in IMR-90 cells (FC = 1.01, P-value = 0.45), has been associated with tumour suppressor p53 protein binding along with the MER61 family [26]. Curiously, the MER61F subfamily was enriched for FOXP1 in HepG2 (FC = 4.68) and moderately enriched for H3K4me3 in B-cells (FC = 1.47) too. These enrichments indicate that TEs may act as cis-regulatory elements, contributing to gene regulation by functioning as enhancers, promoters, silencers or insulators.
The fact that the enrichments observed are cell-type specific and restricted to certain transcription factors and histone marks depending on the TE family/subfamily strongly suggests that T3E is able to capture functional properties of the TEs and, hence, provides a useful framework for the systematic investigation of TEs’ cis-regulatory potential in a context-specific manner.
TE enrichment estimations by T3E are robust to variation in the quality of the ChIP-seq input control
To evaluate the impact of the quality of the input control on the enrichment of TE families/subfamilies estimated by T3E, we applied T3E to the analysis of the libraries of a H3K79me1 ChIP-seq experiment performed by the ENCODE Project in IMR-90 cells its two input controls (Methods). One of the input control libraries had been audited by ENCODE in the “insufficient read depth” category (10,800,072 processed reads, file accession: ENCFF421HCG); the other has sufficient read depth (47,554,621 processed reads, file accession: ENCFF679UAT). For both input controls, Alu families featured the highest enrichments, with AluJ and AluS subfamilies featuring higher Fold-Changes for the input control with sufficient read depth (“good control”) and AluY subfamilies presented higher Fold-Changes for the input control with insufficient read depth (“bad control”). Specifically, we found that among the 1052 TE families/subfamilies with 50 or more individual TE copies in the human genome, 16 Alu retrotransposon subfamilies and one fossil Alu monomer (FAM) were considered enriched (P-value < 0.01) for both the good and the bad controls and had a FC of 2 or higher in at least one of them; in 15 cases, the Fold-Changes were higher for the good control (FCs between 2.01 and 2.12 for the good control vs FCs between 1.76 to 1.93 for the bad control), and in two cases the Fold-Changes were higher for the bad control than the good control (FCs between 2.02 and 2.11, Fig. 4B). Besides Alu retrotransposons, a few other TE families/subfamilies showed moderate enrichments (e.g., SVA retrotransposon subfamilies A-F, FLAM and FRAM, with FCs between 1.59 and 1.85), but their enrichment was always lower than 2-fold, independently of the read depth of the input control. Hence, within reasonable limits, the quality of the input control ChIP-seq data used by T3E to construct the background probability distribution is not likely to impact the biological interpretation of the enrichment analysis, but rather the Fold-Change thresholds chosen to define a TE family/subfamily as enriched.
Mouse neocortex enhancers present modest enrichment for ancient repeats
Next, we applied T3E to ChIP-seq data of the dorsal cerebral wall (DCW) for the transcriptional co-activator p300 from mouse embryos (Dataset 2; see Methods and Supplementary Table S1), for which Wenger, Notwell and their colleagues [27, 28] had reported a striking enrichment (FC = 73) for the MER130 subfamily. The authors used a non-TE aware approach. In summary, they mapped the ChIP-seq reads to the mouse genome using ELAND [29] and retained only reads mapping uniquely with two or fewer mismatches. Peaks were called with MACS [30] based on uniquely mapping ChIP-seq reads and using input DNA reads as control. The expected enrichment for each repeat family/subfamily was calculated by randomly shuffling the locations of the resulting peak set 10,000 times across the entire genome and calculating the average. Only TE families/subfamilies with at least 50 individual copies were considered for this analysis, leading to 997 TE families/subfamilies. Consistent with the published observations, we found that MER130 was the TE subfamily exhibiting greatest enrichment, albeit with a much lower Fold-Change (FC = 4.03; Fig. 4C). Similarly, T3E detected enrichment for the UCON31 subfamily (FC = 2.79), for which the aforementioned studies had also indicated a much higher enrichment (FC ≈ 35). Despite the general agreement between our findings, we noticed some interesting differences. For example, two other TE families/subfamilies reported by Wenger et al. as highly enriched, MER124 (FC ≈ 15) and AmnSINE1 (FC ≈ 5) had only moderate Fold-Changes according to our analysis (MER124 with FC = 1.34 and AmnSINE1 with FC = 1.33). Further, we saw high Fold-Changes for six subfamilies, namely, Eulor9C, MER125, MER126, MER129, UCON5 and UCON7 (FCs between 1.51 to 1.91; Fig. 4C), which had not been acknowledged before. Eulors (euteleostomi conserved low frequency repeats), MERs (medium reiterated frequency repeats) and UCONs are ancient transposable elements [31]. Eulor families present self-complementary regions which suggest they might have been accumulated in the genome by DNA transposition just as some MER families [32]. Since most DNA transposable elements are predicted to be ancient immobilized elements in the genome [33] that have been transposed in an imprecise manner in the vicinity of genes, they might have been co-opted as transcriptional regulators of genes involved in developmental processes [34]. Similarly, UCON families are often located nearby or overlapping genes involved in regulation of transcription and development [35]. This suggests that a broader range of ancient TE families/subfamilies may have been exapted as enhancers and other cis-regulatory elements of genes involved in the development and evolution of the mammalian neocortex.
RNA polymerases bind differently to LTR retrotransposons depending on the cell type
Finally, we applied T3E to an RNA polymerase II and III (RNA Pol II and Pol III) ChIP-seq dataset that had already been analysed with RepEnrich [20]. In agreement with the results obtained by Criscione et al. [20], T3E detected more TE families/subfamilies enriched for RNA Pol II in transformed cell lines (FCs between 2.05 and 17.89) than in normal cell lines (FCs between 2.03 and 3.37). In particular, we found the lowest number of enriched TE families/subfamilies (7 out of 1159) in the PBDE (peripheral blood-derived erythroblast) normal cell line and the highest number (62 out of 1159) in the K562 chronic myelogenous leukemia (CML) cell line. Criscione et al. observed that many LTR retrotransposons and some SINEs featured enrichment for Pol II (FCs > 2.83 and false discovery rates (FDR) < 0.05); on the contrary, only a few TE families/subfamilies of DNA and LINE retrotransposons did. This is especially true for transformed cell lines. Overall, Criscione et identified more than 100 retrotransposon TE families/subfamilies displaying enrichment for Pol II in transformed cells (GM12878 lymphoblastoid, Hela adenocarcinoma and K562) and at least 20 retrotransposon TE families/subfamilies in normal cells (human umbilical vein endothelial cell (HUVEC) and PBDE). Using T3E, we only observed a total of 24 TE families/subfamilies featuring enrichment for Pol II (FCs between 2.83 and 17.89) in transformed (22 out of 24) and normal cell lines (2 out of 24), whereby approximately 80% were LTR retrotransposons. Additionally, we did not find any LINE retrotransposon family/subfamily enriched for Pol II. Criscione et al. suggested that RNA polymerases bind to TE families/subfamilies in a cell line-specific manner. Indeed, T3E detected 13 TE families/subfamilies that were enriched in only one cell line. Nevertheless, we also observed TE families/subfamilies enriched in multiple cell lines, but with lower Fold-Changes in normal cell lines than cancer cell lines, suggesting systematic changes in their epigenetic profiles in cancer. For instance, we found that LTR12E was enriched for Pol II in both transformed and normal cell lines, although transformed cells tended to feature higher levels of enrichment (FCs between 4.55 and 9.57) compared to normal cell lines (FCs 2.07 and 2.19 for PBDE and HUVEC, respectively; Supplementary Fig. S5). This is consistent with the ability of some LTR12 subfamilies to recruit RNA polymerase II and act as cell-specific promoters in certain cell types in mammalian genomes [24, 36].
Criscione et al. highlighted that the human endogenous retrovirus HERV-Fc1 and its subfamilies were enriched for RNA Pol II and Pol II phosphorylated on serine 2 (Pol II S2). In particular, they noted that the internal region of HERV-Fc1 (int) was highly enriched for Pol II and Pol II S2 in the K562 CML cell line. Comparing the results of T3E with those of RepEnrich, we observed similar trends for HERV-Fc1, but differences for individual subfamilies. To illustrate, RepEnrich found enrichment for Pol II S2 at HERV-Fc1-int, LTR1 and LTR3 in the Hela cell line (FC > 2), while T3E detected moderate enrichment at HERV-Fc1-int (FC = 1.45) and no enrichment at any of the other HERV-Fc1 LTRs (Fig. 4D). Like RepEnrich, T3E identified enrichment for Pol II at HERV-Fc1-int in the K562 cell line (FC = 15.77); albeit, RepEnrich reported lower levels of enrichment (FC = 7). Additionally, the results of RepEnrich suggest that HERV-Fc1 is –with some exceptions– associated with histone modifications for active chromatin. The results obtained with T3E are similar, including the enrichment at HERV-Fc1 for the repressive mark H3K9me3 that was reported by the developers of RepEnrich (Fig. 4E), and the enrichment for Pol II and Pol II S2 at HERV-Fc1 in transformed and cancer cell lines (K562 and GM12878), which calls attention to the potential expression of HERVs in cancerous tissues [37]. Nonetheless, it is worth remarking that those enrichments are only supported by a few annotated individual TE copies (3 for LTR2, 5 for LTR3, 7 for the internal region and 17 for LTR1). In conclusion, our data add to the growing body of evidence supporting the implication of TEs in important cellular processes.