Cell type-specific termination of transcription by transposable element sequences
© Conley and Jordan; licensee BioMed Central Ltd. 2012
Received: 25 May 2012
Accepted: 8 August 2012
Published: 30 September 2012
Skip to main content
© Conley and Jordan; licensee BioMed Central Ltd. 2012
Received: 25 May 2012
Accepted: 8 August 2012
Published: 30 September 2012
Transposable elements (TEs) encode sequences necessary for their own transposition, including signals required for the termination of transcription. TE sequences within the introns of human genes show an antisense orientation bias, which has been proposed to reflect selection against TE sequences in the sense orientation owing to their ability to terminate the transcription of host gene transcripts. While there is evidence in support of this model for some elements, the extent to which TE sequences actually terminate transcription of human gene across the genome remains an open question.
Using high-throughput sequencing data, we have characterized over 9,000 distinct TE-derived sequences that provide transcription termination sites for 5,747 human genes across eight different cell types. Rarefaction curve analysis suggests that there may be twice as many TE-derived termination sites (TE-TTS) genome-wide among all human cell types. The local chromatin environment for these TE-TTS is similar to that seen for 3′ UTR canonical TTS and distinct from the chromatin environment of other intragenic TE sequences. However, those TE-TTS located within the introns of human genes were found to be far more cell type-specific than the canonical TTS. TE-TTS were much more likely to be found in the sense orientation than other intragenic TE sequences of the same TE family and TE-TTS in the sense orientation terminate transcription more efficiently than those found in the antisense orientation. Alu sequences were found to provide a large number of relatively weak TTS, whereas LTR elements provided a smaller number of much stronger TTS.
TE sequences provide numerous termination sites to human genes, and TE-derived TTS are particularly cell type-specific. Thus, TE sequences provide a powerful mechanism for the diversification of transcriptional profiles between cell types and among evolutionary lineages, since most TE-TTS are evolutionarily young. The extent of transcription termination by TEs seen here, along with the preference for sense-oriented TE insertions to provide TTS, is consistent with the observed antisense orientation bias of human TEs.
For any individual human, different kinds of somatic cells contain the same genome sequence, but are obviously functionally distinct. Thus, cell type-specific regulation of the genome, rather than the sequence itself, defines the characteristics of a cell type. The importance of cell type-specific activity of promoters in the functional differentiation cell types has long been appreciated; however, the role of cell type-specific termination of transcription in this process has not been as well studied. Nevertheless, recent studies have begun to show that variation in transcription termination is important for cell-type specification[1–3] and have piqued an interest in this largely unexplored phenomenon.
There are numerous transposable element (TE)-derived sequences in the human genome, comprising more than two-thirds of the total sequence, and many of these TEs are located within the introns of human genes. TEs contain their own regulatory sequences, including specific signals that lead to the termination of transcripts initiated from element promoters. Human endogenous retroviral elements (HERVs), for example, have polyadenylation signals in their long terminal repeat (LTR) regions that terminate transcription. Thus, numerous TE sequences located within, or nearby, human gene sequences may contribute substantially to the termination of gene transcription via the provisioning of termination signals.
There are several known examples whereby TE sequences located within, or nearby, human genes have been shown to terminate transcription of genic mRNAs. An early study of HERVs provided the first direct evidence that TE-derived sequences can terminate the transcription of non-TE human mRNAs and further suggested that different subfamilies of these elements may serve to terminate transcription in a cell type-specific manner. Later, the same family of ERVs was demonstrated to terminate transcription of a novel alternatively spliced version of the human NAAA gene. There is also experimental evidence showing that L1 (LINE) retrotransposon sequences can terminate the transcription of human genes, and in this same study the intronic content of L1 sequences in human genes was found to be negatively correlated with their expression levels. A later study showed a similar trend whereby the presence of polymorphic L1 insertions in human genes was correlated with a decrease in their expression in a tissue-specific manner.
Despite the evidence cited above indicating that TE sequences can terminate transcription of human genes in a cell type-specific manner in some cases, the extent of this phenomenon and its overall effect on cell type-specific expression have not been fully explored. A pair of recent genome-scale surveys of transcription termination by TEs revealed ~3,000 cases of human transcripts that terminate with TEs[11, 12], suggesting that the phenomenon may be widespread. These studies, while intriguing, relied on relatively low throughput transcriptomic technologies and did not address the cell type specificity of TE transcription termination. Thus, the full extent of TE transcription termination within the human genome, and equally as important the cell type specificity of this phenomenon, remains unknown.
Here, we deeply interrogated the contribution of TE sequences to human gene transcription termination via the integrated analysis of high-throughput transcriptomic data and TE gene annotations. Since TE sequences have been shown to contribute disproportionately to cell type-specific regulation, we also evaluated the extent to which that transcription termination of human genes by TEs is cell type-specific. To do this, we characterized the space of transcription termination sites (TTS) derived from TE insertions in eight different ENCODE cell types. For these TE-TTS, we characterized the contributions from different TE families, as well as their relative insertion orientations. We found 9,287 TE-derived sequences that terminate the transcription of 5,747 human genes. Our results also show that TEs terminate transcript much more efficiently when inserted in the sense orientation relative to gene transcription and thus lend credence to the previously articulated notion that TE orientation biases result from selection against TE termination of gene transcription. We also show that TE termination of gene transcription is highly cell type-specific and thus may contribute to the specialization of cellular function through differential gene regulation.
We characterized TE sequences that provide transcription termination sites (TE-TTS) to human genes using Paired-End diTag (PET) data. PET is a technique for the high-throughput characterization of the 5′ and 3′ ends of mature full-length mRNAs, which allows for deep annotation of paired transcription start (TSS) and termination sites (TTS), including the discovery of many novel alternative sites. TE-TTS were characterized by co-locating TE sequences with 3′ PET tag clusters that are paired with 5′ PET tag clusters mapped to known human gene promoters (see Methods, Additional file1: Table S1). Using PET data from eight different ENCODE cell types (GM12878, H1HESC, HeLaS3, HepG2, HUVEC, K562, NHEK and Prostate)[15, 16], we discovered 98,632 total TTS, 9,287 of which are derived from TE sequences. Thus, 9.4% of human gene TTS are provided by TE-derived sequences, and 28% of human gene loci have at least one TE-TTS.
Locations of human gene transcription termination sites (TTS) characterized using PET data
Though many alternative TE-derived TTS occur within an intron of a coding locus as seen for GALNT2 and EPHX2, some TE-TTS may leave ORF intact or nearly so. For example, a TTS derived from a FLAM_C TE sequence in the BSDC1 gene is found at an alternative upstream position in the terminal intron (Figure1c). Indeed, a human mRNA from GenBank contains this TTS and suggests an alternative C-terminal coding sequence. The canonical BSDC1 TTS is found several kb downstream of the TE-TTS, and the resulting 3′UTR contains ten miRNA binding sites that could be used to degrade the mRNA or reduce its translation. Thus, utilization of the FLAM_C-derived TTS, which would generate a transcript with a nearly full-length ORF but a drastically shortened 3′UTR lacking miRNA binding sites, could effectively increase the expression of BSDC1 by evading post-transcriptional regulation via miRNA binding. As is the case for the GALNT2 and EPHX2 genes, the utilization of this TE-TTS is cell type-specific, with the majority of transcripts in K562 utilizing the FLAM_C-derived TTS and the majority reading through the TE-TTS in NHEK cells (2 × 2 χ2 = 3,907, P ≈ 0). The contribution of TE sequences to alternative transcription termination is further explored later in the manuscript.
The vast majority of TE sequences within human genes are found in the antisense orientation relative to the direction of transcription of the gene. The genic orientation bias of human TEs is thought to reflect differential selective elimination of sense TE insertions over time rather than a preference in the introduction of antisense insertions at the moment of transposition. The ability of TEs to cause premature termination of gene transcripts, thereby reducing levels of transcription, has been proposed as a mechanism to explain the selective elimination of sense oriented L1 sequences from human gene loci. In order to investigate the role of TE-TTS in the selection against sense-oriented TE insertions genome-wide, we compared the insertion orientations of intragenic TEs that do not provide TTS versus the orientations of TE-TTS for the eight largest families of human TEs (Alu, ERV, hAT, L1, L2, MaLR, MIR and TcMar).
For those genic TE sequences that provide a TTS, the majority of TE families show a significant enrichment of insertions in the sense orientation versus the other insertions. Alus have one of the weaker antisense orientation biases for genic elements, but Alu-derived TTS show far and away the strongest sense bias; an Alu insertion providing a TTS is approximately 20× more likely to be in the sense orientation than the antisense orientation. While LTR element genic insertions show the strongest overall antisense bias, insertions providing a TTS are also much more likely to be in the sense orientation; an LTR element providing a TTS is 4× more likely to be found in the sense orientation than the average genic LTR element insertion. The strong sense orientation enrichment seen for TE-TTS indicates that genic TEs oriented in the same direction as transcription are much more effective transcription terminators, consistent with the notion that sense-oriented TE insertions are selected against owing to their disruptive effects on gene expression.
The only exception to this pattern is seen for the relatively ancient family of MIR TEs. MIRs have previously been implicated in providing gene regulatory sequences in a number of studies[20, 21], and the MIR sequences that remain intact and recognizable in the human genome are likely to have been conserved by purifying selection. Thus, the lack of orientation bias for MIRs, irrespective of their status as TTS, may reflect their general utility as gene regulators rather than an ephemeral presence as neutral sequences that will be eventually lost by mutational decay.
To further explore the contributions of the different Alu subfamilies, we evaluated the strength of utilization for TTS derived from the different subfamilies. The strength of utilization for any TTS is measured as the relative frequency with which it terminates transcription versus the frequency that it is read through (see Methods). Consistent with what is seen for the relative levels of TTS donation by the different Alu subfamilies, older families show higher levels of TTS utilization than do younger families (Figure5b), suggesting the possibility that many of these Alu-TTS are preserved via selection by virtue of their functional utility for the host gene.
In light of the exceptional ability of Alus to provide TTS to human genes, we explored the specific sequence context by which these elements terminate transcription. To do this, we mapped the locations of Alu-derived TTS to their positions in the Alu subfamily consensus sequences. Previously, when a few hundred Alu-TTS were considered as an ensemble, they were found to terminate human gene transcription non-randomly at two specific locations along their sequence[11, 12]. For this study, by considering thousands of Alu-TTS among individual Alu subfamilies of different relative ages, we were able to tease apart this apparently bimodal pattern of termination and discern its origins. The modern Alu element is a dimeric sequence composed of two related precursor sequences: a Free Left Alu Monomer (FLAM) and Free Right Alu Monomer (FRAM)[24, 25]. These sequences themselves descended from the Fossil Alu Monomer (FAM), which in turn descended from a 7SL RNA. Elements from all three families of Alu precursors terminate transcription at single site near their 3′-end (Figure5c). However, when the FLAM and FRAM monomers are considered with respect to their homologous locations in the descendent Alu dimer sequences, these individual termination sites yield a pair a corresponding termination sites: one internal termination site corresponding to the FLAM 3′ site and a 3′ termination site corresponding to the FRAM 3′site. In modern Alus, the 3′ termination site predominates over the internal site, and the use of the internal site markedly decreases among younger element sequences (Figure5c). The attenuation in the strength of this TTS donating site from the internal region of the element may reflect the need of the elements themselves to produce full-length transcripts in order to be transposed. In this case, selection against the internal TTS site would be at the level of the element as opposed to at the level of the host. Thus, the steady migration over time of the Alu-TTS donating site to the 3′ end of the element reflects a complex dynamic between inter-element selection and the effects that the elements can in turn exert on their host genome.
It should be noted that the TTS-enriched positions for Alu subfamilies seen in Figure5 are upstream to oligo-A sequences found in the elements. Since the PET technique utilizes poly-dT primers for the generation of cDNAs, apparent TTS associated with such oligo-A sequences could represent artifacts of internal priming on mRNA sequences. While the PET technique does include a biotin enrichment step that is designed to eliminate such non full-length cDNAs generated from internal priming, it is formally possible that some experimental artifacts remain after this step. We implemented a series of controls in order to ensure that the Alu-TTS observed here are not likely to represent experimental artifacts from internal priming. Methodological details of these controls and the results can be found in Additional file1: Tables S3-S5, Figures S6-S9. Overall, Alu-TTS characteristics are not consistent with internal priming PET artifacts, and the chromatin environment of Alu-TTS closely resembles the chromatin environment of non TE-TTS and differs markedly from the chromatin environment of genic Alus.
The L1 family is curious, being the only TE family to show a strong antisense bias for those insertions providing a TTS (Figure3), yet at the same time showing no difference in TTS strength of utilization between sense and antisense insertions (Figure6). Han et al. showed that L1 insertions are capable of terminating transcription in either the sense or antisense orientation, with several polyadenylation signals occurring in the antisense orientation. The same study also showed that L1 insertions can cause transcriptional disruption when in the sense orientation, independent of polyadenylation. As the PET technique requires that transcripts be polyadenylated, the data used here cannot take into account non-polyadenylated transcriptional disruption by L1s. Therefore, the anomalous L1 patterns observed here with respect to both TTS orientation bias and strength of utilization may reflect the relative usage of polyadenylation in L1-TTS from the different strands.
In light of the results on the orientation bias of TE-TTS (Figure3), we also compared the strength of utilization for TE-TTS found in sense versus antisense orientations relative to the direction of transcription. Five out of eight of the TE families (Alu, ERV, L2, MaLR and MIR) showed a significant difference (P < 0.01, Wilcoxon rank-sum test) in TTS strength of utilization depending on the orientation of the insertion. In all five of these families, TTS derived from sense insertions are more likely to be utilized than those derived from antisense insertions (Figure6). These results are consistent with the findings from the overall TE orientation bias in human genes, suggesting that selection acts to remove TE-derived terminators that disrupt gene expression.
The apparent cell type specificity of many TE-TTS suggests the possibility that the TE-TTS discovered in this study via the analysis of eight ENCODE cell types represent only a fraction of the total complement of TE-TTS that exist in the human genome. To address this possibility, we computed a rarefaction curve for TE-TTS by calculating the number of unique TE-TTS found using all possible combinations of 1–8 of the cell types analyzed here (Figure7b). We then fit this rarefaction curve with a logarithmic trend line (y = 31.34lnx + 33.61; r = 0.99) to evaluate the extent to which the percent of detected TE-TTS is expected to change with increasing numbers of cell types. Based on the observed trend, we estimated that doubling the number of cell types included in a study of this kind would result in only a 20% increase in the number of TE-TTS found, suggesting a substantially diminishing rate of returns with respect to the discovery of novel TE-TTS as more cell types are added. Similarly, the number of genes found to contain a TE-TTS leveled off as more cell types were included. Nevertheless, taking 210 as the total number of human cell types indicates that the TE-TTS discovered here represent half of the total number of human gene TE-TTS. Thus, TEs may provide close to 20,000 TE-TTS for ~11,000 human genes.
It has been appreciated for some time that TE sequences within the introns of human genes show a strong antisense orientation bias[19, 28]. It was proposed that this bias is due to the propensity of the TE sequences to terminate transcription of host genes when inserted in the sense orientation, resulting in selection against such sense-oriented insertions. Nevertheless, studies to date on the ability of TEs to terminate transcription have not revealed evidence in support of this hypothesis[11, 12]. Here, for the first time, we provide genome-scale evidence in support of the notion that the antisense orientation bias of TEs can be attributed to their ability to preferentially terminate host gene transcription when inserted in the sense orientation. We have shown that TE sequences that provide a TTS are significantly more likely to be found in the sense orientation than other intragenic TE sequences (Figure3) and that TE-TTS in the sense orientation terminate transcription much more efficiently than those found in the antisense orientation (Figure6). Nevertheless, there may be additional as yet unknown factors that also contribute to the observed antisense orientation of human TEs.
Among the eight TE families studied here, the Alu, ERV and MaLR families are distinct from the other five families. They all exert substantial effects on the expression of human genes via the termination of transcription, but they do so using distinct genome-wide metastrategies. TTS derived from Alu sequences are generally weakly utilized compared to other TE families, while at the same time having a weak antisense orientation bias. The weaker orientation bias of Alu sequences suggests that there is weaker selection against Alu sequences inserted in sense. We suggest that this weaker selection is due to the generally weak utilization of Alu-TTS. Conversely, LTR elements, the ERV and MaLR families, show a very strong antisense bias and a strong utilization; such strong utilization may account for the strong antisense orientation of LTR elements. The Alu family, by providing many relatively weak TTS, can affect a large number of genes, albeit in a subtle way on a gene-by-gene basis, whereas LTR elements have much larger effects on the expression levels of a smaller number of genes.
Evidence reported here points to the contribution of TE sequences to the cell type-specific termination of transcription; we have shown that internal TTS derived from TE sequences are significantly more cell type-specific compared to canonical TTS (Figure7a). In this way, TE sequences have contributed substantially to the generation of cell type-specific patterns of human gene expression via the pre-mature termination of transcription. In addition to providing for cell type-specific termination of transcription, data reported here indicate that TE sequences are also likely to have contributed substantially to evolutionary lineage-specific transcription termination. Numerous TE insertions can be generated in a short evolutionary time, and accordingly the majority of human TE subfamilies are lineage-specific. This means that the regulatory effects that these TEs exert on their host genomes, including termination of transcription as shown here, will also be lineage-specific and account for regulatory differences between evolutionary lineages.
The Alu family, for example, is a relatively young family of TEs, which is confined to the primate evolutionary lineage. The Alu family has been active throughout primate evolution and has likely been altering primate gene expression via TTS donation since the origin of the primate lineage, as can be seen from the results on the more ancient Alu antecedents from the FAM-related subfamilies (Figure5). This process appears to have accelerated, leading to even more species-specific differences in transcription termination, with the amplification of the more modern Alu dimers (Figure5).
The abundance of TE insertions across eukaryotic lineages suggests that the effect of TE insertions on gene expression via the termination of transcription is not limited to humans. In this study, we characterized the involvement of eight evolutionary diverse families of TEs in the termination of transcription. TEs related to these eight families are present in the genomes of many other eukaryotes. For instance, while LTR elements are functionally dead in humans, multiple LTR element families are still highly active in other species, e.g., the intracisternal A particle (IAP) family of mouse. Indeed, it has been estimated that 10% of mutations in mouse are caused by the novel retrotransposition of an LTR element. As a consequence of this, mice presumably have to contend with a great deal of deleterious transcription termination via novel LTR element insertions. However, these novel insertions also provide the opportunity for innovation in the regulation of gene expression.
Mappings of ENCODE PET data from the GM12878, H1HESC, HepG2, HeLaS3, HUVEC, K562, NHEK and prostate cell types were downloaded from the ENCODE repository on the UCSC genome browser for the hg18 version of the human genome[14, 15]. PET data from nucleus (GM12878, HepG2, HeLaS3, HUVEC, K562 and NHEK) or whole-cell (H1HESC and prostate) were used to characterize TTS. PET 3′-ends from the same data set that overlapped or were separated by 20 or fewer bases were taken as putative TTS clusters. Only those TTS clusters that had a normalized PET tag count of at least 20 per 10 million, tags mapped in at least one cell type were considered for further analysis. For these clusters, the specific locations of the TTS for each cluster were taken to be the base with the highest density of mapped PET 3′-ends. TTS clusters across different cell types that overlapped by at least 80% were taken to be the same TTS.
UCSC gene model annotations were used to associate TTS defined in this way with known human genes. A TTS was considered to be associated with a gene if the linked 5′ ends of the PET tags were mapped to the annotated promoter of the gene and the linked 3′ end TTS cluster was found within the annotated transcriptional united or up to 5-kb downstream of the canonical annotated TTS. Human gene TTS characterized in this way were then co-located with TE sequences using the RepeatMasker annotations. As it has been previously shown that transcription termination occurs within 50 bp of the polyadenylation signal, TE-TTS were defined as those TTS clusters for which the peak base was at least 50 bp downstream from the start of a TE insertion and less than 15 bp downstream of the end of the insertion.
The chromatin environment of PET-characterized TTS was characterized using ENCODE ChIP-seq data. Where available for the same cell types as the PET data, ChIP-seq reads for the H3K9Ac, H3K27Me3 and H3K36Me3 modifications that were downloaded from the ENCODE repository on the UCSC genome browser[15, 16] were mapped to the human genome reference sequence (UCSC hg18; NCBI build 36.1) using the Bowtie short read alignment utility. Tags that mapped to multiple locations were resolved using the GibbsAM utility. The average numbers of ChIP-seq tags were found in five base-pair windows ±5 kb of (1) TE-derived TTS, (2) intragenic TE insertions that do not provide a TTS and (3) non-TE-derived TTS.
To estimate the upper bound for the number of TE-derived TTS in the human genome, we found, for all possible combinations of the eight cell types used here, the number of TE-derived TTS found with each combination. A logarithmic trend line was used to estimate the number of TE-derived TTS that would be found with increasing numbers of cell types. The same analysis was applied for the total number of human genes that bear at least one TE-TTS.
Transcription termination site
Transposable element-derived transcription termination site
Long terminal repeat
Mammalian interspersed repeat
Mammamlian apparent LTR
Chromatin immunoprecipitation followed by high-throughput sequencing.
ABC was funded by the School of Biology, Georgia Institute of Technology. IKJ was funded by the School of Biology, Georgia Institute of Technology, and the Alfred P. Sloan Research Fellowship in Computational and Evolutionary Molecular Biology (BR-4839).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.