Human transposons are an abundant supply of transcription factor binding sites and promoter activities in breast cancer cell lines

Background Transposable elements (TE) are commonly regarded as “junk DNA” with no apparent regulatory roles in the human genome. However, a growing body of evidence demonstrates that some TEs exhibit regulatory activities in a range of biological pathways and diseases, with notable examples in bile metabolism and innate immunity. TEs are typically suppressed by epigenetic modifications in healthy somatic tissues, which prevents both undesirable effects of insertional mutagenesis, and also unwanted gene activation. Interestingly, TEs are widely reported to be dysregulated in epithelial cancers, and while much attention has been paid to their effects on genome instability, relatively little has been reported on their effects on gene regulation. Here, we investigated the contribution of TEs to the transcriptional regulation in breast cancer cell lines. Results We found that a subset of TE subfamilies were enriched in oncogenic transcription factor binding sites and also harboured histone marks associated with active transcription, raising the possibility of these subfamilies playing a broad role in breast cancer transcriptional regulation. To directly assess promoter activity in triple negative breast cancer cell lines, we identified four breast cancer-associated genes with putative TE-derived promoters. TE deletion confirmed a contribution to promoter activity in all cases, and for two examples the promoter activity was almost completely contained within the TE. Conclusions Our findings demonstrate that TEs provide abundant oncogenic transcription factor binding sites in breast cancer and that individual TEs contain substantial promoter activity. Our findings provide further evidence for transcriptional regulation of human genes through TE exaptation by demonstrating the regulatory potential of TEs in multiple breast cancer cell lines. Electronic supplementary material The online version of this article (10.1186/s13100-019-0158-3) contains supplementary material, which is available to authorized users.


Background
Transposable elements (TE), or transposons, are repetitive genetic elements that are ubiquitous in eukaryotic genomes [1,2]. When first discovered in maize by Barbara McClintock in the mid-1940s, TEs were proposed to be "controlling elements" capable of regulating gene activity [3,4]. McClintock's theory was initially dismissed, and the prevailing view was that TEs were "junk" or "selfish" DNA sequences with no apparent regulatory roles [5][6][7].
However, in more recent years, McClintock's theory of TEs as gene expression regulators has been revised and refined by emerging evidence showing that they do indeed play a role in modulating and reshaping host transcriptional regulatory networks [8][9][10][11]. In fact, the regulatory roles of TEs are not rare events exclusive to plants, but are common to almost all eukaryotic evolutionary lineages, including humans [12][13][14].
The regulatory activity of TEs is derived from the cis-regulatory elements within their sequences, which include internal promoters and binding sites that can be recognised by host transcription factors (TF) and RNA polymerases (RNA Pol) [2,8,10,[15][16][17]. For example, LTR retrotransposons, which originated from retroviral infections that successfully integrated into the host germline, originally contained an RNA Pol II promoter in each of their long terminal repeats (LTR) [2]. This compatibility is essential for the transcription of TEs, and allows them to exploit host machinery to aid their proliferation in the host genome [2]. While TEs directly benefit from host compatibility, the host also stands to benefit through exaptation of TE sequences to modify or create transcriptional networks for the regulation of host genes. During the course of evolution, many TEs remaining in the human genome have been exapted to contribute promoters or enhancers of human genes [9,15,18,19]. In this study, we focus specifically on the promoter activity of TEs in the human genome.
Human TEs harbour a substantial number of TF binding sites (TFBS). On average, TEs are estimated to contain1 9% of the total TFBSs within the human genome [8]. These binding sites are recognised by a diverse group of TFs, differing greatly in their biological functions. Amongst these are components of the basic transcription machinery, such as the TATA box binding protein (TBP), as well as factors involved in the remodelling of chromatin states, such as the chromodomain helicase DNA-binding protein 1 (CHD1) and CHD2 [8]. In addition, some pathway-specific TFs have also been demonstrated to rely on TEs for their binding sites. A recent example is the interferon regulatory factor (IRF), and signal transducer and activator of transcription (STAT) TFs involved in the innate immunity [10].
Furthermore, an early study by Jordan et al. shows that 25% of the > 2000 human promoters documented in The Human Promoter Database contain TE-derived DNA sequences [15]. This finding is supported by recent advances in the Cap Analysis of Gene Expression (CAGE) technology, which maps genome-wide transcription start sites (TSS) by identifying mRNA 5′ end clusters [20]. A defining feature of promoters is the presence of transcription initiation sites, and CAGE data demonstrate that approximately 24% of TSSs (defined by the presence of two or more CAGE tags) are located within TEs [18]. Several studies have identified individual TEs with promoter functions in the genome. One example is MER11A, an ancient LTR acting as the primary promoter to BAAT, which encodes an enzyme involved in bile metabolism [21,22]. Further examples are antisense L1 and Alu elements that together contribute the sole apparent promoters to HYAL-4 and FUT5, which are involved in hyaluronan catabolism and cell adhesion respectively [9]. These examples demonstrate the presence of TE-derived promoter activity in a diverse range of biological pathways.
While these cis-regulatory sequences have endowed TEs with the potential to regulate human gene expression, the transcriptional activity of the majority of TEs are suppressed heavily in somatic tissues [23,24]. This is hypothesised to be a defence mechanism by the host genome, which protects the host from large mutations resulting from active TE mobilisation, as well as abnormal gene expression driven by TE-derived promoters and enhancers [23][24][25]. The human genome employs an array of mechanisms to defend against TE activation, and the most relevant to transcriptional repression are DNA methylation and histone tail modifications [25][26][27].
Increased DNA methylation is often associated with gene silencing, and most TEs are heavily suppressed by DNA methylation and thus rendered inactive in somatic tissues [27][28][29]. Unlike DNA methylation, the relationship between human TEs and histone modifications is highly complex and less well-understood. Kondo and Issa first showed the recruitment of repressive H3K9 methylation to Alu elements in the human genome [26]. However, a study by Huda et al. on CD4+ T cells shows that Alu and L2 elements are enriched in active marks, while L1 and LTR TEs are depleted for active marks, and/or enriched for repressive marks [25]. On the other hand, TEs which are exapted to perform regulatory functions are often associated with active histone marks, resembling those typically associated with active regulatory regions [10,30].
However, the suppressive regulations over TEs, such as DNA methylation, are often alleviated in some disease states, and the transcriptional activation of TEs is commonly observed in tumours [28,31,32]. The oncogenic capacity of TE-derived transcriptional regulation is summarised with specific examples in recent reviews [32,33]. For example, TE-derived promoters have been documented to lead to oncogene activation, followed by neoplastic transformation and disease progression in some types of cancer, such as Hodgkin's lymphoma, bladder tumours and diffuse large B cell lymphoma [34][35][36]. However, the involvement of TE-derived promoters in other cancer types remains largely unknown. In this study, we aimed to address this gap in knowledge by investigating the activity of TE-derived promoters in breast cancer cell lines. More specifically, we focussed on the highly malignant triple negative breast cancer (TNBC) subtype, which lacks the expression or amplification of hormone receptors, and is thus refractory to targeted therapies [37][38][39][40].
In this study, we aimed to investigate the transcriptional activity of TEs in breast cancer, particularly their interactions between TFs and contribution to promoter activity, using breast cancer cell lines as a model. We analysed genome-wide binding sites for three breast cancer-associated TFs (C/EBPβ, E2F1 and MYC) in MCF7 cells, and confirmed that a substantial fraction of these TFBSs resided in TEs. We showed that these TFBSs were distributed across a diverse range of TEs, and identified enriched TE subfamilies with a potentially important role in modulating breast cancer transcription. We found that the binding of breast cancer-associated TFs to TEs was correlated with active histone modifications, further supporting the transcriptional activity of these TEs. We subsequently identified individual TEs within the promoters of breast cancer-associated genes and confirmed that TEs located upstream of SYT1, UCA1, AK4 and PSAT1 contributed promoter activity in TNBC cell lines.

Results
TEs were an abundant source of breast cancer-associated TFBSs To investigate the prevalence of TEs in breast cancer-associated TFBSs, we mapped the genome-wide binding sites for C/EBPβ, E2F1 and MYC in MCF7 breast cancer cells by re-analysing publicly available chromatin immunoprecipitation sequencing (ChIP-seq) datasets. These TFs were selected for analysis based on literature evidence on their oncogenic capacity in cancer [37,[41][42][43][44][45][46][47]. In each TF dataset, high confidence ChIP-seq peaks were called using HOMER, with a false discovery rate threshold of 0.001 [48]. We mapped 26,934 to 101,382 binding sites for each TF, with up tõ 55% of these TFBSs identified to overlap TEs (Table 1). In total, 93,901 TE genomic locations, containing TEs from 1069 subfamilies, were found to harbour at least one breast cancer-associated TFBS.
We then investigated whether individual TE subfamilies were enriched for breast cancer-associated TFBSs compared to their expected coincidence. Binomial tests revealed that 268 individual TE subfamilies, with representatives from all major classes, were found to be significantly enriched in at least one ChIP-seq dataset (p < 4.3E-5 indicated statistical significance). LTRs were observed to contribute more enriched subfamilies than any other class of TEs for all TF datasets investigated, and represented approximately 30% of enriched TE subfamilies (Fig. 1a). DNA transposons contributed approximately 26% of the enriched subfamilies (Fig. 1a), second only to LTRs. The extent of enrichment for highly enriched subfamilies are shown in Fig. 1b as representative results, and all enriched subfamilies are shown in Additional file 1: Figure  S1-S5.

TF binding in TEs was correlated with active epigenetic signatures
We next sought to determine whether these TE subfamilies were likely to be transcriptionally active in breast cancer by re-analysing publicly available ChIP-seq datasets for active (H3K27ac, H3K4me1, H3K4me3 and H3K36me3) and repressive (H3K9me3 and H3K27me3) histone tail modifications in MCF7 cells [49][50][51][52][53]. For each of four highly enriched TE subfamilies (MIRb, L2a, AluJb and L2b), active epigenetic signatures were consistently observed for TE elements bound by TFs, but were not observed for elements of the same subfamily that lacked binding ( Fig. 1c and Additional file 1: Figure S6). Repressive signatures were not observed in either bound or unbound subgroups for any TE subfamily ( Fig. 1c and Additional file 1: Figure S6).
A subset of breast cancer-associated mRNAs and lncRNAs were identified to have putative TE-derived promoters Next, we sought to identify breast cancer-associated genes with putative TE-derived promoters. The Cancer Genome Atlas Network (TCGA) had previously identified 3662 genetic elements as being differentially expressed in breast cancer [54,55]. From the microarray probes for these elements, 3585 could be aligned to the human genome (hg38) with BLASTn [54,55]. The probe targets were annotated with GENCODE mRNAs, as well as FANTOM CAT and NONCODE long non-coding RNAs (lncRNA) using a hierarchical process [56][57][58] (Fig. 2a). In total, 3101 probes were annotated with mRNAs from the GENCODE database [58] (Fig. 2a). An additional 79 and 74 probes were annotated with lncRNAs from the FANTOM CAT and NONCODE databases respectively [56,57] (Fig. 2a).
We then identified genes and lncRNAs with putative TE-derived promoters by intersection with TSSs and TFBSs. Of the~380,000 FANTOM CAT CAGE clusters,~76,800 clusters were located nearby a breast cancer-associated TFBS (window = 300 bp). Amongst these CAGE clusters, 6727 overlapped with TEs. Finally, 56 TE-harboured CAGE clusters were located within the putative promoters for 41 breast cancer-associated genes (identified by 79 GENCODE mRNA transcripts) and 8 FANTOM CAT lncRNAs (for list of CAGE clusters and genes see Additional file 2). TEs with putative promoter activity are listed in Additional file 3.
A literature search was performed for genes and lncRNAs with putative TE-derived promoters to identify those with existing experimental evidence for oncogenic activity. SYT1, UCA1, AK4 and PSAT1 were selected for downstream analysis (Table 2) (literature evidence summarised in Additional file 4: Table S1). In particular, UCA1 encodes an oncogenic lncRNA shown to disrupt multiple tumour suppressive mechanisms. For example, it is found to inhibit the tumour suppressive miR-143 by direct binding, and inhibit translation of the p27 tumour suppressor by competitive inhibition [59,60]. The tumour-specific expression of UCA1 is a potential biomarker for bladder and pancreatic cancer [61][62][63].
TEs contributed significant promoter activity to breast cancer-associated genes For SYT1, UCA1, AK4 and PSAT1, the promoter activity was estimated via luciferase assays in three TNBC cell lines (MDA-MB-468, MDA-MB-231 and BT549). To estimate the contribution of TEs to the promoter activity, TEs hypothesised to harbour promoter activity were removed from the wild-type promoters, and the remaining DNA sequences within the promoters were tested for promoter activity by luciferase assays. TE deletion resulted in reduced activity of all four promoters in at least one cell line. Strikingly, deletion of L1PA2-SYT1 resulted in near complete ablation of expression in MDA-MB-468 and MDA-MB-231 cells, where only 2.6 and 11.2% of promoter activity remained after deletion respectively (Fig. 3). A significant decrease in promoter activity was also observed in BT549 cells, with 53.7% of activity remaining after L1PA2-SYT1 deletion. Similar results were observed for the deletion of LTR7C-UCA1. Relative luciferase activity was reduced to 12.6 and 19.4% in MDA-MB-468 and MDA-MB-231 cells respectively, and a 52% remaining promoter activity was observed in BT549 cells (Fig. 3). For AK4, TE deletion resulted in a trend of reduced promoter activity; however, statistical significance was only achieved in BT549 cells (Fig. 3). In addition, the promoter activity Fig. 1 Enriched TE subfamilies in breast cancer-associated TFBSs. a 268 out of 1156 TE subfamilies, with representatives from all major classes, were found to be significantly enriched in breast cancer-associated TFBSs, as identified by a binomial test, where p < 4.3E-5 indicated statistical significance. The abundance of TE subfamilies in the human genome, categorised into classes, are shown for reference. b Heatmap demonstrates the level of enrichment of the most highly enriched TE subfamilies (> 7.5 enrichment ratio in all TF datasets) (ER = Enrichment Ratio). c Subfamilies enriched in the binding sites of C/EBPβ, E2F1 and MYC were investigated for their association with histone marks (H3K27ac, H3K4me1, H3K4me3, H3K36me3, H3K9me3 and H3K27me3) in MCF7 cells using published ChIP-seq datasets. The left panels display the histone modification profile of TE elements containing TFBSs, while the right panels display the profile of TE elements lacking TFBSs. The average RPKM values at 50 bp resolution over a 10 kb region centred on the TEs (filled ractangles) are plotted for i) MIRb and ii) L2a of PSAT1 showed a significant decrease in BT549 and MDA-MB-468 cells following the removal of TE, where only 55.5-61.2% of promoter activity remained (Fig. 3). Although less than 70% of promoter activity was observed in MDA-MB-231 cells following MIR3-PSAT1 removal, the significance of the effect was likely masked by the variability present (Fig. 3).

TE-derived promoter activity was correlated with epigenetic derepression
To further characterise these TE-derived promoters, we investigated whether they displayed differential methylation in breast cancer tumours and whether they resided in open chromatin in breast cancer cell lines, by analysing published whole-genome methylation capture sequencing and DNase-seq datasets [64,65] (for DNase-seq data source see Additional file 4: Table S2). L1PA2-SYT1 was found to be hypomethylated in TNBC tumours (t 5 = − 2.71, p = 0.021).

Discussion
TEs have been demonstrated to be an abundant reservoir of cis-regulatory sequences compatible with human transcription machinery [2, 8-10, 18, 22]. They supply a substantial number of TFBSs and promoters to human genes, and thereby modulate gene expression in normal biological pathways as well as disease progression [8,9,32]. While most TEs are under tight, suppressive regulation in somatic tissues, they often escape epigenetic repression in cancer and contribute promoters which cause abnormal oncogene expression [25-28, 31, 32]. TE-derived promoters have been found to drive tumorigenic gene expression in Hodgkin's lymphoma, bladder tumours and diffuse large B cell lymphoma [34][35][36]; however, the contribution of TE-derived promoters to breast cancer transcriptional regulation is poorly studied. Here, we addressed this gap in knowledge by investigating the landscape of TF binding of TEs, as well as identifying and Fig. 2 Identification of breast cancer-associated mRNAs and lncRNAs using published microarray data. a Microarray probes identified by TCGA as identifying differentially expressed genes in breast cancer were re-annotated with human mRNAs and lncRNAs using a hierarchical process [56][57][58]. 3585 out of 3662 probes were aligned to the human genome (hg38) by BLASTn [55]. The targets of the aligned probes were annotated by intersecting probe locations with GENCODE mRNAs, FANTOM CAT lncRNAs and NONCODE lncRNAs in order of descending priority [56][57][58]. 3101 probes were annotated with mRNAs [58]. 79 probes were annotated with lncRNAs from FANTOM CAT [56]. 74 probes were annotated with lncRNAs from NONCODE [57]. Probes that could not be aligned or annotated were discarded from downstream analysis. b Breast cancer-associated genes with putative TE-derived promoters were identified by a hierarchical process. TSSs, indicated by CAGE clusters defined by FANTOM CAT [56] were intersected with breast cancer-associated TFBSs using a 300 bp window, followed by intersecting with TEs and promoters of breast cancer-associated mRNAs and lncRNAs. 41 genes (79 mRNAs) and 8 FANTOM CAT lncRNAs were identified to have putative TE-derived promoters validating oncogene-associated TE-derived promoters in breast cancer cell lines. Human TEs have been reported to contain binding sites for TFs, with activities ranging from general transcription initiation (e.g. TBP) to chromatin remodelling (e.g. CHD1 and CHD2), as well as those with specialised roles in cellular functions (e.g. IRF and STAT involved in innate immunity) [8,10]. Here, we investigated the extent to which TEs contributed binding sites for three breast cancer-associated TFs (C/EBPβ, E2F1 and MYC) in breast cancer. On average, TEs contributed~38% of the binding sites in MCF7 cells, and up to 54.6% of binding sites for C/EBPβ (Table 1). This demonstrates that TEs represent an abundant source of breast cancer-associated TFBSs.
TEs can be categorised hierarchically into classes, families and subfamilies based on their sequence features [1]. Although many TEs originally contained cis-regulatory sequences that were compatible with the host transcriptional machinery, their sequence can be truncated during initial insertion, and acquired mutations can degrade these cis-regulatory sequences [2]. In particular, LTR retrotransposons are an abundant source of cis-regulatory sequences and are often exapted for the regulation of human genes [8,10]. The pervasive regulatory activity of LTRs can be explained by the fact that 85% of the LTR retrotransposons in the human genome consist of long terminal repeats only, which contain the original RNA Pol II promoter sequences [1,2]. On the contrary, the majority of human DNA transposons exist in the form of miniature inverted-repeat transposable elements (MITE), many of which lack cis-regulatory elements such as internal promoters [2,66]. Of the total 1156 TE subfamilies investigated, we identified 268 subfamilies with significant enrichment for breast cancer-associated TF binding. Notably, LTRs represent 30% of all significantly enriched subfamilies (Fig. 1a). While we observe enrichment for TF binding within some LTR subfamilies, we do not observe a general enrichment of LTRs as a class of TEs (50.3% of all TE subfamilies). Surprisingly, DNA transposons con-tributed~26% of the enriched TE subfamilies for all TFs investigated (compared to 19% of all subfamilies in the human genome) (Fig. 1a) [1]. In particular, the MER136 subfamily was amongst the top enriched subfamilies, suggesting a potential role in breast cancer transcriptional regulation (Fig. 1b). While DNA transposons are less likely to retain cis-regulatory activity, it has been demonstrated that the insertion of mPing, a MITE in plants, could Fig. 3 The effect of TE deletion on relative luciferase activity in MDA-MB-468, MDA-MB-231 and BT549 cells. Luciferase assays were performed for SYT1, UCA1, AK4 and PSAT1 in 3 independent experiments (n = 3) with triplicates. Bar graphs represent the mean relative luciferase activities, with white bars for wild-type and black bars for TE deletion. Error bars represent the standard deviations among independent experiments. The relative luciferase activity of TE-deleted constructs was compared against the corresponding wild-type using a one-tailed t-test, assuming equal variances. *p < 0.05, **p < 0.01, ***p < 0.001 render neighbouring genes stress-inducible in rice, and this TE-derived regulation is possibly due to the TFBSs present in the MITE [67].
The activity of TEs often occurs at the cost of host genomic stability, as active TEs can generate large mutations or drive unwanted gene expression [24,32]. The human genome has thus evolved several defence mechanisms against TE activity, one of which is epigenetic suppression via histone tail modifications [25,26]. As a result, human TEs in somatic tissues are often associated with repressive histone modifications, such as H3K9 methylation [25,26]. On the other hand, TEs exapted to perform regulatory roles exhibit an active histone modification profile, such as H3K27ac, H3K4me1 and H3K4me3, which is similar to that observed in enhancer or promoter regions [10,30]. In this study, we examined the histone modification status of the TF-bound TEs and confirmed that TEs contributing TFBSs showed an active epigenetic signature characterised by increased H3K27ac, H3K4me1 and H3K4me3 ( Fig. 1c and Additional file 1: Figure S6). H3K4 methylation is often found in both active enhancers and promoters; however, active promoters are also likely to bear other histone modifications, such as increased H3K36me3 [68]. Furthermore, H3K27ac has been established as an epigenetic signature indicating active enhancer elements [49]. We did not find any notable difference in the extent of truncation between bound and unbound elements that may help explain this difference in TF binding (Additional file 1: Figure S7). The co-occurrence of TF binding and active histone modifications indicate a possible co-option of the TF-bound TEs for regulatory roles.
Next, we sought to validate the promoter activity of TF-bound TEs in breast cancer cell lines. The intersection of breast cancer-associated genetic elements with CAGE clusters, TFBSs in MCF7 cells and TEs provided a subset of putative TE-derived promoters. We selected the putative TE-derived promoters of four candidate genes (SYT1, UCA1, AK4 and PSAT1) for in vitro validation. These genes were prioritised based on existing literature describing oncogenic characteristics in various cancers [59][60][61][62][63][69][70][71][72][73][74][75][76][77] (summarised in Additional file 4: Table S1). Our filtering process was conservative and required multiple levels of evidence for promoter identification. Final promoter selection was also performed according to existing literature evidence, rather than a prediction of promoter activity. This approach would underestimate the overall contribution of TEs to promoter activity in breast cancer, but would provide disease-relevant candidates for validation.
To evaluate the promoter activity of candidate TEs, the wild-type and TE-deleted promoters were cloned for each gene (Fig. 4). Luciferase assay results indicated that all analysed TEs contributed promoter activity in TNBC cell lines (Fig. 3). In the case of L1PA2-SYT1, TE deletion almost completely abolished the promoter activities in two of the three cell lines examined (Fig. 3). Similar results were observed for LTR7C-UCA1, while TEs contributed significant promoter activity to AK4 and PSAT1 in some, but not all of the cell lines investigated (Fig. 3). Interestingly, we found that the L1PA2-SYT1 and LTR7C-UCA1 elements were hypomethylated, and/or more accessible to transcriptional machinery in breast cancer, as indicated by whole-genome methylation capture sequencing and DNase-seq data. The loosening of epigenetic control over these TEs coincided with their strong promoter activities, supporting a link between epigenetic derepression and the regulatory function of individual TEs. The transcriptional activation of the TE-derived promoters potentially led to the aberrant expression of SYT1 and UCA1 reported by TCGA [54], as well as activation of their oncogenicity.
The L1PA2-SYT1 activity was likely driven by the L1 antisense promoter (L1-ASP) located in the first 500 bp of the TE. Similar oncogene activation by L1-ASP has been reported in other cancer types, such as MET activation in bladder cancer [35,78,79]. SYT1 is a protein-coding gene found to facilitate the export of the oncogenic growth factor FGF-1 [71]. The promoter activity of L1PA2-SYT1 in breast cancer cell lines coincided with its increased DNAse sensitivity and decreased DNA methylation, suggesting the L1PA2-SYT1 element can promote expression of the proto-oncogene SYT1. It remains unknown whether this promoter activity is tumour-specific, and whether it contributes to SYT1 expression under normal physiological conditions. Nonetheless, SYT1 expression is likely mediated by the L1PA2-SYT1 transposon in the context of breast cancer cell lines.
The LTR7C-UCA1 element also contributed essential promoter sequences to the UCA1 promoter (Fig. 3). This region contains the majority of UCA1 CAGE tags and the transcription initiation sites of all annotated UCA1 transcripts, and major peaks of CAGE tags are harboured within the LTR7C element, as shown by the ZENBU genome browser [80] (Additional file 1: Figure S8). Taken together, this suggests that the identified LTR7C element acts as the primary promoter for the UCA1 oncogene, where the promoter activity is likely driven by the internal promoter within the long terminal repeat. LTR-driven oncogene activation has been reported in multiple cancers, with a notable example in Hodgkin's lymphoma where an LTR element contributes an alternate promoter causing ectopic activation of the CSF1R oncogene [32][33][34]. Interestingly, UCA1 encodes an oncogenic lncRNA that inhibits the tumour suppressive miR-143 by direct binding, and also disrupts the translation of tumour suppressor protein p27 through competitive inhibition [59,60]. UCA1 expression has been proposed to be a potential biomarker for bladder and pancreatic cancer [61][62][63]. Confirmation of LTR7C as a critical promoter element for oncogenic lncRNA activation demonstrates the contribution of TE-derived regulatory elements to breast cancer transcriptional regulation.
It is worth noting that many TF-bound TE subfamilies exhibited epigenetic profiles resembling active enhancer regions ( Fig. 1c and Additional file 1: Figure S6). Thus, the contribution of TEs to breast cancer transcriptional regulation is likely to be much larger than the few examples highlighted in this study. Further investigation of the regulatory activity and subsequent biological effects of TE-derived enhancers will likely demonstrate extensive exaptation of TEs for oncogene regulation in cancer.

Conclusions
The exaptation of TEs in the transcriptional regulation of human genes has been demonstrated in an extensive range of biological pathways and cellular functions. TEs serve as a supply of binding sites for RNA polymerases and transcription factor enzymes, and can influence host gene expression by providing promoter or enhancer activities. TEs have also been found to be released from transcriptional repression in several types of epithelial cancer, with specific examples having a direct tumorigenic effect. Here, we have provided a focussed analysis of TE-mediated TF binding and promoter activity in breast cancer cell lines.
We demonstrate that TEs are an abundant source of binding sites for TFs known to show oncogenic activity in breast cancer transcriptional networks. In particular, a number of TE subfamilies were significantly enriched in these TFBSs relative to their genomic occupancy. Within these subfamilies, those bound by TFs were associated with active epigenetic signatures, raising the possibility of these TEs playing a widespread role in breast cancer transcriptional regulation. We subsequently analysed the contribution of TEs to the promoter activities of human genes in three TNBC cell lines. We limited our search to a list of genes identified by the TCGA as being differentially expressed in breast cancer, then focussed on those reported to have oncogenic properties.
While this study is not exhaustive, we have integrated multiple data sources to provide the first evidence for TE-derived transcriptional regulation in breast cancer cell lines. It is likely that more examples will be demonstrated through further study.

Identification of breast cancer-associated TFBSs
To identify TFBSs for C/EBPβ, E2F1 and MYC across the genome, ChIP-seq datasets from MCF7 breast cancer cells were retrieved from GEO and ENCODE in the format of Fig. 4 Assembly of the reporter constructs. The wild-type promoters, as well as promoters with the targeted TEs removed (TE-Del), were ligated into the pGL3 Basic vectors, upstream of the luciferase genes. The genomic locations (hg38) of the promoter regions and the sizes of deletions are shown for each of the candidate genes: a SYT1, b UCA1, c AK4, d) PSAT1. Arrows represent TE orientation relative to the candidate genes. Triangles represent the deletions Sequence Read Archive (SRA) and FASTQ respectively [64,81] (for sources of datasets see Additional file 4: Table  S3). Datasets in the SRA format were converted to the FASTQ format with FASTQ-DUMP [82]. The FASTQ sequences (reads) were aligned to the human genome (hg38) via the Burrows-Wheeler Aligner (BWA) program, keeping all possible alignments (−a option) [83]. The mapping of multi-mapped reads were improved using the Multi-Mapper Resolution (MMR) program with 3 iterations [84]. Validation of MMR remapping is provided in Additional file 1: Figure S9. Reads mapping to poorly represented genomic regions (simple repeats, satellite DNA, RNA repeats and low-complexity regions) were removed using BEDTools Intersect (−v option) to avoid skewing of the peak-calling threshold [85].
TF-binding peaks were called using HOMER with a false discovery rate threshold of 0.001 [48]. To increase confidence in peak calling, MYC peaks identified using the ENCODE datasets were filtered using BEDTools Intersect to only retain peaks common to at least two replicates [85]. For MYC, the ChIP-seq datasets from GEO and ENCODE were kept separate for downstream analyses. The genomic locations of TEs (hg38) were obtained from the UCSC RepeatMasker table (including truncated and interrupted elements) [86]. Although the categorisation of TEs is often inconsistent in the literature, here we employ the classification system on the Dfam database [87,88]. TFBSs within TEs were identified using BEDTools intersect [85].

Identification of enriched TE subfamilies in TFBSs
For each set of ChIP-seq peaks, the expected representation of each TE subfamily was estimated by the random rotation of the genome and the peak locations (10,000 permutations), then counting ChIP-seq peaks that intersected with a TE using a custom python script. The average count of peaks intersecting with each TE subfamily was divided by the total number of ChIP-seq peaks for each TF to produce the expected possibility of TFBSs being located in TEs by random chance. The observed number of TF peaks overlapping each TE subfamily was counted with no rotation applied. A binomial test was used to identify significantly enriched TE subfamilies. P-values were corrected for multiple testing using Bonferroni correction, where p < 4.3E-5 indicated statistical significance. Enrichment ratios were calculated for each TE subfamily as the ratios between observed and averaged expected overlaps.

Epigenetic analyses of TF-bound TEs
To investigate the association between oncogenic TF binding and the epigenetic status of TEs, we analysed the epigenetic profiles of top enriched TEs bound by oncogenic TFs in MCF7 cells using published ChIP-seq datasets (for data sources see Additional file 4: Table S4). The datasets for 6 histone modifications (H3K27ac, H3K4me1, H3K4me3, H3K36me3, H3K9me3 and H3K27me3) were downloaded from ENCODE in FASTQ format, then corrected for irregular sequencing coverage using BBnorm (target = 40, mindepth =1) [89]. Reads were subsequently aligned to the human genome (hg38) with BWA mem, keeping only the primary alignments [83]. MIRb, L2a, AluJb and L2b subfamilies were selected as they were identified to be significantly enriched subfamilies in the oncogenic TFBSs analysed. For each subfamily, TE copies containing a TFBS defined above were identified using BEDTools Intersect, and those without a TFBS were identified using the -v option [85]. A 10 kb region centred on each TE was divided into 50 bp bins. For each histone modification, BEDTools Coverage was used to count reads in each bin, then values were converted to RPKM [85]. RPKM values were subsequently averaged across replicates, normalised to the control by subtraction, and finally averaged across TE copies within each TE subfamily. Plots were smoothed using a Kernel Density Estimation with an Epanechnikov kernel. Histone modification profiles were also plotted for high mapping quality reads (mapQ > = 1 and mapQ > = 10) to confirm observed results were not an artefact of including multi-mapped reads in our analysis (Additional file 1: Figure S10).
We subsequently investigated the degree of truncation for TEs with or without breast cancer-associated TFBSs. For each TE subfamily, the location of each TE, relative to the consensus sequence, was mapped using the "RepStart" and "RepEnd" from RepeatMasker [86]. The coverage of the consensus sequence at 1 bp resolution was calculated for TEs with or without the presence of breast cancer-associated TFBSs, and normalised to the total number of TEs in each group by division.

Identification of breast cancer-associated genetic elements
A list of~3600 probes identifying differentially expressed genetic elements in breast cancer, previously identified by TCGA [54], was obtained from Professor K. Hoadley (personal communications). To integrate this dataset with hg38, the microarray probe sequences were retrieved from the TCGA website (TCGA Platform Code: AgilentG4502A_07_3), and aligned to the human genome (hg38) using BLASTn (BLASTn parameters: reward = 1; penalty = 3; gap-open = 5; gap-extend = 2; evalue = 3) [54,55]. Unaligned probes were excluded from further analysis. Probe locations were converted to the BED format using AWK [90].
Genetic elements targeted by the probes were identified by intersecting probe locations with human mRNAs from GENCODE, and lncRNAs from FANTOM CAT and NONCODE, in order of descending priority with BEDTools Intersect (−s and -split options) [85] (for sources of the annotation data see Additional file 4: Table  S5). Probes that could not be annotated were discarded.

Identification of candidate genes with putative TF-bound TE-derived promoters
To increase the likelihood of identifying TE-derived promoters with true transcriptional activity, several filtering steps were used to ensure that the identified TEs contained previously defined TSSs and were located nearby a TFBS defined above. Firstly, the genomic locations of FANTOM CAT CAGE clusters (lv2, permissive) were converted to hg38 by LiftOver [56,91]. TSSs regulated by breast cancer-associated TFs were identified by intersecting CAGE clusters with TFBSs defined above using BEDTools Window (−w 300) [56,85]. A 300 bp window was selected to include as many biologically meaningful TSSs as possible for subsequent analysis (Additional file 1: Figure S11). These TSSs were subsequently intersected with TEs using BEDTools Intersect [85]. TSSs located within TEs were then intersected with the promoters of breast cancer-associated genetic elements with BEDTools Intersect (−s option), where the promoters were defined as the 2 kb region around the 5′ ends of the genetic elements [85]. This intersection defined the list of genetic elements for manual curation through literature review of known associations with cancer biology. SYT1, UCA1, AK4 and PSAT1 were selected for further analyses.

Molecular cloning and assembly of reporter constructs
For all molecular cloning, plasmids were purified using QIAprep Spin Miniprep kit (QIAGEN) following the manufacturer's protocol. Gel extractions were performed using 1% agarose gels, and MinElute Gel Extraction Kit (QIAGEN) following the manufacturer's protocol, unless otherwise specified.

Amplification of wild-type promoters
Wild-type promoters were amplified from HeLa genomic DNA by polymerase chain reaction (PCR) using the Kapa HiFi PCR kit (Kapa Biosystem) (for primer sequences see Additional file 4: Table S6). Nested PCR was used for SYT1. Amplicons were analysed by agarose gel electrophoresis and extracted.
Amplicons were ligated into the pCR Blunt vector (Thermo Fisher Scientific) using T4 DNA ligase (New England Biolabs) (37°C for 2-16 h), and transformed into E.coli DH5α cells, followed by kanamycin selection and blue-white colony screening. Plasmid sequences were confirmed by capillary sequencing using M13 forward and reverse primers, and customised primers (for primer sequences see Additional file 4: Table S6). Mutation-free clones were identified for all promoters and used for the following steps.

TE deletion by PCR-driven overlap extension
TEs were deleted from mutation-free wild-type promoters via PCR-driven overlap extension [92] (Fig. 4), using the Kapa HiFi PCR kit (Kapa Biosystem) or Phusion High-Fidelity DNA Polymerase (New England Biolabs). For detail of PCR-driven overlap extension and primer sequences see Additional file 1: Figure S12 and Additional file 4: Table S6 respectively. For SYT1, only the first~500 bp of L1PA2 was deleted, as this region was previously shown to be critical for the L1 promoter activities [93] (Fig. 4a).
Final amplicons were ligated into the pCR Blunt vector (Thermo Fisher Scientific) and transformed into E.coli DH5α cells. Colonies were screened by colony PCR using MyTaq HS DNA polymerase (Bioline) (for primer sequences see Additional file 4: Table S6). Positive colonies were validated by capillary sequencing, and error-free clones were identified for further use.

Assembly of reporter constructs
Wild-type and TE-deleted promoters were excised from the pCR Blunt vectors by restriction enzymes digestion (See Additional file 4: Table S6 for insert-enzyme combinations), then ligated into the pGL3 Basic vector (Promega) with T4 DNA ligase (New England Biolabs) (16°C for 2-16 h), upstream of the firefly luciferase-encoding gene. The ligated plasmids were transformed into E.coli DH5α cells, followed by ampicillin selection. Colonies were screened for inserts in the correct orientation via colony PCR using RVprimer3.
Purification of reporter constructs E.coli DH5α cells were transformed with the reporter constructs, followed by ampicillin selection. For each gene, 3 colonies (clones) per construct were subjected to plasmid extraction. Plasmid DNA of each set of wild-type and TE-deleted constructs, which would be directly compared against each other in luciferase assays, was extracted simultaneously. Plasmid DNA was confirmed for identity and supercoiling by restriction enzyme digestion and gel electrophoresis. Only predominantly supercoiled plasmid DNA was used for transfection.

Cell culture
TNBC cells lines used in this study were MDA-MB-468, MDA-MB-231 and BT549. All cell lines were cultured in Dulbecco's Modified Eagle Medium (Thermo Fisher Scientific) supplemented with 10% foetal bovine serum (Thermo Fisher Scientific). Cells were incubated at 37°C and 5% CO 2 , and passaged with Trypsin-EDTA (Gibco) at 80-90% confluency. Cell passaging at a 1:6 ratio was performed twice a week for MDA-MB-231 cells, and weekly for the other cell lines.

Transfection of TNBC cells lines
Cells were plated in 24-well plates to give a density of 80-90% at 24 h (plating densities are shown in Additional file 4: Table S7). Triplicate wells were plated for each construct and the positive control. Transfections were performed 24 h post-plating using the Lipofectamine 3000 Reagent (Thermo Fisher Scientific), with equimolar amount of reporter construct (~500 ng) and 20 ng of pRL-TK plasmid per well. 500 ng of the pGL3 Promoter vector (Promega) was used as a positive control. Medium was changed at 24 h post-transfection.

Luciferase assays and statistical analyses
Luciferase activities were measured using the Dual-Glo Luciferase Assay System (Promega). At 48 h post-transfection, medium was removed and cells were lysed using 50 μL of luciferase reagent diluted with 50 μL of PBS. After a 10-min shaking incubation, 90 μL of cell lysate from each well was transferred to a White Opaque 96-well microplate (PerkinElmer). Firefly luminescence was measured at 25°C on a DTX880 Multimode Detector (Beckman Coulter) (MDA-MB-468 and MDA-MB-231 cells), or CLARIOstar (BMG LABTECH) (BT549 cells). Renilla luminescence was measured for each well 10 min after adding 45 μL of Stop&Glo reagents.
The relative luciferase activity of each construct was calculated as the ratio of firefly: Renilla luminescence, averaged amongst the triplicates and normalised to the positive control.
For each gene, luciferase assays were replicated in three independent experiments. Relative luciferase activities were normalised to the mean wild-type activity across the replicates for each gene. Using R, the relative luciferase activities of TE-deleted constructs were compared against the wild-type with a one-tailed t-test assuming equal variances, to test our hypothesis that the deletions would reduce promoter activities. P < 0.05 indicated statistical significance.

Bioinformatic analysis of DNA methylation
To investigate the epigenetic status of the candidate TEs in TNBC, their methylation states were evaluated using published whole-genome methylation capture sequencing data [65]. This technique sequences methylated genomic regions by capturing DNA sequences containing methyl-CpGs [65]. This dataset included 6 paired samples. Within each pair, one sample was collected from the TNBC tumour, with another sample collected from the neighbouring normal tissues. The paired sample data were processed as described for TF ChIP-seq data shown above.
As TFBSs were associated with transcriptional activities and thus likely to be epigenetically regulated [94], reads mapping to the previously identified TFBS in each candidate TE were counted using SAMtools, along with reads mapping to the same regions in all TEs from the same subfamily [94]. To normalise for variations in sequencing depth, the methylation level of each candidate TE was calculated as read counts in the TFBS divided by the read count in the same regions of all TEs from the same subfamily. Using R, a one-sample, one-tailed t-test was used to determine whether the difference in the methylation levels between the tumours and the paired normal samples was less than zero. P < 0.05 indicated statistical significance.

Bioinformatic analysis of chromatin accessibility
TE epigenetic states were also evaluated by analysing ENCODE DNase-seq datasets from MCF7 and HMEC (normal breast tissues) cell lines [64] (for DNase-seq data sources see Additional file 4: Table S2). DNase-seq maps accessible chromatin by sequencing DNase I hypersensitive regions, and indirectly reflects epigenetic regulations [95,96]. Data were analysed as described for TF ChIP-seq data shown above.
Similar to the analysis of DNA methylation, the DNase sensitivity of each candidate TE was calculated as read counts in the TFBS divided by the read count in the same regions in all TEs from the same subfamily, and subsequently averaged within each cell line. Using R, a one-tailed proportion test was employed to determine whether the TEs were more sensitive to DNase cleavage in MCF7 cells relative to HMEC cells. P < 0.05 indicated statistical significance.