Transcriptome analysis reveals negative correlation between transcripts of mitochondrial genes and L1HS, and positive correlation between KRAB-ZFPs and older LINE elements in normal somatic tissue

Despite the long-held assumption that transposons are normally only expressed in the germ-line, recent evidence shows that transcripts of LINE sequences are frequently found in the somatic cells. However, the extent of variation in LINE transcript levels across different tissues and different individuals, and the genes and pathways that are co-expressed with LINEs are unknown. Here we report the variation in LINE transcript levels across tissues and between individuals observed in the normal tissues collected for The Cancer Genome Atlas. Mitochondrial genes and ribosomal protein genes were enriched among the genes that showed negative correlation with L1HS in transcript level. We hypothesize that oxidative stress is the factor that leads to both repressed mitochondrial transcription and LINE over-expression. KRAB zinc finger proteins (KZFPs) were enriched among the transcripts positively correlated with older LINE families. The correlation between transcripts of individual LINE loci and individual KZFPs showed highly tissue-specific patterns. There was also a significant enrichment of the corresponding KZFP’s binding motif in the sequences of the correlated LINE loci, among KZFP-LINE locus pairs that showed co-expression. These results support the KZFP-LINE interactions previously identified through ChIP-seq, and provide information on the in vivo tissue context of the interaction.


INTRODUCTION
LINE transposable elements (TEs) comprise more than 17% of the human genome [Lander et al. 2001]. Most LINE elements are incapable of retrotransposition, except for the few elements among the youngest families including L1HS. In addition to generating insertional mutations and causing disease through its retrotransposition activities (reviewed in (1)), L1HS also causes damage to the cell through aberrant expression of its sequence (Reviewed in (2)). Aberrant expression of full-length or even partial L1HS sequence that contains the ORF2 is known to cause double-strand DNA breaks (DSBs) (3,4). This excess damage in DNA can lead to cell cycle arrest and apoptosis (3,5), or a senescence-like state (4). LINE elements have long been thought to be expressed only in the germline cells (6)(7)(8), but recently we have learned that both full-length and partial transcripts of LINEs are frequently found in the somatic cells (8)(9)(10) with large variation in expression levels across tissue types. A large number of different LINE element sites are expressed in human somatic tissues, and this expression varies among different individuals (11). The level of LINE expression is especially pronounced in cancer cells. Endogenous expression of LINEs have been observed early on in germ cell tumors (GCTs) (12), and GCT-derived embryonal carcinoma cell (ECC) lines (13), and in numerous other types of cancer (14) more recently. There are hypotheses considering LINE activity as a driver of tumorigenesis or LINE expression as one of the hallmarks of cancer (15).
Although there are many reports of LINE expression in the somatic cells, how LINE expression is repressed and de-repressed in human somatic cells is still largely unknown. Based on what we have learned so far, LINE expression is regulated through multiple layers, consisting of transcription factors, DNA methylation, PIWI-interacting RNAs (piRNAs), RNA interference (RNAi), and posttranscriptional host factors. Full-length human LINE elements contain a 5' Untranslated Region (UTR) that includes an internal RNA polymerase II promoter (16), as well as binding sites for RUNX3 (17), SRY (18) and YY1 (19). LINE expression is also regulated through cell-specific and stagespecific epigenetic mechanisms (2). Expression of LINE full-length transcripts is correlated with differential methylation of the promoter region in various cell lines (45), and in fetal germ cell Recently, CRISPR-Cas9 screen was used to identify proteins that restrict LINE activity (28). The protein MORC2 and the human silencing hub (HUSH) complex was shown to selectively bind evolutionarily young, full-length LINEs located within euchromatic environments, and promote deposition of histone H3 Lys9 trimethylation (H3K9me3) for transcriptional silencing (28).
Among the proteins participating in regulation of TEs, Krüppel associated box-Zinc Finger Proteins We first summarize the survey of LINE expression variation found in the RNA-seq data from 660 samples of normal tissue. We confirm the earlier findings that LINE expression varies across tissue types, with the tissues esophagus, stomach and head and neck showing the highest level of L1HS transcripts. Transcript levels of individual LINE loci are highly tissue specific and within each family only a few individual loci are highly expressed, contributing to the bulk of the transposon transcripts at the family level. We also find large variation in LINE expression across individual samples within each tissue type. The expression of the L1HS family varies by as low as 2-fold in tissues like kidney, to greater than 8-fold in breast, thyroid, and many other tissues.
By analyzing the correlation between individual genes and L1HS transcripts, controlling for the older LINE transcript levels, we found a significant excess of mitochondrial genes and ribosomal protein genes negatively correlated with L1HS in transcript level. Several of the genes that recurrently show strong negative correlation across tissues, have functions related to oxidative stress. We also found an enrichment of KRAB-ZFPs that were positively correlated with older LINE expression, recurrently in multiple tissue types. Binding motifs of the KRAB-ZFP proteins are enriched in the sequences of LINE loci for co-expressed KZFP-LINE pairs, when compared to random KZFP-LINE loci combinations that are not co-expressed.

RNA-Seq and gene expression quantification in the normal tissues.
We used the gene level quantification provided by The Cancer Genome Atlas (TCGA) for the gene expressions. We collected gene level quantifications for 660 samples from The Cancer Genome Atlas (TCGA). We focused on cancer types that had at least 6 normal samples of RNA-seq data, collected from normal tissue adjacent to the cancer tissue. As a result, 17 Table 1. Although we will use the acronym for the cancer type to describe these tissues, we emphasize again that all our samples come from the normal tissues collected from the same organ of the same patient with the tumour. The tumour tissue samples were not included in our analysis.
Methods for sequencing and data processing of RNA using the RNA-seq protocol have been previously described for The Cancer Genome Atlas (TCGA) (35-37). Briefly, RNA was extracted, prepared into poly(A) enriched Illumina TruSeq mRNA libraries, sequenced by Illumina HiSeq2000 (resulting in paired 48-nt reads), and subjected to quality control. RNA reads were aligned to the hg19 genome assembly using Mapsplice (38). Gene expression was quantified for the transcript models corresponding to the TCGA GAF2.1 using RSEM (39). We used the raw_count values in the .rsem.genes.results files, rounded to an integer, as the gene level quantification.

Quantifying TE derived transcripts at the locus and family level
We collected RNA-seq level 1 binary alignment files (.bam files) for 660 samples ( Table 1) from The Cancer Genome Atlas. These bam files are results of the RNA-seq and MapSplice protocol described in the previous section. We used a modified version of the software TEtranscripts (40) for quantifying the reads mapping to annotated transposons. TEtranscripts is a software that can quantify both gene and TE transcript levels from RNAseq experiments. It takes into account the ambiguously mapped TE-associated reads by proportionally assigning read counts to the corresponding TE families using an Expectation-Maximization algorithm. We implemented two modification to the original TEtranscripts software. 1) We modified it to report read counts for each individual TE locus in the reference genome in addition to the family level counts. 2) We developed a function to discount the read counts by removing read counts that correspond to transcripts containing TE sequences that originate from pre-mrna or retained introns in the mature RNA (41). If the reads mapping to TEs are part of the pre-mrna or retained introns, we should see continuous mapping of reads that span the introns flanking the TE of interest. We utilized the read depths in the flanking introns to proportionally reduce the number of total reads mapped to the TE using the following formula: In addition to the discounted quantification, we also used the option in the TEtranscripts software to quantify the TE transcripts using only the uniquely mapped reads. Downstream analyses were done using both the discounted quantification based on multi-mapped reads and the uniquely mapped quantification, to assess the impact of uncertainty in multi-mapped reads.
The retrotransposon annotations used were generated from the RepeatMasker tables, obtained from the UCSC genome database and provided by TEtranscripts. For quantifying reads mapping to the TE flanking introns we generated gtf files containing 1) the TE flanking intron positions, 2) the intergenic TE positions, 3) the exonic TE positions (TEs that fall within an exon, including non-coding RNA genes). In case of intronic TEs, we use the algorithm described above to discount the transcripts from pre-mrna or retained introns. In case of intergenic TEs, we count all EM estimated reads mapped to TEs without any discount. In case of exonic TEs, we ignore those counts altogether, and the exonic TEs do not contribute to the locus count nor the family level count. The modified version of the TEtranscripts software and the required gtf files can be found at https://github.com/HanLabUNLV.

Normalization and transformation of read counts
After quantifying the reads mapping to annotated genes and TEs, both the gene level counts, and the TE counts were normalized between samples across all tissue types with DEseq2. We used the default "median ratio method" for normalization in DESeq2 (42). Briefly, the scaling factor for each sample is calculated as median of the ratio, for each gene, of its read count to its geometric mean across all samples. The assumption of the median ratio method is that most genes are not consistently differentially expressed between tissues. If there is systematic difference in ratio between samples, the median ratio will capture the size relationship. But, this assumption may be violated when we are comparing large number of tissues types at the same time, since a large proportion of the genes may be differentially expressed in at least one tissue type, or one of the tissues may be extremely biased in their number of differentially expressed genes. In order to achieve more robust normalization, we used a two-step normalization method called the differentially expressed genes elimination strategy (DEGES) (43). We performed preliminary normalization using the "median ratio method", filtered out potential differentially expressed genes in the data, found a subset of robust nondifferentially expressed genes, and used the subset to perform the second round of "median ratio type of each sample was used as the ground truth. Each resulting cluster was then assigned a group label based on the majority tissue type. Normalized mutual information was calculated by comparing the labels from the clustering to the true class labels.

Correlated expression between genes and LINEs
Before examining genes that are associated with L1HS expression, we first examined the clinical variables to check any confounding variables associated with the L1HS levels in the normal tissue.
We tested the variables age, days to death, pathological stage, T staging, N staging, M staging, gender, radiation and race for each tissue type. Radiation therapy was the only clinical variable associated with L1HS expression in the normal tissue of thyroid, and thus was the only clinical variable included as a covariate in our linear models and only for the thyroid tissue.
Correlation between gene and L1HS transcripts were tested in each tissue groups separately, in bladder, breast, liver/bile duct, colon/rectum, stomach/esophagus, head and neck, kidney, lung, prostate and thyroid. We tested 20532 genes for each tissue group using a linear model with log2 L1HS expression as the dependent variable, and log2 gene expression as the independent variable.
For a gene to be included in our test, it had to be present in at least eight individual patients. We also required that the gene be expressed with a minimum RPM of 2 in 75% of the samples to be included in the dataset. In addition to the radiation therapy for thyroid tissue, we considered effective library size (sum of all normalized counts) and the batch ID provided by the TCGA project as additional covariates. The linear model we used is described below. log 2 L1HS ~ log 2 gene + log 2 effective_library_size + batch + radiation We tested all combination of linear models that can be created by including or excluding these variables. Second-order Akaike Information Criterion (AIC c ) was used to select the best linear model.
The coefficient and p-value used are the ones from the best model.
After the preliminary analysis, we found there was significant correlation between the level of L1HS and the level of the older LINE elements. To tease apart the effects, we separated the total LINE quantification into L1HS and older LINEs. L1HS was the sum of read counts mapping to all the individual L1HS loci. An additional variable that we called "oldLINEs" was quantified as the sum of read counts mapping to all the LINE elements except for L1HS, L1PA2 and L1PA3. We estimated the linear models above for L1HS, this time including the covariate oldLINEs. To identify genes correlated with older LINE elements, we also estimated the linear models with oldLINEs as the dependent variable, using L1HS as the covariate. log 2 L1HS ~ log 2 gene + log 2 effective_library_size + batch + radiation + log 2 oldLINEs log 2 oldLINEs ~ log 2 gene + log 2 effective_library_size + batch + radiation + log 2 L1HS

Genes showing recurrent correlation with L1HS and older LINEs in multiple tissues
For the gene-L1HS correlation analysis, we decided to focus on global correlation signals that appear in multiple tissue types. In order to identify these global recurrent correlations, we used two different methods. First, we collected genes that showed significant gene coefficients from the linear model (q-value < 10e-4) in more than one tissue type. But, because this method relies on the p-value of the coefficient, it is biased towards picking up genes in tissues with large sample sizes. So, we also utilized a second measure, called REC score (47), that utilizes the ranks of correlations instead of the absolute p-values to find recurrent correlation across tissues. This rank-based approach ensures that individual tissue types are weighted equally, and limits bias from tissues with large sample sizes or from strong associations measured in only a single tissue type. We modified the REC score slightly to test all gene relationship against a single dependent variable (L1HS or oldLINEs).
Let > be the ordered vector of negative and positive associations involving mRNA j in cancer type k. The strength of the association between mRNA j and L1HS in cancer type k is defined by the relative rank: where, A,> is the rank of the association between mRNA j in the vector > . The rest of the formulas for calculating RECj , the REC score for mRNA j follow the methods described in (47).
Once genes with recurrent correlations were identified, we used the gene enrichment analysis software DAVID (48) with the positively and negatively correlated gene sets to identify functional clusters that are enriched among the correlated genes. We also derived a ranked list of these genes based on their sign of the gene coefficient and the maximum partial eta-squared values for the gene coefficient from the linear model. We used this pre-ranked list of genes with recurrent correlations to run the Gene Set Enrichment Analysis (GSEA) software (49) against the curated pathways gene set and the GO terms gene set of the MSigDB collections (50). To test the significance of our results, we generated two different null models for comparison.

Correlation between individual LINE loci and KRAB-ZFPs
For the first null model, we randomly sampled 10,782 pairs of KZFP motifs and L1 loci out of all available KZFPs and L1 loci in our gene expression dataset, regardless of whether they are coexpressed or not. For the second null model, we aimed to construct a null model that reflects the distribution of the L1 families in our significantly co-expressed pairs. We thus started with the 10,782 pairs of zinc finger motifs and L1 sequences that were identified as significantly co-expressed, but we created new pairwise relationships by randomly permuting the LINEs against the KZFPs so that the co-expression relationships are disrupted. We generated 1000 sets of each null model, and tested for the existence of the KZFP motif in the paired L1 sequence. We then summed up the total number of zinc finger motifs identified in the L1 sequences as well as the total number of unique motifs in each of the L1 sequences. We fit a normal distribution to the total number of motifs identified for the 1000 set of null models, and calculated the p-value of our observed count of motifs.

Validation data set from the GTEx project
In addition to the normal samples from TCGA, we also obtained normal lung tissue RNA-seq files from the Genotype-Tissue Expression (GTEx) project (54) as an independent set of samples for validation. GTEx donors are identified through low-post-mortem-interval (PMI) autopsy or organ and tissue transplant settings, in which biospecimens collection can start within 24 hours of death or surgery. It is expected that the majority of the tissue samples in this dataset is free of major disease.
The RNA-seq protocol for the GTEx project has been previously described in (54). Briefly, RNA was extracted to a non-strand specific poly(A) selected Illumina TruSeq library, sequenced by HiSeq 2000 into 76bp paired-end reads, with median coverage of 82M total reads. The reads were mapped to the GENCODE 19 annotation using STAR (55). We downloaded 210 aligned bam files from dbGaP for the lung tissue, chosen because of its large sample size compared to other tissues. The analyses described above were repeated identically for these 210 samples.

TE transcripts are quantified with correction to discount TE reads originating from pre-mRNAs or retained introns
In general, we find that there is good correlation between the total raw counts, and the reduced counts of the TE family after our correction ( Figure 1). We also visually confirmed several examples of clear L1HS expression that are not part of the surrounding intron (supplementary Figure 2). But there were a few samples with large differences before and after correction, such that the overall distribution of the corrected difference was positively skewed (Figure 1, Supplementary Figure 3). On average, the total read count for L1HS was reduced by around 426 reads for each sample. There were 10 patients with large reduction in L1HS read counts, greater than twice the standard deviation above the mean, and these outliers with large corrections came from two tissue types, the stomach and the esophagus ( Table 2). When looked in detail these extremely large corrections came from very few specific L1HS loci. The most extreme case was due to a single L1HS locus, L1HS_dup39 on chromosome 1, 91853148-91853486 (Supplementary Figure 4a). It lies next to an rDNA that is highly expressed in stomach and esophagus. There was transcription read-through of the rDNA that extended beyond the rDNA boundary and continued to transcribe part of the L1HS sequence, beyond the 5-base threshold that we required for being counted as L1HS transcription. Because of this, L1HS_dup39 had a read count of 11473 reads in this sample (patient ID 6852). But after applying our correction method based on the flanking read depths, all the reads mapping to this particular L1HS locus was removed, and the quantification for the particular locus went down to zero as it should.
Another example is L1HS_dup898, which was the single locus most frequently identified as the largest correction within each sample. Among the 674 samples that we quantified, L1HS_dup898 showed the largest correction in 310, almost half of the samples (Table 3a). L1HS_dup898 is found on chr 9, 85664455-85670486 in the middle of the first intron of the gene RASEF. When we looked into the read alignment in detail in one of the samples with the highest expression (patient ID A4OM), we found that there were large numbers of reads uniquely mapping to L1HS_dup898, but we also found equally large numbers of reads mapping to the surrounding intron on both the left and right side of the TE, indicating intron retention (Supplementary Figure 4b). The estimated read counts for L1HS_dup898 were reduced to zero after the correction.
Although, we focused on L1HS elements as example cases, L1HS is actually not the TE family that showed the largest corrections compared to other TE families. AluSx, AluSx1, AluJb, AluY, MIR3, MIRb, etc. showed the largest corrections, several orders of magnitude larger reduction in read counts compared to L1HS (Table 3b). As we have seen in the case of L1HS, a few specific loci were responsible for a major proportion of the correction for a TE family, depending on the tissue type. This was due to specific transcription read-throughs or intron retentions that happen more frequently in certain tissue types.
We also found cases where the method corrected for erroneous TE quantifications due to TEs embedded within long non-coding RNAs (lncRNAs). One TE locus that was most frequently identified with the largest correction in the sample was AluSx1_dup59209. AluSx1_dup59209 had very high transcript levels with an average read count of 18863 in thyroid and head and neck tissue. But when we looked into the alignment in detail, we found that the Alu element was embedded in a lncRNA gene called TTTY14. The reads mapping here were counted as AluSx1 transcripts in the original quantification. In the alignment, we see that AluSx1_dup59209 has accumulated enough mutations, such that almost all the reads mapped to it are uniquely mapped reads. It looks to be a novel case of an Alu domestication, where an Alu insertion or a secondary duplication of an original Alu insertion became part of a testis specific RNA gene (56). Since we added a step to remove all read counts mapping to TEs if the TE falls within exons or non-coding RNAs, the read counts for again correctly classified into their corresponding tissue groups, even without any information from the genes ( Figure 2). We used normalized mutual information between the different clustering results and the ground truth (the true tissue group) to evaluate the quality of clustering. Normalized mutual information was compared for clustering results based on gene expression, family level LINE expression, locus level LINE expression and random assignments. We found that the locus level LINE expression was as predictive of tissue groups as the genes (Table 4).
Compared to other older LINE elements, the tissue specificity was less pronounced for L1HS expression. Still we found large variation in the amount of L1HS expression in different somatic tissues across normal tissue samples (Figure 3), consistent with the previous observation in adult human tissues (10) and in human cell lines (57). There were at least a moderate level of L1HS expression across all the tissue types we examined. Tissue with the lowest L1HS level were the liver and the bile ducts, with the median value at about 300 reads per million. Head and neck tissue, esophageal tissue and stomach tissue showed the highest level of L1HS expression in the normal tissue samples, with a median of around 1500 reads per million. This was even after large amount of corrections in stomach and esophagus samples seen in Figure 1. This is consistent with the observation by Belancio  , and it is most frequently the most highly expressed L1HS locus in the genome. It is possible that the few L1HS loci that we identified to be highly expressed are also the handful of source L1s (hot L1's) that are responsible for the majority of retrotransposition events (58)(59)(60). We tested the overlap between the 23 L1HS loci in our top 100 highly expressed list, and the 28 L1HS elements in Table S3   list, again showing that head and neck, esophagus and stomach showed higher L1HS expression not only at the family level but also at the individual locus level (Table 5b).

Specific sub-families of SINEs, ERVs and MERs show recurrent positive correlation with LINE expression in multiple normal tissues
Since TEtranscripts quantifies the reads mapped to different transposon families (40), we were able to study the co-expression of different transposons with L1HS. Based on the estimates of TEtranscripts, we found multiple LINE subfamilies closely related to L1HS showing up with the strongest correlation in expression levels with L1HS (Supplementary Table 2a). In fact, the correlations between L1HS and many repeat families were much stronger than the relationship between L1HS and any of the genes. The correlation between closely related LINE subfamilies is expected because reads from transposons that map to sequences that are indistinguishable between subfamilies are assigned to multiple subfamilies with proportional weight by TEtranscripts using an Expectation-Maximization algorithm. What was not expected, was that we also found several DNA transposons and Endogenous retroviruses (ERVs) that are highly and recurrently correlated with L1HS expression in multiple tissues (Supplementary Table 2b). Since there was wide-spread correlation between different LINE subfamilies and multiple transposons outside LINEs, we decided to tease apart the correlation using a multivariate model including separate quantification for the L1HS element and the older LINEs. When we controlled for the RNA level of older LINE elements as a covariate, most of the correlation seen between non-LINE transposons and L1HS disappeared, and the elements correlated with L1HS was limited to the most closely related L1 elements (L1PAs), but no other transposon families (Supplementary Table 2c). On the other hand, when we looked at elements that are correlated to older LINEs, controlling for correlation with L1HS, we found several Alu elements recurrently correlated with older LINEs, (Supplementary Table 2d). Considering there is no sequence similarity between the SINEs, MERs, ERVs and the LINEs, the correlation among these diverse transposons is probably due to a common regulatory mechanism that is de-repressing these transposons and LINEs at the same time. There have been reports of such co-expression of ERVs and LINEs in cancerous tissues (61,62), possibly through concordant hypomethylation (63).

Expression of mitochondrial proteins are negatively correlated with L1HS expression
Based on our results showing widespread correlation between L1HS and older LINEs, we decided to examine genes co-expressed with L1HS and older LINEs separately, controlling for the other variable as a covariate in the linear model. We found 3650 genes that showed negative correlation in expression with L1HS (q-value < 10e-4) in at least one tissue (Supplementary Table 3). There were 1056 genes that were negatively correlated with L1HS in more than one tissue, i.e. the correlation was replicated in at least two tissues. Figure 4 shows the top 12 genes with the largest partial etasquared estimated across all tissues.
We did a gene set enrichment analysis (GSEA) using DAVID for the 1056 genes. The top three most enriched clusters were mitochondrial transit peptide / mitochondrial inner membrane, oxidative phosphorylation and ribosome and ribonucleoprotein (including mitochondrial ribosome proteins) ( Table 6, Supplementary Table 4). This result was robust to the threshold we used for significance.
When we used a more stringent cutoff of q-value < 10e-5, we found 326 genes negatively correlated with L1HS in more than one tissue, but the top three most enriched annotation clusters were still the same for the 326 genes. The result was also robust to the software we used for the gene set  Table   5, Supplementary Figure 6). When we look at the list of genes significant in these categories, we find that there is almost an across-the-board reduction in transcripts for mitochondria and the ribosome ( Figure 5) in the cells that have high level of L1HS transcript levels. Nuclear encoded mitochondrial genes are known to be co-regulated with mtDNA encoded genes, and ribosomal protein genes are co-regulated with rDNAs. We did not include mtDNA encoded genes, nor rDNAs in our quantification, so we could not test whether they are also negatively correlated with L1HS expression. The enrichment of mitochondrial genes among the genes negatively correlated with L1HS was significant in models both including and excluding expression of older LINE elements as a covariate. In contrast, there was no enrichment of mitochondrial genes when we looked at the genes correlated with the expression of older LINE elements adjusting for L1HS level as a covariate ( Figure 6). This shows that the negative correlation between the expression of mitochondrial genes and LINE expression is either a common relationship among all LINE families or stronger for L1HS compared to older LINE elements.
To understand how the uncertainty in reads mapping to TEs affect our results, we decided to examine the correlation only using the uniquely and unambiguously mapped reads. We used the option in the TEtranscripts software to quantify TEs only based on the uniquely mapped reads, and replicated the gene-L1HS co-expression analysis. Overall, the significance of the correlations dropped, due to the lower power coming from lower number of reads. We found 6 genes negatively correlated with L1HS with q-value < 10-e4 and 96 genes using a less stringent cutoff (q-value < 10-e3). The six genes identified are CA4, CCDC151, PARD6A, RASSF4, SLC29A2, UBTD1, and there are no common functions across these genes, and no obvious link to mitochondria was found. But with the 96 genes, we found that mitochondrial inner membrane and mitochondrial transit peptide were still the most enriched category among the genes negatively correlated with L1HS measured by uniquely mapped reads (Supplementary Table 6).
We also looked at the recurrent correlations using REC score (Jacobsen et al. NSMB 2013). A strong negative REC score reflects that the mRNA-L1HS relationship generally show correlation in the same direction across the studied tissue types. Compared to the list of correlated genes we examined above, that were identified by a significant coefficient in the linear model in at least two or more tissue groups, this list is based on the rank of correlations and doesn't rely on the significance of the p-value in the linear model. Table 7 shows the top 20 genes that have the best REC scores. None of the genes had a REC score that is less that 6.2, which is considered as the threshold for significance, showing that no gene was consistently co-expressed with L1HS across all tissues. But, when we used the top 100 genes with the most negative REC scores and examined the enriched functions, we again found that the most enriched cluster was mitochondrial respiratory chain and oxidative phosphorylation.
When examining the top 20 genes that show recurrent negative correlation with L1HS across tissues (Table 7), we found interesting interaction and overlap in function among the genes. DHPS encodes the Deoxyhypusine Synthase, the enzyme that cleaves and transfers spermidine that is necessary for the hypusination of eIF5A. Depletion of eIF5A leads to endoplasmic reticulum stress and unfolded protein response (64). The function of AIP is unknown but it is reported to interact with the aryl hydrocarbon receptor (AhR). A tight coupling of AhR and Nrf2-dependent regulation in the prevention of quinone-induced oxidative stress and ER stress has been reviewed in (65). CREB3 is a transcription factor that activates unfolded protein response (UPR) target genes during endoplasmic reticulum (ER) stress response (66). In addition to the genes DHPS, FLOT1, AIP and CREB3, the overall transcriptional repression of ribosomal proteins seen in Table 6 and Figure 6 also led us to consider the possibility that the cells that have high level of L1HS transcripts may be exhibiting activation of the unfolded protein response (UPR) due to ER stress. Although the UPR pathway is mainly regulated at the protein level through phosphorylation and cleavage, we examined whether any genes in the UPR pathway showed correlation with L1HS at the transcript level that was replicated in at least two tissue types. We found that L1HS transcript level was positively correlated with EIF2AK2(PKR), ERN1(IRE1a), and NFE2L2 (NRF2), but negatively correlated with ATF4(CREB2) and CHOP(DDIT3) (Figure 7). We did not observe correlation between L1HS and the other transducers of UPR, PERK or ATF6, nor did we see any correlation between KEAP1 and L1HS. We did see that the gene SQSTM1, encoding p62 that binds to KEAP1, was negatively correlated with L1HS. We also saw several chaperone proteins, ubiquitination factors and other regulators of ER stress correlated with L1HS. DNAJC5, HSPA4L, UBE4B, MBTPS2, ITPR1, ITPR2, and ITPR3 was positively correlated with L1HS and DNAJC4, DNAJC17, DNAJC19, UBE2G2, SSR2 (TRAPB), BNIP1 was negatively correlated with L1HS in at least two tissues. Excluding the annotations related to mitochondria, the functional annotations "proteasome", "autophagy", "endosome", and "antioxidant" were the top annotations enriched in the negatively correlated gene set (Table 6).
Since XBP1 is one of the genes in the UPR pathway regulated through splicing, we examined if the XBP1s isoform correlates with the L1HS level in our samples. We tested the log2 transformed raw counts of the XBP1s isoform against the log2 transformed L1HS level, and we also tested the ratio of the spliced isoform over all transcripts XBP1s/(XBP1u+XBP1s) against the log2 transformed L1HS level. But neither quantification of the spliced version of XBP1 correlated with L1HS consistently across tissues. We found that the ratio of the spliced form positively correlated with L1HS levels in the colorectal samples, but it was not replicated in any other tissue.
To find whether there are common transcription factors upstream of the genes correlated with L1HS, we took our list of 1056 genes negatively correlated with L1HS and looked for common regulatory elements using the web server DiRE (https://dire.dcode.org) (74). We found that the transcription factor binding sites (TFBS) of NRF2 and ELK1 are the most frequently identified motifs for our list of genes. 476 genes out of 1056 had at least one TFBS identified, and of those 99 (20.69%) included at least one NRF2 binding site and 105 (22.52%) included at least one ELK1 binding site. Unlike NRF2, we did not find any significant correlation between ELK1 and L1HS. But, as we describe in the next section, several genes in the MAPK signaling pathway were positively correlated with L1HS in multiple tissues.

Expression of chromatin modifiers are positively correlated with L1HS expression
Compared to the genes in negative correlation with L1HS, there was a larger number of genes in positive correlation with L1HS in expression. We found 4338 genes that showed positive correlation in expression with L1HS (q-value < 10e-4) in at least one tissue out of the ten tissue groups (Supplementary Table 2). There were 1318 genes that were positively correlated with L1HS in more than one tissue. We did a gene set enrichment analysis using DAVID for the 1318 genes that are positively correlated with L1HS in more than one tissue. After excluding the very general categories of "kinase" and "transcription" that consisted of 293 and 768 genes respectively, we saw SH3 domains, DNA damage and repair, bromodomains, PHD fingers, and chromodomain helicases as most enriched functional categories for genes positively correlated with L1HS expression (Table 8 TRAF6 and NF-κB were positively correlated with L1HS in transcript levels. Table 9 shows the genes with the best REC scores that have recurrent positive correlation across tissues. We removed genes RASEF, MAST4, and LPP from the list, because they have L1HS sequences in their introns, that can lead to spurious positive correlation. Although we correct for the TEs embedded in introns, the correction is not complete, as we discuss in the Discussion section below.

Expression of KRAB-ZFPs are positively correlated with expression of older LINE elements in the normal tissues
As can be seen in Figure 9, various zinc finger proteins are both positively and negatively correlated with L1HS and older LINEs. But, overwhelmingly more zinc fingers are positively correlated with LINE expression, rather than negatively. Also, if we focus on the Krüppel associated box (KRAB) domain, zinc fingers with KRAB domains were significantly enriched among genes positively correlated with older LINE elements but less among genes correlated with the young and active L1HS (Supplementary Table 4). There are three ZFPs that show recurrent negative correlation with L1HS in multiple tissues, ZNF511, ZNF32, ZNF174 (  Figure 10 shows an example of correlation between KZFPs and individual L1M2b loci in the breast tissue.
Again, confirming what we found in our initial analysis, the families that were found as significantly . We looked at whether any of these proteins showed correlation with L1HS in the expression level, but we did not find any KRAB-ZFPs that showed significant correlation with L1HS.

KZFP-LINE locus pairs that show significant co-expression
We Results are replicated in an independent dataset of normal tissue samples of the lung.
All our samples come from supposedly normal tissues that were selected by pathologists based on histology, although they are adjacent to the cancer cells. But, since suppression of mitochondria and oxidative phosphorylation, replaced by enhanced aerobic glycolysis is one of the hallmarks of cancer, referred to as the Warburg effect, there is a possibility that our results reflect the cancerous environment that these "normal tissue" are experiencing, including hypoxic conditions. To find out if we could replicate the results in a true normal tissue, we used an independent dataset from the GTEX project to test the correlation with the same analysis.
First, we confirmed that the lung tissues clustered well with the lung tissues from the TCGA dataset based on the gene expression patterns (Supplementary Figure 7a). The negative relationship between mitochondrial genes, ribosomal protein genes and L1HS expression was replicated in the lung transcriptome data from GTEX in the linear model without controlling for older LINEs as the covariate, but was not replicated when we controlled for the older LINEs. We are not sure why there is this discrepancy. But, the TCGA data is a more robust result because we only included genes that showed correlation in multiple tissues. With the GTEX data, we only had one tissue type, the lung, so we did not require the additional filtering of recurrent correlation.
When examining the linear model without controlling for older LINEs as a covariate, we confirmed that the main results were replicated. There was enrichment of terms mitochondrial genes, ribosomal protein genes, response to reactive oxygen species, autophagy, proteasome, among the genes negatively correlated with L1HS. There was enrichment of KRAB-ZFPs, among genes positively correlated with both L1HS and older LINE elements.

Limitations to the quantification and correction
Quantifying transposon transcripts is a difficult problem, due to their ambiguity in short read mapping because of repeated content in the reference genome. Current state of the art methods rely on Expectation-Maximization to account for the uncertainty in multi-mapped reads (40). Because the estimation of multi-mapped reads start from values initialized based on the uniquely mapped reads, the final estimates tend to correlate well with the uniquely mapped read counts. This works well with gene transcripts where more than 90% of the reads are estimated to map uniquely across all genes (77), and the variance in the mappability across genes is not large. But with transposons, not only are the mappability lower than the genes, the variance of the mappability across different loci are huge due to the large number of similar sequences. This can lead to biases in the quantification if one locus has high mappability due to more unique mutations, while another locus has lower mappability due to smaller number of unique mutations by chance. Focusing only on uniquely mapped reads doesn't really solve this problem, and will lead to equally biased quantification, as evidenced by the high correlation observed between uniquely mapped reads and the total reads including multi-mapped reads.
In addition, the reference genome that are used for mapping the transposon transcripts do not include all the polymorphic TE insertions in the human population. If the transcript originated from a polymorphic TE that is not present in the human reference genome, it will be redirected to the most similar locus that is present in the reference genome. This will not be a serious problem when quantification is done at the family level, but if one is interested in locus level quantification it becomes a serious problem, especially because the polymorphic loci could potentially be the more active loci that are expressed at higher levels compared to other loci of the same family.
Another complication in TE transcript quantification is that TEs are frequently embedded within introns that are transcribed before they are processed, or sometimes fail to be spliced out, or embedded within exons or non-coding RNAs that are expressed in different conditions (41). To account for this source of error, we introduced a method to correct for TE reads coming from retained introns or pre-mRNA. Although we observed large corrections for specific TE elements embedded within introns, the correction is not complete. The main limitation to our approach stems from the fact that the correction is done after the EM algorithm, after the multi-mapped counts are probabilistically assigned to multiple TE loci in the genome. The correction only removes the equivalent of read counts assigned to the problematic locus, and the uncertainty in the assignment through EM means there may be remaining reads assigned to alternative locations, that actually originate from retained introns as well. A more accurate approach would be to correct for the read counts from retained introns before the EM algorithm based on the uniquely mapped reads, and then run EM based on the corrected counts. But, estimating the depth of the repeat region using uniquely mapped reads is a difficult problem. The effective length of the uniquely mapped region is difficult to estimate, because again mappability varies from locus to locus for any TE, depending on the unique mutations it has accumulated. So, for this study, we decided to use the easier approach to run EM first, and probabilistically assign the TE reads, and then correct based on the expected read depth across the length of the TE locus.
Although we partially dealt with the third problem described above, we have ignored the first and second problems, and we recognize the limitations of the current methods. An important future study would be to study the mappability of individual TE loci carefully, including the known polymorphic sites, and to design a software for TE quantification that can take into account the mappability of each locus in its EM algorithm, as well as correct for the retained introns while considering the effective length of the uniquely mappable region within the TE. Despite these limits, we believe the main results of gene correlations found with L1HS and older LINEs are not affected by the quantification.
The uncertainty lies in the assignment of reads to each LINE families, but the total reads coming from LINEs that are found in the RNAseq are valid. Our correlation analysis was done at the family level for each tissue separately. We may have under-or over-estimated the L1HS read counts, but as long as the errors are similar in distribution within each tissue type, and the errors are not systematically biased to be associated with a certain set of genes, it should not be a serious problem. We tried to make the results robust by dropping the reads mapping to LINEs most closely related to L1HS, i.e.
L1PA2 and L1PA3, and requiring the correlation be replicated in at least two tissue types. We also confirmed that the results were replicated using uniquely mapped reads only.

Stress and TE expression
Initially, when we started the project, our goal was to identify candidate genes involved in transposon control, based on the co-expression analysis. But, once the analysis was done, the results were pointing to what induces LINE expression, rather than what suppresses LINE expression.
Among the genes known to function in transposon control, RNaseH2C (Figure 4), and RNaseH2A Our study is the first that reports observation on L1 expression in vivo in normal human tissue samples, and shows widespread association between L1HS expression and genes modulating stress response. In this light, it is interesting that we were not able to observe correlation between age and L1 transcript levels in our data. It may be because, although our samples are all normal tissue, they come from cancer patients and thus the age distribution is skewed towards older age. The range is from 15 to 90, but the median age across all samples is 62. We did find association between past radiation therapy and L1HS expression level but only in the thyroid tissue. The exposure and damage to normal tissues resulting from radiation therapy has been long recognized (96), and the sensitivity to radiation is known to vary depending on the tissue or organ, as well as the genetics of the patient (97). Our results show that one of the responses to radiation damage in the adjacent thyroid tissue may be a higher level of L1HS expression (Supplementary Figure 8) and it may have implications for future rate of recurrence.

TE expression in disease
The genes and pathways that we find to be correlated with L1HS expression are interesting when we think about the accumulated evidence of transposon activity in cancer, and more recent reports of transposon dysregulation in neurodegenerative diseases. Suppressed mitochondrial function and altered metabolism, are one of the hallmarks of cancer, and abnormal MAPK signaling is implicated in a wide range of cancers. Oxidative stress has also been linked to cancer through increased DNA damage, and genome instability. In our preliminary analysis, we observed that the same correlations are found in the cancer tissue samples as well, but we focused on the normal tissues for the scope of this study, to avoid the confounding factors of malignant transformation. The fact that we observe these correlations in normal tissue, albeit obtained adjacent to the cancer tissue, could mean that LINE dysregulation is one of the earlier steps in the initiation of cancer and could potentially contribute to the progression of cancer through the damage that they cause to the DNA. Similarly, mitochondrial dysfunction, oxidative stress and proteotoxic stress, and more recently transposon expression are some of the common features that are observed in the tissue samples or animal models of neurodegenerative diseases, such as amyotrophic lateral sclerosis (ALS) (98-100) and Alzheimer (101). Our observation that the link between mitochondria, ER stress, and LINE expression is observed across multiple tissues other than brain that are considered normal and healthy, indicate that cellular stress, transcriptional response and the induction of L1 expression may be a fundamental process that happens quite frequently across different organs. Whether such expression becomes a burden to the cell and contributes to disease, and if so, why the process leads to more serious consequences in certain tissues such as brain are questions that will need to be explored.

TEs and KRAB-ZFPs
Although LINE elements have been studied for a long time, their ubiquitous and highly tissuespecific expression patterns are starting to be appreciated only recently.    TCGA patient id, read counts before and after correction, read counts removed, and their tissue type.  samples with the largest L1HS read counts. Sample ID, frequency of the sample in the top 100, tissue type. Table 6. Enriched annotations among genes negatively correlated with L1HS transcripts in at least two or more tissues.
Enriched annotation clusters identified with DAVID. Rank, annotation, annotation terms, enrichment score, number of genes in the annotation cluster, gene names. Enrichment score is the geometric mean of the enrichment scores (modified fisher's exact test) for all terms in the cluster. Rank 4 to rank 8 and rank 10 are additional annotation clusters related with mitochondria that we omitted for brevity.
The full table is included in Supplementary Table 4. 20 genes with the highest REC score reflecting recurrent negative correlation across multiple tissues. Table 8. Enriched annotations among genes positively correlated with L1HS transcripts in at least two or more tissues.