Whole-genome analysis reveals the contribution of non-coding de novo transposon insertions to autism spectrum disorder

Background Retrotransposons have been implicated as causes of Mendelian disease, but their role in autism spectrum disorder (ASD) has not been systematically defined, because they are only called with adequate sensitivity from whole genome sequencing (WGS) data and a large enough cohort for this analysis has only recently become available. Results We analyzed WGS data from a cohort of 2288 ASD families from the Simons Simplex Collection by establishing a scalable computational pipeline for retrotransposon insertion detection. We report 86,154 polymorphic retrotransposon insertions—including > 60% not previously reported—and 158 de novo retrotransposition events. The overall burden of de novo events was similar between ASD individuals and unaffected siblings, with 1 de novo insertion per 29, 117, and 206 births for Alu, L1, and SVA respectively, and 1 de novo insertion per 21 births total. However, ASD cases showed more de novo L1 insertions than expected in ASD genes. Additionally, we observed exonic insertions in loss-of-function intolerant genes, including a likely pathogenic exonic insertion in CSDE1, only in ASD individuals. Conclusions These findings suggest a modest, but important, impact of intronic and exonic retrotransposon insertions in ASD, show the importance of WGS for their analysis, and highlight the utility of specific bioinformatic tools for high-throughput detection of retrotransposon insertions. Supplementary Information The online version contains supplementary material available at 10.1186/s13100-021-00256-w.


TE filtering with xTea
TE type-specific filters were implemented within xTea to remove false positives. In specific, while we used the default values for most of the parameters, there are three major parameters (the number of clipped reads, the number of discordant pairs, and the number of clip and discordant reads) which can affect the sensitivity and specificity. We thoroughly evaluated these three parameters and required >= 3 clipped reads, >=5 discordant pairs, and >=1 clip and discordant pairs as the optimal ones to maintain high sensitivity and high specificity. Because of target-primed reverse transcription, a polyA tail and target site duplication should be observed along with enough supporting clipped and discordant reads at both sides of the breakpoint.
However, in many cases, not all of these features could be detected. xTea incorporates a confidence rating system that evaluates whether all these features are found and whether they are on one or both sides of the breakpoint. We selected only insertions classified as "high confidence". Additional filters within xTea include examining the patterns of insertion-supporting clipped sequences and discordant reads mapped to the TE consensus sequences: the supporting reads should not be scattered across the consensus but instead form one cluster (c1) for 5'-clip reads and another cluster (c2) for 3'-clip reads; the mates of 3' and 5' discordant reads should form two distinct clusters (d1 and d2). The distance between c1 and d2 and between c2 and d1 must be less than the average insert size ± 3*(standard deviation of the insert size). The supplementary xTea filter module with TE type-specific filters implemented in our analysis is now part of the main xTea code in the latest version.

Processing for comparison with known non-reference TEIs
To obtain Venn diagrams for overlap with other cohorts, TEIs from unrelated parental individuals were given a 40 base pair margin from the midpoint of the breakpoints and were merged with bedtools (2) "merge" if they overlapped to obtain a unique set of non-redundant TEIs in unrelated individuals in the SSC cohort. Breakpoints from gnomAD (3) and 1000 genomes (4) were also given a 40 base pair margin. The different datasets were overlapped using bedtools (2) "intersect" and counts were plotted in R with the VennDiagram library (5).

Allele frequencies and comparison with previous studies
The population allele frequency (PAF) within the cohort was calculated as the number of alleles carrying the TEI in the population divided by the total number of chromosomes in the population.
As an additional approach, we estimated the population allele frequency (PAF) within the SSC parental cohort using the Hardy-Weinberg principle. Here, p+q=1, where p is the frequency of the insertion allele in the population and q is the frequency of the non-insertion allele in the population. Assuming that q 2 is the fraction of individuals without an insertion, we calculated the PAF as 1 -(sqrt((total parental individuals in the cohort -individuals with insertion allele) / total individuals in the cohort)). Merged breakpoints were overlapped with gnomAD TEIs (3) using a window of 40 base pairs to define overlap. We compared the PAF within the SSC cohort to both the overall PAF and the European PAF in gnomAD since 83% of fathers and 85% of mothers were classified as white (Additional File 1: Fig. S6).

Detection of de novo TEIs
We used the high confidence post-filtered insertions from xTea (6) for this analysis for Alu, L1, and SVA (https://github.com/parklab/xTea). TEIs were given a 40 base pair margin from the midpoint of the breakpoints and were excluded if they overlapped with known non-reference (KNR) insertions obtained from previous studies (4,(7)(8)(9)(10)(11)(12)(13)(14) as well as reference SVA, reference young L1 (L1HS, L1PA2, L1PA3) or reference young Alu (AluY) (15). To exclude inherited insertions that may have been missed in parents, we excluded insertions that had clipped or discordant reads in the raw parental files (clip_reads_tmp0 and discordant_reads_tmp0) in the xTea output. We imaged de novo candidates on IGV 2.4.19 (16) for manual inspection. We visually confirmed the absence of supporting parental reads, as well as the presence of a target site duplication, a polyA tail, and clipped and discordant supporting reads that support a retrotransposition event (17). Insertions were scored as "high confidence de novo" if visual inspection of calls passed these criteria for de novo insertions, as "de novo" if there were some discordant reads in the parents but no clipped reads supporting the breakpoint in parents, as "somatic candidate" if there is strong read support with a polyA tail, target site duplication, clipped reads and discordant reads, but the supporting reads are a small fraction of the overall coverage at the breakpoint, as "parental mosaic candidate" if there were <=2 clipped reads in one of the parents and discordant reads at a low allele frequency, suggesting it might be mosaic in parent's blood yet not called by xTea due to the low allele frequency, and "false negative parental" if it was not clear whether there was a false negative insertion or a mosaic blood insertion in the parental sample due to having few clipped reads but many discordant reads near the insertion site. Only insertions scored as "high confidence de novo", "de novo", "somatic candidate", and "parental mosaic candidate" were included. TEIs that were detected in both the affected and unaffected siblings were excluded as these are either parental false-negative insertions or parental mosaic and inherited events.

Adjusting de novo retrotransposition rates
With long-read technologies, the sensitivity for detection of TEIs is higher (18,19), suggesting that our raw rates are an underestimate. To account for genomic regions in which xTea is unable to detect TEIs given the lower sensitivity with Illumina short-read data, as well as for the reference filters we used for de novo insertions, we calculated our sensitivity for detecting germline TEIs in the Genome in a Bottle sample NA24385/HG002 (20) which has been sequenced with both long and short-read technologies. A curated set of 9,970 (>50bp) insertions was obtained from Genome in a Bottle V0.6 (21) and integrated with 15,268 (>50bp) insertions from a haplotype assembly of the samples (6,22). RepeatMasker (15)  Chromatin states from fetal brain tissue (E081 and E082) were downloaded from https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/imput ed12marks/jointModel/final/ (27). States were classified as: 13_EnhA1 and 12_EnhA2 = active enhancers; 15_EnhAF, 16_EnhW1, 17_EnhW2, 18_EnhAc = other enhancers (weak, flank, acetylation only); 2_PromU, 3_PromD1, 4_PromD2 = promoters. TEIs were overlapped with the two fetal brain regions using bedtools (2) and the number of unique calls in each category was obtained.

Mobile Element Insertion Size
To determine the size of polymorphic and de novo insertions, we only included TEIs which had supporting clipped reads on both breakpoints and which did not overlap with reference. Since xTea maps reads to several subfamilies of retrotransposons, we excluded calls where the clipped and discordant reads mapped above the consensus size xTea uses for mapping for AluY, L1HS, and SVA (282, 6,120, and 1,400 base pairs respectively) since, particularly for Alu calls, these tended to be poly-A expansion artifacts. The consensus sequences within xTea were obtained from RepBase23.02 (28) and the L1HS consensus sequence was manually constructed by multiple sequence alignment of full-length reference sequences. The position of clipped and discordant reads mapping to reference retrotransposon sequences was obtained for each insertion. The minimum position was subtracted from the maximum position to obtain the predicted size. If the maximum position was larger than the consensus length, this was set to the consensus length. The resulting estimated insertion size is an approximation since we are unable to account for repeat expansions and different poly-A tail lengths (29,30). For all insertions including polymorphic and de novo TEIs, calls were given a 40 base pair margin from the midpoint of the two insertion breakpoints and were merged based on the overlap of these coordinates. The median size for all samples with each insertion is reported. Loess regressions were performed in R (31) with a 25% smoothing span.

De novo insertions in brain expressed genes
For overlap of insertions with brain expressed genes, we selected the neocortex regions from  Table S4). We then overlapped genes with insertions in Autism Spectrum Disorder (ASD) and controls with the top 10% of gene expression observed.

Primer design and validation strategy
We excluded events overlapping duplicated regions or reference insertions of the same class from validations to reduce amplification artifacts. A custom pipeline, based on a previously developed pipeline (35), was used to obtain primer sequences for full-length validation.
PCRs were performed using Phusion Hot Start II High-Fidelity DNA Polymerase (F549L, Thermo Fisher Scientific) (see Additional File 3: Table S7 for detailed PCR protocols). Primers were tested and optimized using the Genome in a Bottle sample NA24385/HG002 (20), where we had sequencing data and high confidence insertions from the gold standard available. DNA was quantified using a Quant-iT™ dsDNA Assay Kit (Q33120, Thermo Fisher Scientific) before running at least 70 ng of PCR product, when possible, on a 2% agarose gel (for Alu) or a 1% agarose gel and with a Genomic DNA ScreenTape Analysis (5067-5366, Agilent) on an Agilent TapeStation (for L1) for a higher resolution at determining the insertion amplicon size. A 1kb Plus DNA ladder (10787-026, Invitrogen) was used.
Some primer pairs produced additional amplification bands or artifact bands and were further optimized by increasing the annealing temperature and/or decreasing the number of amplification cycles, and some primer pairs produced lower concentrations of DNA and were optimized by increasing the number of amplification cycles and/or decreasing the annealing temperature. If primer pairs did not amplify a unique non-insertion allele and had artifact bands, we did not proceed with validation of those insertions using those primers. Out of 12 L1 primer pairs designed for validations of de novo insertions, we were able to optimize 9 primer pairs, and we optimized 23 Alu primer pairs out of 25. 2 of the L1 primers were selected for mosaic candidates in 1 case and 1 control and were considered separately for validation rates. These 2 cases did not validate in lymphoblastoid cell line DNA. Fig. S1. The version of xTea used in this study has a high performance and is comparable to MELT in short-read Illumina WGS data. A xTea and MELT show higher F1 scores for Alu, L1, and SVA TEIs compared to Mobster. B At 40X coverage, the sensitivity of xTea is higher for Alu and is comparable to MELT for SVA and L1. The sensitivity of Mobster is higher for L1, at the cost of lower specificity. C The specificity of MELT and xTea is comparable for Alu and SVA at 40X and is higher in xTea for L1. Mobster shows a much lower specificity for Alu, L1, and SVA.        (26) (pLI >=0.90) compared to expected TEIs based on 10,000 random simulations using a position probability matrix to consider the L1 endonuclease cleavage site. A Similar results to the simulations performed with fully random insertions (Fig. 3) were observed, including a trend for more L1 insertions in SFARI genes in cases than expected by chance (p=0.003, q-value = 0.07). B We also identified a non-significant trend for more high pLI genes than expected in cases with this method (p=0.04, q-value = 0.52).