Integrating transposable elements in the 3D genome

Chromosome organisation is increasingly recognised as an essential component of genome regulation, cell fate and cell health. Within the realm of transposable elements (TEs) however, the spatial information of how genomes are folded is still only rarely integrated in experimental studies or accounted for in modelling. Whilst polymer physics is recognised as an important tool to understand the mechanisms of genome folding, in this commentary we discuss its potential applicability to aspects of TE biology. Based on recent works on the relationship between genome organisation and TE integration, we argue that existing polymer models may be extended to create a predictive framework for the study of TE integration patterns. We suggest that these models may offer orthogonal and generic insights into the integration profiles (or “topography”) of TEs across organisms. In addition, we provide simple polymer physics arguments and preliminary molecular dynamics simulations of TEs inserting into heterogeneously flexible polymers. By considering this simple model, we show how polymer folding and local flexibility may generically affect TE integration patterns. The preliminary discussion reported in this commentary is aimed to lay the foundations for a large-scale analysis of TE integration dynamics and topography as a function of the three-dimensional host genome.


Background
Transposable elements (TEs) are DNA sequences that can move from one location of the genome to another. By being able to spread their own DNA across the genome independent of the cell's replication cycle [1], TEs represent the majority of genomic content in most eukaryotes. For example, they comprise 85% of the maize genome [2] and up to 50% of primate genomes [3]. As such, TE activity is a major driver of phenotypic and genotypic evolution [4,5] and affects key biological processes from meiosis and transcription to immunological responses [6]. At the same time, TEs have been associated with various diseases and cancer in humans [7].
Most TEs transpose via cut-and-paste or copy-andpaste mechanisms that can both result in a net increase of the TE copy number [8]. Amplification phases, or bursts, of TEs can occur multiple times in the evolutionary history of the host and may produce hundreds if not thousands of new copies within short time windows [5,9,10].
Most TEs exhibit some level of integration site selection, from very specific target sites [11] to non-random but more dispersed genomic biases [12][13][14]. Short DNA motifs, epigenetic marks and nuclear proteins have been associated with such integration site preferences. For example, yeast Ty1 retrotransposons integrate upstream of Pol III-transcribed genes through a direct interaction between the integrase complex and the AC40 subunit of Pol III [15,16]. In contrast, in plants and fungi, the integrase of certain Gypsy retrotransposons contains a chromodomain that can bind to repressive histone marks and aid insertion into heterochromatin [17].
While the role of protein tethering and DNA motifs in TE integration is well established by now, it remains elusive how the three-dimensional (3D) structure of chromosomes and the nuclear environment is affecting TE spreading in host genomes. Chromosome folding and nuclear organisation have been shown to play key roles in all major DNA related processes [18][19][20][21][22][23][24][25], from transcription and replication to DNA repair, and it is thus natural to expect that transposition will also be affected by the 3D organisation of the genome.

Roles of TEs in 3D genome organisation
A number of recent reports have highlighted that TE activity is involved in shaping 3D chromosome structure. For example, TEs of diverse families have been implicated in the establishment and maintenance of insulator boundaries between so-called "topologically associated domains" (TADs) [26][27][28][29][30][31]. Furthermore, TE amplification is suggested to account for a significant amount of binding motifs for the CTCF [32][33][34] protein, a key regulator of 3D chromosome organisation [35]. The involvement of TEs in the establishment of evolutionarily conserved long-range chromosomal interactions has been shown in different organisms [36,37] and some of these TE-mediated interactions appear to be of functional importance in gene regulation. In the plant Arabidopsis thaliana, TEs are enriched at genomic hubs of long-range chromosomal interactions with anticipated functional roles in silencing of foreign DNA elements [38].
Arguably, how TEs contribute to genome folding will certainly receive more attention in the future, but it is an equally fundamental question for both genome and TE biology to understand how genome folding affects TE integration preferences. For example, depending on the 3D organisation of chromosomes, a new TE copy that enters the nucleus from the cytoplasm will come across distinct parts of the genome in terms of their accessibility and organisation compared to a preexisting TE copy that relocates to a new genomic locus without exiting the nucleus. Intriguingly, retroelements (including retrotransposons and retroviruses) and DNA transposons have different replication and transposition pathways [39], which implies that genome architecture may have a different impact in each TE type.
There is a clear gap in the experimental and theoretical work on understanding the impact of genome architecture on integration of TEs. Models of TE amplification dynamics have traditionally been based on populationbased approaches [40], which typically set up systems of (stochastic) ordinary differential equations accounting for generic competing elements during TE expansion [41][42][43]. Few works, instead, have considered the 1D distribution of nucleosomes along the genome in order to predict preferential sites of HIV integration [44]. Both these classes of models necessarily neglect the multi-scale 3D organisation of the genome, i.e. from nucleosomes to TADs and from compartments to chromosome territories [45,46]. Because of this, they are not suited to predict the "topography" of TEs, i.e. the pattern of genomic sites in which TEs will preferentially integrate.
In this commentary, we introduce and discuss a computational model based on principles of polymer physics, which aims to dissect the interplay between genome organisation and biases in TE integration. We first briefly review the existing framework of polymer models -which have been proved to be very successful tools to rationalise 3D genome folding [47][48][49][50][51][52][53] -we then discuss a recent development of such models to understand the physical principles of HIV integration [54], and finally present preliminary data obtained by extending these models to the case of TEs (Fig. 3). We conclude this commentary by discussing potential future directions in this unexplored line of research.

Biophysical principles of genome folding
While genomes are, biologically speaking, the carrier of genetic information they also are, physically speaking, long polymers [55]. Polymers are well-known objects that have been studied for several decades in particular in relation to industrial applications, such as rubbers [56]. Pioneers in polymer physics realised a long time ago that they obey "universal" laws that are independent of their chemical composition [57]. For instance, the way a long polystyrene molecule folds in space must be identical, statistically speaking, to that of a long DNA molecule in the same solvent conditions. Because of this, polymer physicists typically employ "coarse-grained" approaches, which blur the chemical details and only retain the necessary ingredients that allow the formulation of simple and generic (universal) frameworks [58]. Universality then implies that these coarse-grained models have predictive power for a broad range of systems with different chemistry.
Coarse-graining several base-pairs and groups of atoms into mesoscopic beads (see Fig. 1a), while retaining the salient physical behaviour of DNA, allows the formulation of computational models that can reliably predict the spatial organisation of whole chromosomes from minimal input -such as epigenetic patterns and generic binding proteins [51,53,[59][60][61][62][63][64] -and disentangle the contribution of different classes of proteins to genome folding [61,65,66].
These computational models, coupled to Chromosome Conformation Capture (and its higher order variants, such as HiC) experiments [18,19], are providing new information on the spatio-temporal organisation of the genome in different conditions, such as healthy [62,67], senescent [68] and diseased [52] cells, or even during cell-fate decisions [25] and reprogramming [69]. For instance, polymer models can rationalise features such as TADs [64,70], compartments [51,59] and loops [53,64] seen in experimental HiC maps [71]. Importantly, these works are proving that traditionally physical phenomena such as liquidliquid and polymer-polymer phase separation [49,[72][73][74][75][76], gelation [77,78], emulsification [79] and viscoelasticity [80,81] may be found in ubiquitous and key biological processes such as transcription, replication, mitosis, RNA splicing and V(D)J recombination to name a few. Polymer highlighted are the features of DNA stiffness (set by penalising large angles θ between consecutive pairs of monomers) and connectivity (set by penalising large extensions x between consecutive beads). We also account for excluded volume interactions and pair-attraction represented by the wrapping of the orange segment around the histone octamer (here a blue spherical bead). c Schematics showing that integration events on DNA deform the substrate. d Snapshots from molecular dynamics simulations showing an integration event within a nucleosome. Color scheme: orange = wrapped host DNA, green= viral DNA, grey = non-wrapped host DNA. Adapted from Ref. [54] models are thus providing the community with a physical lens through which they may interpret complex data, and a quantitative framework to generate de novo predictions. In light of this, we here propose that the use of polymer models may shed new light into the relationship between TE transposition and 3D organisation. Earlier this year one of us made a first step in this direction by formalising a polymer-based model for understanding the site selection features displayed by HIV integration in the human genome [54]. Below, we briefly review this work, which will then be used as a stepping stone to formalise a polymer model for TE expansion.

A polymer model for HIV integration
One of the least understood features of HIV integration is that its integration patterns display markedly non-random distributions both along the genome [82] and within the 3D nuclear environment [83]. HIV displays a bias for nucleosomes [84,85], gene-rich regions [82] and superenhancer hotspots [86] that has defied comprehension for the past three decades. Clearly, from the perspective of a retrovirus such as HIV, integrating in frequently transcribed regions is evolutionary advantageous. But how is this precise targeting achieved?
For the past decades, the working hypothesis to address this important question was that there must exist specialised factors or protein chaperones that guide HIV integration site selection. Prompted by this hypothesis, much work has been devoted to discover and identify such proteins [14]. Some factors, such as the lens-epitheliumderived growth factor [87] (LEDGF/p75) have been proposed as potential candidates for this role but even knocking-down their expression could not completely remove the bias for gene-rich regions [88]. Additionally, the preference of HIV to integrate in nucleosomes -oppositely to naked DNA -was shown in vitro using minimal reaction mixtures [85,[89][90][91].
We recently put forward a different working hypothesis to address the bias of HIV integration site selection: could there be universal (non-system specific) physical principles that -at least partially -can contribute to biasing the site-selection of HIV integration? [54] Prompted by this hypothesis, we decided to propose a polymer-based model in which retroviral integration occurs via a stochastic and quasi-equilibrium topological reconnection between 3D proximal polymer segments [54]. In other words, whenever two polymer segments (one of the host and one of the invading DNA) are nearby in 3D we assign a certain probability for these two segments to reconnect, based on the difference in energy between the old and new configurations. This strategy is known as a Metropolis criterion and it satisfies detailed balance, thus ensuring that the system is sensitive to the underlying free energy [92]. [Note that in vitro HIV integrase works without the need of ATP [85], and we therefore assume that the integration process is in (or near) equilibrium.] Finally, we impose that the viral DNA is stuck once integrated in the selected location and cannot be excised within the simulation time.
Within this simple model, we discovered that geometry alone may be responsible for a bias in the integration of HIV in nucleosomes (Fig. 1b-d adapted from Ref. [54]). This is because the pre-bent conformation of DNA wrapped around the histone octamer lowers the energy barrier against DNA deformation required to integrate the viral DNA into the host (Fig. 1b, see also Refs. [89][90][91]). While the preference of HIV for pre-bent DNA conformations was suggested before [85,93,94], it had not been explained and formalised within a physical and mathematical framework that could generate quantitative predictions. Further, by considering a longer region of a human chromosome folded as predicted by the above mentioned polymer models [49,50,59], we discovered that at larger scales HIV integration sites obtained from experiments [82] are predominantly determined by chromatin accessibility. Thus, by accounting for DNA elasticity and chromatin accessibility -two universal and cell unspecific features of genome organisation -our model could predict HIV integration patterns remarkably similar to those observed in experiments in vitro [84,85] and in vivo [82].

Extension to DNA transposition
In light of the success of this polymer model, we now propose to extend it to understand the distribution of integration sites across the TE phylogeny. Importantly, different TEs have different integration strategies, which suggests that they are likely to interact differently with the 3D genome organization within the nucleus. TEs can be primarily distinguished based on the mechanism through which they proliferate, i.e. via "copy and paste" (class I or retrotransposons) or "cut and paste" (class II, DNA transposons). The former require an RNA intermediate to proliferate and thus exit the nuclear environment, whereas the latter simply relocate their DNA via endonuclease excision [8,39].
Retroviruses, like HIV, are very similar to retrotransposons with the addition that they can exit to the extracellular space and invade other cells. Thus, a new copy of a retrotransposon (class I or copy-and-paste) or a retrovirus must travel from the periphery to the nuclear interior while the genome is "scanned" from the outsidein for integration sites (Fig. 2a). This implies that the global, nuclear-scale genome architecture is expected to be important for this re-integration process. For instance, Lamin Aassociated Domains (LAD) [21] positioning with respect to nuclear pores, inverted versus conventional organisation [95], compartments [71] and enhancer hotspots [86] will likely play major roles for retrotransposons. On the contrary, a DNA transposon (class II or cut-andpaste) probes the genome in the immediate surrounding of its excision site and will diffuse from the inside-out (Fig. 2b-c). As a result of this, the mesoscale (∼1 Mbp) organisation of the genome may have a profound effect on the 1D genomic distribution of DNA transposons. For example, heterochromatin-rich chromatin is thought to be collapsed [49,96] with a typical overall size that depends on the genomic length as R ∼ L 1/3 ; on the other hand, euchromatin-rich compartments [97] are more open [98] and their size may be more similar to that of a random walk, i.e. scaling as R ∼ L 1/2 . The contact probability of two genomic loci at distance s can be estimated to scale as P c ∼ s −3ν [47] where ν is 1/3 for collapsed polymers (such as heterochromatin), 1/2 for ideal ones (such as euchromatin [96]) and 3/5 for self-avoiding walks [56]. Thus, a crude calculation would predict that a DNA transposon should re-integrate at distance s with a probability P c (s) ∼ s −3ν that depends (through the exponent ν) on the folding of the genome at these (TAD-size) length-scales. A similar effect is at play in the enhancement of long range contacts in oncogene-induced senescent cells [99].
It should be noted that the arguments above assume that the chromosomes can be seen as polymers in the melt [55,100]. In such a picture, chromosome folding at the level of TADs (100kbp-10Mbp) and territories (>10Mbp) takes place on heteromorphic chromatin fibres, which can assume a range of local packaging at the scale of 10-30 nm (1-100kbp) [62,101,102].
In addition to this contribution coming from large-and meso-scale folding, one may argue that there ought to be other complementary effects such as specific features of the integrase [90,103] or tethering [17] enzymes. These orthogonal elements are more local and are expected to equally affect both DNA transposons and retrotransposons. To investigate the role of local chromatin features on a generic integration event, we here perform some original, yet preliminary, simulations on a heterogeneously flexible polymer that crudely mimics heterogeneous chromatin in vivo (Fig. 3a). Specifically, we consider a stretch of 1.6kbp long DNA with persistence length l p = 150bp = 50nm and regularly interspersed with soft sites that display a lower bending rigidity l f . This lower local DNA rigidity may be due to, for instance, to denaturation bubbles [104], R-loops [105] or replication stress [106]. In these conditions -which may be reproduced in vitro by considering DNA with a sequence of bases that modulates its local flexibility [84] -we ask what is the integration pattern displayed by an invading DNA element by counting the number of integration events in each segment of the polymer over many (1000) independent simulations. We observe that, by varying the value of the rigidity parameter Fig. 2 a Copy-and-paste transposition explores the nuclear space by diffusing from the periphery towards the interior, i.e. outside-in. The large-scale nuclear architecture, i.e. inverted or conventional [95], Lamin Associated Domains (LADs) [21], compartments [71] and enhancers hot-spots [83,86], are expected to play the biggest roles in the integration site selection. b-c Cut-and-paste transposition explores the nuclear interior inside-out. In this case, TAD-scale (∼1 Mbp) genome folding is expected to dominate and in particular open conformations will yield short range de novo re-integration whereas collapsed ones will lead to longer range re-integration. Duplication of the transposon is also possible by homologous DNA repair of the broken strands from l f = l p = 150 bp to l f = 60 bp, the integration patterns become less uniform, more periodic and reflecting the distribution of soft sites (Fig. 3b).
From these patterns we can compute the enhancement of integration in susceptible sites due to their different flexibility. This is simply the sum of integration frequencies in all soft sites divided by the one expected for a random distribution of events, i.e. n/N where n is the number of soft sites and N the length of the polymer. This calculation is reported in Fig. 3c and shows that the enhancement increases with the flexibility of the susceptible sites. For small deviations from l p = 150 bp, this increase can be fitted as a single exponential as expected for a Metropolis Monte Carlo algorithm. For l p = 60 bp the increase is faster than exponential and may be indicative of non linear effects coming from the polymer folding in presence of soft sites.
The output of these simulations may be readily measured in experiments in vitro on designed DNA and chromatin templates as done for HIV [89,91], and may thus inform the mechanistic principles leading to DNA integration. Perhaps more importantly, however, these simulations suggest that the heterogeneity of the DNA (or chromatin) substrate in both mechanics and folding may affect TE expansion with potentially important and far-reaching consequences on the evolutionary paths and proliferative success of certain TEs in vivo.
It should be finally mentioned that other classes of transposases have been found to display enhanced efficiency on bent or geometrically deformed substrates. Most notably the DNA-bending class of proteins HMGB is found to enhance the efficiency of V(D)J recombination by RAG1-RAG2 [107,108] and Sleeping Beauty transposition [109]. This suggests that the model introduced in Ref. [54] and described here could have a broader relevance to other classes of transposition and recombination.

Conclusions
It is now becoming increasingly clear that cell function, health and fate are correlated to 3D genome folding [25,45]. TEs are intrinsically linked to 3D organisation as they are "living elements" within a complex multi-scale environment. In the last few years, there have been a handful of studies that started to interrogate how TEs shape genome organisation, from demarcating TAD boundaries [29][30][31] to harboring binding sites for architectural proteins [34]. It is thus now realized that TEs have profound implications in the fate and health of a cell -not only via the traditional pathway of genomic instability and epigenetic silencing -but also through the global regulation of genome folding. Now, while this crucial relationship will certainly receive more attention in the future, in this commentary we argue that the other direction of the relationship, i.e. how 3D structure affects de novo TE insertions, is also of utmost importance. For example, biases in insertion patterns due to tissue-specific genome organisation in the germline Fig. 3 a Sketch of the original simulations performed in this work where we consider a segment of DNA 1.6 kbp long (or N = 200 beads, each bead representing 8 bp) with rigidity l p = 150 bp. The DNA is interspersed with "soft" sites which display a different rigidity l f . The length of these weak sites is 8 bp, or 1 bead. b We compute the frequency of integration events per each segment of the substrate by counting the number of events occurring at a specific locus over the total integration events. We average over 1000 independent simulations. One can notice that the patterns, which are roughly uniform for l f = l p become more and more periodic and reflecting the positions of the soft sites (denoted by the black arrows) when we reduce l f . The dotted line shows the expected frequency for random events 1/N, with N = 200 the length of the substrate. For clarity we report only the segment 0.5-1 kbp. c Integration enhancement in soft sites over the expected random frequency. Each box represents a different value of the rigidity of the soft sites l f . Recall that l f = l p = 150 bp reflects a uniformly stiff substrate and indeed we recover the expected value (unity) for the enhancement (versus, for example, somatic cells) may create preferential pathways for genome evolution. In mammals, while the overall genome organisation is preserved in the germ line, the strength of specific features such as compartments and TADs varies [110][111][112]; far less is known in plants and significant differences between chromosome organization in germ line and somatic cells have been reported [113]. We suggest that the dissection of the interplay between 3D organisation and TE integration could be done by employing a "perturb-and-measure" strategy, i.e. by inducing TE expansion in a cell line whilst obtaining information on the 3D genome organisation and epigenetic states pre and post expansion. This approach may determine -also through the use of polymer physics models -which 3D and/or epigenetic features are associated with de novo TE insertions and thus detect insertion biases. Consequently, it will allow the generation of "topographical maps" of TE insertions in a given tissue-specific 3D genome organisation. Ultimately, understanding the preferential insertion of TEs may lead to a better understanding of genome and TE evolution or even inform better strategies to drive genomic variations in crops.