Constructs and transposition assays
Transposition assays were performed by co-transforming E. coli with two plasmids: an effector plasmid (pEffector) encoding the guide RNA and all protein components on a pCDFDuet-1 backbone, and a donor plasmid (pDonor) encoding a ~ 1-kb mini-transposon on a pBBR1 backbone. pEffector plasmids for the V. cholerae (Vch INTEGRATE, Tn6677) and S. hofmannii (ShCAST) systems were cloned from versions described previously [8]. TnsA-D90A pEffector plasmids for the V. cholerae system were generated by introducing a D90A mutation in TnsA. TnsAB fusion pEffector plasmids for the V. cholerae system were cloned by inserting a 5′-GC-3′ sequence in the region where the TnsA and TnsB open reading frames overlap. The D90A version of the TnsAB fusion protein was generated by introducing the D90A mutation in the TnsAB fusion protein. Components for the A. wodanis system were derived from Aliivibrio wodanis strain 06/09/160 genome (NCBI accession LR721750.1), synthesized as fragments (GenScript), and cloned into the pCDFDuet-1 backbone under control of an IPTG-inducible T7 promoter.
Transposition assays for the V. cholerae and S. hofmannii systems were performed by transforming chemically competent WT BW25113 E. coli. Cells were recovered at 37 °C for 1 h in LB media, plated on LB-agar with appropriate antibiotic selection, and incubated for 30 h at 37 °C. Transposition assays for the A. wodanis system were performed similarly by transforming BL21(DE3) E. coli, with final incubation on LB-agar also containing 0.1 mM IPTG for expression induction.
PCR and qPCR analyses
Colonies were scraped after incubation, resuspended, and subjected to heat lysis. PCR/qPCR analysis of the resulting lysates were performed as described previously [5]. Briefly, cells were resuspended in water, lysed at 95 °C for 10 min, and pelleted at 4000 g for 2 min. For each sample, the resulting supernatant was diluted 20-fold and used as template either for a 12.5 uL PCR reaction with Q5 Polymerase (NEB), or for a 10 uL qPCR reaction with the SsoAdvanced Universal SYBR Green 2X Supermix (BioRad). Primer pairs were designed as described previously [5]. Briefly, each reaction includes one primer specific to the genome and one primer specific to the mini-transposon, as shown in Fig. 2b.
For PCR, products were resolved by electrophoresis on 1.5% agarose gels stained with SYBR Safe (Thermo Scientific), and purified PCR products were confirmed by Sanger sequencing (Genewiz). For qPCR, each sample was analyzed using three primer pairs in three reactions: two as shown in Fig. 2b probing for the T-RL and T-LR orientations, respectively, and a third pair amplifying a reference genomic region. Integration efficiency (%) for each orientation is defined as 100 x (2^ΔCq), where ΔCq is the Cq (genomic reference pair) – Cq(T-RL pair OR T-LR pair); the total integration efficiency is the sum of both orientation efficiencies. This qPCR approach has been previously benchmarked using lysate samples simulating known integration efficiencies and orientation biases [5].
SMRT-sequencing
Colonies were scraped from LB-agar plates after incubation and resuspended in LB media, and genomic DNA (gDNA) was extracted using the Wizard Genomic DNA Purification kit (Promega). Multiplexed, whole-genome SMRTbell libraries were prepared as recommended by the manufacturer (Pacific Biosciences). Briefly, 1 μg of high molecular weight gDNA from each sample (n = 20–22 per pool) was sheared using a g-TUBE to ~ 15 kb (Covaris). Sheared gDNA samples were then used as input for SMRTbell preparation using the Express Template Preparation Kit 2.0, where each sample was treated with a DNA Damage Repair and End Repair/A-tail mix, in order to repair nicked DNA and create A-tailed ends. Barcoded overhang SMRTbell adapters were ligated onto each sample to complete SMRTbell library construction, and these libraries were then pooled equimolarly, with a final multiplex of 20–22 samples per pool. The pooled libraries were then cleaned up with 0.45X AMPure PB beads (Pacific Biosciences) and subjected to size-selection on Blue Pippin (SAGE Science) in order to remove DNA fragments < 7 kb. The completed 20–22-plex pools were annealed to sequencing primer V4 and bound to sequencing polymerase 2.0, and were sequenced using one SMRTcell 8 M on the Sequel 2 system, with a 30-h movie.
After data collection, the raw sequencing reads were demultiplexed according to their corresponding barcodes using PacBio’s Lima tool (version 1.11.0). Circular consensus sequencing algorithm (CCS version 4.2.0) was used to perform intramolecular error correction on demultiplexed subreads with at least 3 polymerase passes, to generate highly accurate (>Q20) CCS reads. Each sample yielded between 7.1 and 12.6 Gb of total data (median = 8.7 Gb); CCS generation yielded between 36.3 k and 81.6 k CCS reads (median = 48.4 k reads), with average CCS read lengths between 10.4 kb and 11.9 kb (median = 11.5 kb). For each sample, median CCS read quality ranged from Q32 to Q36. Full sequencing statistics are provided in Supplementary Table 1.
SMRT-seq data analysis
Analysis of CCS reads from SMRT-seq was performed using a custom Python script. For each sample, BLASTn was performed on the CCS read sequences to search for the mini-transposon (mini-Tn) sequence. BLASTn hits that were not within 5-bp of the length of the mini-Tn, or that had E-values greater than 0.000001, were removed. Reads without any valid hits for the mini-Tn sequence were removed from further analyses; if multiple mini-Tn hits were identified in a read, the hits were analyzed separately in order of 5′-to-3′ position within the read.
For each mini-Tn hit within a read, 50-bp sequences flanking the mini-Tn were then extracted; flanking sequences shorter than 50 bp (i.e. mini-Tn hits positioned near either edge of the read) were discarded. This resulted in a list of 50-bp sequences flanking the mini-Tn for each read, with a pair of sequences corresponding to each mini-Tn hit. These flanking sequences (‘flanks’) were then classified as either donor-plasmid-mapping (‘plasmid flanks’) or genome-mapping (‘genomic flanks’), as follows: for each 50-bp flank, bowtie2 was used to align the flank to both the full donor plasmid sequence and the full target genome. The flank sequence was classified as genomic or plasmid based on which alignment had a lower Hamming distance, with a maximum allowed Hamming distance of 2. Flanks with Hamming distances greater than 2 from any genome or plasmid sequence were not classified, and the corresponding read was removed from further analyses. This process results in a list of classified flanks for each read, which was then used to classify the entire read. Reads containing only plasmid flanks were classified as plasmid reads; reads containing only genomic flanks were classified as simple insertion transposition products; and reads containing a mixture of plasmid and genomic flanks were classified as cointegrate transposition products.
The distance between multiple mini-Tn hits from reads categorized as cointegrates were manually inspected and confirmed to be the exact predicted length of the donor plasmid backbone. For all reads containing multiple mini-Tns (including ones not categorized as cointegrates or simple insertions), the same distance analysis was performed to look for mini-Tn hits within several bp from each other, which would be characteristic of tandem insertions; only one such read was found across all samples. We note that there were a number of observed reads that contained multiple consecutive plasmid flanks, suggesting the presence of concatemer plasmids and resulting concatemer cointegrates [23]. These reads were classified using the same rules as above.
For each read that fell into either the simple insertion or cointegrate categories, the genomic coordinate where the genomic flank sequence mapped was recorded. For reads with more than one genomic-mapping flank, the first flank in the read (from 5′-to-3′) was used to determine the genomic insertion location. Extracted genomic coordinates were used to generate genome-wide histograms of integration locations; for visualization purposes, these locations were grouped into 912 5-kb bins. On-target reads were defined as reads with genomic insertion locations within a 100-bp window, centered at the site X-bp downstream of the 3′ end of the target site complementary to the guide RNA [5], with X = 49 for the V. cholerae system and X = 40 for the S. hofmannii system.
Alignments of exemplary CCS reads showing either simple insertions or cointegrates were performed using Geneious Prime 2020.2 at medium sensitivity with no fine-tuning. Simple insertion reads were aligned to a synthetic reference genome with a simple insertion product added at the expected target site; cointegrate reads were aligned to a synthetic reference genome with a cointegrate product added at the expected target site. Alignments were visualized using IGV 2.8.2, with indels < 10 bp not shown.