Genome ARTIST: a robust, high-accuracy aligner tool for mapping transposon insertions and self-insertions

Background A critical topic of insertional mutagenesis experiments performed on model organisms is mapping the hits of artificial transposons (ATs) at nucleotide level accuracy. Mapping errors may occur when sequencing artifacts or mutations as single nucleotide polymorphisms (SNPs) and small indels are present very close to the junction between a genomic sequence and a transposon inverted repeat (TIR). Another particular item of insertional mutagenesis is mapping of the transposon self-insertions and, to our best knowledge, there is no publicly available mapping tool designed to analyze such molecular events. Results We developed Genome ARTIST, a pairwise gapped aligner tool which works out both issues by means of an original, robust mapping strategy. Genome ARTIST is not designed to use next-generation sequencing (NGS) data but to analyze ATs insertions obtained in small to medium-scale mutagenesis experiments. Genome ARTIST employs a heuristic approach to find DNA sequence similarities and harnesses a multi-step implementation of a Smith-Waterman adapted algorithm to compute the mapping alignments. The experience is enhanced by easily customizable parameters and a user-friendly interface that describes the genomic landscape surrounding the insertion. Genome ARTIST is functional with many genomes of bacteria and eukaryotes available in Ensembl and GenBank repositories. Our tool specifically harnesses the sequence annotation data provided by FlyBase for Drosophila melanogaster (the fruit fly), which enables mapping of insertions relative to various genomic features such as natural transposons. Genome ARTIST was tested against other alignment tools using relevant query sequences derived from the D. melanogaster and Mus musculus (mouse) genomes. Real and simulated query sequences were also comparatively inquired, revealing that Genome ARTIST is a very robust solution for mapping transposon insertions. Conclusions Genome ARTIST is a stand-alone user-friendly application, designed for high-accuracy mapping of transposon insertions and self-insertions. The tool is also useful for routine aligning assessments like detection of SNPs or checking the specificity of primers and probes. Genome ARTIST is an open source software and is available for download at www.genomeartist.ro and at GitHub (https://github.com/genomeartist/genomeartist ). Electronic supplementary material The online version of this article (doi:10.1186/s13100-016-0061-0) contains supplementary material, which is available to authorized users.


Background
Consequent to the sequencing of model genomes, a massive effort was focused towards in vivo validation of putative genes, as an essential support for accurate biological annotations. D. melanogaster is arguably the most versatile eukaryotic model for genetics and genomics studies and insertional mutagenesis was of paramount importance for bridging genetics and molecular genetics of this organism [1]. Nevertheless, many other model genomes, such as those of Pseudomonas aeruginosa [2], Saccharomyces cerevisiae [3], Caenorhabditis elegans [4], Danio rerio [5] and Arabidopsis thaliana [6] are also currently interrogated with transposon mutagenesis. Although high-throughput procedures are predominant nowadays, small-scale experiments are still performed whenever particular mutant phenotypes are considered. Insertional mutagenesis is a very effective strategy used to construct mutant alleles and it relies on a plethora of specific ATs designed for this purpose [7,8]. Many ATs are defined at their ends by TIRs, as it is the case of P{lacW} [9] and P{EP} [10] molecular constructs, which were designed for mutagenesis of D. melanogaster genome. Almost all transposon insertions conduct to the duplication of a short target sequence (target site duplication or TSD), therefore each of the TIRs is flanked by a TSD [11]. The raw data used to map the insertional mutations is composed of query sequences containing transposongenome junctions (or transposon-genome reads). These reads are usually obtained by sequencing specific amplicons derived by inverse PCR (iPCR) performed on DNA template extracted from specific mutants [12]. Actually, mapping an insertion consists in computing the reference coordinate of the genomic nucleotide present at the juxtaposition between the genomic fragment and TIR in the transposon-genome read. We further refer to this critical nucleotide as terminal genomic nucleotide (TGN).
The mapping accuracy may be hindered when smallscale genomic mutations like SNPs or small indels are present very close to the TIR or when minor sequencing artifacts located near to the TIR affect query sequences. This issue is not manageable by available mapping tools as they rely on identification and removal of the transposon fragments from the transposon-genome read. This trimming of the transposon fragments results in a shorter query sequence, which is further aligned against the reference genome, in order to identify the site of insertion. It is important to notice that, consecutive to the trimming, the impeding small-scale mutations or sequencing artifacts become located very close to the end of the new query sequence. From our mapping experience, it is challenging to overpass such small-scale mutations or sequencing artifacts. Hence, the TGN is often not included in the final genomic alignment and therefore a nucleotide that precedes the mutation is erroneously reported as the insertion site instead. We developed Genome ARTIST, an application designed to map insertions of DNA entities into a reference sequence, but also the self-insertions of transposons, even when interrogated with poor-quality or mutations-bearing query sequences. The mapping strategy of Genome ART-IST is resilient to small-scale mutations and sequencing errors, providing a more accurate mapping performance as compared to similar mapping tools, such as iMapper [13].
Herein, we describe the performances of Genome ARTIST v1. 19, an offline, gapped heuristic aligner which was originally conceived to map insertions of ATs in D. melanogaster genome using the specific files archived in FlyBase database format [14]. In order to cope with various genomes archived in Ensembl [15] or NCBI [16] database formats, specific scripts were written in order to enable Genome ARTIST to map insertions in a wide range of prokaryote and eukaryote genomes.

Software requirements
Genome ARTIST was written in C++ and JAVA for Linux OS. The minimal computer requirements are an Intel Atom 1 GHz CPU or equivalent, 1 GB of disk memory, 1 GB of RAM for bacteria and invertebrate genomes and up to 4 GB of RAM for the small vertebrate genomes. Genome ARTIST was designed for 32-bit architectures but it may also be run on a 64-bit OS version by using the detailed instructions presented in Additional file 1 (available in docs folder and as an additional file). The user may either copy Genome ARTIST on the hard disk or can run it from an external device formatted as ext3 or ext4. Regardless of the choice, the Genome-ARTIST.sh file should be selected as an executable. We tested Genome ARTIST and obtained similar performances on Ubuntu (versions 10.04, 11.04, 12.04, 13.04, 14.04), Linux Mint 14.1, Open Suse 12.3, CentOS 6.4, Fedora 19 and on Bio-Linux 8 bioinformatics workstation platform [17]. Bio-Linux 8 is a straightforward alternative for using Genome ARTIST since it contains the pre-installed Java JDK environment and the appropriate 32-bit library required for running Genome ARTIST on the 64-bit OS version. As a feasible alternative for the Linux environment, we tested the open-source Oracle virtual machine VirtualBox for emulating Bio-Linux 8 on Mac X OS and Windows platforms. Consecutive to the installation of the ISO file format of Bio-Linux 8 as a virtual machine on both OS versions, we were able to run Genome ARTIST with full performances. After opening the Genome ARTIST folder in Bio-Linux 8 environment, the user should select: Edit > Preferences > Behavior > Ask each time in order to customize Ubuntu 14.04 to run appropriate files as executable. The Genome-ARTIST.sh file must be marked as an executable following the path: Properties > Permissions > Execute, then Genome ARTIST can be run for mapping work. The specific scripts required to convert genome data downloaded from either Ensembl or NCBI should also be marked as executable in order to work (see Additional file 1).
In order to compute the alignments results, different fragments of the reference sequences must be loaded in RAM, which is a time consuming step. To circumvent this aspect, the script cachePreloadGenomes.sh optimizes the writing of big chunks of data from the hash tables, .raw and .gene files in RAM, concomitant with launching Genome-ARTIST.sh.

The mapping strategy of genome ARTIST
The nucleotides are binary encoded by Genome ARTIST as A = 00 (0), C = 01 (1), G = 10 (2), T = 11 (3), where the decimal conversion of binary values is shown in parentheses. Overlapped intervals of 10 nucleotides referred as decamers or basic intervals (BIs) are used for indexing the reference sequences and for spanning the query sequence. The decamers are overlapped by 9 nucleotides. The length of BIs was arbitrarily chosen in order to offer an equilibrium between the accuracy and speed of the alignment steps. Longer BIs would affect the mapping accuracy and shorter ones would increase the aligning time. During the loading of a reference AT or genome sequence, Genome ARTIST builds a hash table with an index for each decamer. The hash tables for each reference sequence are computed and saved as .hash files. They are accessed when interrogated with the overlapped decamers of the query sequence and then the specific addresses relative to coordinates of the reference sequences are retrieved. Specific files are generated in the resources folder, namely distinct .raw files containing the standard nucleotide strand of each reference sequence and specific associated .gene files containing the gene annotations. By creating distinct files for each chromosome of a genome. Genome ARTIST is particularly able to work with single or many chromosomes. Genome ARTIST allows the user to customize each working session by adding or deleting chromosomes, genomes or transposons, depending on the queries or on the purposes of the research project. The time necessary for hashing depends on the size of the genome. Multiple tests revealed that less than a minute is required for hashing a bacterial genome, a few minutes are necessary for invertebrate genomes and around 20 min are required for small vertebrates as D. rerio if average computing power is used. Large mammalian genomes such as those of M. musculus and Homo sapiens are too big to be dealt with by Genome ARTIST, but either distinct chromosomes or groups of chromosomes may be loaded from any mammal reference genomes and used for mapping of insertions (about a half of the human genome is loadable in a single working package). On average, when starting a query search for a sequence of about 500 nucleotides, Genome ARTIST computes the list of the resulting alignments in a time interval ranging from seconds to tens of seconds, contingent upon the particular CPU performances and the size of the reference genome. As a rule of thumb, using a computer having a Core i7 processor and 4 GB of RAM memory, 100 bp from a query are mapped in 1 s for the genome of D. melanogaster and even faster for genomes of bacteria. Genome ARTIST supports mapping of multiple query sequences either in FASTA format (where care should be the taken to avoid empty spaces before the ">" symbol of the first FASTA descriptor in the list), or in text format, assuming that all query sequences in the list are separated by at least an empty row from each other.
The overlapped and/or adjacent BIs are merged into contiguous association intervals. Their margins are further extended by a combination of a Smith-Waterman (SW) algorithm [18] implementation (SW1 step) and an original scoring formula. The expansion strategy of Genome ART-IST relies on gradually computing an alignment score for a gliding window of four nucleotides, which was designed as a robust procedure able to surpass both mutations like SNPs or small indels and various sequencing artifacts (see Additional file 2). The resulting product of the expansion step is referred to as an extended interval (EI) and represents an association interval between two nucleotide stretches: a query fragment and a matching nucleotide window of the reference sequence. Whenever existent, the overlapped or adjacent EIs are joined together into nucleotide associations referred as MEIs (merged extended intervals). Each MEI is further converted into a proper alignment by a second SW implementation (SW2 step) and is graphically reported as a partial alignment (PA). Except for sequences which contain only genomic or transposon nucleotides, where the SW2 product is reported as the final result, a PA covers the query sequence just partially and it is considered an intermediate result. All of the PAs identified for the same query sequence, regardless if they are transposon partial alignments (TPAs) or genomic partial alignments (GPAs), are reported in a single customizable list, according to the criteria of score, location or nucleotide coordinates. Each PA contains a core region referred as a nucleus, defined by the outermost possible lateral stretches of at least 10 consecutive nucleotide matches (see Additional file 2). The nucleus is flanked by sub-alignments with lower matching density (alignment tails) and is of high importance during the assembly and scoring of the results. The structure and length of both the nucleus and the alignment tails of a PA are dependent on the settings applied for the specific parameters of Genome ARTIST (see Additional file 2).
The main innovation of Genome ARTIST is the dynamic procedure used to set the border between genomic and transposon fragments present in the composite query sequences. The most challenging step of the procedure is to merge the appropriate PAs into a final alignment, in order to cover the entire query sequence and to detect the insertion coordinate with very high accuracy. To solve this item, Genome ARTIST combines TPAs and GPAs in an interactive manner, using original joining rules that govern the edge trimming and merging of PAs. The first rule is that, when overlapping, the nucleus of a PA is privileged over the alignment tail of the partner PA, regardless of the origin of the two PAs. A second rule is that if the nucleus of a TPA happens to overlap the nucleus of a GPA (overlapping is allowed between two nuclei, but no more than 40 % over their individual length), the shared nucleus fragment is allotted to the transposon in the final mapping result. This feedback between TPA and GPA entities is designed to prioritize both the TIR integrity and the structure and length of the nuclei. If the transposon fragment is not affected by mutations or by sequencing artifacts, the TIR-containing TPA would have no alignment tail towards the border with the GPA since the TPA cannot exceed the margin of the transposon reference sequence beyond the TIR. On the contrary, even when perfectly aligning composite queries are interrogated with Genome ARTIST, an alignment tail is generated at the TIR-facing end of the GPA, due to the random extension of the genomic alignment into the transposon fragment. This acquisitive behavior is possible because Genome ARTIST does not employ the standard practice of ab initio identification and removal of the transposon fragments to obtain cleansed genomic fragments, which are further aligned against the reference sequence. If the composite query sequence is affected by mutations or by sequencing artifacts occurring around the genome-TIR border, the alignment tails would contain them as indels and mismatches located close to each nucleus. It is crucial to correctly include these gaps and mismatches in the final result in order to increase the mapping accuracy. Although an intermediary TPA-GPA intersection point is estimated by Genome ARTIST, the insertion coordinate is computed only consecutive to a final re-alignment of each component PA of the final result by means of a supplemental SW adaptation. This SW3 step is applied only for those PAs which are merged into a final alignment, because the joining process often involves edge trimming of alignment tails or/and of nuclei, thus changing the context for which the alignment was optimal consecutive to SW2 step. The rationale for SW3 is simple: when mutations or sequencing artifacts are present very close to the junction border, the adjustment of the overlapped sub-alignments may affect the best possible final alignment of each modified PA, a condition which affects the mapping accuracy.
The original, key aspect of the SW3 implementation of Genome ARTIST is that the query fragment is not realigned against the exact corresponding reference nucleotide window of the PA but against a longer one. Essentially, the initial reference window is elongated with two lateral nucleotide strings, each of them representing the next 10 consecutive nucleotides of the main reference sequence. When the reference sequence window of a PA is located close to the end of the main reference sequence, one of the lateral strings is either shorter than 10 nucleotides or even absent and SW3 is accordingly performed. As a result of this approach, the gaps and mismatches located close to the border may be included in the final result. The joining strategy of Genome ARTIST overcomes mapping problems encountered when a transposon is inserted very close to SNPs or small indels in a particular genotype. A flowchart of Genome ARTIST's mapping strategy is described in Fig. 1.
When poor quality query sequences are analyzed, false positive alignments with conjunctural better scores may obscure the actual unique insertional event. To circumvent this problem, we implemented an optional cumulative bonus score of 500, which is applicable only for alignments which contain a TIR-genome border. By selectively boosting the scores of alignments that contain a TIR-genome juxtaposition, the bonus score helps the user to distinguish among real insertional events and circumstantial false positives having close aligning scores. The utility of the bonus score is evident when dealing with poor-quality query sequences which require regular trimming. Genome ARTIST was devised to resolute insertions in unique genomic sequences and the bonus option is a feature supporting this purpose. On the other hand, mapping of self-insertions is a representative asset of Genome ARTIST tool and the bonus option should be avoided when mapping such molecular events. The reason is that short genomic sequences which may randomly be placed close to TIRs are highlighted if the conditions for bonus allocation are fulfilled. Since many ATs contain in their structure genetic markers derived from the target model genome, the bonus usage may gratuitously highlight alignments which stand for apparent insertions in the corresponding genomic locations. An example is represented by the self-insertion of P{lacW} construct in its own mini-white marker. If the bonus option is activated, the best scoring result reported by Genome ARTIST is a false positive genomic insertion in white locus, outscoring the real self-insertion event with the arbitrary score of 500. As a rule of thumb, whenever Genome ARTIST reports an insertion in a gene cloned in the respective AT, it is a good option to analyze the respective query sequence without the bonus option.
The mapping performances of Genome ARTIST may be fine-tuned by adjusting the values of a set of alignment parameters (see Additional file 2). Whenever illustrative for the examples described in this article, the values used to compute some particular alignments are mentioned. Technical details about the performances of Genome ARTIST are provided in the accompanying Additional file 1. Distinct packages of Genome ARTIST containing genomes of classical model organisms are also provided as archives at www.genomeartist.ro.

Results
The general performances of Genome ARTIST were tested with 39 original sequences derived by iPCR inquiry of D. melanogaster mutant strains obtained in our laboratory by mobilization of P{lacW} and P{EP} artificial transposons with a Δ2-3 transposase source [19]. A less complex variant of our tool was used in previous mapping work to map some of these insertions [20]. The trimmed sequences were deposited in GenBank database under accession numbers provided in Additional file 3. These sequences represent 35 hits of P{lacW} and P{EP} in unique genomic sites, a P{lacW} insertion located in an opus transposon copy and three self-insertions of P{lacW}. A few of these sequences (as it is the insertion affecting wech) contain minor sequencing errors, a condition that makes them suitable for testing the robustness and accuracy of Genome ARTIST.
We also used Genome ARTIST to map 18 splinkerettederived sequences from D. melanogaster and described in the paper of Potter and Luo [21]. Except for one sequence retrieved from a mutant strain having genomic features different from the reference genome, Genome ARTIST mapped these insertions in agreement with the nucleotide coordinates reported by the authors (the D. melanogaster genome release R5.57 is used throughout this article for reporting the mapping coordinates). Additionally, we evaluated the performances of Genome ARTIST with 96 mouse-derived splinkerette sequence data made available for testing by the web page of iMapper [22]. Because of the size of mouse genome, we used two packages of Genome ARTIST, each loaded with about a half of the genome. All mapping results offered by Genome ARTIST were in agreement with the results computed by iMapper for these sequences.

Visualization of mapping data
Genome ARTIST offers intuitive graphical annotations such as: nucleotide coordinates for both the query and the reference sequences, the gene or the overlapped genes affected by the insertion, the left and right neighboring genes flanking the hit and the relative orientations of the transposon and genomic sequences present in the query. If present in the query sequence, the intersections of the genomic and AT fragments are presented as perpendicular borders separating blue rectangles (the genomic sequences) from red rectangles (the AT sequences). TGN is the critical mapping marker and Genome ARTIST reports it as the site of the insertion using blue digits. For example, the terminal coordinates of the reference sequence of P{lacW} construct are 1 and 10691 [FlyBase:FBtp0000204]. Hence, the genomic reference coordinate of a TGN located consecutive either to coordinate 1 or 10961 is the one reported by Genome ARTIST as the insertion site. When any insertion occurs between two consecutive nucleotides but no TSDs are induced, two consecutive mapping coordinates may be computed, depending if the sequencing was performed at the 5′ or at the 3′ end of the insertion. On the other hand, when TSDs are generated, as it is the case for most of the described transposons [11], an absolute mapping is not possible, as the TSD occurs both at Overlapped decamers (or BIs) are used for hashing the genomic and transposon reference sequences, but also for the interrogation of the query sequence against the hash table, in order to detect BIs associated with the reference. The matching BIs are merged if they are adjacent or overlapped, then the resulting contiguous association intervals are extended to EIs (the SW1 step). The adjacent and overlapping EIs are merged to MEIs, which are rigorously aligned against the reference sequences during the SW2 step to map partial alignments as TPAs and GPAs. Each partial alignment contains a nucleus, a sub-alignment which is critical during the merging step. The specific joining algorithm of Genome ARTIST, which includes a SW3 step, prioritizes the nucleus of TPA but also searches for the best possible TGN whenever small-scale mutations or sequencing artifacts are present close to the joining border the 5′ and the 3′ end of the insertion. Genome ART-IST does not depend on TSDs for mapping, even if a specific TSD may be easily inferred if both junction ends are sequenced. Although some drosophilists consider that the insertion site is represented by the first nucleotide at the 5′ end of the TSD [23], any mapping convention is debatable, as correctly pointed out by Bergman [24]. Actually, such an insertion is physically located between the last nucleotide of a TSD copy and the first nucleotide of the second TSD copy. Both of these nucleotides represent distinct TGNs, as each of them is proximal to a TIR. The specific TGN reported by Genome ARTIST depends on which junction end was sequenced and fed as a query sequence for aligning and mapping. The same approach is used by iMapper, which also does not consider TSDs during mapping performance. Genome ARTIST and iMapper report two different mapping coordinates when alternatively fed with query sequences standing for 5′ end and for 3′ end of the insertion. If the TSD is an octet, as it is the case for P{lacW}, the two coordinates are not consecutive but are separated by 7 successive positions in the genomic reference sequence. RelocaTE, a tool which uses NGS data and relies on accurate detection of both TSD copies for transposon mapping, reports two coordinates for any insertion [25] as, by default, there is no option to use only one end sequence/read for mapping. The two coordinates reported by RelocaTE stand for the first and respectively for the last nucleotide of the TSD, just to deal with the mapping uncertainty described above.
As an example for data visualization, we present the mapping of a P{lacW} insertion in lama gene from D. melanogaster (Fig. 2). The blue area represents the genomic sub-sequence corresponding to lama while the encompassing red rectangles stand for fragments of P{lacW}, as in a canonical iPCRderived sequence. The border between the terminal nucleotide of TIR (coordinate 10691) and the genomic fragment reveals the site of insertion at nucleotide 5348435. The second border is at coordinate 5348475, just consecutive to GATC sequence, which represents the restriction site of Sau3AI restrictase used in our specific iPCR experiment, as recommended by Rehm [12]. Genome ARTIST assigns the overlapped sequences to the AT, therefore Sau3AI restriction site sequence, which exists both in the genomic fragment and in the P{lacW} subsequence, is incorporated in a red rectangle.
If the genomic reference sequence files are imported in FlyBase format for D. melanogaster, the cytological location is also shown when double-clicking on the green bar of the affected gene. Similar annotations are displayed for natural transposons or for other model genomes loaded in Genome ARTIST in Ensembl or NCBI format, excepting for the cytological coordinates.
When the coordinates of an alignment are decreasing from left to right, an arrow points to left, meaning that the graphics represent the reverse (or "-") genomic/ transposon strand and vice versa. There are two possible orientations of transposon insertions relative to the genomic reference strand [23] and they are accordingly reported by Genome ARTIST. Detailed instructions for interpreting the relative orientation of insertions when query sequences were derived by iPCR are described in Table 1.
When using iMapper, only one of the two possible TIRs sub-sequences may be defined as a tag, namely the one at the 3′ end of each strand of AT, as its end points toward the genomic border of insertion. Consequently, iMapper reports as genomic sequence only the nucleotides running next to the 3′ end of the tag. The aligned query sequence is presented by Genome ARTIST exactly as it was entered in the search window. If necessary, a virtual iPCR sequence may be simulated by Genome ARTIST by means of a built-in option of reversecomplementing the query sequence.
Genome ARTIST displays the results as double stranded alignments, which are score-ranked in a customizable list. For each of the results, the upper strand of nucleotides represents the query sequence and the lower one contains fragments of the genomic and AT reference sequence. Due to this graphical representation, the user may also detect small mutations or polymorphisms, which are visible as mismatches or indels, a feature not offered by iMapper.

Mapping of self-insertions
To our knowledge, Genome ARTIST is the only available mapping tool which allows mapping of self-insertions. While other mappers trim out the AT sequences because of their potential to blur the mapping, Genome ARTIST keeps them in the query sequence. In order to compute the insertion coordinate, Genome ARTIST may use either a TIR or the whole sequence of the AT which is loaded in the transposon database. We recommend the use of the complete sequence of the AT of interest, because it allows the detection of self-insertions, aside from unique genomic insertions. Such molecular events are frequently reported for some artificial transposons [26][27][28] and they should be accurately differentiated from genomic insertions affecting genetic markers cloned in ATs. A typical case is the one of white gene from D. melanogaster, where mini-white marker allele is cloned in many P element-derived constructs [23]. For ATs such as P{lacW} and P{EP}, the expression of miniwhite is essential for tracking insertional events. The graphics of Genome ARTIST enables a sharp visualization of the intersection coordinates of ATs inserted into each other. Any reference sequence, including those of ATs, may be Fig. 2 Screenshot of the result display. In the figure, we show the mapping of the insertion coordinate when using a query sequence derived by iPCR from a P{lacW} hit affecting lama gene from D. melanogaster. The red rectangles stand for the transposon fragments, the blue ones represent the genomic sequence and the green ones stand for annotations of lama gene and of 3′ TIR of P{lacW}. Herein, the TGN is the C nucleotide located just next to the terminal coordinate 10691 of P{lacW}, which is also a C nucleotide. Hence, the insertion coordinate explicitly reported by Genome ARTIST with blue digits is 5348435. The genomic coordinate 5348475 is the one bordering the GATC restriction site of Sau3A1 used in the iPCR procedure. Since the restriction site belongs both to the transposon and to the local genomic region, it is arbitrarily allocated to the transposon sequence. Herein, we used a query sequence which contains the two transposon fragments encompassing the genomic sub-sequence Table 1 The orientation of AT insertions identified by iPCR and sequencing as reported by Genome ARTIST Arrows' sense The relative orientation of the AT insertion ← r ← b or b → r → Type I orientation. When both arrows are pointing to the same sense and the red arrow (← r or r →) is the closest one relative to an imaginary target point, it means that the junction was sequenced at the 5′ end of the transposon. The variant on the left stands for sequencing with a primer facing towards the 5′ TIR (such as Sp1 for P{lacW}) and the variant on the right stands for sequencing with a primer facing towards the restriction site sequence (such as Plac4 for P{lacW}).
When both arrows are pointing to the same sense and the blue arrow (← b or b →) is the closest one relative to an imaginary target point, it means that the junction was sequenced at the 3′ end of the transposon. The variant on the left stands for sequencing with a primer facing towards the 3′ TIR (such as Spep1 for P{lacW}) and the variant on the right stands for sequencing with a primer facing towards the restriction site sequence (such as Sp6 for P{lacW}).
← r b → or ← b r → Type II orientation. When the blue and red arrows are pointing away from each other (a divergence of the senses), it means that the junction was sequenced at the 5′ end of the transposon. The variant on the left stands for sequencing with a primer facing towards the 5′ TIR (such as Sp1 for P{lacW}) and the variant on the right stands for sequencing with a primer facing towards the restriction site sequence (such as Plac4 for P{lacW}).
r →← b or b →← r Type II orientation. When the blue and red arrows are pointing to each other (a convergence of the senses), it means that the junction was sequenced at the 3′ end of the transposon. The variant on the left stands for sequencing with a primer facing towards the 3′ TIR (such as Spep1 for P{lacW}) and the variant on the right stands for sequencing with a primer facing towards the restriction site sequence (such as Sp6 for P{lacW}).
The arrows associated with the genomic and transposon rectangles from the result panel have several possible arrangements that are illustrative for two types of orientation: type I and type II, respectively. In the figure, the superscript letter b stands for the tail of the arrow inside the blue genomic rectangle and letter r marks the tail of the arrow inside the red transposon rectangle which contains the terminal nucleotide (or the TIR). For example, the alignment for the query sequence of P{lacW} insertion in lama (Fig. 2) fits the code ← b ← r . Hence, the insertion is in type I orientation and the AT-genomic junction was sequenced at the 3′ end of the transposon, with a primer facing towards the genomic sequence easily annotated by the user in the Genome ARTIST environment, as it is described for P{lacW} (see Additional file 4). Using annotations for TIRs and genes cloned in the specific transposon permits a quick identification of the functional components affected by the self-insertion. In Fig. 3, we present the case of the self-insertion event symbolized LR2.11A [GenBank:KM396322]. It may be noticed that the coordinate of this self-insertion is 8021 (as it is located just next to the terminal coordinate 1 of 5′ TIR). The selfinsertion affects mini-white allele, therefore care should be taken not to consider it as an insertion in white gene located in X chromosome. Genetic analysis data revealed that LR2.11A self-insertion event is actually located on chromosome 3. Genome ARTIST may report marker sequences cloned in ATs as genomic fragments even when the query sequences are derived from self-insertion events. To highlight the score of a self-insertion, the bonus option should not be activated, as previously described. Mapping ambiguities specific for self-insertion events emphasize on the fact that the bioinformatics mapping data should always be correlated with the supporting genetic data.

Mapping insertions in particular genomic locations
According to our tests, a particular insertion of P{EP} construct located very close to wech gene of D. melanogaster [GenBank:GU134145] is correctly mapped by Genome ARTIST but not by iMapper, regardless the settings of its parameters. The sequence derived by iPCR from the respective molecular event contains two insertions in the genomic fragment as comparative to the reference sequence. As described in Fig. 4, Genome ARTIST maps this insertion upstream to wech, at nucleotide 3377332, just next to the 3′ terminal nucleotide 7987 of P{EP} construct.
On the other hand, iMapper is not able to map this insertion associated with wech, even when the aligning parameters are set at very low stringency values. Actually, iMapper recognizes the TIR as a tag, but instead reports "No genome match found" for the genomic sequence. The genomic fragment contains 39 nucleotides, where two supplemental adenines (As) are present as insertions relative to the reference sequence. We trimmed the sequence in order to eliminate the insertions, but iMapper is still unable to recognize the genomic sequence of 37 consecutive matching nucleotides. When the genomic sub-sequence was artificially elongated from 37 to exactly 57 nucleotides of reference wech sequence (and the two inserted adenines are trimmed out), iMapper was able to report the correct coordinate of insertion upstream of wech. If the two adenines are kept, wech sequence has to be elongated from 39 to 83 nucleotides, regardless of the parameters' settings. It is interesting to interrogate why iMapper does not recognize the string of 37 consecutive matching nucleotides upstream of Fig. 3 Screenshot of the mapping of a P{lacW} self-insertion symbolized LR2.11A. The coordinate of self-insertion is 8921 and belongs to mini-white allele, which is cloned as a genetic marker in the P{lacW} construct wech. Most probably, this situation reflects a lower sensitivity of SSAHA aligner as comparative to the aligning heuristic of Genome ARTIST. As described by the authors [29], SSAHA constructs the hash table by searching only for non-overlapped k-tuples (equivalent to words or k-mers), whereas Genome ARTIST considers overlapped k-mers for the hash table. Additionally, SSAHA excludes from the hash table the words having a frequency above a cutoff threshold N, in order to filter out hits matching repetitive sequences. It may be noticed that the genomic sequence of wech query sequence contains a CT-rich fragment (Fig. 4), therefore SSAHA implementation used by iMapper may consider this sequence as containing a repetitive pattern. The example of wech insertions points to the fact that insertions in specific regions of the reference genome may be lost if a mapper is not designed to detect problematic insertions. The laboratory practice evidences that iPCR technology often generates such short genomic sequences depending on the position in the reference genome of a specific restriction site relative to the TIRs; the closer the restriction site, the shorter the genomic fragment in the iPCR amplicon.
Whenever a TIR terminal sub-sequence incidentally overlaps a genomic sub-sequence in a specific query, the superimposed fragment is reported as pertaining to the genome by either online BLAST [30] or BLAT [31], since the reference ATs sequences are not compiled in the reference genomes. Therefore, the user may erroneously infer that the insertion site is located next to the overlapped fragment if the result is not manually annotated. As an example, the critical sub-sequence TCATG present in query sequence derived from the wech mutant is an overlap between the terminal nucleotides of P{EP} and the genomic nucleotides interval 3377327-3377332. If P{EP} construct is present in the database of Genome ARTIST, our application interprets the overlapped sequence as belonging to the TIR of P{EP} and accurately reports 3377332 as the site of insertion. On the contrary, BLAST and BLAT algorithms erroneously report the coordinate 3377327 as the insertion point. Even more confusing, the best alignment scores reported by either online BLAST or BLAT for this query do not refer to wech but to paralogous heat shock protein genes (3R).

Mapping performances on queries with simulated small-scale mutations and sequencing artifacts
When small-scale mutations (polymorphisms) or sequencing artifacts reside close to TIR-genome junction, the robustness and accuracy of the mapping tool is Fig. 4 Screenshot of the mapping of a P{EP} insertion located upstream to wech gene. The border between the end of the P{EP} transposon and the genomic region points to coordinate 3377332 as the place of insertion. This coordinate is located just upstream of wech gene (2R) in R5.57, but in previous genome annotations it is internal to wech gene. The TCATG sequence present at the AT-genomic border is an overlapped sequence between the genomic fragment and the AT sub-sequence, but is assigned by Genome ARTIST to P{EP} and hence it is integrated in the red rectangle essential for the precise mapping of the insertion. Herein, we comparatively test Genome ARTIST versus iMapper when feeding both tools with the same query sequences. We used 23 sequences derived by iPCR from real insertions of P{lacW} in D. melanogaster genome (see Additional file 3). Genome ARTIST successfully mapped all the insertions with Short option and the bonus 500 assigned (the recommended parameters), while iMapper with default parameters is able to map 22/23 insertions to the same coordinates mapped by Genome ARTIST. The exception stands for CR43650 gene sequence [GenBank:HM210947.1], where the value of iMapper parameter SSAHA mapping score should be slightly lowered from >35 to >34 in order to obtain a correct coordinate of insertion.
To test the mapping robustness of both Genome ART-IST and iMapper tools to small-scale mutations or sequencing errors, we handled all of the 23 sequences in order to place SNPs (transversions), small deletions or insertions (Ns) inside a presumptive TSD of 8 nucleotides. The range of the mutated interval starts with the second nucleotide closest to the TIR and ends at the 6 th nucleotide outside of the TIR, as described in Fig. 5.
The simulations for each of the 23 sequences were generated in a step by step approach. As a result, we induced: SNPs affecting positions 2, 3, 4, 5 or 6 relative to TGN, one-nucleotide deletions/insertions affecting positions 2, 3, 4, 5 or 6 relative to TGN, substitutions of two consecutive nucleotides simultaneously affecting positions 3 and 4 relative to TGN, deletions/insertions of two consecutive nucleotides simultaneously affecting positions 3 and 4 relative to TGN, substitutions of three consecutive nucleotides simultaneously affecting positions 3, 4 and 5 relative to TGN, deletions/insertions of three nucleotides simultaneously affecting positions 3, 4 and 5 relative to TGN.
We always kept the TGN unmodified since it should be reported as the genomic coordinate of the insertion if the simulated small-scale mutations are properly overpassed.
We noticed that, when affected, the most sensitive positions of TSD are 2, 3 and 4, as they impede the mapping accuracy of both Genome ARTIST and iMapper. Nevertheless, Genome ARTIST still reports the real insertion coordinates for most of the sensitive simulations, reflecting the ability of our tool to surpass small-scale mutations occurring very close to the TIR. In our hands, iMapper fails to report the real coordinate of transposon insertions for many of the simulations, even when the mapping parameters were set for the most permissive values. The comparative results of mapping the simulated sequences are presented in Table 2 and in Fig. 6.
To reinforce these data, we tested virtual P{lacW} insertions adjacent to 5′ UTR of 102 randomly chosen genes of D. melanogaster (see Additional file 3). The respective sequences were processed to contain transversion SNPs involving either nucleotides 2, 3 or 4 or single-nucleotide deletions affecting nucleotides 2, 3 or 4 closer to the TGN. The comparative mapping results obtained with Genome ARTIST and iMapper (each of them set at the same parameters mentioned above) are presented in Table 3 and in Fig. 7. The results confirm that nucleotides 2, 3 and 4 located right next to the TGN are the most critical ones for the mapping accuracy (especially the nucleotide in position 2) and are consistent with those obtained on the simulations performed on the real insertions. Both mapping tools report the real genomic coordinate for any of the 102 insertions when they are not affected by the respective small-scale mutations.
Although it highlights the alignment details for the TIR fragment of the query, iMapper does not present the pair-wise alignment of the genomic fragment, which actually contains the TGN standing for the coordinate of insertion. In fact, iMapper graphically displays the genomic sub-sequence of the query in a rather mechanistic manner. As a result, whenever mutations occur close to TIR-genome junction, the insertion coordinate reported by iMapper may not be the one corresponding to the Fig. 5 Simulation of small-scale mutations affecting nucleotides located close to the TIR. The mutations were modeled in a region equivalent to TSD, which is represented herein by the arbitrary octet CCAAACTT (blue). With reddish are highlighted the partial sequences of the two TIRs specific for P{lacW} construct. TGN I (a T nucleotide) and TGN II (a C nucleotide) are capitalized inside the respective TSD boxes. The nucleotides affected by simulations in TSD are those located in the relative positions 2, 3, 4, 5 and 6 as sliding away from each TGN toward the other end of TSD. The drawing was realized with CLC Main Workbench software v.6.9 (CLC Bio-Qiagen, Aarhus, Denmark) nucleotide depicted as bordering the junction (see Additional file 5). In other words, the apparent TGN is not the same with the nucleotide standing for the site of insertion. On the contrary, Genome ARTIST offers explicit graphics of each sub-alignment and unambiguously displays the computed TGN, an approach which is useful when polymorphisms or sequencing artifacts are present in the query sequence. The coordinate of insertion reported by Genome ARTIST is always the same with the graphically visible TGN.
Our results reveal that Genome ARTIST is more tolerant than iMapper to small-scale mutations and sequencing artifacts residing near the transposon-genome junction. The analysis of our simulations pointed that the three nucleotides of the TSD located just next to the TGN (as described in Fig. 6) are critical positions for the mapping accuracy. When mutagenized, these positions are interpreted by Genome ARTIST rather as a buffer zone, favoring a robust detection of the TGN's coordinate. Genome ARTIST is able to accurately deal with both small-scale mutations and sequencing artifacts, mainly due to its expansion procedure and to the interactive strategy of joining TPAs and GPAs.
The complex procedure that enables the accurate joining of transposon and genomic fragments would not be possible if the transposon fragments are removed from the composite query. Actually, this commonly employed approach would reduce Genome ARTIST to a mere aligner tool. The attempt of Genome ARTIST to cover the entire composite query sequence by a best-scoring final alignment is a premise for the TPA-GPA merging step. This joining operation triggers the SW3 step, which reconsiders some nucleotides initially removed by edge trimming of TPAs and GPAs, but which are actually crucial for the mapping accuracy. As a result of SW3 step, some key nucleotides placed around the T-G border, including the TGN, are ultimately incorporated or rearranged in the final alignment even if the TSD or the TIR are affected by mutations or sequencing errors. Genome ARTIST also applies SW3 step for other less common, but possible junctions, such as TPA-TPA and GPA-GPA ones.
The alignment extension specific to Genome ARTIST allows the correct detection of the TGN in many of the simulated sequences even when the TIR was trimmed out. In our hands, such a performance was not attainable with either BLAST or BLAT aligners when  For the majority of the other simulations, Genome ARTIST is also superior to performances of iMapper considering the same simulations. It appears that SSAHA, BLAST and BLAT aligners fail to accurately map the genomic sequences containing terminal smallscale mutations if the transposon sequences are removed from the composite query. Therefore, we consider that Genome ARTIST is a particularly robust alternative as both an aligner and a mapper for problematic query sequences.

Discussions
To test the mapping performances of various tools, the simulations of transposon insertions in the target genome is a current practice [32]. We simulated genomic small-scale mutations very close to the TIRs of 23 real and of 102 virtual P{lacW} insertions located in D. melanogaster genome. This approach was intended to comparatively test the robustness of Genome ARTIST to map ATs insertions when affected by polymorphisms and/ or by sequencing artifacts as compared to the similar achievements of iMapper, BLAST and BLAT. According to our results, the accuracy of insertion mapping is affected when mutations or sequencing artifacts are present around the TIR-genome border or when repetitive patterns occur in the genome fragment of the query sequence. Genome ARTIST is able to surpass these problems, as revealed by the simulations of smallscale mutations data and by the wech example. Therefore, the robustness of Genome ARTIST represents a real advantage when such query sequences are inquired for mapping of insertions. Apart from a total of 1095 simulated sequences, we also comparatively mapped a number of 153 insertions, for which Genome ARTIST detected the right insertion coordinate. Self-insertions are molecular events reported for artificial transposons in classical studies [26]. To our knowledge, Genome ARTIST is the only tool able to map both self-insertions and genomic insertions of ATs, but mapping of natural transposons is also feasible. As the natural transposons represent a very consistent fraction of the eukaryotic genomes [33] an application able to annotate insertions relative to both targeted genes and to natural transposons is of practical interest for this research field. In Fig. 8, we present relative mapping data of a real P{lacW} insertion in a copy of opus, a natural transposon from D. melanogaster [GenBank:KM593302.2]. Which copy of  (Fig. 6a) and 79/102 insertions if a SNP is placed in the same position (Fig. 6b), while iMapper is unable to accurately map any of the respective simulated insertions. For the majority of the other simulations, Genome ARTIST is also superior to performances of iMapper. We noticed an exception when the SNP is placed in position 4 relative to TGN, for which Genome ARTIST correctly maps 99 virtual insertions, while iMapper successfully maps all of them opus is actually affected may eventually be revealed only consecutive to applying a PCR splinkerette procedure to the mutant line. Multimers of transposons may be generated by nested transpositions or by self-insertions when copies of a transposon hit the original insertion of the respective mobile element [34]. This insertional behavior is a driving force for genome evolution as described in maize [34] and D. melanogaster studies [35]. Therefore, mapping of self-insertions is of particular interest for experiments aiming to decipher the biological significance of nested transposition phenomena.
As an online application, iMapper works only with a few predefined animal genomes from Ensembl repository (an exception is the S. cerevisiae genome). Supplemental genomes may be added upon request, according to the authors [13], but only from Ensembl repository, which may be a limiting option. As a difference, Genome ART-IST deals with a broader spectrum of genomes, ranging from those of bacteria to the ones of vertebrates. The only prerequisite is the availability in the public databases of the annotated sequenced genomes in formats that may be converted with the accompanying scripts of Genome ARTIST (see Additional file 1). Additionally, Genome ARTIST allows the user to load and annotate genomic and/or transposon reference sequences, as described in Additional file 1 and in the Additional file 4. We successfully tested Genome ARTIST with the genomes of P. aeruginosa, S. cerevisiae, C. elegans, D. rerio and A. thaliana.
A supplementary advantage of Genome ARTIST is the fact that different releases of a genome may be co-loaded in the same package to test for inherent differences of annotations. The user of Genome ART-IST may work either with a whole genome of interest or with individual chromosomes, since the conversion scripts generate the output in such a way that individual chromosome files may be selected (see Additional file 1). If short orthologies are to be hunted, small and medium size genomes of different species may be simultaneously interrogated with the same query sequence. Similarly, if various ATs are employed in an insertional mutagenesis experiment, all of their reference sequences may be coloaded in the Genome ARTIST database.
RelocaTE [25], ngs_te_mapper [36], TIF [37], T-lex2 [38], and TE-Tracker [39] tools were designed to employ TSDs to map transposons when starting from split-reads (junction reads) obtained by NGS sequencing. A splitread or a junction read contains a fragment of the inquired transposon linked to a unique genomic fragment. The TSDs are detected and then used for merging unique genomic subsequences into small contigs which . The coordinate 19677229 stands for a possible site of insertion, as many copies of opus are present in the genome of D. melanogaster. When using a query sequence derived by splinkerette PCR, Genome ARTIST is expected to provide mapping coordinates for a unique, specific opus copy are further aligned with various implementations of BLAST (TIF), BLAT (RelocaTE, ngs_te_mapper and T-lex2), or BWA [40] (TE-Tracker) aligners to find the mapping coordinates. TIF and RelocaTE report both terminal coordinates of the detected TSD as the insertion site, as revealed in a comparative work of mapping insertions of Tos17 transposon in ttm2 and ttm5 lines of japonica rice cv. Nipponbare [37].
A recent improvement of BWA is BWA-MEM, an alignment algorithm that is able to align both single query sequences and pair-end reads [41]. In order to overcome poorly matching regions, BWA-MEM uses an extension strategy based on banded dynamic programming and an arbitrary Z-dropoff value. This approach successfully overcomes variations located toward the end of the query sequences or reads, a feature similar to the ability of Genome ARTIST to surpass small mutations found around TIR-genome junction. One key difference between the two approaches resides in the fact that BWA-MEM surpasses the problematic regions using an extension strategy, while Genome ARTIST performs a rigorous realignment (SW3) of the query subsequences with an extended reference window. This SW step may confer by default a higher mapping accuracy for particular small-scale mutations located next to the TGN without the need of refining the settings of the aligning parameters.
Mapping of transposon insertions consecutive to targeted PCR and Sanger sequencing versus mapping when starting from NGS data are different endeavors, a reality reflected in the algorithms developed to cope with this mapping strategies. The split reads obtained by NGS are short and more prone to sequencing artifacts, hence both high sequencing coverage and detection of perfectly overlapping TSDs are ideally required for mapping insertions at nucleotide level accuracy. On the contrary, the junction sequences obtained by the robust Sanger method starting from amplicons generated by inverse PCR or by vectorette PCR are more reliable. These sequences are, on average, an order of magnitude longer (hundreds of nucleotides instead of a few tens as in NGS). They contain unique genomic fragments embraced by two molecular markers, namely a TIR and the restriction site used for cutting the genomic DNA of the insertional mutant. In these cases, sequencing of genomic sequences flanking both ends of the inserted AT (which, indeed, would allow to confirm the TSD presence) is recommended, but not mandatory for an accurate mapping. In our experience, the detection of the two TSD copies is not a critical aspect per se when mapping insertions starting from PCR amplicons as it is when using short split-read sequences obtained in NGS projects. Moreover, it is known that sometimes sequencing at both ends of the insertion is quite difficult because of technical reasons [42,43]. Hence, sequencing a genomic region flanking only one end of the AT should be enough as long as either the derived sequence is of high quality or the bioinformatics mapping tool used to interpret it is very accurate. Genome ARTIST is not depending on TSDs detection for mapping and successfully deals with query sequences affected by sequencing artifacts or with small polymorphisms occurring very close to the TIRs.
Tangram uses split-reads obtained by NGS for precise mapping of insertions and implements SCISSORS program to find the breakpoint between the transposon sequence and the genomic one [44]. As a drawback, the authors mention that mapping errors may occur when transposon and genomic sequences are similar. According to the authors, Tangram's analysis may conduct to erroneous mapping results when short sequences from split-reads are common to both genomic and transposon sub-sequences. The algorithm used by Genome ARTIST for computing the precise border between transposon and genomic sub-sequences of a junction sequence circumvents this problem by always assigning the overlapped sequences to TPAs and, implicitly, to the TIR. This strategy is designed to cover the whole junction query sequence by a single, final alignment, an original approach which provides very accurate mapping performances.
According to our tests, Genome ARTIST may also be used to map insertion sites of integrative viruses, as herpes simplex virus. Such a task can be easily accomplished if the virus reference sequence is loaded into the transposon database of Genome ARTIST. Depending on the genes affected by the virus integration, accurate mapping could be of biological or medical relevance. Another application of Genome ARTIST is to map transposons carrying antibioresistance genes as the tool may be loaded simultaneously with many genomes of various bacteria strains and with a multitude of transposons of interest. Additionally, Genome ARTIST offers very reliable results when used for SNP detection or when checking the specificity of oligonucleotides (as primers and probes) against a reference genome. The field of transposon mapping software is heavily relying on Linux environment as revealed by the fact that some recent transposon mapping tools are actually developed for Unix/Linux. Relevant examples are represented by software/programs like TEMP [32], TIF [37] and ITIS [45]. Genome ARTIST is an open-source software which runs on many flavors of Linux OS and perfectly fits the popular BioLinux8 workbench.