Skip to main content

Table 1 Summary of programs

From: A pipeline of programs for collecting and analyzing group II intron retroelement sequences from GenBank

Program

Steps

blast_and_parse

• A tblastn search is done against NCBI’s ‘nr’ database with a set of representative group II intron ORF sequences as queries

• A list of unique, non-overlapping candidate hits is assembled, along with accession number and coordinates

DNA_sequence_download

• The GenBank entry for each candidate DNA sequence is downloaded

• Candidates are separated by taxonomic classification, with bacterial and archaeal candidates proceeding to the next step by default

create_storage

• Creates a FASTA file for each candidate’s DNA sequence

• Creates storable files containing information about each candidate, to be used in later programs

filter_out_ non_gpII_rts

• A blastx search of candidate sequences is done against a local database of known, categorized bacterial RT sequences; candidate RTs whose closest relatives are not group II introns are separated out

find_intron_class

• A blastx search of candidate sequences is done against a local database of known and classified group II intron ORF sequences; based on the top matches, the ORF class is assigned, and the closest relative in the curated set is identified

find_orf_domains

• A blastx alignment is done between a candidate sequence and a representative IEP of the same class, whose IEP is mapped for the domains characteristic of group II introns

• The domains present for each IEP are tabulated, and the candidate is categorized as having complete domains or missing domains; candidate sequences with complete IEP domains continue to be analyzed

find_orf

• A blastx alignment is done between each candidate sequence and its closest relative among curated group II introns

• From the alignment, it is decided whether the candidate sequence contains frame shifts, premature stops or other problems within its IEP

• If the ORF appears intact, then a predicted amino acid sequence is assigned

find_intron_boundaries

• Information on possible boundary positions is acquired using class-specific HMM profiles of boundary sequences

generate_rna_sequence

• Boundary sequence data are evaluated and the most probable intron boundaries are predicted, along with the complete sequence of the intron

• Candidates with ambiguous boundaries are noted

group_candidates

• All ORF sequences assigned to a given class are aligned using ClustalW, and pair-wise distances are calculated using PROTDIST of the Phylip package

• Sequences differing by less than 0.061 units are assigned to a group of 95% identity

• For each group of 95% identity, the complete intron DNA sequence of each member is aligned using ClustalW

select_prototypes

• For each group of 95% identity, one candidate sequence is selected as the prototype, or representative of the group