A pipeline of programs for collecting and analyzing group II intron retroelement sequences from GenBank

Abebe, Michael; Candales, Manuel A; Duong, Adrian; Hood, Keyar S; Li, Tony; Neufeld, Ryan A E; Shakenov, Abat; Sun, Runda; Wu, Li; Jarding, Ashley M; Semper, Cameron; Zimmerly, Steven

doi:10.1186/1759-8753-4-28

Mobile DNA

Table 1 Summary of programs

From: A pipeline of programs for collecting and analyzing group II intron retroelement sequences from GenBank

Program	Steps
blast_and_parse	• A tblastn search is done against NCBI’s ‘nr’ database with a set of representative group II intron ORF sequences as queries
blast_and_parse	• A list of unique, non-overlapping candidate hits is assembled, along with accession number and coordinates
DNA_sequence_download	• The GenBank entry for each candidate DNA sequence is downloaded
DNA_sequence_download	• Candidates are separated by taxonomic classification, with bacterial and archaeal candidates proceeding to the next step by default
create_storage	• Creates a FASTA file for each candidate’s DNA sequence
create_storage	• Creates storable files containing information about each candidate, to be used in later programs
filter_out_ non_gpII_rts	• A blastx search of candidate sequences is done against a local database of known, categorized bacterial RT sequences; candidate RTs whose closest relatives are not group II introns are separated out
find_intron_class	• A blastx search of candidate sequences is done against a local database of known and classified group II intron ORF sequences; based on the top matches, the ORF class is assigned, and the closest relative in the curated set is identified
find_orf_domains	• A blastx alignment is done between a candidate sequence and a representative IEP of the same class, whose IEP is mapped for the domains characteristic of group II introns
find_orf_domains	• The domains present for each IEP are tabulated, and the candidate is categorized as having complete domains or missing domains; candidate sequences with complete IEP domains continue to be analyzed
find_orf	• A blastx alignment is done between each candidate sequence and its closest relative among curated group II introns
	• From the alignment, it is decided whether the candidate sequence contains frame shifts, premature stops or other problems within its IEP
	• If the ORF appears intact, then a predicted amino acid sequence is assigned
find_intron_boundaries	• Information on possible boundary positions is acquired using class-specific HMM profiles of boundary sequences
generate_rna_sequence	• Boundary sequence data are evaluated and the most probable intron boundaries are predicted, along with the complete sequence of the intron
generate_rna_sequence	• Candidates with ambiguous boundaries are noted
group_candidates	• All ORF sequences assigned to a given class are aligned using ClustalW, and pair-wise distances are calculated using PROTDIST of the Phylip package
	• Sequences differing by less than 0.061 units are assigned to a group of 95% identity
	• For each group of 95% identity, the complete intron DNA sequence of each member is aligned using ClustalW
select_prototypes	• For each group of 95% identity, one candidate sequence is selected as the prototype, or representative of the group

Back to article page

ISSN: 1759-8753

Contact us

General enquiries: journalsubmissions@springernature.com