Wednesday 17 June 2015

Sequence names and species codes for GOPHER

GOPHER (and any tools using orthologue alignments produced by GOPHER) needs sequence names to be formatted in a particular way so that the species information can be corrected parsed. This “SLiMSuite fasta” format is the only sequence format fully supported by SLiMSuite. If you are getting an unexpected error, sequence formatting and naming is one of the first things to check. It should not break any other programs that I know about.

This format is:

>Gene_SPCODE__AccNum [Description]
SEQUENCE

Where:

  • Gene is not used for anything and is purely for easy visual identification.
  • SPCODE is the species code. Where possible, Uniprot species mnemonics should be used but any short code can be used as long as (a) it contains uppercase letter and numbers only (no symbols), and (b) it is consistently used within a species/database. (i.e. you can make it up as long as all sequences from the same species use the same code.)
  • AccNum is the accession number, which is what is used as the unique sequence identifier.
  • Description is optional and can contain any other text.
  • SEQUENCE can be on one or more lines and contain spaces. However, it is best to have a single SEQUENCE line with no whitespace. (Some programs may enforce this.)

Seqsuite can be used to rename and reformat sequences, using the seq and seqlist programs.

Uniprot downloads should be automatically recognised and converted where needed.

No comments:

Post a Comment