Wednesday, 7 October 2015

File format: FASTA [SEQFILE, FASFILE]

One of the most common input and output formats for SLiMSuite is FASTA format, which is a very simple, human-readable sequence format. Despite the simplicity of FASTA, there are many sub-format variants in which the sequence name is formatted with specific information. Many of these will work and be recognised by SLiMSuite programs, but it also has its own favoured subformat, which is preferentially used for input/output.

SLiMSuite FASTA format is:

>Gene_SPCODE__AccNum [Description]
SEQUENCE

Where:

  • Gene is not used for anything and is purely for easy visual identification.
  • SPCODE is the species code. Where possible, Uniprot species mnemonics should be used but any short code can be used as long as (a) it contains uppercase letter and numbers only (no symbols), and (b) it is consistently used within a species/database. (i.e. you can make it up as long as all sequences from the same species use the same code.)
  • AccNum is the accession number, which is what is used as the unique sequence identifier.
  • Description is optional and can contain any other text.
  • SEQUENCE can be on one or more lines and contain spaces. However, it is best to have a single SEQUENCE line with no whitespace. (Some programs may enforce this.)

Seqsuite can be used to rename and reformat sequences, using the seq and seqlist programs.

Uniprot downloads should be automatically recognised and converted where needed. Genbank files can be converted using the genbank tool of Seqsuite. (NB. `V1.3.2` currently only supports the standard Genetic Code.)

Commands of the type cmd=FASFILE and cmd=SEQFILE will recognised FASTA format input. Some other commands (where documented) will also expect FASTA files.

Most SLiMSuite programs (unless otherwise stated) will assume protein sequences are being used. The dna=T flag should be used for DNA or RNA sequences where this will affect behaviour (e.g. the alphabet is important).

No comments:

Post a Comment