Wednesday, 7 October 2015

File format: FASTA [SEQFILE, FASFILE]

One of the most common input and output formats for SLiMSuite is FASTA format, which is a very simple, human-readable sequence format. Despite the simplicity of FASTA, there are many sub-format variants in which the sequence name is formatted with specific information. Many of these will work and be recognised by SLiMSuite programs, but it also has its own favoured subformat, which is preferentially used for input/output.

SLiMSuite FASTA format is:

>Gene_SPCODE__AccNum [Description]
SEQUENCE

Where:

  • Gene is not used for anything and is purely for easy visual identification.
  • SPCODE is the species code. Where possible, Uniprot species mnemonics should be used but any short code can be used as long as (a) it contains uppercase letter and numbers only (no symbols), and (b) it is consistently used within a species/database. (i.e. you can make it up as long as all sequences from the same species use the same code.)
  • AccNum is the accession number, which is what is used as the unique sequence identifier.
  • Description is optional and can contain any other text.
  • SEQUENCE can be on one or more lines and contain spaces. However, it is best to have a single SEQUENCE line with no whitespace. (Some programs may enforce this.)

Seqsuite can be used to rename and reformat sequences, using the seq and seqlist programs.

Uniprot downloads should be automatically recognised and converted where needed. Genbank files can be converted using the genbank tool of Seqsuite. (NB. `V1.3.2` currently only supports the standard Genetic Code.)

Commands of the type cmd=FASFILE and cmd=SEQFILE will recognised FASTA format input. Some other commands (where documented) will also expect FASTA files.

Most SLiMSuite programs (unless otherwise stated) will assume protein sequences are being used. The dna=T flag should be used for DNA or RNA sequences where this will affect behaviour (e.g. the alphabet is important).

Thursday, 1 October 2015

SLiMSuite data types and file formats

SLiMSuite is designed to be a suite of programs that enable you to navigate your way through most of the main motif discovery tasks. Well, I say designed but it would probably be more accurate to say evolved. All the programs within SLiMSuite arose from research needs within the lab. As a result, they are heavily biased to the kind of data that we analyse and data sources that we use. However, it should be fairly easy to get data from other formats and sources into SLiMSuite.

From a computer science perspective, input and output for SLiMSuite is just plain ASCII text. This makes it easy to plug SLiMSuite into existing scripts and pipelines - and manually view/edit any input or output files if required. However, “plain text” is not very informative, and SLiMSuite actually deals with a lot of different formats of plain text (from a “human formatting” rather than “file type” point of view). The documentation is currently in the process of being updated to better reflect these formats but some commandline options will still simply list FILE, FILES or FILELIST as input parameters: see the accompanying descriptions to see what format these should be. Ask if it’s not clear! (File format documentation will also be added to the SLiMSuite blog, so check there.)

Within SLiMSuite, each file type has a distinct “file extension” that denotes the file type. Note that these are not enforced for input, although some programs may not always recognise the right format if a different extension is used. If you get odd input behaviour/errors that you do not understand, see if changing the file extensions helps. If you want a common file extension to be auto-recognised, let me know and I might be able to add it. SLiMSuite file extensions will not necessarily be recognised by other programs. NOTE: Operating systems will sometimes hide file extensions by default. If you are getting very confused, or have problems of extra *.txt extensions on everything, try changing the system settings. (And/or becoming familiar with command-line file manipulation.)

The main file types used by SLiMSuite are:

  • MOTIFS = A list of SLiM motif patterns. SLiMSuite has its own motif format but a number of other formats will also work when given as input. This includes a plain list of regex patterns, and results tables from other SLiMSuite programs. [*.motifs]
  • ACCLIST = A list of Uniprot accession numbers. [*.acc]
  • SEQFILE = A file containing biological sequences - usually protein sequences. (Some of the non-SLiM programs will use nucleotides sequences.) These can either be in fasta format (see FASFILE) or Uniprot plain text format (see DATFILE). [*.fas, *.dat]
  • FASFILE = A fasta file of (unaligned) protein sequences. [*.fas]
  • DATFILE = Uniprot plain text format [*.dat]
  • ALNFILE = Aligned fasta file [*.aln.fas]
  • DSVFILE = Delimiter separated value text file. The delimiter will be auto-recognised if possible as a tab [*.tdt, *.tsv], comma [*.csv] or whitespace [*.txt], or can be set with delimit=X if not recognised. Note: delimit=X input may not work with every program, so it is safest to use a consistent files name. The delimit=X parameter is more commonly used to control output format.
  • TDTFILE = Tab delimited text file [*.tdt, *.tsv]
  • CSVFILE = Comma separated text file [*.csv]
  • PPIFILE = Delimited text file with Hub and Spoke (gene symbol) fields and preferably also HubUni (uniprot), SpokeUni (uniprot) and Evidence fields.
  • GENELIST = Plain text list of gene symbols.
  • XREFDATA = Delimited text file that links gene symbols to identifiers from other databases.

See also:

SLiM discovery

The main input formats for SLiM discovery are:

  • A source of protein sequence data. This could be a protein FASTA file, a Uniprot plain text file, or a list of Uniprot accession numbers to download. For some tools, a single Uniprot accession number will work.
  • A source of motif (regular expression) definitions. This is only required if looking for known (or other pre-defined) motifs and/or wanting to compare a set of de novo predictions with known motifs. A number of different formats are accepted for motif input, including SLiMFinder (summary) results and ELM downloads. The simplest/easiest is a plain text file of regular expressions. For more on motif regular expression formats, please see Edwards and Palopoli 2015.

The main output formats are delimited text files.