Thursday, 10 December 2015

BioInfoSummer2015 SLiMSuite Workshop

Dr Richard Edwards, University of New South Wales
Thursday 10th December 2015

Session outline

Click for slides.

Part I: Theory

  • Introduction to workshop
  • What are SLiMs?
  • What is SLiMSuite

Part II: Practice

  • Installing/running SLiMSuite
  • Data types and main input formats
  • Motif discovery using the SLiMSuite REST Servers
  • Motif discovery using the SLiMScape app for Cytoscape

Additional help and documentation

General information about SLiMs and motif discovery can be found in the literature. Some good places to start are the recent ELM 2016 paper and our 2015 Methods in Molecular Biology review as well as the SLiMScape app paper:

For information about SLiMSuite, please visit the EdwardsLab webpage and the SLiMSuite blog. Help and documentation for the REST servers can also be found at the REST homepage. If in doubt, please email: richard.edwards@unsw.edu.au.

Several EdwardsLab publications also cover motifs and SLiMSuite tools.

Installing/Running SLiMSuite

NOTE: For this workshop, you do not need to install SLiMSuite. You will need Cytoscape and the SLiMScape app for the later parts.


The current SLiMSuite release is 2015-11-30 and can be downloaded by clicking the button (left).

In addition to the tarball available via the links above, SLiMSuite is now available as a GitHub repository (right).

See also: Installation and Setup.

For this workshop, we will primarily be running the tools (and looking at pre-generated results) via the online servers:

Data types and main input formats

From a computer science perspective, input and output for SLiMSuite is just plain ASCII text. This makes it easy to plug SLiMSuite into existing scripts and pipelines - and manually view/edit any input or output files if required. However, “plain text” is not very informative, and SLiMSuite actually deals with a lot of different formats of plain text (from a “human formatting” rather than “file type” point of view). The documentation is currently in the process of being updated to better reflect these formats but some commandline options will still simply list FILE, FILES or FILELIST as input parameters: see the accompanying descriptions to see what format these should be. Ask if it’s not clear! (File format documentation will also be added to the SLiMSuite blog, so check there.)

Within SLiMSuite, each file type has a distinct “file extension” that denotes the file type. Note that these are not enforced for input, although some programs may not always recognise the right format if a different extension is used. If you get odd input behaviour/errors that you do not understand, see if changing the file extensions helps. If you want a common file extension to be auto-recognised, let me know and I might be able to add it. SLiMSuite file extensions will not necessarily be recognised by other programs. NOTE: Operating systems will sometimes hide file extensions by default. If you are getting very confused, or have problems of extra *.txt extensions on everything, try changing the system settings. (And/or becoming familiar with command-line file manipulation.)

The main input formats for SLiM discovery are:

  • A source of protein sequence data. This could be a protein FASTA file, a Uniprot plain text file, or a list of Uniprot accession numbers to download. For some tools, a single Uniprot accession number will work.
  • A source of motif (regular expression) definitions. This is only required if looking for known (or other pre-defined) motifs and/or wanting to compare a set of de novo predictions with known motifs. A number of different formats are accepted for motif input, including SLiMFinder (summary) results and ELM downloads. The simplest/easiest is a plain text file of regular expressions. For more on motif regular expression formats, please see Edwards and Palopoli 2015.

Common motif discovery tasks

Jobs can be run and retrieved at: http://www.slimsuite.unsw.edu.au/servers.php. (This is a bit easier than making the URL directly, although this is also an option as we will see.)

NOTE: Some of the jobs take a while to run and the SLiMSuite servers have limited resources. It would therefore be useful if you could click on the example JobID links rather than trying to run every example REST command yourself. The first output tab (and the log tab) will show you the run times for that job, so you can see which jobs are fast or slow before you experiment.


Task 1: Find known SLiMs in a protein (ELM/SLiMProb)

ELM. Visit http://http://www.elm.eu.org/ and enter your protein of choice as Uniprot identifier or accession number in the box. (Identifiers will auto-complete and fill in some extra details.) For non-Uniprot protein sequences, you can also enter fasta format.

Try this now with P03070 (LT_SV40) or P03254 (E1A_ADE02). Each of them should have a True Positive LIG_Rb_LxCxE_1 motif.

SLiMProb. We can do a similar search using the SLiMProb REST server (paste the contents of the grey box onto the end of the http://rest.slimsuite.unsw.edu.au/ URL):

slimprob&uniprotid=E1A_ADE02&motifs=elm

JobID: 15120800029

NOTE: The ELM alias currently searches the 2015 ELM classes.


Task 2: Find custom SLiMS in a protein (SLiMProb)

slimprob&uniprotid=E1A_ADE02&motifs=LxCxE,PxDLS

JobID: 15120800031


Task 3: Finding proteome-wide occurrence of a motif using Bioware (SLiMSearch)

The SLiMSearch server is accessible at: http://slim.ucd.ie/slimsearch/. This has been recently updated to Version 4 and now brings in a lot of information, so it is recommended that you read the Help pages for the server.

Example (LIG_CtBP_PxDLS_1): http://slim.ucd.ie/rest/#/slimsearch/annotations?jobId=7R8Tvssm9HEdjWW7jQsgEHUfP0VlHdR6

Human protein PRDM16 is particularly interesting: it does not have an annotated ELM but does match a region annotated to interact with CTBP1. (See the Region column - Expand the instance Feature annotations for a clearer look.) This kind of search can be a good way of identifying new instances of known motifs - some of which may be in the literature but may not have yet made it into database annotation.

The ELM definition for this motif P[LVIPME][DENS][LM][VASTRG] is very degenerate with a lot of hits - over-prediction is a big problem in motif discovery. We can try to make the definition a little tighter as the expense of some instances, using another tool called SLiMMaker:

slimmaker&peptides=LIG_CtBP_PxDLS_1&iterate=T&align=F&minfreq=0.67&minseq=2

JobID: 15120600004

Repeating the SLiMSearch analysis with the redefined motif (P[EILMV][DN]L[ARST]) gives a greater density of known ELMs (see the Motif column) in the top ranked motifs: http://slim.ucd.ie/rest/#/slimsearch/annotations?jobId=L41BRpXQ1oTD6ByDuUSqjWbQZ22WBKbw.


Task 4: Predicting novel SLiMs de novo in a set of proteins (SLiMFinder)

SLiMFinder is designed to look for convergently evolved motifs that are shared between unrelated proteins. For example, we can look at the proteins known (in ELM) to contain the LIG_PCNA_PIPBox_1. As SLiMs are generally in disordered regions, we will switch disorder masking on with dismask=T, which uses IUPred to predict globular regions, which are masked out:

slimfinder&uniprotid=LIG_PCNA_PIPBox_1&dismask=T

JobID: 15120800001

(We will look at the UPC and motif cloud output among others.)


Task 5: Identifying known motifs from de novo predictions (CompariMotif)

When you have a lot of motif predictions, it can be tiresome and error-prone to manually scan them for things that look familiar. SLiMSuite has a tool called CompariMotif, which compares sets of motifs for similarity.

The comparimotif server can take motif files/lists (like SLiMProb or SLiMFinder output directly. These are given to the &motifs and/or &searchdb options: if no &searchdb is given then the input motifs are searched against themselves. (This can be useful if clouding goes a bit wrong.)

To pass the output of one server to another, use the format: &cmd=jobid:XXXXXX:OUTFMT, where XXXXXX is the Job ID and OUTFMT is the desired output format. E.g.:

comparimotif&motifs=jobid:15120800001:main&searchdb=LIG_PCNA_PIPBox_1

JobID: 15120900004

The server is currently in development so output is not sorted usefully yet. This is more of a problem if searching against many SLiMs:

comparimotif&motifs=jobid:15120800001:main&searchdb=elm

JobID: 15120900005

The best advice is to save the compare output table (retrieve&jobid=15120900005&outfmt=compare), open it up in Excel and sort on Score. Alternatively, use the CompariMotif server at http://bioware.ucd.ie.


Task 6: SLiM prediction with conservation masking (SLiMFinder)

Masking is important as it reduces the search space. It can also reduce the signal if it incorrectly masks some true positives but for larger datasets the reduction in "noise" can be more important. As well as dismask=T/F there are several other masking options in SLiMSuite:

  • low complexity masking (ON by default)
  • N-terminal methionines (ON by default)
  • conservation-based masking (OFF by default)
  • Uniprot feature masking (OFF by default)
  • Motif masking (OFF by default)

For custom sequence input, there is also the option for custom masking based on upper/lower case. For now, we will just look at conservation masking, as this has been shown to improve sensitivity in PPI data. For example, a 2013 compilation of CTBP1 interactors does not yield a significant motif:

slimfinder&uniprotid=CTBP1&dismask=T&runid=CtBP1-DisMask

JobID: 15120900002

But if consmask=T is also switched on:

slimfinder&uniprotid=CTBP1&dismask=T&consmask=T&runid=CtBP1-ConsMask

JobID: 15120900003

The importance of correcting for evolutionary relationships

The UPC correction can be switched off with efilter=F. Many motif prediction tools calculate estimated expectations without such correction. This can result is massive biases due to shared evolutionary history, which swamp any convergent SLiM evolution signal, for example with the LIG_CtBP_PxDLS_1 ELM proteins:

slimfinder&uniprotid=LIG_CtBP_PxDLS_1&dismask=T&runid=CtBP-NoEFilter&efilter=F

JobID: 15120800036


Task 7: Look for enrichment or depletion of motifs in a set of proteins (SLiMProb)

We can investigate why the PxDLS motif did not come back with just disorder masking by looking at its enrichment using SLiMProb. When given multiple proteins, SLiMProb will use the same UPC correction as SLiMFinder but also return statistics without UPC correction and simply treating all the sequences as one giant sequence. It can, for example, be used to investigate different definitions of a motif:

slimprob&uniprotid=CTBP1&dismask=T&runid=CtBP1-DisMask&motifs=PxDLS,P[LVIPME][DENS][LM][VASTRG],Px[DE][LM][ST]

JobID: 15120900016

In this case, we can see that even though the "true" motif has the most support, it is also expected to occur more by chance. It is enriched, but not enough to survive the multiple testing correction of SLiMChance.

Though not of interest here, the pUnd statistics can be used to look for depletion/avoidance of a particular motif in a dataset.


Task 8: Find novel motifs from a conservation pattern (SLiMPrints)

Patterns of evolutionary conservation can also be used to directly identify regions of proteins that look like motifs. The tool we have developed for this is called SLiMPrints, which can be run at the Bioware SLiMPrints server. For example, we can look for motif-like regions in one of the CtBP PPI partners, FOG1_HUMAN (Q8IX07): http://bioware.ucd.ie/~compass/biowareweb/cgi-bin/PHP_helper_files/slimprintsInfo.php?jobId=e7GZLf

This protein has a bunch of significant motif-like regions, including the PxDLS motif region at rank 7: http://bioware.ucd.ie/~proviz/ProViz/alignmentViewer/drawer.php?uniprotid=Q8IX07&slim=GPIDL&slimpos=793&column=794.5&width=80&collapse=false

(Note how the precise motif is rarely returned by de novo predictors.)


Task 9: Using the SLiMScape app to visualise a server job

We're now going to fire up Cytoscape and have a quick look at the SLiMScape app. This is fairly well described in the paper, so we will just look at the main ways to run the server. If you've not used Cytoscape before, you'll want to visit the Cytoscape website and watch the introduction video, before installing it.

The simplest is to retrieve an existing run:

  1. In the SLiMFinder tab, enter 15120900003 in the Run ID box and hit Retrieve.
  2. Apply the default layout.
  3. Explore the results. Connections are UPC relationships in the data.

Task 10: Running QSLiMFinder through SLiMScape

Now let's imagine we had seen the SLiMPrints results from above for FOG1_HUMAN and knew that it interacted with CtBP1. We could ask the specific question if any motifs in FOG1_HUMAN were enriched in the rest of the PPI dataset. We do this by using QSLiMFinder and giving Q8IX07 as the query. (&query=Q8IX07 on the server.)

First, add a node to the network and change its name to Q8IX07. Enter this in the Query Sequence box then highlight all of the nodes before hitting Run QSLiMFinder:

JobID: 15120900007

This is the essence of molecular mimicry and we could use the same approach to see if E1A_ADE02 shares any motifs by adding P03254 and using it as a query:

JobID: 15120900008


Task 11: Building PPI networks for analysis

The most useful thing of having access to SLiMSuite through Cytoscape is to be able to use it to explore PPI networks and select nodes for analysis. There are in-built tools to get PPI data into Cytoscape. For SLiMSuite, the ID must be a Uniprot ID or accession number, or a Node must have "Uniprot" attribute.

The SLiMSuite REST server also provides some methods for getting PPI data into Cytoscape (and/or for use on the server), using the PINGU server. This is still under development and so the documentation of the available PPI data is currently limited, but just get in touch if you want to use it. (Currently human only.)

PPI data is retrieved by entering one or more gene symbols as a &hublist, optionally along with a &ppisource (see the ppisource alias):

pingu&hublist=CTBP1,CTBP2&ppisource=intact

JobID: 15120900009

This can be used directly for &uniprotid input using the &rest=uniprot output:

slimfinder&uniprotid=jobid:15120900009:uniprot&dismask=T&consmask=T&runid=CtBP1and2

JobID: 15120900011

Alternatively, the PPI data can be imported into Cytoscape using the pairwise table:

  1. Start a new session. (Later you can workout how to import and merge networks.)
  2. Import network from URL: http://rest.slimsuite.unsw.edu.au/retrieve&jobid=15120900009&rest=pairwise
  3. Rename the HubUni and SpokeUni fields to name and attribute them to Source Node and Target Node attributes. Make Hub the Source, Spoke the Target and Evidence the Interaction Type then import.
  4. Select the nodes that are shared interactors of both CtBP proteins.
  5. Modify the masking settings to include disorder, conservation and feature masking.
  6. Hit Run:

JobID: 15120900013

Monday, 7 December 2015

New SLiMSuite REST Servers

Since the move to UNSW in 2013, the Bioware SLiMSuite servers and REST servers have been undergoing some much needed TLC. As part of this process, a new set of UNSW REST servers were introduced and online with the 2015-06-01 SLiMSuite release.

An overview of how the REST servers work is given on the REST Homepage. The available tools are listed at the REST Tools page. The main ones - accessible through the SLiMScape app for Cytoscape are (or support):

The primary focus has been setting up new servers to be accessed via a RESTful-style interface whereby a URL can be directly given to the server and used to either download results directly (if accessing programmatically) or view in a web browser. As with the main programs, these servers use plain text inputs and outputs wherever. Whilst this probably makes proper computer scientists very unhappy, it should make it very easy to incorporate SLiMSuite REST functions into your own scripts - you only need to learn how to parse text. (It also makes it easy for me to swap input sources.) If you don’t want to write your own, SLiMParser is provided in the SLiMSuite download to do this for you.

The other design consideration that has gone into the REST servers is to make them run as much like the commandline versions as possible: (1) they use the same code; (2) they use the same commandline options, parsed from the URL. This means that (a) you should easily be able to reproduce server results on your own system, and (b) new functions (and bug fixes) should become quickly available via the REST servers.

To save the need for constructing complex URLs, there is a simple on-size-fits-all form at the EdwardsLab server page. Over time, tool-specific forms will be established. Currently, this only exists for SLiMMaker.

As ever, if something about the new servers misbehaves or does not make sense - or you really want some new functions - please get in touch.

Monday, 30 November 2015

SLiMSuite release v1.1.0 (2015-11-30) online

The November 2015 release of SLiMSuite v1.1.0 (2015-11-30) in now on GitHub. This is intermediate release in preparation for the BioInfoSummer 2015 SLiMSuite workshop and contains a few minor modifications to SLiMSuite programs. The main updates are preliminary versions of some tools for PacBio genomics, notably PAGSAT and SMRTSCAPE. These are still in development and need further documentation and testing before use is advised.

The SeqSuite Genbank parser has some bug fixes for reverse complemented protein sequences with introns, and initial capacity for different codon tables. (This has been implemented for yeast, so only NCBI transl_tables 1-3 currently implemented: please get in touch if you want to use this program with other codon tables.)

SLiMSuite updates in this release

Updates in libraries/:

• rje: Updated from Version 4.14.0.
→ Version 4.14.1: Fixed matchExp method to be able to handline multilines. (Shame re.DOTALL doesn’t work!)
→ Version 4.14.2: Modified integer commands to read/convert floats.
→ Version 4.15.0: Added intList() and numList() functions.

• rje_db: Updated from Version 1.7.5.
→ Version 1.7.6: Added table.opt[‘Formatted’] = Whether table data has been successfully formatted using self.dataFormat()
→ Version 1.7.7: Added option to constrain table splitting to certain field values.
→ Version 1.8.0: Added option to store keys as tuples for correct sorting. (Make default at some point.)

• rje_genbank: Updated from Version 1.3.1.
→ Version 1.3.2: Fixed bug in reverse complement sequences with introns.

• rje_iridis: Updated from Version 1.10.
→ Version 1.10.1: Attempted to fix SLiMFarmer batch run problem. (Should not be setting irun=batch!)
→ Version 1.10.2: Trying to clean up unknown 30s pause. Might be freemem issue?

• rje_obj: Updated from Version 2.1.2.
→ Version 2.1.3: Modified integer commands to read/convert floats.

• rje_qsub: Updated from Version 1.6.2.
→ Version 1.6.3: Tweaked the showstart command for katana.

• rje_samtools: Created/Renamed/moved.
→ Version 0.0: Initial Compilation.
→ Version 0.1.0: Modified version to handle multiple loci per file. (Original was for single bacterial chromosomes.)

• rje_seqlist: Updated from Version 1.11.0.
→ Version 1.12.0: Added peptides/qregion reformatting and region=X,Y.
→ Version 1.13.0: Added summarise=T option for generating some summary statistics for sequence data. Added minlen & maxlen.
→ Version 1.14.0: Added splitseq=X split output sequence file according to X (gene/species) [None]
→ Version 1.15.0: Added names() method.
→ Version 1.15.1: Fixed bug with storage and return of summary stats.
→ Version 1.15.2: Fixed dna2prot reformatting.
→ Version 1.15.3: Fixed summarise bug (n=1).

• rje_sequence: Updated from Version 2.4.1.
→ Version 2.5.0: Added yeast genome renaming.
→ Version 2.5.1: Modified reverse complement code.
→ Version 2.5.2: Tried to speed up dna2prot code.

• rje_slimcalc: Updated from Version 0.9.
→ Version 0.9.1: Modified combining of motif stats to handle expectString format for individual values.
→ Version 0.9.2: Changed default conscore in docstring to RLC.

• rje_slimcore: Updated from Version 2.7.3.
→ Version 2.7.4: Fixed walltime server bug.
→ Version 2.7.5: Fixed feature masking.

• rje_slimlist: Updated from Version 1.7.2.
→ Version 1.7.3: Fixed bug that could not accept variable length motifs from commandline. Improved error message.

• rje_taxonomy: Updated from Version 1.0.
→ Version 1.1.0: Added parsing of yeast strains.

• rje_tree: Updated from Version 2.11.2.
→ Version 2.12.0: Added treeLen() method.
→ Version 2.13.0: Updated PNG saving with R to use newer code.

• rje_uniprot: Updated from Version 3.21.3.
→ Version 3.21.4: Fixed Feature masking. Should this be switched off by default?

• rje_xref: Updated from Version 1.6.0.
→ Version 1.7.0: Added comments=LIST ist of comment line prefixes marking lines to ignore (throughout file) [‘//’,’%’]
→ Version 1.7.1: Added xreformat=T/F : Whether to apply field reformatting to input xrefdata (True) or just xrefs to map (False) [False]
→ Version 1.8.0: Added recognition and parsing of yeast.txt XRef file from Uniprot (http://www.uniprot.org/docs/yeast.txt)

• snp_mapper: Created/Renamed/moved.
→ Version 0.0: Initial Compilation. Batch mode for mapping SNPs needs updating.
→ Version 0.1: SNP mapping against a GenBank file.
→ Version 0.2: Fixed complement strand bug.
→ Version 0.3.0: Updated to work with RATT(/Mummer?) snp output file. Improved docs.
→ Version 0.4.0: Major reworking for easier updates and added functionality. (Convert to 1.0.0 when complete.)

Updates in tools/:

• gablam: Updated from Version 2.19.2.
→ Version 2.20.0: Added SNP Table output.

• gopher: Updated from Version 3.4.1.
→ Version 3.4.2: Removed GOPHER System Exit on IOError to prevent breaking of REST server.

• pagsat: Created/Renamed/moved.
→ Version 1.0.0: Initial working version for based on rje_pacbio assessment=T.
→ Version 1.1.0: Fixed bug with gene and protein summary data. Removed gene/protein reciprocal searches. Added compare mode.
→ Version 1.1.1: Added PAGSAT output directory for tidiness!
→ Version 1.1.2: Renamed the PacBio class PAGSAT.
→ Version 1.2.0: Tidied up output directories. Added QV filter and Top Gene/Protein hits output.
→ Version 1.2.1: Added casefilter=T/F : Whether to filter leading/trailing lower case (low QV) sequences [True]
→ Version 1.3.0: Added tophitbuffer=X and initial synteny analysis for keeping best reference hits.
→ Version 1.4.0: Added chrom-v-contig alignment files along with *.ordered.fas.
→ Version 1.4.1: Made default chromalign=T.
→ Version 1.4.2: Fixed casefilter=F.
→ Version 1.5.0: diploid=T/F : Whether to treat assembly as a diploid [False]
→ Version 1.6.0: mincontiglen=X : Minimum contig length to retain in assembly [1000]
→ Version 1.6.1: Added diploid=T/F to R PNG call.

• peptcluster: Updated from Version 1.5.1.
→ Version 1.5.2: Improved clarity of warning message.

• pingu_V4: Updated from Version 4.5.0.
→ Version 4.5.1: Debugging missing identifiers and indexing speed. Added good and bad DB.
→ Version 4.5.2: Fixed SIF output and changed names to sif-* for opening in browser.
→ Version 4.5.3: Updated REST output.

• seqsuite: Updated from Version 1.8.0.
→ Version 1.9.0: Added PAGSAT and SMRTSCAPE.
→ Version 1.9.1: Fixed HAQESAC setobjects=True error.
→ Version 1.10.0: Added batchrun=FILELIST batcharg=X batch running mode.
→ Version 1.11.0: Added SAMTools and Snapper/SNP_Mapper.

• slimbench: Updated from Version 2.10.0.
→ Version 2.10.1: Updated ELM Source URLs.

• slimfarmer: Updated from Version 1.4.2.
→ Version 1.4.3: Added recognition of missing slimsuite programs and switching to slimsuite=F.

• slimfinder: Updated from Version 5.2.0.
→ Version 5.2.1: Fixed ambocc<1 and minocc<1 issue. (Using integers rather than floats.) Fixed OccRes Sig output format.

• slimparser: Updated from Version 0.3.1.
→ Version 0.3.2: Fixed issue reading files for full output.
→ Version 0.3.3: Tidied output names when restbase=jobid.

• slimprob: Updated from Version 2.2.3.
→ Version 2.2.4: Improved slimcalc output (s.f.).

• slimsuite: Updated from Version 1.5.0.
→ Version 1.5.1: Changed disorder to iuscore to avoid module conflict.

• smrtscape: Created/Renamed/moved.
→ Version 0.0.0: Initial Compilation.
→ Version 1.0.0: Initial working version for server.
→ Version 1.1.0: Added xnlist=LIST : Additional columns giving % sites with coverage >= Xn [10,25,50,100].
→ Version 1.2.0: Added assessment -> now PAGSAT.
→ Version 1.3.0: Added seed and anchor read coverage generator (calculate=T).
→ Version 1.3.1: Deleted assessment function. (Now handled by PAGSAT.)
→ Version 1.4.0: Added new coverage=T function that incorporates seed and anchor subreads.
→ Version 1.5.0: Added parseparam=FILES with paramlist=LIST to parse restricted sets of parameters.
→ Version 1.6.0: New SMRTSCAPE program building on PacBio v1.5.0. Added predict=T/F option.
→ Version 1.6.1: Updated parameters=T to incorporate that the seed read counts as X=1.
→ Version 1.7.0: Added *.summary.tdt output from subread summary analysis. Added minreadlen.
→ Version 1.8.0: preassembly=FILE: Preassembly fasta file to assess/correct over-fragmentation (use seqin=FILE for subreads)

Wednesday, 7 October 2015

File format: FASTA [SEQFILE, FASFILE]

One of the most common input and output formats for SLiMSuite is FASTA format, which is a very simple, human-readable sequence format. Despite the simplicity of FASTA, there are many sub-format variants in which the sequence name is formatted with specific information. Many of these will work and be recognised by SLiMSuite programs, but it also has its own favoured subformat, which is preferentially used for input/output.

SLiMSuite FASTA format is:

>Gene_SPCODE__AccNum [Description]
SEQUENCE

Where:

  • Gene is not used for anything and is purely for easy visual identification.
  • SPCODE is the species code. Where possible, Uniprot species mnemonics should be used but any short code can be used as long as (a) it contains uppercase letter and numbers only (no symbols), and (b) it is consistently used within a species/database. (i.e. you can make it up as long as all sequences from the same species use the same code.)
  • AccNum is the accession number, which is what is used as the unique sequence identifier.
  • Description is optional and can contain any other text.
  • SEQUENCE can be on one or more lines and contain spaces. However, it is best to have a single SEQUENCE line with no whitespace. (Some programs may enforce this.)

Seqsuite can be used to rename and reformat sequences, using the seq and seqlist programs.

Uniprot downloads should be automatically recognised and converted where needed. Genbank files can be converted using the genbank tool of Seqsuite. (NB. `V1.3.2` currently only supports the standard Genetic Code.)

Commands of the type cmd=FASFILE and cmd=SEQFILE will recognised FASTA format input. Some other commands (where documented) will also expect FASTA files.

Most SLiMSuite programs (unless otherwise stated) will assume protein sequences are being used. The dna=T flag should be used for DNA or RNA sequences where this will affect behaviour (e.g. the alphabet is important).

Thursday, 1 October 2015

SLiMSuite data types and file formats

SLiMSuite is designed to be a suite of programs that enable you to navigate your way through most of the main motif discovery tasks. Well, I say designed but it would probably be more accurate to say evolved. All the programs within SLiMSuite arose from research needs within the lab. As a result, they are heavily biased to the kind of data that we analyse and data sources that we use. However, it should be fairly easy to get data from other formats and sources into SLiMSuite.

From a computer science perspective, input and output for SLiMSuite is just plain ASCII text. This makes it easy to plug SLiMSuite into existing scripts and pipelines - and manually view/edit any input or output files if required. However, “plain text” is not very informative, and SLiMSuite actually deals with a lot of different formats of plain text (from a “human formatting” rather than “file type” point of view). The documentation is currently in the process of being updated to better reflect these formats but some commandline options will still simply list FILE, FILES or FILELIST as input parameters: see the accompanying descriptions to see what format these should be. Ask if it’s not clear! (File format documentation will also be added to the SLiMSuite blog, so check there.)

Within SLiMSuite, each file type has a distinct “file extension” that denotes the file type. Note that these are not enforced for input, although some programs may not always recognise the right format if a different extension is used. If you get odd input behaviour/errors that you do not understand, see if changing the file extensions helps. If you want a common file extension to be auto-recognised, let me know and I might be able to add it. SLiMSuite file extensions will not necessarily be recognised by other programs. NOTE: Operating systems will sometimes hide file extensions by default. If you are getting very confused, or have problems of extra *.txt extensions on everything, try changing the system settings. (And/or becoming familiar with command-line file manipulation.)

The main file types used by SLiMSuite are:

  • MOTIFS = A list of SLiM motif patterns. SLiMSuite has its own motif format but a number of other formats will also work when given as input. This includes a plain list of regex patterns, and results tables from other SLiMSuite programs. [*.motifs]
  • ACCLIST = A list of Uniprot accession numbers. [*.acc]
  • SEQFILE = A file containing biological sequences - usually protein sequences. (Some of the non-SLiM programs will use nucleotides sequences.) These can either be in fasta format (see FASFILE) or Uniprot plain text format (see DATFILE). [*.fas, *.dat]
  • FASFILE = A fasta file of (unaligned) protein sequences. [*.fas]
  • DATFILE = Uniprot plain text format [*.dat]
  • ALNFILE = Aligned fasta file [*.aln.fas]
  • DSVFILE = Delimiter separated value text file. The delimiter will be auto-recognised if possible as a tab [*.tdt, *.tsv], comma [*.csv] or whitespace [*.txt], or can be set with delimit=X if not recognised. Note: delimit=X input may not work with every program, so it is safest to use a consistent files name. The delimit=X parameter is more commonly used to control output format.
  • TDTFILE = Tab delimited text file [*.tdt, *.tsv]
  • CSVFILE = Comma separated text file [*.csv]
  • PPIFILE = Delimited text file with Hub and Spoke (gene symbol) fields and preferably also HubUni (uniprot), SpokeUni (uniprot) and Evidence fields.
  • GENELIST = Plain text list of gene symbols.
  • XREFDATA = Delimited text file that links gene symbols to identifiers from other databases.

See also:

SLiM discovery

The main input formats for SLiM discovery are:

  • A source of protein sequence data. This could be a protein FASTA file, a Uniprot plain text file, or a list of Uniprot accession numbers to download. For some tools, a single Uniprot accession number will work.
  • A source of motif (regular expression) definitions. This is only required if looking for known (or other pre-defined) motifs and/or wanting to compare a set of de novo predictions with known motifs. A number of different formats are accepted for motif input, including SLiMFinder (summary) results and ELM downloads. The simplest/easiest is a plain text file of regular expressions. For more on motif regular expression formats, please see Edwards and Palopoli 2015.

The main output formats are delimited text files.

Wednesday, 17 June 2015

Sequence names and species codes for GOPHER

GOPHER (and any tools using orthologue alignments produced by GOPHER) needs sequence names to be formatted in a particular way so that the species information can be corrected parsed. This “SLiMSuite fasta” format is the only sequence format fully supported by SLiMSuite. If you are getting an unexpected error, sequence formatting and naming is one of the first things to check. It should not break any other programs that I know about.

This format is:

>Gene_SPCODE__AccNum [Description]
SEQUENCE

Where:

  • Gene is not used for anything and is purely for easy visual identification.
  • SPCODE is the species code. Where possible, Uniprot species mnemonics should be used but any short code can be used as long as (a) it contains uppercase letter and numbers only (no symbols), and (b) it is consistently used within a species/database. (i.e. you can make it up as long as all sequences from the same species use the same code.)
  • AccNum is the accession number, which is what is used as the unique sequence identifier.
  • Description is optional and can contain any other text.
  • SEQUENCE can be on one or more lines and contain spaces. However, it is best to have a single SEQUENCE line with no whitespace. (Some programs may enforce this.)

Seqsuite can be used to rename and reformat sequences, using the seq and seqlist programs.

Uniprot downloads should be automatically recognised and converted where needed.

Monday, 1 June 2015

SLiMSuite release 2015-06-01 now available

A new download of SLiMSuite (release 2015-06-01) is now available. This is the first release in the new git repository at https://github.com/slimsuite/SLiMSuite. A tarball slimsuite.2015-06-01.tgz is also available, containing the same code. Once unpacked, it should be possible to pull down additional updates with git. (This release corresponds to the UCD svn repo r895.)

The major change since the last release is a general tidying of the repository in preparation for going on GitHub and tidying documentation for the new online help via the SLiMSuite REST Server:

for:

To try out the new documentation for a given program, replace sitemap in the box and click View Documentation. Leaving sitemap in the box will list all modules, which can then be clicked on.

The old PDF Manuals are still included in the release and can be accessed from the EdwardsLab Software page. These will be updated eventually but the focus is currently on getting module docstrings and the online help up-to-date. As ever, please get in touch if you have any questions.

This release also sees the addition of a new tool, SLiMParser for running/parsing the new REST servers. SLiMMaker has also undergone some improvements and now features: (1) basic peptide alignment prior to motif generation; (2) extension of degenerate sites using an “equivalence” list of similar amino acids.

A full list of updates is given below.

Updates since previous release

Updates in tools/:

• gablam: Updated from Version 2.16.1.
→ Version 2.17.0: Added localalnfas=T/F : Whether to output local alignments to *.local.fas fasta file (if local=T) [False]
→ Version 2.17.1: Fixed bug where query and hit lengths were not being output for fullblast.
→ Version 2.18.0: Added blaste filtering to be applied to existing BLAST results.
→ Version 2.19.0: Added maxall=X limits to all-by-all analyses. Added qassemble=T.
→ Version 2.19.1: Fixed handling of basefile and results generation for blastres=FILE.
→ Version 2.19.2: Modified output to be in rank order.

• gopher: Updated from Version 3.4.
→ Version 3.4.1: Fixed stripXGap issue. (Why was this being implemented anyway?). Added REST output.

• haqesac: Updated from Version 1.10.
→ Version 1.10.1: Tweaked QryVar interactivity.
→ Version 1.10.2: Corrected typos and disabled buggy post-HAQESAC data reduction.

• multihaq: Updated from Version 1.2.
→ Version 1.2.1: Updated documentation to include the HAQESAC reference.
→ Version 1.2.2: Switched default to keepblast=T. Added forking blasta=X command to BLAST.

• peptcluster: Updated from Version 1.4.
→ Version 1.5.0: Added peptalign=T/F/X function for aligning peptides using regex or minimal gap addition. Added REST.
→ Version 1.5.1: Updated REST output. Removed peptide redundancy.

• pingu_V4: Updated from Version 4.3.
→ Version 4.4.0: Converted ppicompile=T to ppicompile=LIST.
→ Version 4.5.0: Added hublist=LIST : List of hub genes to restrict pairwise PPI to, and pairwise parsing.

• qslimfinder: Updated from Version 2.0.
→ Version 2.1.0: Added PTMData and PTMList options.

• seqsuite: Updated from Version 1.4.0.
→ Version 1.5.0: Added extatic.ExTATIC and revert.REVERT. NOTE: Dev only.
→ Version 1.5.1: Added 'seq' as alias for 'rje_seq' - want to avoid rje_ prefix requirements.
→ Version 1.6.0: Added mitab and rje_mitab for MITAB parsing.
→ Version 1.6.1: Added extra error messages.
→ Version 1.7.0: Added pingu_V4.PINGU.
→ Version 1.8.0: Added rje_pacbio.PacBio.

• slimbench: Updated from Version 2.8.0.
→ Version 2.8.1: Removed use of Protein name for ELM Uniprot entries due to problems mapping old IDs.
→ Version 2.9.0: Added SLiMMaker ELM reduction table and output.
→ Version 2.9.1: Enabled download only with generate=F benchmark=F.
→ Version 2.10.0: Add generation of table mapping PPIBench dataset generation.

• slimfarmer: Updated from Version 1.4.1.
→ Version 1.4.2: Fixed log transfer issues due to new #VIO line. Better handling of crashed runs.

• slimfinder: Updated from Version 5.1.
→ Version 5.1.1: Modified alphabet handling and fixed musthave bug.
→ Version 5.2.0: Added PTMList and PTMData modes (dev only).

• slimmaker: Updated from Version 1.2.0.
→ Version 1.3.0: Added varlength option to identify gaps in aligned peptides and generate variable length motif.
→ Version 1.3.1: Fixed varlength option to work with end of peptide gaps. (Gaps ignored completely - should not be there!)
→ Version 1.4.0: Add iteration REST output.
→ Version 1.4.1: Add unmatched peptides REST output.
→ Version 1.4.2: Fixed bug with variable length wildcards at start of sequence.
→ Version 1.5.0: Added peptalign=X functionality, using PeptCluster peptide alignment.
→ Version 1.6.0: Added equiv=LIST : List (or file) of TEIRESIAS-style ambiguities to use [AGS,ILMVF,FYW,FYH,KRH,DE,ST]
→ Version 1.6.1: Fixed peptide case bug.

• slimparser: Created/Renamed/moved.
→ Version 0.0.0: Initial Compilation.
→ Version 0.0.1: Fixed RestKeys bug.
→ Version 0.1.0: Added retrieval and parsing of existing server job. Added password.
→ Version 0.2.0: Added API access to REST server if restin is REST call (i.e. starts with http:)
→ Version 0.2.1: Added PureAPI output of API REST call returned text.
→ Version 0.3.0: Added parsing of input files to give to rest calls.
→ Version 0.3.1: Fixed issue that had broken REST server full output.

• slimprob: Updated from Version 2.2.0.
→ Version 2.2.1: Updated REST output.
→ Version 2.2.2: Modified input to allow motif=X in addition to motifs=X.
→ Version 2.2.3: Tweaked basefile setting and citation.

• slimsuite: Updated from Version 1.3.0.
→ Version 1.4.0: Added RLC and Disorder progs to call SLiMCore. Added CompariMotif.
→ Version 1.5.0: Added peptcluster and peptalign calls.

Updates in extras/:

• file_monster: Created/Renamed/moved.
→ Version 0.0: Initial Compilation.
→ Version 1.0: Initial Working version
→ Version 1.1: Broadened away from strict extension-based scavenging to whole file names with wildcards
→ Version 1.2: Added DirSum function and updated FileMonster slightly.
→ Version 1.3: Added redundant file cleanup
→ Version 1.4: Added skiplist and purgelist
→ Version 1.5: Added rename function (to replace rename.pl Perl module)
→ Version 1.6: Minor bug fix.
→ Version 2.0: Major reworking with new object making use of rje_db tables etc. Old functions to be ported with time.
→ Version 2.1: Added dirsum function.
→ Version 2.2: Added fixendings=FILELIST to convert Mac \\r into UNIX \\n

• prodigis: Created/Renamed/moved.
→ Version 0.0: Initial Compilation.
→ Version 0.1: Added probability calculations based on hydrophobicity, serine and cysteine.
→ Version 0.2: Added cysteine count and cysteine weighting.

• rje_glossary: Created/Renamed/moved.
→ Version 0.0: Initial Compilation.
→ Version 1.0: Working version, including text setup for webserver.
→ Version 1.1: Added href=T option to add external hyperlinks for and [text] in descriptions [True]
→ Version 1.2: Added recognition of _italics_ markup.
→ Version 1.3: Fixed minor italicising bug.
→ Version 1.4: Added keeporder=T/F to maintain input order (e.g. for MapTime).

• rje_itunes: Created/Renamed/moved.
→ Version 0.0: Initial Compilation.
→ Version 0.1: Added Plays/Track, default Album Artist and topHTML() method.

• rje_phos: Created/Renamed/moved.
→ Version 0.0: Initial Compilation. Basic pELM parsing done.
→ Version 0.1: Added phosBLAST method.

• rje_pydocs: Updated from Version 2.14.0.
→ Version 2.15.0: Added parsing and generation of "pages" for new rest server docs functions.
→ Version 2.15.1: Tweaked formatting of outfmt and docstring documentation.
→ Version 2.15.2: Tweaked formatting of docstring documentation.
→ Version 2.15.3: Fixed URL formatting of docstring documentation.
→ Version 2.16.0: Added Webserver tab to doc parsing from settings/*.form.
→ Version 2.16.1: Added parsing of imports within a try/except block. (Cannot be on same line as try: or except:)
→ Version 2.16.2: Tweaked makePages() output.

• rje_seqplot: Created/Renamed/moved.
→ Version 0.0: Initial Compilation.

• rje_ssds: Created/Renamed/moved.
→ Version 0.0: Initial Compilation.

• rje_yeast: Created/Renamed/moved.
→ Version 0.0: Initial Compilation.

• wormpump: Created/Renamed/moved.
→ Version 0.0: Initial Compilation.

Updates in libraries/:

• rje: Updated from Version 4.13.1.
→ Version 4.13.2: Removed excess REST HTML methods.
→ Version 4.13.3: Added uselower=False to dataDict() method.
→ Version 4.13.4: Added maxrep=X to listCombos() method.
→ Version 4.14.0: Added listToDict() method.
→ Version 4.15.1: Fixed matchExp method to be able to handline multilines. (Shame re.DOTALL doesn't work!)

• rje_blast_V2: Updated from Version 2.7.
→ Version 2.7.1: Added capacity to keep alignments following GABLAM calculations.
→ Version 2.7.2: Fixed bug with hitToSeq fasta output for rje_seqlist.SeqList objects.
→ Version 2.8.0: A more significant BLAST e-value setting will filter read results.
→ Version 2.9.0: Added qassemble=T/F : Whether to fully assemble query stats from all hits [False].
→ Version 2.9.1: Updated default BLAST and BLAST+ paths to '' for added modules.

• rje_db: Updated from Version 1.7.1.
→ Version 1.7.2: Fixed numerical join issue during Table.compress().
→ Version 1.7.3: Added lower case enforcement of headers for reading tables from file.
→ Version 1.7.4: Added optional restricted Field set for output.
→ Version 1.7.5: Added more error messages and tableNames() method.

• rje_ensembl: Updated from Version 2.14.
→ Version 2.15.0: Added capacity to download/process a section of Ensembl with speclist=LIST.
→ Version 2.15.1: Improved error handling for too many FTP connections: still need to fix problem!
→ Version 2.15.2: Trying to improve speed of Uniprot parsing for EnsLoci.

• rje_genbank: Updated from Version 1.2.2.
→ Version 1.3.0: Added split viral output.
→ Version 1.3.1: Fixed bug in split viral output.

• rje_html: Updated from Version 0.1.
→ Version 0.2.0: Added delimited text to HTML table conversion.
→ Version 0.2.1: Updated default CSS to http://www.slimsuite.unsw.edu.au/stylesheets/slimhtml.css.

• rje_mitab: Created/Renamed/moved.
→ Version 0.0.0: Initial Compilation.
→ Version 0.1.0: Added complex=LIST : Complex identifier prefixes to expand from mapped PPI [complex]
→ Version 0.1.1: Fixed Evidence/IType parsing bug for BioGrid/Intact.
→ Version 0.2.0: Added splicevar=T/F option.

• rje_obj: Updated from Version 2.1.0.
→ Version 2.1.1: Removed excess REST HTML methods.
→ Version 2.1.2: Tweaked glist cmdRead warnings.

• rje_qsub: Updated from Version 1.6.1.
→ Version 1.6.2: Updated module list: blast+/2.2.30,clustalw,clustalo,fsa,mafft,muscle,pagan,R/3.1.1

• rje_scoring: Updated from Version -.

• rje_seq: Updated from Version 3.21.0.
→ Version 3.22.0: Added loading sequences from provided sequence files contents directly, bypassing file reading.
→ Version 3.22.1: Fixed problem if seqin is blank, triggering odd Uniprot download.
→ Version 3.23.0: Add speclist to reformat options.

• rje_seqlist: Updated from Version 1.10.0.
→ Version 1.11.0: Added more dna2prot reformatting options.

• rje_slim: Updated from Version 1.9.
→ Version 1.10.0: Added varlength option to makeSlim() method.
→ Version 1.10.1: Fixed varlength and terminal position compatibility.
→ Version 1.10.2: Fixed issue of [] returns.
→ Version 1.10.3: Fixed makeSlim bug with variable length wildcards at start of sequence.
→ Version 1.11.0: Added splitMotif() function.
→ Version 1.12.0: Added equiv to makeSlim() function.

• rje_slimcore: Updated from Version 2.6.1.
→ Version 2.7.0: Updating MegaSLiM function to work with REST server. Allow megaslim=seqin. Added iuscoredir=PATH and protscores=T/F.
→ Version 2.7.1: Modified iuscoredir=PATH and protscores=T/F to work without megaslim. Fixed UPC/SLiMdb issue for GOPHER.
→ Version 2.7.2: Fixed iuscoredir=PATH to stop raising errors when file not previously made.
→ Version 2.7.3: Fixed serverend message error.

• rje_slimhtml: Created/Renamed/moved.
→ Version 0.0: Initial Compilation.
→ Version 0.3: Added code for making Random Dataset pages
→ Version 0.4: Updated UPC pages and added additional front pages.
→ Version 0.5: Split front page into front and full. Added GO tabs/pages.
→ Version 0.6: Added XGMML output.
→ Version 0.7: Modified output for HumSF10 and HAPPI analysis.
→ Version 0.8: Added SVG output. Integrated better with HAPPI code.
→ Version 0.9: Added SLiM Descriptions.

• rje_slimlist: Updated from Version 1.6.
→ Version 1.7.0: Added direct feeding of motif file content for loading (for REST servers).
→ Version 1.7.1: Modified input to allow motif=X in additon to motifs=X.
→ Version 1.7.2: Fixed bug that could not accept variable length motifs from commandline. Improved error message.

• rje_specificity: Updated from Version -.

• rje_tree: Updated from Version 2.11.0.
→ Version 2.11.1: Tweaked QryVar interactivity.
→ Version 2.11.2: Updated tree paths.

• rje_tree_group: Updated from Version -.

• rje_uniprot: Updated from Version 3.20.3.
→ Version 3.20.4: Fixed bug introduced by REST access modifications.
→ Version 3.20.5: Improved handling of downloads for uniprot IDs that have been updated (i.e. no direct mapping).
→ Version 3.20.6: Improved handling of zero accession numbers for extraction.
→ Version 3.20.7: Fixed uniformat default error.
→ Version 3.21.0: Added uparse=LIST option to try and accelerate parsing of large datasets for limited information.
→ Version 3.21.1: FullText is no longer stored in Uniprot object. Will need special handling if required.
→ Version 3.21.2: Fixed single uniprot extraction bug.
→ Version 3.21.3: Added REST datout to proteomes extraction.

• rje_xref: Updated from Version 1.3.0.
→ Version 1.3.1: Fixed xref list bug.
→ Version 1.4.0: Added optional Mapping dictionary for speeding up recurring mapping (should avoid if memsaver=F).
→ Version 1.5.0: Added stripvar=CDICT removal of variants using Field:Char list, e.g. Uniprot:-,GenPept:. []
→ Version 1.6.0: Added mapxref=LIST List of identifiers to map to KeyIDs using mapfields []

• rje_zen: Updated from Version 1.3.0.
→ Version 1.3.1: Added some more words.

Thursday, 28 May 2015

How to Cite SLiMSuite programs

Most programs in the SLiMSuite package have citation instructions in their docstring. Either run with the help or help option (or if running through SLiMSuite or SeqSuite, use help=T) to bring up that information. Alternatively, the online documentation should show the papers to cite.

You can retrieve documentation for a given program by replacing slimsuite in the box below and clicking View Documentation.

for:

If a program does not have its own citation, please cite SLiMSuite using: Edwards RJ & Palopoli N (2015): Computational Prediction of Short Linear Motifs from Protein Sequences. Methods Mol Biol. 1268:89-141.

For general use or reuse of code, please cite the Zenodo DOI for the GitHub release: DOI

External programs

See External Components of SLiMSuite for citing programs that SLiMSuite uses.

Wednesday, 13 May 2015

Including the date in a log file name

It is often useful for data management purposes to include the run date in the name of a log file. One of the minor tweaks with the latest release of SeqSuite is the ability to do that automatically from the commandline (or ini file) without having to change the command. Simply include #DATE in the log=FILE command. e.g.:

log=mylog.#DATE.log

If run today, for example, the log file generated would be mylog.2015-05-13.log. As well as helping record keeping, this is also a good way of having different log files for different runs whilst still using a single log=FILE command in an ini file.

Thursday, 8 January 2015

New MAJOR.MINOR.PATCH version numbers

One of the changes in the last release was the introduction of three component X.Y.Z version numbers in place of the old X.Y numbers. These are slowly being rolled out across all the modules in an effort to approach proper semantic versioning for SLiMSuite. (Due to the somewhat organic nature of its development, it may never reach full semantic versioning.)

From release 2015-01-07 onwards, therefore, version number changes should indicate the nature of the change following MAJOR.MINOR.PATCH version numbering:

  1. MAJOR version increments when a backwards-incompatible change is made. Typically a major change to input/output or core module/class structure.

  2. MINOR version increments when functionality is added in a backwards-compatible manner.

  3. PATCH version increments when bugs are fixed or minor functionality added in a backwards-compatible manner.

Under the old MAJOR.MINOR version numbers, PATCH changes were treated as MINOR changes.

Regrettably, due to modular structure of SLiMSuite, the main program modules will not always have MINOR and PATCH version increments when the underlying modules are changed. The plan is to make sure that the main SLiMSuite and SeqSuite modules do increment with a new release to reflect changes. In the meantime, please contact the author if you have any questions or unexpected behaviour.

Wednesday, 7 January 2015

SLiMSuite release 2015-01-07 now available

A new download of SLiMSuite (release 2015-01-07) is now available at both UK (U. Southampton) and Australia (UNSW) sites (svn r613).

Many of the changes are under the hood, in preparation for a new set of REST services, which will be coming soon. The new download also features two new programs in the tools/ folder, which will hopefully simplify running many of the programs. The core programs and several of the key accessory programs (e.g. rje_seq and rje_uniprot) can now be run using the main SLiMSuite program:

python tool/slimsuite.py -prog X

where X is one of the SLiMSuite or SeqSuite programs. To see which are currently supported, run with -help. Simply add additional commandline options for the chosen program (and/or use ini files) as normal. For program-specific help, run with help=T: this will give the help documentation for program X rather than SLiMSuite. (NB. SLiMSuite can be used to access both SLiMSuite and SeqSuite programs. There is also a seqsuite.py that can be used to access just the SeqSuite programs and accessories.)

The other major update is that SLiMSuite programs (SLiMProb, SLiMFinder, QSLiMFinder and SLiMCore) can now take lists of Uniprot accession numbers as alternative input, using uniprotid=LIST in place of seqin=FILE. Providing there is an open internet connection, the relevant proteins will be downloaded from the Uniprot server for analysis.

GABLAM has also benefited from the addition of a new fullblast=T mode, which will perform the full all versus all BLAST+ search prior to GABLAM processing. Depending on your machine setup, this can be faster than the current method that forks out a single sequence at a time and is more IO-intensive as a result. The GABLAM functions to use existing BLAST+ results have also been fixed and tidied a little. (If re-running might be required, keepblast=T can retain the full BLAST results file to accelerate subsequent runs.)

Updates since last release:

• fiesta: Updated from Version 1.8.
→ Version 1.8.1: Replaced type with stype throughout to try and avoid TypeError crashes.
→ Version 1.9.0: Altered HAQDB to be a list of files rather than just one.

• gablam: Updated from Version 2.14.
→ Version 2.15.0: Added seqnr function. Add run() method.
→ Version 2.16.0: Added fullblast=T/F : Whether to perform full BLAST followed by blastres analysis [False]
→ Version 2.16.1: Fixed a bug where the fullblast option was failing to return scores and evalues.

• multihaq: Updated from Version 1.1.
→ Version 1.2: Changed defaults to autoskip=F.

• pingu_V4: Updated from Version 4.2.
→ Version 4.3: Modified to use Pfam as hub field for DomPPI generation. Modified naming of PPI output after ppisource.

• seqsuite: Created/Renamed.
→ Version 0.0: Initial Compilation.
→ Version 0.1: Added rje_seq and FIESTA. Added Uniprot.
→ Version 1.0: Moved to tools/ for general release. Added HAQESAC and MultiHAQ. Moved mod to enable easy external access.
→ Version 1.1: Added XRef = rje_xref.XRef. Identifier cross-referencing module.
→ Version 1.2: Added taxonomy.
→ Version 1.3.0: Added rje_zen.Zen. Modified code to work with REST services.
→ Version 1.4.0: Added rje_tree.Tree, GABLAM and GOPHER.

• slimbench: Updated from Version 2.5.
→ Version 2.6: Added ELM domain interactions table: http://www.elm.eu.org/infos/browse_elm_interactiondomains.tsv.
→ Version 2.6: Fixed issues introduced with new SLiMCore V2.0 SLiMSuite code.
→ Version 2.7: Reinstate filtering. (Not sure why disabled.) Add genspec=LIST to filter by species. Added domlink=T/F.
→ Version 2.8.0: Implemented PPIBench benchmarking for datasets without Motifs in name.

• slimfarmer: Updated from Version 1.3.
→ Version 1.4: Added modules=LIST : List of modules to add in job file [clustalo,mafft]
→ Version 1.4.1: Fixed farm=batch mode for qsub=T.

• slimmaker: Updated from Version 1.1.
→ Version 1.2.0: Modified to work with REST servers

• slimmutant: Updated from Version 1.0.
→ Version 1.1: Minor tweaks to generate method to increase speed. (Make index in method.) Added splitfield=X.
→ Version 1.2: Added a batch mode for mutfiles - all other options will be kept fixed. Added maxmutant and minmutant.
→ Version 1.3: Added SLiMPPI analysis (will set analyse=T). Started basing on SLiMCore

• slimprob: Updated from Version 2.1.
→ Version 2.2.0: Added basic REST functionality.

• slimsuite: Created/Renamed.
→ Version 0.0: Initial Compilation based on SeqSuite.
→ Version 1.0: Moved to tools/ for general release. Added reading and using of SeqSuite programs.
→ Version 1.1: Added slimlist.
→ Version 1.2: Added SLiMBench.
→ Version 1.3.0: Added SLiMMaker and modified code to work with REST services.

• rje: Updated from Version 4.12.
→ Version 4.13.0: Added new built-in attributes/options for REST services.
→ Version 4.13.1: Fixed MemSaver typo in WarnLog output. Modified mkDir() to avoid clashes raising errors.

• rje_db: Updated from Version 1.5.
→ Version 1.6: Added option to save a subset of entries using saveToFile(savekeys=LIST).
→ Version 1.7.0: Added splitchar to table splitting.
→ Version 1.7.1: Reinstated raise error if expected table missing.

• rje_dismatrix_V3: Created/Renamed.
→ Version 3.0: Updated to new rje_obj.RJE_Object class.

• rje_ensembl: Updated from Version 2.13.
→ Version 2.14: Add enspep=T/F : Create full gnspacc EnsEMBL peptide datasets [False]

• rje_genbank: Added to download.
→ Version 0.0: Initial Compilation.
→ Version 0.1: Modified and Tidied output a little.
→ Version 0.2: Added details to skip and option to use different detail for protein accession number.
→ Version 0.3: Added reloading of features.
→ Version 1.0: Basic functioning version. Added fetchuid=LIST Genbank retrieval to generate seqin=FILE.
→ Version 1.1: Added use of rje_taxonomy for getting Species Code from TaxID.
→ Version 1.2: Modified to deal with genbank protein entries.
→ Version 1.2.1: Fixed feature bug that was breaking parser and removing trailing '*' from protein sequences.
→ Version 1.2.2: Fixed more features that were breaking parser.

• rje_obj: Updated from Version 2.0.
→ Version 2.1.0: Added new built-in attributes/options for REST services.

• rje_ppi: Updated from Version 2.8.
→ Version 2.8.1: Fixed bug with Spring Layout interruption message.

• rje_qsub: Updated from Version 1.5.
→ Version 1.6: Added modules=LIST : List of modules to add in job file [clustalo,mafft]
→ Version 1.6.1: Added R/3.1.1 to modules.

• rje_seq: Updated from Version 3.20.
→ Version 3.21.0: Added extraction of uniprot IDs for seqin.

• rje_seqlist: Updated from Version 1.7.
→ Version 1.8: Added sortseq=X : Whether to sort sequences prior to output (size/invsize/accnum/name/seq/species/desc) [None]
→ Version 1.9.0: Added extra functions for returning sequence AccNum, ID or Species code.
→ Version 1.10.0: Added extraction of uniprot IDs for seqin. Added more dna2prot reformatting options.

• rje_sequence: Updated from Version 2.3.
→ Version 2.4: Added recognition of modified IPI format. Added standalone low complexity masking.
→ Version 2.4.1: Moved the gnspacc fragment recognition to reduce issues. Should perhaps remove completely?

• rje_slim: Updated from Version 1.8.
→ Version 1.9: Reinstated ambcut for slimToPattern()

• rje_slimcalc: Updated from Version 0.8.
→ Version 0.9: Improvements to use of GOPHER.

• rje_slimcore: Updated from Version 2.2.
→ Version 2.3: Docstring edits. Minor tweak to walltime() to close open files.
→ Version 2.4: Added megaslimfix=T/F : Whether to run megaslim in "fix" mode to tidy/repair existing files [False]
→ Version 2.5: Added (hidden) slimmutant=T/F : Whether to ignore '.p.\D\d+\D' at end of accnum. Made default append=True.
→ Version 2.6.0: Added uniprotid=LIST : Extract IDs/AccNums in list from Uniprot into BASEFILE.dat and use as seqin=FILE. []
→ Version 2.6.1: Removed the maxseq default setting.

• rje_slimlist: Updated from Version 1.4.
→ Version 1.5: Added run() method for slimsuite.py compatibility. Improved split motif handling.
→ Version 1.6: Modified to read in new ELM class download file with extra header information. Added varlength=T/F filter.
→ Version 1.6: Modified so that filtering one element of a split motif removes all.

• rje_tree: Updated from Version 2.10.
→ Version 2.11.0: Modified for standalone running as part of SeqSuite.

• rje_uniprot: Updated from Version 3.19.
→ Version 3.20: Updated dbsplit=T output and checked function with Pfam. Probably needs work for other databases.
→ Version 3.20.1: Added uniprotid=LIST as an alias to acclist=LIST and extract=LIST.
→ Version 3.20.2: Added extra sequence return methods to UniprotEntry. Added fasta REST output.
→ Version 3.20.3: Fixed bug if new uniprot extraction method fails.

• rje_xml: Created/Renamed.
→ Version 0.0: Initial Compilation.
→ Version 0.1: Added xml.sax functions.
→ Version 0.2: Added parsing from URL.

• rje_xref: Updated from Version 1.1.
→ Version 1.2: Added join=LIST Run in join mode for list of FILE:key1|...|keyN:JoinField [] and naturaljoin=T/F.
→ Version 1.3.0: Added compress=LIST to handle 1:many input data. []

• rje_zen: Updated from Version 1.2.
→ Version 1.3.0: Modified output to work with new REST service calls.