SLiMSuite & SeqSuite sequence analysis tools: 2014

Monday 25 August 2014

SLiMSuite release 2014-08-25 now available

A new download of SLiMSuite (release 2014-08-25) is now available (svn r466).

The latest release sees a major revamp of the basic chassis for SLiMSuite onto a newer underlying SLiMCore 2.x class to enable the megaslim=FILE upgrade (below). This should have no effect on end use but please report any odd behaviour, as it is possible that some compatibility bugs have crept in that will crash programs in rare scenarios.

The following tools are now in ./legacy/:

slimsuite/legacy/gopher_V2.py
slimsuite/legacy/qslimfinder_V1.9.py
slimsuite/legacy/slimbench_V1.py
slimsuite/legacy/slimdisc_V1.4.py
slimsuite/legacy/slimfinder_V4.9.py
slimsuite/legacy/slimprob_V1.4.py

SLiMSuite now features a new mode for multiple runs/datasets featuring the same input proteins. Running with megaslim=FILE, where FILE is a fasta file of all possible sequences in the dataset(s), will generate files of masking scores that can be reused instead of recalculating each time. In each case, each sequence will be on one line, with the sequence name followed by a score per residue. If dismask=T then *.iupred.txt or *.anchor.txt will be created/read, which will list the disorder score for each position. If consmask=T then *.rlc.txt will list RLC scores. By default, scores will be read where present, or calculated and appended where missing. Running rje_slimcore will calculate all sequences for megaslim=FILE.

In addition to disorder and conservation scores, running SLiMCore with megablam=T will also create an all-by-all GABLAM run, which can be used for subsequent UPC generation. Even without megaslim=FILE, this can still be used with the gablamdis=FILE option.

If megaslim=None (the default), creating alignments for conservation masking can also be sped up giving usegopher=T gopherdir=PATH forks=X, where forks > 1. This will use forking to check/create GOPHER alignments for all input sequences before a regular masking run.

The final upgrade of note is that SLiMBench now features an occurrence benchmarking mode (occbench=T), which has not yet made its way into the manual. More on SLiMBench soon.

Bug fixes in this release:

Combined case masking and disorder masking bug fixed.
Minor bugs (introduced with DNA=T mode) that affect extended alphabets in SLiMFinder have been fixed.
A minor bug in SLiMProb that returned position 0 to L-1 rather than 1-L for N-terminal motif matches (^xxx) has been fixed.
A bug affecting GOPHER when run with rje_blast_V2 has been fixed. There might still be some issues with rje_blast_V2 when number of alignments output is smaller than the number of one-line hits.

Other miscellaneous updates are listed below.

Updates since last release:

• fiesta: Updated from Version 1.7.
→ Version 1.8: Minor crash fixes. Updated more functions to work with BLAST+.

• multihaq: Updated from Version 0.1.
→ Version 1.0: Fully working version. Fixed minor basefile bug. Added blastcut filter.
→ Version 1.1: Improved pickup of aborted run.

• qslimfinder: Updated from Version 1.8.
→ Version 1.9: Preparation for QSLiMFinder V2.0 & SLiMCore V2.0 using newer RJE_Object.
→ Version 2.0: Converted to use rje_obj.RJE_Object as base. Version 1.9 moved to legacy/.

• slimbench: Updated from Version 2.4.
→ Version 2.5: Basic OccBench assessment benchmarking. Added ELM Uniprot acclist output. (Download issues?)

• slimfinder: Updated from Version 4.8.
→ Version 4.9: Preparation for SLiMFinder V5.0 & SLiMCore V2.0 using newer RJE_Object.
→ Version 5.0: Converted to use rje_obj.RJE_Object as base. Version 4.9 moved to legacy/.
→ Version 5.1: Modified SLiMChance slightly to catch missing aafreq.

• slimprob: Updated from Version 1.3.
→ Version 1.4: Preparation for SLiMProb V2.0 & SLiMCore V2.0 using newer RJE_Object.
→ Version 2.0: Converted to use rje_obj.RJE_Object as base. Version 1.4 moved to legacy/.
→ Version 2.1: Modified output of N-terminal motifs to correctly start at position 1.

• rje: Updated from Version 4.11.
→ Version 4.11: Added self.name() to basic object class.
→ Version 4.12: Added 'bool' and 'str' to _cmdRead() to ease switchover to new RJE_Objects.

• rje_blast_V2: Updated from Version 2.6.
→ Version 2.7: Fixed occasional oneline versus description mismatch error. Fixed some localhits bugs.

• rje_db: Updated from Version 1.4.
→ Version 1.5: Fixed occasional key error following addField. Added indexReport() method.

• rje_disorder: Updated from Version 0.7.
→ Version 0.8: Added makeRegions() method.

• rje_obj: Updated from Version 1.7.
→ Version 1.8: Cleaned up some erroeneous opt, stat and info references.
→ Version 2.0: Added self.file dictionary and methods for handling file handles with matching self.str filenames.

• rje_seq: Updated from Version 3.19.
→ Version 3.20: Added run() method for SeqSuite.

• rje_seqlist: Updated from Version 1.6.
→ Version 1.6: Add sequence fragment extraction.
→ Version 1.7: Added code to create rje_sequence.Sequence objects.

• rje_slim: Updated from Version 1.7.
→ Version 1.8: Modified use of aa/dna defaults to (hopefully) not break when using extended alphabets.

• rje_slimcore: Updated from Version 1.15.
→ Version 1.16: Preparation for SLiMCore V2.0 using newer RJE_Object.
→ Version 2.0: Converted to use rje_obj.RJE_Object as base. Version 1.16 moved to legacy/.
→ Version 2.1: Added megaslim=FILE option to make/use precomputed results for a proteome. Upgraded MotifSeq method.
→ Version 2.2: Modified aa frequency calculations to use alphabet to generate 0.0 frequencies (rather than missing aa).

• rje_slimlist: Updated from Version 1.3.
→ Version 1.4: Modified code to be compatible with SLiMCore V2.x objects.

• rje_zen: Updated from Version 1.1.
→ Version 1.2: Added a webserver mode to return text directly.

Wednesday 6 August 2014

SLiMSuite bug with combined sequence case and disorder masking

A small flaw has been discovered in the current implementation of disorder masking when it is combined with masking upper or lower case residues (casemask=X dismask=T). Rather than predicting disorder on the unmasked sequence and then combining with any case masking, disorder predictions are currently made on the masked sequences.

Hopefully, this will have minimal impact for the majority of cases. (Although I am not certain, I suspect that it will produce a tendency to over-predict disorder and thus under-mask.) This bug has been fixed for the next release of SLiMSuite. Note that other masking combinations are not affected.

Monday 7 July 2014

Minor SLiMSuite update released

An updated download of SLiMSuite (release 2014-07-06) is now available. (This release contains some minor modifications in line with a manuscript submission.)

Updates since last release:

• gablam: Updated from Version 2.13.
→ Version 2.14: Added checktype=T/F option to check sequence/BLAST type.

• slimbench: Updated from Version 2.3.
→ Version 2.4: Improved error messages.

• slimfarmer: Updated from Version 1.2.
→ Version 1.3: Modified default vmem request to 127GB from 64GB.

• slimfinder: Updated from Version 4.7.
→ Version 4.8: Modified cloud generation to avoid issues with flexible-length wildcards.

• rje_ensembl: Updated from Version 2.12.
→ Version 2.13: Added speedskip=T/F [True] that will skip when pep.all, cdna.all and dna.toplevel are found.

• rje_xgmml: Updated from Version 0.0.
→ Version 1.0: Basic functional version for use with other modules. Disabled attributes by default for Cytoscape.

Monday 23 June 2014

New SLiMSuite release now available

A new download of SLiMSuite (release 2014-06-22) is now available.

As well as fixing the minor GOPHER output bug, a new Taxonomy processing module (rje_taxonomy) has been added. Although primarily designed for use with other SLiMSuite programs, this module has some standalone functionality for generating lists of Taxa IDs and species codes. (Details to follow.)

Output for the main SLiMSuite programs, SLiMFinder, SLiMProb and QSLiMFinder has also been consolidated and made more consistent for both re-running analyses and running analyses with multiple settings (consecutively) in the same directory. The use of GOPHER for generating alignments for conservation masking in these programs has also been improved to enable forking. (Details to follow.)

The final change of note is that SLiMMaker is now used to generate a consensus motif for each cloud returned by (Q)SLiMFinder. Note that (by default), these consensus motifs will not necessarily cover all occurrences in the cloud. (See SLiMMaker for more information.)

Other miscellaneous updates are listed below.

Updates since last release:

• gablam: Updated from Version 2.12.
→ Version 2.13: Fixed Protein vs DNA GABLAM. Modified sequence extraction to handle larger sequences. Add blastdir=PATH/.

• gopher: Updated from Version 3.3.
→ Version 3.4: Fixed FullRBH paralogue duplication issue.

• pingu_V4: Updated from Version 4.1.
→ Version 4.2: Bug fixes for use of PPISource to create PPI databases. Add HGNC to sourcedata (xrefdata=HGNC)

• qslimfinder: Updated from Version 1.7.
→ Version 1.8: Added cloudfix=T/F Restrict output to clouds with 1+ fixed motif (recommended) [False]. Consolidating output.

• slimbench: Updated from Version 2.2.
→ Version 2.2: Modified the FN/TN and ResNum calculations. No longer rate TP in random data as OT.
→ Version 2.3: Changed the default to queries=F. SearchINI bug fix. Added occbench generation.

• slimfarmer: Updated from Version 1.1.
→ Version 1.2: Implemented the slimsuite=T/F option and got SLiMFarmer qsub to work with GOPHER forking.

• slimfinder: Updated from Version 4.6.
→ Version 4.7: Added SLiMMaker generation to motif clouds. Added Q and Occ to Chance column.

• slimprob: Updated from Version 1.2.
→ Version 1.3: Consolidating output file naming for consistency across SLiMSuite. (SLiMBuild = Motif input)

• rje: Updated from Version 4.10.
→ Version 4.11: Enabled '\t#' comments in ini files. Modified getStrLC to return '' for 'none' by default. Added listMax().
→ Version 4.11: Added self.name() to basic object class.

• rje_blast: Reinstated for SLiMDisc legacy compatibility.

• rje_ensembl: Reinstated and updated.
→ Version 2.11: Added rje_taxonomy and makeuniprot=T/F. Removed metlist. Moved release and species data extraction.
→ Version 2.12: Changed chromspec to enable downloads of all species but also download toplevel files, not chromosomes.

• rje_hmm_V1: Reinstated.

• rje_hpc: Updated from Version 1.0.
→ Version 1.1: Disabled memory checking in Windows and OSX.

• rje_motif_V3: Updated from Version 3.0.
→ Version 3.1: Fixed minor code bugs.

• rje_obj: Updated from Version 1.6.
→ Version 1.7: Added self.name() to basic object class.

• rje_seq: Updated from Version 3.18.
→ Version 3.19: Fixed BLAST+ sequence extraction name truncation error.

• rje_seqlist: Updated from Version 1.4.
→ Version 1.5: Added sampler=N(,X) : Generate (X) file(s) sampling a random N sequences from input into seqout.N.X.fas [0]
→ Version 1.6: Modified currSeq() and nextSeq() slightly to fix index mode breakage. Look out for other programs breaking.
→ Version 1.6: Add sequence fragment extraction.

• rje_slim: Updated from Version 1.6.
→ Version 1.7: Fixed import slimFix(slim) error that was reporting slimProb()

• rje_slimcalc: Updated from Version 0.7.
→ Version 0.8: Made RLC the default.

• rje_slimcore: Updated from Version 1.14.
→ Version 1.15: Added pre-running GOPHER if no alndir and usegopher=T. Updated dataset() to use Input not Basefile.

• rje_taxonomy: Created/Renamed.
→ Version 0.0: Initial Compilation.
→ Version 0.1: Initial working version with rje_ensembl.
→ Version 1.0: Fully functional version with modified viral species code creation.

• rje_uniprot: Updated from Version 3.18.
→ Version 3.19: Updated and consolidated dbxref table generation (formerly linkout) using rje_db. Changed acc_num to accnum.

Wednesday 14 May 2014

Minor bug in GOPHER output with BLAST+

A bug has been identified with the current SLiMSuite release when using BLAST+ to generate orthologue alignments with GOPHER. Sequences extracted from the blast database have the first letter of their name truncated. In real terms, this should not make a lot of difference (if using the recommended naming format) but it could present mapping issues. A few other programs, such as HAQESAC, may also be affected if BLAST+ is being used to extract sequences.

A fix is available on request and will be part of the next release, which should be soon.

Friday 25 April 2014

Blog switchover

Posts and pages from the old SLiMSuite and SeqSuite blog have now been imported and this blog will take over as the main source of ongoing news, tips, documentation and updates.

Wednesday 23 April 2014

SLiMSuite 2014-04-22 now available

A new download of SLiMSuite (release 2014-04-22) is now available. As well as fixing the gopher.py error, the download page and readme have had a slight makeover, which should make them load quicker.

As part of ongoing consolidation and documentation, SeqSuite has now been incorporated into in a single SLiMSuite download. (Previously, SLiMSuite was available as a reduced set of programs and SeqSuite had the full set.) The intention is to retire the SeqSuite moniker over the coming months, although the programs themselves will still be available.

The lastest release also features a new program, SLiMFarmer, for running (Q)SLiMFinder and SLiMProb batch jobs on parallel processors. SLiMFarmer is still under development and should hopefully work with other SLiMSuite programs too but has not yet been tested.

Other miscellaneous updates are listed below.

Updates since last release:

• comparimotif_V3: Updated from Version 3.10.
→ Version 3.10: Added forking.
→ Version 3.11: Added additional overlap/matchfix checks during basic comparison to try and speed up.
→ Version 3.12: Replaced deprecated sets.Set() with set().

• gablam: Updated from Version 2.11.
→ Version 2.12: Consolidated use of BLAST V2.

• haqesac: Updated from Version 1.9.
→ Version 1.10: Added exceptions for BLAST failure.

• picsi: Updated from Version 1.1.
→ Version 1.2: Updated to BUDAPEST 2.3 and rje_mascot.

• pingu_V4: Created.
→ Version 4.0: Initial Compilation based on code from SLiMBench and PINGU 3.9 (inherited as pingu_V3).
→ Version 4.1: Adding compilation of PPI databases using new rje_xref V1.1 and older objects from PINGU V3.
→ Version 4.2: Bug fixes for use of PPISource to create PPI databases.

• qslimfinder: Updated from Version 1.6.
→ Version 1.7: Fixed "MustHave=LIST" correction of motif space.

• seqmapper: Updated from Version 2.0.
→ Version 2.1: Added catching of failure to read input sequences. Removed 'Run' from GABLAM table.

• slimbench: Updated from Version 2.0.
→ Version 2.1: Fixed memsaver=T unless in development mode (dev=T). Removed old Assessment. Tested with simbench analysis.
→ Version 2.2: Replaced searchini=LIST with searchini=FILE and moved to SimBench commands.
→ Version 2.2: Modified the FN/TN and ResNum calculations. No longer rate TP in random data as OT.

• slimfarmer: Created.
→ Version 0.0: Initial Compilation.
→ Version 1.0: Functional version using rje_qsub and rje_iridis to fork out SLiMSuite runs.
→ Version 1.1: Updated to use rje_hpc.JobFarmer and incorporate main SLiMSuite farming within SLiMFarmer class.

• slimfinder: Updated from Version 4.5.
→ Version 4.6: Minor modification to seqocc=T function. !Experimental! Added main occurrence output and modified savespace.

• slimmutant: Created.
→ Version 0.0: Initial Compilation.
→ Version 1.0: Working version with standalone functionality.

• slimprob: Updated from Version 1.0.
→ Version 1.1: Tidied import commands.
→ Version 1.2: Increased extras=X levels. Adjusted maxsize=X assessment to be post-masking.

• ned_rankbydistribution: Updated from Version 1.1.
→ Version 1.2: Replaced depracated Set module.

• rje: Updated from Version 4.8.
→ Version 4.9: Added rje.slimsuite, which determines the slimsuite home directory from rje.py file path.
→ Version 4.10: Added osx=T/F option for Mac-specific running options.

• rje_blast_V2: Updated from Version 2.4.
→ Version 2.5: Minor modifications for SLiMCore UPC generation.
→ Version 2.6: Minor bug fixes.

• rje_db: Updated from Version 1.2.
→ Version 1.3: Minor modifications for SLiMCore FUPC development.
→ Version 1.4: Added list checking with addEmptyTable.

• rje_dismatrix_V2: Updated from Version 2.9.
→ Version 2.10: Minor modifications for SLiMCore UPC.

• rje_genemap: Updated from Version 1.4.
→ Version 1.5: Minor tweak of expected HGNC input following change to downloads.

• rje_hpc: Created.
→ Version 1.0: Initial Compilation based on rje_iridis V1.10.

• rje_iridis: Updated from Version 1.9.
→ Version 1.10: Modified freemem setting to run on Katana. Made rsh optional. Removed defunct IRIDIS3 option.

• rje_obj: Updated from Version 1.3.
→ Version 1.4: Added sourceDataFile() method from SLiMBench for wider use.
→ Version 1.5: Added 'basestr' and 'basefile' cmdlist types.
→ Version 1.6: Added osx=T/F option for Mac-specific running options.

• rje_qsub: Updated from Version 1.4.
→ Version 1.5: Added emailing of job stats after run. Added vmem limit.

• rje_seq: Updated from Version 3.17.
→ Version 3.18: Minor BLAST+ bug fixes. Added exceptions to readBLAST failure.

• rje_seqlist: Updated from Version 1.3.
→ Version 1.4: Added dna2prot reformat function.

• rje_slimcore: Updated from Version 1.12.
→ Version 1.13: Modified the savespace settings to reduce numbers of files. targz file now uses RunID not Build Info.
→ Version 1.14: Started adding code for Fragmented UPC (FUPC) clustering.

• rje_slimlist: Updated from Version 1.2.
→ Version 1.3: Added auto-download of ELM data.

• rje_uniprot: Updated from Version 3.14.
→ Version 3.14: Added dblist=LIST and dbsplit=T/F for additional DB link output control. Set unipath default to url.
→ Version 3.15: Added extraction of taxonomic groups. Add UniFormat to improve pure downloads.
→ Version 3.16: Added WBGene ID's from WormBase as one of the recognised DB XRef to parse.
→ Version 3.17: Efficiency tweak to URL-based extraction of acclist.
→ Version 3.18: Minor modification to database parsing.

• rje_xref: Updated from Version 1.0.
→ Version 1.1: Added output of ID lists to text files. Major reworking. Tested with HPRD and HGNC.

Monday 21 April 2014

Coming soon...

This blog will replace the current SeqSuite and SLiMSuite blog. For now, please see the existing blog.

Tuesday 8 April 2014

Missing gopher.py file

There is a bug with the current software download, with a file missing from the libraries/ directory. The download will hopefully be updated soon but in the meantime please email richard.edwards[at]unsw.ed.au and I will send you the file.

Tuesday 14 January 2014

Using SLiMFinder on Phage Display Data (or other peptides)

Although SLiMFinder is designed with whole protein sequences in mind, it can also be used to identify statistically over-represented motifs in peptide data, including phage display results. Indeed, it is the third example application in the original SLiMFinder paper.

Unfortunately, the SLiMFinder webserver is currently not set up for phage display analysis, so if you are interested in this kind of work then you will need to download SLiMSuite.

Suggested settings for phage display data are below. If anyone has a go and/or wants more advice, please get in touch. (If you try it, I’d be interested to hear how well it works!) Similarly, if you want some advice/ideas on how to combine the peptides with interaction data and full length protein sequences for a more sophisticated analysis, send me a bit more info and I’d be happy to make some suggestions.

Custom settings for phage display data

Here is an overview of the settings that should be tweaked for phage display analysis:

Amino acid frequencies. One thing you will want to try is changing the way that the amino acid frequencies are used. By default, SLiMFinder will use the amino acid frequencies of the input dataset but for phage display peptides this is not really right as the peptides are clearly biased in their composition due to the motifs they contain. Instead, you probably want to set the amino acid frequencies for the background model to those of the human proteome (for human peptides) or even a uniform amino acid distribution. (Select frequencies that model the pre-screening amino acid frequencies.) This is done using the aafreq=FILE option, where FILE can be a fasta file of protein sequences or a delimited file of aa frequencies with the headings “AA” and “FREQ”. (See the manual for details.) If in doubt, try a few runs with different amino acid frequencies.

Evolutionary Filtering. Evolutionary filtering should be switched off (efilter=F) but you will also want to make sure that there is no redundancy in your peptides. (rje_seq.py can be used for this.)

SLiMChance. If you are not so interested in the statistical significance and primarily want to use SLiMFinder to return a ranked list of interesting motifs in the data, set sigcut=1.0 and choose the number of motifs to return with topranks=X.

Ambiguity. Peptide data is usually pretty quick to run, and so it is probably worth exploring the full range of ambiguity with combamb=T (combined amino acid and variable-lengh wildcards). The basic equiv=LIST set for aa degeneracy should be OK for most jobs but you can easily tweak it to add or remove ambiguity combinations as appropriate.

Masking. You will probably want to switch off all masking (masking=F). Low complexity masking might be useful but metmask=F posmask="" should be used as the N-termini are not true protein N-termini.