Wednesday 28 November 2012

QSLiMFinder 1.4: quicker and more efficient - available on request

The on-going benchmarking of QSLiMFinder has thrown up a couple of discoveries to date. The first is that, reassuringly, it appears to work. (More on this another time.) The second is that it is slow. Or, at least, it was slow.

Thankfully, the cause of its surprisingly slow performance (compared to SLiMFinder) has been tracked down and fixed. At the same time, a (related) potential memory issue with large query sequences has also been sorted out.

The underlying problem is unlikely to have had a large effect on the SLiM prediction itself, although this is currently under investigation. The last release of SLiMSuite was only last week and, as QSLiMFinder is not officially published and released yet, I will not be compiling a new download immediately to take advantage of the improvements. The revised code is available on request if anyone is using QSLiMFinder.

Saturday 24 November 2012

New SLiMSuite, SeqSuite and RJESuite releases are now available

New releases of SLiMSuite, SeqSuite and RJESuite are now available from the Edwards Lab software page.

Please note that the documentation (particularly the manuals) are still lagging a bit behind, so do report anything that does not make sense. The default settings also need to be verified as there is a chance that some of these may have inadvertently changed over the years. (The same core code is now used for the webservers, which often have different defaults.) Checking these along with updating and checking the servers themselves are ongoing priorities.

A full list of updated modules is given below. As well as SLiMMaker now handling end of sequence characters, the biggest changes this release are updates to CompariMotif to (3.7) output unmatched input motifs and (3.8) improve handling of partially overlapping ambiguous positions (e.g. [AGS] and [ST]). The motivation behind both these changes is the ongoing benchmarking (and preparation for publication) of QSLiMFinder and the creation of SLiMBench for benchmarking motif prediction methods. A QSLiMFinder section has been added to the SLiMFinder Manual (section 5.4). SLiMBench is still a work in progress and will be documented in a later release.

Updates since last release:

• comparimotif_V3: Updated from Version 3.6.
→ Version 3.7: Added coreIC and output of unmatched motifs.
→ Version 3.8: Added overlaps=T/F : Whether to include overlapping ambiguities (e.g. [KR] vs [HK]) as match [True]
→ Version 3.8: Changed scoring of overlapping ambiguities - uses IC of all possible ambiguities. Added "Ugly" match type.

• slimbench: Created.
→ Version 0.0: Initial Compilation.
→ Version 0.1: Functional version with benchmarking dataset generation.
→ Version 1.0: Consolidation of "working" version with additional basic benchmarking analysis.
→ Version 1.1: Added simulated dataset construction and benchmarking.
→ Version 1.2: Added MinIC filtering to benchmark assessment. Sorted beginning/end of line for reduced ELMs.
→ Version 1.3: Made SimCount a list rather than Integer. Sorted CompariMotif assessment issue.
→ Version 1.4: Added ICCut and SLiMLenCut as lists and output columns.
→ Version 1.5: Added Summary Results output table. Removed PropRes.

• slimmaker: Updated from Version 1.0.
→ Version 1.1: Modified to work with end of line characters.

• slimsearch: Updated from Version 1.5.
→ Version 1.6: Minor tweaks to Log output. Add option for UPC number in occ output.

• rje: Updated from Version 4.1.
→ Version 4.2: Modified INI reading across the board to look in ../settings/ and look for defaults.ini as well as rje.ini.
→ Version 4.2: Enabled handing on -ini FILE in addition to ini=FILE.
→ Version 4.3: Added ilist and nlist types to cmdRead for objects. (Lists of integers and floats). Add ratio() function.

• rje_blast: Updated from Version 1.13.
→ Version 1.14: Added blast.checkProg(qtype,stype) to check whether blastp setting matches sequence formats.

• rje_db: Created.
→ Version 0.0: Initial Compilation.
→ Version 0.1: Added merge tables option.
→ Version 0.2: Miscellaneous updates to various methods.
→ Version 0.3: Minor doc tweaks and added keepFields().

• rje_seq: Updated from Version 3.12.
→ Version 3.13: Updated sequence type checking for use with GABLAM 2.10.

• rje_seqlist: Created.
→ Version 0.0: Initial Compilation. Based on rje_seq 3.10.
→ Version 0.1: Added basic species filtering and sequence output.
→ Version 0.2: Added upper case filtering.
→ Version 0.3: Added accnum filtering and sequence renaming.
→ Version 0.4: Added sequence redundancy filtering.
→ Version 0.5: Added newgene=X for sequence renaming (newgene_spcode__newaccXXX). NewAcc no longer fixed Upper Case.
→ Version 1.0: Upgraded to "ready" Version 1.0. Added concatenate=T and split=X options for sequence concatenation.
→ Version 1.0: Added reading of sequence type from rje_seq.py and mixed=T/F.
→ Version 1.1: Added shortName() and modified SeqDict.

• rje_sequence: Updated from Version 2.0.
→ Version 2.1: Added re_unirefprot = re.compile('^([A-Za-z0-9\-]+)\s+([A-Za-z0-9]+)_([A-Za-z0-9]+)\s+')

• rje_slim: Updated from Version 1.5.
→ Version 1.6: Fixed splitting bug introduced by lower case motifs.

• rje_slimcore: Updated from Version 1.8.
→ Version 1.9: Minor modifications to Log output. Updated motifSeq() function to output unmasked sequences.

• rje_slimlist: Updated from Version 0.6.
→ Version 1.0: Functional module with lower case motif splitting fixed and ? -> .{0,1} replacement.

• rje_zen: Updated from Version 1.0.
→ Version 1.1: Added a few more words here and there.

Saturday 17 November 2012

Using SLiMFinder to discover "local motifs" in protein sequences

The makers of the highly successful MEME Suite have another tool out:
DLocalMotif: A discriminative approach for discovering local motifs in protein sequences
I've not had a chance to go over it in detail but it looks like it could be pretty useful, especially for subcellular targeting motifs. There is one thing that rankles me slightly, though. They define a "local motif" as
"patterns in DNA or protein sequences that occur in a short sequence interval relative to a sequence anchor or landmark."
They then go on to say:
"We believe that DLocalMotif is the only tool for discovering local motifs in protein sequences."
This is just a quick post to point out that SLiMFinder will happily find "local motifs" in protein sequences using the start and end of the sequence as an anchor or landmark. I think it is more limited than DLocalMotif as it is restricted to SLiMs that are very proximal to the sequence termini but it features the usual SLiMChance probability calculations and corrections for evolutionary relationships. (Even without restricting to searches relative to anchor points, SLiMFinder is very successful at finding the KDEL motif and C-terminal PDZ ligand motifs.) The max distance from the termini can be set by maxwild=X up to a limit of 9aa.

If you want to restrict yourself to just N- or C-terminal motifs, use the musthave=LIST option:
  • musthave="^" for N-terminal motifs.
  • musthave="$" for C-terminal motifs.
  • musthave="^,$" for both.
  • If you want to anchor the motifs internally, this can be done too with a bit of imagination. Just insert an non-standard amino acid character (e.g. Z) at the anchor position, set the expanded alphabet using alphabet=LIST and then force the motif to have the new symbol using musthave=X, e.g.:
    alphabet="A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,Z" musthave=Z
    I must confess that I have never tried this but it should work and I am happy to help iron out any wrinkles.

    You can also use position-specific or case masking to restrict motif analysis to certain regions of input proteins. This is probably even better than simply constraining the motif location, as it will reduce the sequence search space rather than the motif search space.

    (BTW, SLiMFinder also has an experimental feature for using a negative dataset (negatives=FILE if anyone wants to try it out.)