Sunday, 5 March 2017

Problem with SLiMFinder bioware webserver

There is currently a problem with the SLiMFinder webserver hosted at UCD, where masking is failing to be performed, regardless of settings. This severely impacts the quality of results. (Disorder, low complexity and n-terminal methionine masking are generally recommended for SLiMFinder.)

I am in communication with the Shields lab to try and get the issue fixed but, until it has been rectified, the SLiMFinder webserver should not be used.

If you wish to run SLiMFinder online, you can do so via the SLiMFinder REST server (see BioInfoSummer 2016 workshop), which can also be run from within Cytoscape using the SLiMScape App.

Friday, 27 January 2017

SLiMSuite species codes

SLiMSuite species codes are designed to follow the UniprotKB organism (species) identification codes, using them wherever possible. They form part of the standard gene_SPECIES__AccNum naming convention for sequences within SLiMSuite. Species codes should be upper case, and unique for each species.

Tuesday, 11 October 2016

Odd blog behaviour

For some reason, the Download blog page is not working properly. Until rectified, please check the downloads tag if a link to SLiMSuite downloads breaks.

Update: this should be fixed now!

Monday, 12 September 2016

SLiMSuite Downloads

The current SLiMSuite release is 2016-09-12 and can be downloaded by clicking the button (left).

In addition to the tarball available via the links above, SLiMSuite is now available as a GitHub repository (right).

See also: Installation and Setup.

Previous Releases

SLiMSuite release v1.2.0 (2016-09-12) online

The long-overdue September 2016 release of SLiMSuite 2016-09-12 - v1.2.0 is now on GitHub. Apart from a few bug fixes, the main updates in this release are to the tools for PacBio genomics, notably PAGSAT, SMRTSCAPE and a new SNP Mapping tool, Snapper. These are still in development and need further documentation but are ready for use with a little help. Please get in touch if you are interested. Proper documentation and example use will hopefully follow soon, as the first PacBio yeast paper is written.

GABLAM has had some minor tweaks for improved function with Snapper, PAGSAT and another developmental tool that will be in the next release (REVERT - available via the REST servers). These have been focused on the fragfas=T output of fragmented BLAST hits based on local alignments. This includes addition of a new default to reverse complement reverse hits (fragrevcomp=T) and the separation of parameters for splitting up local hits into multiple fragments (gablamfrag=X) and merging close/overlapping fragments (fragmerge=X).

SLiMSuite updates in this release

Updates in extras/:

• rje_pydocs: Updated from Version 2.16.2.
→ Version 2.16.3: Fixed docstring REST parsing to work with _V* modules.

Updates in libraries/:

• rje: Updated from Version 4.15.1.
→ Version 4.16.0: Added list2dict(inlist,inkeys) and dict2list(indict,inkeys) functions.
→ Version 4.16.1: Improved handling of integer parameters when given bad commands.
→ Version 4.17.0: Added extra functions to randomList()

• rje_blast_V2: Updated from Version 2.9.1.
→ Version 2.10.0: Added nocoverage calculation based on local alignment table.
→ Version 2.11.0: Added localFragFas output method.
→ Version 2.11.1: Fixed snp local table revcomp bug. [Check this!]
→ Version 2.11.2: Fixed GABLAM calculation bug when '*' in protein sequences.

• rje_db: Updated from Version 1.8.0.
→ Version 1.8.1: Added sfdict to saveTable output.

• rje_genbank: Updated from Version 1.3.2.
→ Version 1.4.0: Added addtags=T/F : Add locus_tag identifiers if missing - needed for gene/cds/prot fasta output [False]
→ Version 1.4.1: Fixed genetic code warning.
→ Version 1.5.0: Added setupRefGenome() method based on PAGSAT code.
→ Version 1.5.1: Fixed logskip append locus sequence file bug.
→ Version 1.5.2: Fixed addtag(s) bug.

• rje_hprd: Updated from Version 1.2.
→ Version 1.2.1: Fixed "PROTEIN_ARCHITECTURE" bug.

• rje_menu: Updated from Version 0.3.
→ Version 0.4.0: Changed handling of default for exiting menu loop. May affect behaviour of some existing menus.

• rje_mitab: Updated from Version 0.2.0.
→ Version 0.2.1: Fixed redundant evidence/itype bug (primarily dip)

• rje_obj: Updated from Version 2.1.3.
→ Version 2.2.0: Added screenwrap=X.
→ Version 2.2.1: Improved handling of integer parameters when given bad commands.

• rje_samtools: Updated from Version 0.1.0.
→ Version 0.2.0: Added majmut=T/F : Whether to restrict output and stats to positions with non-reference Major Allele [False]
→ Version 1.0.0: Major reworking. Old version frozen as rje_samtools_V0.
→ Version 1.1.0: Added snptabmap=X,Y alternative SNPTable mapping and read_depth statistics []. Added majref=T/F.
→ Version 1.2.0: Added developmental combining of read mapping onto two different genomes.
→ Version 1.3.0: Major debugging and code clean up.
→ Version 1.4.0: Added parsing of read number (to link SNPs) and fixed deletion error at same time. Added rid=T/F and snponly=T/F.
→ Version 1.5.0: Added biallelic=T/F : Whether to restrict SNPs to pure biallelic SNPs (two alleles meeting mincut) [False]
→ Version 1.5.1: Fixed REF/Ref ALT/Alt bug.
→ Version 1.6.0: Added majfocus=T/F : Whether the focus is on Major Alleles (True) or Mutant/Reference Alleles (False) [True]
→ Version 1.7.0: Added parsing of *.sam files for generating RID table.
→ Version 1.8.0: Added read coverage summary/checks.
→ Version 1.8.1: Fixed issue when RID file not generated by pileup parsing. Set RID=True by default to avoid issues.

• rje_samtools_V0: Created/Renamed/moved.
→ Version 0.0: Initial Compilation.
→ Version 0.1.0: Modified version to handle multiple loci per file. (Original was for single bacterial chromosomes.)
→ Version 0.2.0: Added majmut=T/F : Whether to restrict output and stats to positions with non-reference Major Allele [False]

• rje_seq: Updated from Version 3.23.0.
→ Version 3.24.0: Added REST seqout output.

• rje_seqlist: Updated from Version 1.15.3.
→ Version 1.15.4: Fixed REST server output bug.
→ Version 1.15.5: Fixed reformat=fasta default issue introduced from fixing REST output bug.
→ Version 1.16.0: Added edit=T sequence edit mode upon loading (will switch seqmode=list).
→ Version 1.17.0: Added additional summarise=T output for seqmode=db.
→ Version 1.18.0: Added revcomp to reformat options.
→ Version 1.19.0: Added option log description for deleting sequence during edit.
→ Version 1.20.0: Added option to give a file of changes for edit mode.
→ Version 1.20.1: Fixed edit=FILE deletion bug.

• rje_sequence: Updated from Version 2.5.2.
→ Version 2.5.3: Fixed genetic code warning error.

• rje_slimcore: Updated from Version 2.7.5.
→ Version 2.7.6: Added feature masking log info or warning.
→ Version 2.7.7: Switched feature masking OFF by default to give consistent Uniprot versus FASTA behaviour.

• rje_synteny: Created/Renamed/moved.
→ Version 0.0.0: Initial Compilation.

• rje_taxonomy: Updated from Version 1.1.0.
→ Version 1.2.0: Added storage of Parents.

• rje_tree: Updated from Version 2.13.0.
→ Version 2.14.0: Added cladeSpec().

• rje_uniprot: Updated from Version 3.21.4.
→ Version 3.22.0: Tweaked REST table output.

• rje_xref: Updated from Version 1.8.0.
→ Version 1.8.1: Added rest run mode to avoid XRef table output if no gene ID list is given. Added `genes` and `genelist` as `idlist=LIST` synonym.
→ Version 1.8.2: Catching self.dict['Mapping'] error for REST server.

• snp_mapper: Updated from Version 0.4.0.
→ Version 0.5.0: Added CDS rating.
→ Version 0.6.0: Added AltFT mapping mode (map features to AltLocus and AltPos)
→ Version 0.7.0: Added additional fields for processing Snapper output. (Hopefully will still work for SAMTools etc.)
→ Version 0.8.0: Added parsing of GFF file from Prokka.
→ Version 0.8.1: Corrected "intron" classification for first position of features. Updated FTBest defaults.
→ Version 1.0.0: Version that works with Snapper V1.0.0. Not really designed for standalone running any more.

Updates in tools/:

• comparimotif_V3: Updated from Version 3.12.
→ Version 3.13.0: Added REST server function.

• gablam: Updated from Version 2.20.0.
→ Version 2.21.0: Added nocoverage Table output of regions missing from pairwise SNP Table.
→ Version 2.21.1: Added fragrevcomp=T/F : Whether to reverse-complement DNA fragments that are on reverse strand to query [True]
→ Version 2.22.0: Added description to HitSum table.
→ Version 2.22.1: Added localaln=T/F to keep local alignment sequences in the BLAST local Table.
→ Version 2.22.2: Fixed local output error. (Query/Qry issue - need to fix this and make consistent!)
→ Version 2.22.3: Fixed blastv and blastb error: limit also applies to individual pairwise hits!
→ Version 2.23.0: Divided GablamFrag and FragMerge.

• pagsat: Updated from Version 1.6.1.
→ Version 1.7.0: Added tidy=T/F option. (Development)
→ Version 1.7.1: Updated tidy=T/F to include initial assembly.
→ Version 1.7.2: Fixed some bugs introduced by changing gablam fragment output.
→ Version 1.7.3: Added circularise sequence generation.
→ Version 1.8.0: Added orphan processing and non-chr naming of Reference.
→ Version 1.9.0: Modified the join sorting and merging. Added better tracking of positions when trimming.
→ Version 1.9.1: Added joinmargin=X : Number of extra bases allowed to still be considered an end local BLAST hit [10]
→ Version 1.10.0: Added weighted tree output and removed report warning.
→ Version 1.10.1: Fixed issue related to having Description in GABLAM HitSum tables.
→ Version 1.10.2: Tweaked haploid core output.
→ Version 1.10.3: Fixed tidy bug for RevComp contigs and switched joinsort default to Identity. (Needs testing.)
→ Version 1.10.4: Added genetar option to tidy out genesummary and protsummary output. Incorporated rje_synteny.
→ Version 1.10.5: Set gablamfrag=1 for gene/protein hits.
→ Version 1.11.0: Consolidated automated tidy mode and cleaned up some excess code.
→ Version 1.11.1: Added option for running self-PAGSAT of ctidX contigs versus haploid set. Replaced ctid "X" with "N".
→ Version 1.11.2: Fixed Snapper run choice bug.

• pingu_V4: Updated from Version 4.5.3.
→ Version 4.6.0: Added hubonly=T/F : Whether to restrict pairwise PPI to those with both hub and spoke in hublist [False]
→ Version 4.6.1: Fixed some ppifas=T/F bugs and added combineppi=T/F : Whether to combine all spokes into a single fasta file [False]
→ Version 4.6.2: Added check/filter for multiple SpokeUni pointing to same sequence. (Compilation redundancy mapping failure!)
→ Version 4.6.3: Fixed issue with 1:many SpokeUni:Spoke mappings messing up XHub.
→ Version 4.7.0: Added ppidbreport=T/F : Summary output for PPI compilation of evidence/PPIType/DB overlaps [True]
→ Version 4.8.0: Fixed report duplication issue and added additional summary output

• qslimfinder: Updated from Version 2.1.0.
→ Version 2.1.1: Switched feature masking OFF by default to give consistent Uniprot versus FASTA behaviour.

• seqsuite: Updated from Version 1.11.0.
→ Version 1.11.1: Redirected PacBio to call SMRTSCAPE.
→ Version 1.11.2: Fixed batchrun batchlog=False log error.
→ Version 1.12.0: Added Snapper.

• slimfarmer: Updated from Version 1.4.3.
→ Version 1.4.4: Modified default vmem request to 126GB from 127GB.
→ Version 1.4.5: Updated BLAST loading default to 2.2.31

• slimfinder: Updated from Version 5.2.1.
→ Version 5.2.2: Added warnings for ambocc and minocc that exceed the absolute minima. Updated docstring.
→ Version 5.2.3: Switched feature masking OFF by default to give consistent Uniprot versus FASTA behaviour. Fixed FTMask=T/F bug.

• slimparser: Updated from Version 0.3.3.
→ Version 0.3.4: Tweaked error messages.
→ Version 0.4.0: Added simple json format output.

• slimprob: Updated from Version 2.2.4.
→ Version 2.2.5: Fixed FTMask=T/F bug.

• slimsearch: Updated from Version 1.7.
→ Version 1.7.1: Minor modification to docstring. Preparation for update to SLiMSearch 2.0 optimised for proteome searches.

• slimsuite: Updated from Version 1.5.1.
→ Version 1.5.2: Updated XRef REST call.
→ Version 1.6.0: Removed SLiMCore as default. Default will now show help.

• smrtscape: Updated from Version 1.8.0.
→ Version 1.9.0: Updated empirical preassembly mapefficiency calculation.
→ Version 1.10.0: Added batch processing of subread files.
→ Version 1.10.1: Fixed bug in batch processing.

• snapper: Created/Renamed/moved.
→ Version 0.0.0: Initial Compilation.
→ Version 0.1.0: Tidied up with improved run pickup.
→ Version 0.2.0: Added FASTQ and improved CNV output along with all features.
→ Version 0.2.1: Fixed local output error. (Query/Qry issue - need to fix this and make consistent!) Fixed snp local table revcomp bug.
→ Version 0.2.2: Corrected excess CNV table output (accnum AND shortname).
→ Version 0.2.3: Corrected "intron" classification for first position of features. Updated FTBest defaults.
→ Version 1.0.0: Working version with completed draft manual. Added to SeqSuite.
→ Version 1.0.1: Fixed issues when features missing.

Tuesday, 23 August 2016

SMRTSCAPE: SMRT Subread Coverage & Assembly Parameter Estimator

SMRTSCAPE (SMRT Subread Coverage & Assembly Parameter Estimator) is tool in development as part of our PacBio sequencing projects for predicting and/or assessing the quantity and quality of useable data required/produced for HGAP3 de novo whole genome assembly. The current documentation is below. Some tutorials will be developed in the future - in the meantime, please get in touch if you want to use it and anything isn’t clear.

The main functions of SMRTSCAPE are:

  1. Estimate Genome Coverage and required numbers of SMRT cells given predicted read outputs. NOTE: Default settings for SMRT cell output are not reliable and you should speak to your sequencing provider for up-to-date figures in their hands.

  2. Summarise the amount of sequence data obtained from one or more SMRT cells, including unique coverage (one read per ZMW).

  3. Calculate predicted coverage from subread data for difference length and quality cutoffs.

  4. Predict HGAP3 length and quality settings to achieve a given coverage and accuracy.

SMRTSCAPE will be available in the next SLiMSuite download. The coverage=T mode can be run from the EdwardsLab server at: (This is currently running a slightly old implementation but should be updated shortly.)

SMRTSCAPE Documentation

Version:      1.10.1
Last Edit:    26/05/16


### ~ General Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
genomesize=X    : Genome size (bp) [0]
### ~ Genome Coverage Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
coverage=T/F    : Whether to generate coverage report [False]
avread=X        : Average read length (bp) [20000]
smrtreads=X     : Average assemble output of a SMRT cell [50000]
smrtunits=X     : Units for smrtreads=X (reads/Gb/Mb) [reads]
errperbase=X    : Error-rate per base [0.14]
maxcov=X        : Maximmum X coverage to calculate [100]
bysmrt=T/F      : Whether to output estimated  coverage by SMRT cell rather than X coverage [False]
xnlist=LIST     : Additional columns giving % sites with coverage >= Xn [1+`minanchorx`->`targetxcov`+`minanchorx`]
### ~ SubRead Summary Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
summarise=T/F   : Generate subread summary statistics including ZMW summary data [False]
seqin=FILE      : Subread sequence file for analysis [None]
batch=FILELIST  : Batch input of multiple subread fasta files (wildcards allowed) if seqin=None []
targetcov=X     : Target percentage coverage for final genome [99.999]
targeterr=X     : Target errors per base for preassembly [1/genome size]
calculate=T/F   : Calculate X coverage and target X coverage for given seed, anchor + RQ combinations [False]
minanchorx=X    : Minimum X coverage for anchor subreads [6]
minreadlen=X    : Absolute minimum read length for calculations (use minlen=X to affect summary also) [500]
rq=X,Y          : Minimum (X) and maximum (Y) values for read quality cutoffs [0.8,0.9]
rqstep=X        : Size of RQ jumps for calculation (min 0.001) [0.01]
### ~ Preassembly Fragmentation analysis Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
preassembly=FILE: Preassembly fasta file to assess/correct over-fragmentation (use seqin=FILE for subreads) [None]
### ~ Assembly Parameter Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
parameters=T/F  : Whether to output predicted "best" set of parameters [False]
targetxcov=X    : Target 100% X Coverage for pre-assembly [3]
xmargin=X       : "Safety margin" inflation of X coverage [1]
mapefficiency=X : [Adv.] Efficiency of mapping anchor subreads onto seed reads for correction [1.0]
xsteplen=X      : [Adv.] Size (bp) of increasing coverage steps for calculating required depths of coverage [1e6]
parseparam=FILES: Parse parameter settings from 1+ assembly runs []
paramlist=LIST  : List of parameters to retain for parseparam output (file or comma separated, blank=all) []
predict=T/F     : Whether to add XCoverage prediction and efficiency estimation from parameters and subreads [False]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###


SMRTSCAPE has a number of functions for PacBio sequencing projects, concerned with predicting and/or assessing the quantity and quality of useable data produced:

  • Summarise subreads (summarise=T). This function summarises subread data from a given seqin=FILE fasta file, or a set of subread fasta files given with batch=FILELIST.
  • Calculate length cutoffs (calculate=T). Calculates length cutoffs for different XCoverage combinations from subread summary data.
  • Parameters (parameters=T). This function attempts to generate predicted optimum assembly settings from the summary=T and calculate=T table data.
  • Genome Coverage (coverage=T). This method tries to predict genome coverage and accuracy for different depths of PacBio sequencing.
  • Preassembly fragmentation analysis (preassembly=FILE).
  • Parse Parameters (parseparam=FILES). This method parses assembly parameters from the SmrtPipeSettings Summary file.
  • Prediction (predict=T). This method compares predicted coverage from seed reads with estimated coverage from preassembly data.

These are explored in more detail below.


The setup phase primarily sorts out the SMRTSCAPE objects. With the exception of pure ParseParam runs, it will also set up:

  1. Genome Size. Unless given with genomesize=X, this will be asked for as it is required X coverage/depth and accuracy calculations.
  2. Target Error Rate. Unless given with targeterr=X, a target error rate of 1/GenomeSize will be set.
  3. Basefile. Finally, if no basefile=X setting is given, the basename for output files will be set to the basename of seqin=FILE if given (stripping .subreads if present), else smrtscape.*.

Summarise subreads (summarise=T)

This function summarises subread data from a given seqin=FILE fasta file, or a set of subread fasta files given with batch=FILELIST. This uses the summarise=T function of rje_seqlist to first give text output to the log file of some summary statistics:

  • Total number of sequences.
  • Total length of sequences.
  • Min. length of sequences.
  • Max. length of sequences.
  • Mean length of sequences.
  • Median length of sequences.
  • N50 length of sequences.

Next, Pacbio-specific subread header information will be parsed to generate three summary tables (described in more detail below):

  1. *.zmw.tdt summary of all subreads.
  2. *.unique.tdt summary of the longest subread per ZMW.
  3. *.rq.tdt summary of subread data for different Read Quality values.

These are generated by parsing both the sequence data and the sequence name data:

>m150625_001530_42272_c100792502550000001823157609091582_s1_p0/9/0_3967 RQ=0.784
>m150625_001530_42272_c100792502550000001823157609091582_s1_p0/11/0_20195 RQ=0.868

These are split into component parts:

  • m150625_001530_42272_c100792502550000001823157609091582_s1_p0 = SMRT cell (SMRT).
  • /9 = ZMW (ZMW).
  • /0_3967 = raw read positions used for subread (5’ adaptor removed) (Pos).
  • RQ=0.784 = read quality (mean per base accuracy) (RQ).

The numbers and “best” subreads for each ZMW are summarised in #RN log file entries.

Subread output (*.zmw.tdt)

The first output file is the *.zmw.tdt, which stores the subread information:

  • SMRT, ZMW, Pos, RQ = as above.
  • RN = subread number within ZMW (1 to N).
  • Len = length of subread.
  • Seq = sequence file position.

The unique key for this file is: SMRT,ZMW,RN.

Unique subread output (*.unique.tdt)

This table is made from *.zmw.tdt by reducing each ZMW’s output to a single read. Reads are kept in preference of (a) longest, (b) read quality (if tied), and finally (c) earliest read (if tied for both). It hase the same fields as *.zmw.tdt with keys determined by: SMRT,ZMW.

Read Quality output (*.rq.tdt)

This table outputs the number and percentage of subreads and longest reads at each read quality (RQ). Fields:

  • RQ = read quality (per base accuracy)
  • xerr = the Xdepth required @ that RQ to meet the targeterr=X error per base accuracy.
  • subread = number of subreads with that RQ.
  • unique = number of “best” unique subread per ZMW with that RQ.
  • f.subreads = proportion of subreads with that RQ.
  • f.unique = proportion of unique subreads with that RQ.
  • cum.subreads = proportion of subreads with quality >=RQ.
  • cum.unique = proportion of unique subreads with quality >=RQ.
  • x.subreads = XCoverage of subreads with quality >=RQ.
  • x.unique = XCoverage of unqiue subreads with quality >=RQ.
  • MeanRQ = mean read quality of bases in subreads with quality >=RQ.
  • Mean.XErr = the Xdepth required at that MeanRQ to meet the targeterr=X error per base accuracy.

Calculate length cutoffs (calculate=T)

This function calculates length cutoffs for different XCoverage combinations from subread summary data. It uses the *.zmw.tdt, *.unique.tdt and *.rq.tdt files from above, generating if required (and regenerating if force=T).

First, XCovLimit data are calculated. These are the summed read lengths required to generate the desired genome coverage (targetcov=X) at different depths of X coverage. Note that the square root of the targetcov=X value is used, as the HGAP assembly process involves two layers of genome coverage: (1) coverage of seed reads by anchor reads to generate the pre-assembly; (2) coverage of the pre-assembly.

XCovLimit data are calculated by incrementing total summed read lengths in 1 Mb increments (adjusted with xsteplen=X). At each incremment, genomesize=X is used to calculate the total X coverage. The probability of the target X coverage (starting at 1X) given the total X coverage is then calculated using a Poisson distribution. If this probability exceeds the target genome coverage, the current summed length is set as the XCovLimit for X and the target X increased by 1. The total summed read length is then incremented by xsteplen and the process repeated until the summed length reaches the total length of all subreads of at least the size set by minreadlen=X (default 500 bp).

Next, the read quality steps to calculate are established using rq=X,Y (minimum (X) and maximum (Y) read quality cutoffs, defaults=0.8-0.9) and rqstep=X (the size of RQ steps for calculations, default=0.01 (min 0.001).

Finally, the seed and anchor lengths required to achieve certain minimum Xdepths of target genome coverage are calculated and output to *.cutoffs.tdt.

For each RQ cutoff, TargetErr is used to establish the min. required anchor read Xdepth (AnchorMinX). If this exceeds the specified minanchorx=X value, this will be used as the minimum target instead. Next, the seed read length (SeedLen) required to get a combined unique seed read length to give a minimum Xdepth (SeedMinX) of targetxcov=X is calculated. Where xmargin > 0, additional seed lengths will also be calculated to give deeper minimum seed Xdepths. e.g. targetxcov=3 xmargin=2 will calculate SeedLen for 3X, 4X and 5X. For each SeedLen, anchor read length cutoffs are also calculated such that the summed length of unique reads where AnchorLen <= length < SeedLen is sufficent to give AnchorMinX values from the minimum established above until XMargin is reached.

If there is insufficient data to meet the minimum seed and anchor read depths, the “optimal” seed XCoverage will be calculated from the unique seed reads, as described for coverage=T. In essence, SeedMinX will be reduced to meet AnchorMinX until it falls below zero and then the two values will be optimised to try and maximise genome coverage at the required target error. Users may prefer instead to relax the targetxcov=X and/or minanchorx=X values. This optimisation is based on unique Xcoverage and the precise values subsequently output in *.cutoffs.tdt may differ. (Relaxed variations could be run with higher xmargin=X values to output a range from which the chosen values can be selected for predict=T.)

NOTE: mapefficiency=X is used during this process to reduce the effective seed coverage at a given summed length and thus inflate the required SeedX needed to achieve a given SeedMinX.


This table contains the main output from the calculate=T function. Unique entries are determined by combinations of 'RQ, SeedMinX and AnchorMinX.

  • RQ = Read Quality Cutoff.
  • SeedMinX = Min seed XCoverage for TargetCov % of genome (unique reads). If this is <1, it indicates the proportion of the genome covered at 1X.
  • SeedLen = Seed length Cutoff.
  • SeedX = Total Seed read XCoverage (all reads).
  • AnchorMinX = Min anchor XCoverage for TargetCov % of genome (unique reads). If this is <minanchorx=X, it indicates the proportion of the genome covered at minanchorxX.
  • AnchorLen = Anchor read (i.e. overall subread) length cutoff.
  • AnchorX = Total Anchor read XCoverage (all reads).

Preassembly fragmentation analysis (preassembly=FILE)

[ ] : Add preassembly details here.

Optimal Assembly Parameters (parameters=T)

The parameters=T function attempts to generate predicted optimum assembly settings from the summary=T and calculate=T table data.

  1. The *.cutoffs.tdt table is read in (or generated if required).
  2. The targetxcov=X and xmargin=X settings are used to determine the desired SeedMinX value, i.e. the miniumum depth of coverage of the chosen seed reads. (NB. The mapefficiency=X setting is used for inflating the required XCoverage needed to meet this Min XDepth during the calculate process.)
  3. Each potential RQ cutoff is now assessed and the targeterr=X setting used to determine the required Xdepth per base for error correction. Note: This uses the minimum RQ value and should thus be conservative. Using the mean RQ was considered but it was deemed that conservative is best in this scenario.
  4. The desired AnchorMinX is calculated from the target Xdepth and xmargin=X. The final value is reduced by one to allow for the fact that the seed read counts as X=1 for error estimation.
  5. Parameter combinations have now been reduced to those with the desired Seed and Anchor minimum Xdepth. Two parameter combinations are then output:
    • MaxLen parameters using the mininum RQ value and thus the longest seed and anchor read length cutoffs.
    • MaxRQ parameters using the maximum RQ value.

Note that the predicted parameter settings are only output as #PARAM log entries: the full set (including these “optimal” ones) are part of the *.cutoffs.tdt file.

Genome Coverage (coverage=T)

This method tries to predict genome coverage and accuracy for different depths of PacBio sequencing based on predicted usable output statistics and the genome size. Statistics for existing runs can be generated using the summarise=T option and used to inform calculate=T if run together.

Setup. If the summarise=T option is used and/or there is an existing BASEFILE.unique.tdt file (and force=F) then the *.unique.tdt table will be used to generate SMRT cells statistics: mean read count (used to populate smrtreads=X); mean total read count (used to populate avread=X); mean RQ (used to populated errperbase=X). Otherwise, the corresponding commandline options will be used. If smrtunits is Gb or Mb then smrtreads=X will be recalculated as smrtreads/avread.

XnList. The second stage of setup is to calculate the %coverage at certain X depth coverage to be calculated along with overall depth of coverage etc. These numbers are based on the subread summarise and calculate settings: targetxcov and targetxcov+minanchorx. Any levels explicitly chosen by xnlist=LIST will also be calculated.

TargetXDepth. Next, target Xdepth values are calculated in the same fashion as XCovLimit data (above), except that these are now converted to Xcoverage (based on GenomeSize) rather than total subread lengths.

Accuracy. The % genome coverage and accuracy for different X coverage of a genome are then calculated assuming a non-biased error distribution. Calculations use binomial/poisson distributions, assuming independence of sites. Accuracy is based on a majority reads covering a particular base with the correct call, assuming random calls at the other positions (i.e. the correct bases have to exceed 33% of the incorrect positions). The errperbase=X parameter is used for this calculation.

Coverage. Coverage statistics are then calculated for each Xdepth of sequencing (or SMRT cell if bysmrt=T) First, the optimal seed read Xcoverage is calculated. The target seed Xdepth (targetxcov=X) and anchor depth (minanchorx=X) are used to identify the total target Xcoverage. If this is met (inflating required seed coverage to account for mapefficiency=X), the largest seedX value that meets the minanchorx=X anchor depth is selected. If this cannot be achieved with seedX >= 1, the optimal balance between seed length and anchor length is achieved by maximising the probability of 1X seed coverage and MinAnchorX+ anchor coverage. This seed read length is then used to generate the predicted coverage output.


Main output is the *.coverage.tdt file. All calculations are based on subreads, and therefore using the “raw” polymerase read data for the smrtreads=X value for SMRT cells will overestimate coverage. Note that smrtreads=X can be used to input sequence capacity in Gb (or Mb) rather than read counts by changing smrtunits=X.

  • XCoverage = Total mean depth of coverage.
  • SMRT = Total number of SMRT cells.
  • %Coverage = Estimated %coverage of assembly.
  • %Accuracy = Estimated %accuracy of assembled genome. This is established by working out the predicted proportion of the genome at each Xcoverage (given the total XCoverage) and the accuracy at that depth (as described above).
  • %Complete = Product of %coverage and %accuracy.
  • SeedX = Estimated optimal depth of coverage for seed reads.

Parsing assembly parameters (ParseParam=FILES)

This method will take a bunch of *.settings text files (wildcards allowed) and parse out the assembly parameter settings into a delimited text file. The contents of these files should be consistent with the Assembly_Metrics_*.xlsx file produced by the SMRT Portal.

Output for this method is a *.settings.tdt file, which has the following field headers:

  • Setting. The full setting name, e.g. p_preassemblerdagcon.minCorCov.
  • Prefix. The setting prefix, e.g. p_preassemblerdagcon.
  • Suffix. The setting suffix, e.g. minCorCov. (This is used for some of the *.predict.tdt fields.)
  • Variable. Whether the parameter is variable (“TRUE”) or fixed (“FALSE”) in the set of *.settings files being parsed.
  • Assemblies. Each assembly (the * in *.settings) will get its own field containing the actual value used for that parameter in that assembly.

NOTE: The files coming off SMRT Portal have some undesirable non-unicode characters in them. These are hopefully stripped by SMRTSCAPE but it is possible that some parameters may not be correctly parsed.


It is possible to parse a selected subset of parameters using paramlist=LIST. (This is easiest where LIST is a text file with one parameter per line.) This should be a list of the full parameter name, i.e. the content of the Setting field.

[ ] : Add the recommended list of parameters here.

Predicting assembly coverage (Predict=T)

This function predicts coverage from parsed assembly parameters and compares to pre-assembly subreads if possible. Its primary function is to check that the parameter settings from calculate=T are working as expected (at least in terms of preassembly generation) and to tweak the mapefficiency=X option if required. Where a reference is available, it can also be used to test the make SRMTSCAPE calculations in terms of coverage etc.

Predict uses data from the summarise=T and parseparam=FILES functions. (These will be run if required.) As such, it requires the original subread data (seqin=FILE) and the list of *.settings files that identifies the assemblies. (See ParseParam=FILES.) Pre-assembly *.preassembly.fasta files should match the *.settings files. (The file looked for will be identified as a #PREX log entry.)

If it already exists, the *.predict.tdt will be loaded and updated. Otherwise a new *.predict.tdt file will be created. (See below.) Predict first loads in the relevant data and assembly parameters (see output) before calculating expected coverage from subread data and observed coverage from preassembly data.

Predict output

The output of Predict mode is a *.predict.tdt output file with the following fields:

  • Assembly = The assembly base filename for a given file in parseparam=FILES.
  • minSubReadLength = parsed p_filter.minSubReadLength setting.
  • readScore = parsed p_filter.readScore setting.
  • minLongReadLength = parsed p_preassemblerdagcon.minLongReadLength setting.
  • minCorCov = parsed p_preassemblerdagcon.minCorCov setting.
  • ovlErrorRate = parsed p_assembleunitig.ovlErrorRate setting.
  • SeedX = mean depth of coverage for seed reads given seed length and min RQ score.
  • AnchorX = mean depth of coverage for anchor reads given seed length and min RQ score.
  • SeedMinX = minimum depth of coverage of unique seed reads to achieve XCovLimit (see calculate=T.
  • AnchorMinX = minimum depth of coverage of unique anchor reads to achieve XCovLimit (see calculate=T.
  • PreCov = predicted base coverage of pre-assembly.
  • CorPreCov = corrected predicted base coverage of pre-assembly given mapefficiency=X.
  • PreX = average depth of coverage of *.preassembly.fasta sequences, given genomesize.
  • PreMinX = minimum depth of coverage of *.preassembly.fasta sequences to achieve XCovLimit (see calculate=T.
  • PreMapEfficiency = PreX / SeedX as an estimate of the loss of seed sequence during the preassembly mapping phase. Ideally, this should be close to the mapefficiency=X setting. (NOTE: SMRTSCAPE has not undergone extensive testing of this assumption.

Thursday, 10 December 2015

BioInfoSummer2015 SLiMSuite Workshop

Dr Richard Edwards, University of New South Wales
Thursday 10th December 2015

Session outline

Click for slides.

Part I: Theory

  • Introduction to workshop
  • What are SLiMs?
  • What is SLiMSuite

Part II: Practice

  • Installing/running SLiMSuite
  • Data types and main input formats
  • Motif discovery using the SLiMSuite REST Servers
  • Motif discovery using the SLiMScape app for Cytoscape

Additional help and documentation

General information about SLiMs and motif discovery can be found in the literature. Some good places to start are the recent ELM 2016 paper and our 2015 Methods in Molecular Biology review as well as the SLiMScape app paper:

For information about SLiMSuite, please visit the EdwardsLab webpage and the SLiMSuite blog. Help and documentation for the REST servers can also be found at the REST homepage. If in doubt, please email:

Several EdwardsLab publications also cover motifs and SLiMSuite tools.

Installing/Running SLiMSuite

NOTE: For this workshop, you do not need to install SLiMSuite. You will need Cytoscape and the SLiMScape app for the later parts.

The current SLiMSuite release is 2015-11-30 and can be downloaded by clicking the button (left).

In addition to the tarball available via the links above, SLiMSuite is now available as a GitHub repository (right).

See also: Installation and Setup.

For this workshop, we will primarily be running the tools (and looking at pre-generated results) via the online servers:

Data types and main input formats

From a computer science perspective, input and output for SLiMSuite is just plain ASCII text. This makes it easy to plug SLiMSuite into existing scripts and pipelines - and manually view/edit any input or output files if required. However, “plain text” is not very informative, and SLiMSuite actually deals with a lot of different formats of plain text (from a “human formatting” rather than “file type” point of view). The documentation is currently in the process of being updated to better reflect these formats but some commandline options will still simply list FILE, FILES or FILELIST as input parameters: see the accompanying descriptions to see what format these should be. Ask if it’s not clear! (File format documentation will also be added to the SLiMSuite blog, so check there.)

Within SLiMSuite, each file type has a distinct “file extension” that denotes the file type. Note that these are not enforced for input, although some programs may not always recognise the right format if a different extension is used. If you get odd input behaviour/errors that you do not understand, see if changing the file extensions helps. If you want a common file extension to be auto-recognised, let me know and I might be able to add it. SLiMSuite file extensions will not necessarily be recognised by other programs. NOTE: Operating systems will sometimes hide file extensions by default. If you are getting very confused, or have problems of extra *.txt extensions on everything, try changing the system settings. (And/or becoming familiar with command-line file manipulation.)

The main input formats for SLiM discovery are:

  • A source of protein sequence data. This could be a protein FASTA file, a Uniprot plain text file, or a list of Uniprot accession numbers to download. For some tools, a single Uniprot accession number will work.
  • A source of motif (regular expression) definitions. This is only required if looking for known (or other pre-defined) motifs and/or wanting to compare a set of de novo predictions with known motifs. A number of different formats are accepted for motif input, including SLiMFinder (summary) results and ELM downloads. The simplest/easiest is a plain text file of regular expressions. For more on motif regular expression formats, please see Edwards and Palopoli 2015.

Common motif discovery tasks

Jobs can be run and retrieved at: (This is a bit easier than making the URL directly, although this is also an option as we will see.)

NOTE: Some of the jobs take a while to run and the SLiMSuite servers have limited resources. It would therefore be useful if you could click on the example JobID links rather than trying to run every example REST command yourself. The first output tab (and the log tab) will show you the run times for that job, so you can see which jobs are fast or slow before you experiment.

Task 1: Find known SLiMs in a protein (ELM/SLiMProb)

ELM. Visit http:// and enter your protein of choice as Uniprot identifier or accession number in the box. (Identifiers will auto-complete and fill in some extra details.) For non-Uniprot protein sequences, you can also enter fasta format.

Try this now with P03070 (LT_SV40) or P03254 (E1A_ADE02). Each of them should have a True Positive LIG_Rb_LxCxE_1 motif.

SLiMProb. We can do a similar search using the SLiMProb REST server (paste the contents of the grey box onto the end of the URL):


JobID: 15120800029

NOTE: The ELM alias currently searches the 2015 ELM classes.

Task 2: Find custom SLiMS in a protein (SLiMProb)


JobID: 15120800031

Task 3: Finding proteome-wide occurrence of a motif using Bioware (SLiMSearch)

The SLiMSearch server is accessible at: This has been recently updated to Version 4 and now brings in a lot of information, so it is recommended that you read the Help pages for the server.

Example (LIG_CtBP_PxDLS_1):

Human protein PRDM16 is particularly interesting: it does not have an annotated ELM but does match a region annotated to interact with CTBP1. (See the Region column - Expand the instance Feature annotations for a clearer look.) This kind of search can be a good way of identifying new instances of known motifs - some of which may be in the literature but may not have yet made it into database annotation.

The ELM definition for this motif P[LVIPME][DENS][LM][VASTRG] is very degenerate with a lot of hits - over-prediction is a big problem in motif discovery. We can try to make the definition a little tighter as the expense of some instances, using another tool called SLiMMaker:


JobID: 15120600004

Repeating the SLiMSearch analysis with the redefined motif (P[EILMV][DN]L[ARST]) gives a greater density of known ELMs (see the Motif column) in the top ranked motifs:

Task 4: Predicting novel SLiMs de novo in a set of proteins (SLiMFinder)

SLiMFinder is designed to look for convergently evolved motifs that are shared between unrelated proteins. For example, we can look at the proteins known (in ELM) to contain the LIG_PCNA_PIPBox_1. As SLiMs are generally in disordered regions, we will switch disorder masking on with dismask=T, which uses IUPred to predict globular regions, which are masked out:


JobID: 15120800001

(We will look at the UPC and motif cloud output among others.)

Task 5: Identifying known motifs from de novo predictions (CompariMotif)

When you have a lot of motif predictions, it can be tiresome and error-prone to manually scan them for things that look familiar. SLiMSuite has a tool called CompariMotif, which compares sets of motifs for similarity.

The comparimotif server can take motif files/lists (like SLiMProb or SLiMFinder output directly. These are given to the &motifs and/or &searchdb options: if no &searchdb is given then the input motifs are searched against themselves. (This can be useful if clouding goes a bit wrong.)

To pass the output of one server to another, use the format: &cmd=jobid:XXXXXX:OUTFMT, where XXXXXX is the Job ID and OUTFMT is the desired output format. E.g.:


JobID: 15120900004

The server is currently in development so output is not sorted usefully yet. This is more of a problem if searching against many SLiMs:


JobID: 15120900005

The best advice is to save the compare output table (retrieve&jobid=15120900005&outfmt=compare), open it up in Excel and sort on Score. Alternatively, use the CompariMotif server at

Task 6: SLiM prediction with conservation masking (SLiMFinder)

Masking is important as it reduces the search space. It can also reduce the signal if it incorrectly masks some true positives but for larger datasets the reduction in "noise" can be more important. As well as dismask=T/F there are several other masking options in SLiMSuite:

  • low complexity masking (ON by default)
  • N-terminal methionines (ON by default)
  • conservation-based masking (OFF by default)
  • Uniprot feature masking (OFF by default)
  • Motif masking (OFF by default)

For custom sequence input, there is also the option for custom masking based on upper/lower case. For now, we will just look at conservation masking, as this has been shown to improve sensitivity in PPI data. For example, a 2013 compilation of CTBP1 interactors does not yield a significant motif:


JobID: 15120900002

But if consmask=T is also switched on:


JobID: 15120900003

The importance of correcting for evolutionary relationships

The UPC correction can be switched off with efilter=F. Many motif prediction tools calculate estimated expectations without such correction. This can result is massive biases due to shared evolutionary history, which swamp any convergent SLiM evolution signal, for example with the LIG_CtBP_PxDLS_1 ELM proteins:


JobID: 15120800036

Task 7: Look for enrichment or depletion of motifs in a set of proteins (SLiMProb)

We can investigate why the PxDLS motif did not come back with just disorder masking by looking at its enrichment using SLiMProb. When given multiple proteins, SLiMProb will use the same UPC correction as SLiMFinder but also return statistics without UPC correction and simply treating all the sequences as one giant sequence. It can, for example, be used to investigate different definitions of a motif:


JobID: 15120900016

In this case, we can see that even though the "true" motif has the most support, it is also expected to occur more by chance. It is enriched, but not enough to survive the multiple testing correction of SLiMChance.

Though not of interest here, the pUnd statistics can be used to look for depletion/avoidance of a particular motif in a dataset.

Task 8: Find novel motifs from a conservation pattern (SLiMPrints)

Patterns of evolutionary conservation can also be used to directly identify regions of proteins that look like motifs. The tool we have developed for this is called SLiMPrints, which can be run at the Bioware SLiMPrints server. For example, we can look for motif-like regions in one of the CtBP PPI partners, FOG1_HUMAN (Q8IX07):

This protein has a bunch of significant motif-like regions, including the PxDLS motif region at rank 7:

(Note how the precise motif is rarely returned by de novo predictors.)

Task 9: Using the SLiMScape app to visualise a server job

We're now going to fire up Cytoscape and have a quick look at the SLiMScape app. This is fairly well described in the paper, so we will just look at the main ways to run the server.

The simplest is to retrieve an existing run:

  1. In the SLiMFinder tab, enter 15120900003 in the Run ID box and hit Retrieve.
  2. Apply the default layout.
  3. Explore the results. Connections are UPC relationships in the data.

Task 10: Running QSLiMFinder through SLiMScape

Now let's imagine we had seen the SLiMPrints results from above for FOG1_HUMAN and knew that it interacted with CtBP1. We could ask the specific question if any motifs in FOG1_HUMAN were enriched in the rest of the PPI dataset. We do this by using QSLiMFinder and giving Q8IX07 as the query. (&query=Q8IX07 on the server.)

First, add a node to the network and change its name to Q8IX07. Enter this in the Query Sequence box then highlight all of the nodes before hitting Run QSLiMFinder:

JobID: 15120900007

This is the essence of molecular mimicry and we could use the same approach to see if E1A_ADE02 shares any motifs by adding P03254 and using it as a query:

JobID: 15120900008

Task 11: Building PPI networks for analysis

The most useful thing of having access to SLiMSuite through Cytoscape is to be able to use it to explore PPI networks and select nodes for analysis. There are in-built tools to get PPI data into Cytoscape. For SLiMSuite, the ID must be a Uniprot ID or accession number, or a Node must have "Uniprot" attribute.

The SLiMSuite REST server also provides some methods for getting PPI data into Cytoscape (and/or for use on the server), using the PINGU server. This is still under development and so the documentation of the available PPI data is currently limited, but just get in touch if you want to use it. (Currently human only.)

PPI data is retrieved by entering one or more gene symbols as a &hublist, optionally along with a &ppisource (see the ppisource alias):


JobID: 15120900009

This can be used directly for &uniprotid input using the &rest=uniprot output:


JobID: 15120900011

Alternatively, the PPI data can be imported into Cytoscape using the pairwise table:

  1. Start a new session. (Later you can workout how to import and merge networks.)
  2. Import network from URL:
  3. Rename the HubUni and SpokeUni fields to name and attribute them to Source Node and Target Node attributes. Make Hub the Source, Spoke the Target and Evidence the Interaction Type then import.
  4. Select the nodes that are shared interactors of both CtBP proteins.
  5. Modify the masking settings to include disorder, conservation and feature masking.
  6. Hit Run:

JobID: 15120900013