Although SLiMFinder is designed with whole protein sequences in mind, it can also be used to identify statistically over-represented motifs in peptide data, including phage display results. Indeed, it is the third example application in the original SLiMFinder paper.
Suggested settings for phage display data are below. If anyone has a go and/or wants more advice, please get in touch. (If you try it, I’d be interested to hear how well it works!) Similarly, if you want some advice/ideas on how to combine the peptides with interaction data and full length protein sequences for a more sophisticated analysis, send me a bit more info and I’d be happy to make some suggestions.
Custom settings for phage display data
Here is an overview of the settings that should be tweaked for phage display analysis:
Amino acid frequencies. One thing you will want to try is changing the way that the amino acid frequencies are used. By default, SLiMFinder will use the amino acid frequencies of the input dataset but for phage display peptides this is not really right as the peptides are clearly biased in their composition due to the motifs they contain. Instead, you probably want to set the amino acid frequencies for the background model to those of the human proteome (for human peptides) or even a uniform amino acid distribution. (Select frequencies that model the pre-screening amino acid frequencies.) This is done using the
aafreq=FILE option, where
FILE can be a fasta file of protein sequences or a delimited file of aa frequencies with the headings “AA” and “FREQ”. (See the manual for details.) If in doubt, try a few runs with different amino acid frequencies.
Evolutionary Filtering. Evolutionary filtering should be switched off (
efilter=F) but you will also want to make sure that there is no redundancy in your peptides. (
rje_seq.py can be used for this.)
SLiMChance. If you are not so interested in the statistical significance and primarily want to use SLiMFinder to return a ranked list of interesting motifs in the data, set
sigcut=1.0 and choose the number of motifs to return with
Ambiguity. Peptide data is usually pretty quick to run, and so it is probably worth exploring the full range of ambiguity with
combamb=T (combined amino acid and variable-lengh wildcards). The basic
equiv=LIST set for aa degeneracy should be OK for most jobs but you can easily tweak it to add or remove ambiguity combinations as appropriate.
Masking. You will probably want to switch off all masking (
masking=F). Low complexity masking might be useful but
metmask=F posmask="" should be used as the N-termini are not true protein N-termini.