SCLpred-MEM: quick help and references
The Servers: description
SCLpred-MEM is a server for the prediction of protein subcellular localization of endomembrane system and secretory pathway (EMS) proteins in 2 classes based on ensembles of Convolutional Neural Networks. The two classes are Membrane vs. non-Membrane.
SCLpred-MEM's features include:
- Large ensembles of finely tuned, deep models.
SCLpred-MEM is an ensemble of 25 Neural Networks trained on different partitions of the training set. The models are stacks of Convolutional Neural Networks followed by a normalized average pooling layer and two fully connected layers. Each model is fairly small (~500 tunable parameters), but deep (6 inner layers) and explicitly takes into account all motifs of 21 residues in the input sequence. We have observed significant gains when increasing the motif size up to the one used in SCLpred-MEM, and moderate but significant gains when deepening the architecture up to 6 inner layers.
- New, large training sets.
The server is trained on a recent redundancy reduced set containing ~5,000 proteins.
- Expanded evolutionary information use
We use psiblast on a recent version of the UniProt database which allows us to leverage larger sets of alignments.
- More efficient input encoding. We have come up with an encoding for the evolutionary information that gives us an informative representation of the evolutionary history of a protein and at the same time keeps track of the identity of its residue. This gives us a significant boost compared to using plain frequency profiles. Some info on this encoding can be found here (though the paper is about a different problem and server).
SCLpred-MEM, tested on an independent test set containing 240 proteins with low identity (<30%) against the training set, achieves approximately 62% MCC and over 74% 2-class correct classification, at least matching the other public servers we have tested. On a second independent test set containing 118 proteins and more stringent redundancy reduction against the training set (>0.001 e-value) SCLpred-MEM achieves 52% MCC, 58% 2-class correct classification and performs better than the other systems we have tested.
A paper on SCLpred-MEM is currently undergoing review. Many of the details on the architectures can be found in older references on the previous versions of the server.
Input formats
Email
If you input an email address you will receive an email containing the results.
NOTE: Check that you typed your address correctly. A lot of
the queries handled don't receive an answer because of incorrect typing.
Whether you input an email address or not, a link to a web page will be provided to you
after you click submit. The link will point
you to the results page, which is updated automatically every 60 seconds until the query
is complete.
Notice that if we have many jobs in our queue it may take hours to serve a query containing many sequences.
Even if the queue is empty, a maximal query (64kbytes) would typically take in the region of two hours to be processed.
If you don't want to keep a browser window open for half a day, bookmark the response link or input an email address.
Input sequence(s)
The sequence of amino acids:
- You can submit sequences in FASTA format. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line must begin with a greater-than (">") symbol in the first column.
- You can send up to 64kbytes per submission, which is approximately 200 average sized proteins
- Larger queries can be broken down into 64kbyte chunks, or you can ask us to lift the limit for you on a one-off basis.
- Spaces, newlines and tabs will be ignored, so feel free to have them in your query.
- Characters not corresponding to any amino acid will be treated as X.
- Only 1 letter amino acid code understood. Please do not send nucleotide sequences. If so, A will be treated as Alanine, C as Cysteine, etc...
Output format
Replies are sent by email (if you give us one) and shown as a web page if you click the link we give you after you submit. The email response and main web page contain the same information in the same format.
SCLpred-MEM's replies come as text.
You might have to "view attachments inline" in your email client to see these replies.
If you submit multiple sequences you will receive one single email/web page with all the predictions.
The web page version of the results will be updated incrementally (every 60 seconds) until your
query is complete.
Queries will be first submitted to out predictor of EMS location SCLpred-EMS. If predicted to be in the EMS class, they will be further classified into Membrane or non-Membrane by SCLpred-MEM.
Here you have an example of prediction:
Subject: SCLpred-MEM response to 4 queries
Query_name: A1CBI6_1b
Query_length: 93
Query_sequence:
MPGKELTDPCVDCADAEAILTVRCRRLCQDCYARFVNFKVFKRMENYRLRRNMSRTGPCK
LLLPLSYGTSSSVLLHILNAQIQHERAKSHPSP
Prediction: non-EMS
Confidence: 6 (0=lowest; 9=highest)
Query_name: A1CBI6_2b
Query_length: 87
Query_sequence:
GFELHVLVIEPSTVSTSSPPHDEGFDLLQQTFPSHSFTRVSLHNVFELDPSIQDVLSQFS
SEGFTDDATMSDKDRLDAFRASITTAT
Prediction: EMS, non-Membrane
Confidence: 2 (0=lowest; 9=highest)
Query_name: A1CDAW_1b
Query_length: 89
Query_sequence:
SRVDVDYILITRLVVAFAKKIECRGVVWGDSDTRLAAKTLANVAKGRGSAITWQVCDGMS
PFGLEFSFPLRDLYKAEVQNYASFFPELA
Prediction: non-EMS
Confidence: 4 (0=lowest; 9=highest)
Query_name: A1CDAW_2b
Query_length: 105
Query_sequence:
KIIIPDEPPSENILTKNLSIDELMMRYVQTQGEKYPGIMANVTRTASKLQASLVPANVPR
CSFCGGSMLNQDGQIIMGGAAGNSEVRQGAELCYACTRSRPEVSY
Prediction: non-EMS
Confidence: 3 (0=lowest; 9=highest)
Query served in 781 seconds
Your query is split into individual proteins.
For each protein you have 5 records:
- Query_name: As divined from the fasta file.
- Query_length: Number of residues in the query.
- Query_sequence: The residues in the query, split into 50-residue lines.
- Prediction:
- EMS, Membrane: endomembrane system and secretory pathway, Membrane
- EMS, non-Membrane: endomembrane system and secretory pathway, non-Membrane
- non-EMS: all other classes
- Confidence: a number between 0 and 9, with 9 signifying maximal confidence.
References
A nearly up-to-date ECCB 2018 poster describing some of the SCLpred-MEM methods
M.Kaleel, Y.Zheng, J.Chen, X.Feng, J.C.Simpson, G.Pollastri, C.Mooney, "SCLpred-EMS: subcellular localization prediction of endomembrane system and secretory pathway proteins by Deep N-to-1 Convolutional Neural Networks", Bioinformatics, (2020), https://doi.org/10.1093/bioinformatics/btaa156
Article on the Bioinformatics web site.
Journal articles on older versions of SCLpred
C.Mooney, A.Cessieux, D.Shields, G.Pollastri,
"SCL-Epred: A Generalised De novo Eukaryotic Protein Subcellular Localisation Predictor",
Amino Acids, 45(2), 291-9, 2013; doi: 10.1007/s00726-013-1491-3.
Abstract and PDF (Amino Acids web site)
A.Adelfio, V.Volpato, G.Pollastri, "SCLpredT: Ab initio and homology-based prediction of subcellular localization by N-to-1 Neural Networks", SpringerPlus, 2:502, 2013. DOI: 10.1186/2193-1801-2-502
Abstract and PDF (SpringerPlus web site)
C.Mooney, Y.Wang, G.Pollastri,
"SCLpred: Protein Subcellular Localization Prediction by N-to-1 Neural Networks", Bioinformatics, 27 (20), 2812-2819, 2011.
Abstract and PDF (Bioinformatics web site)
A preprint on a different problem, but with some information about the input encoding used in SCLpred 2.0
M.Torrisi, M.Kaleel, G.Pollastri, "Porter 5: fast, state-of-the-art ab initio prediction of protein secondary structure in 3 and 8 classes", bioRxiv, 2018.
Abstract and PDF
More comprehensive lists of publications from our groups are available here and here.
|