SCLpred-EMS:
quick help and references





The Servers: description

SCLpred-EMS

SCLpred-EMS is a server for the prediction of eukaryotic protein subcellular localization in 2 classes based on ensembles of Convolutional Neural Networks. The two classes are endomembrane system and secretory pathway (EMS) vs. all other classes. SCLpred-EMS's features include:

  • Large ensembles of finely tuned, deep models. SCLpred-EMS is an ensemble of 100 Neural Networks trained on different partitions of the training set. The models are stacks of Convolutional Neural Networks followed by a normalized average pooling layer and two fully connected layers. Each model is fairly small (~2,000 tunable parameters), but deep (6 inner layers) and explicitly takes into account all motifs of 21 residues in the input sequence. We have observed significant gains when increasing the motif size up to the one used in SCLpred-EMS, and moderate but significant gains when deepening the architecture up to 6 inner layers.
  • New, large training sets. The server is trained on a recent redundancy reduced set containing over 7,000 proteins.
  • Expanded evolutionary information use We use psiblast on a recent version of the UniProt database which allows us to leverage larger sets of alignments.
  • More efficient input encoding. We have come up with an encoding for the evolutionary information that gives us an informative representation of the evolutionary history of a protein and at the same time keeps track of the identity of its residue. This gives us a significant boost compared to using plain frequency profiles. Some info on this encoding can be found here (though the paper is about a different problem and server).
SCLpred-EMS, tested on an independent test set containing 611 proteins with low identity (<30%) against the training set, achieves approximately 78% MCC and over 90% 2-class correct classification, better than all the other public servers we have tested. On a second independent test set containing 227 proteins and more stringent redundancy reduction against the training set (>0.001 e-value) SCLpred-EMS achieves 83% MCC, well over 90% 2-class correct classification and again performs better than the other systems we have tested.
A paper on SCLpred-EMS is currently undergoing review. Many of the details on the architectures can be found in older references on the previous versions of the server.

Input formats

Email

If you input an email address you will receive an email containing the results.
NOTE: Check that you typed your address correctly. A lot of the queries handled don't receive an answer because of incorrect typing.

Whether you input an email address or not, a link to a web page will be provided to you after you click submit. The link will point you to the results page, which is updated automatically every 60 seconds until the query is complete.

Notice that if we have many jobs in our queue it may take hours to serve a query containing many sequences. Even if the queue is empty, a maximal query (64kbytes) would typically take in the region of two hours to be processed. If you don't want to keep a browser window open for half a day, bookmark the response link or input an email address.

Input sequence(s)

The sequence of amino acids:

  • You can submit sequences in FASTA format. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line must begin with a greater-than (">") symbol in the first column.
  • You can send up to 64kbytes per submission, which is approximately 200 average sized proteins
  • Larger queries can be broken down into 64kbyte chunks, or you can ask us to lift the limit for you on a one-off basis.
  • Spaces, newlines and tabs will be ignored, so feel free to have them in your query.
  • Characters not corresponding to any amino acid will be treated as X.
  • Only 1 letter amino acid code understood. Please do not send nucleotide sequences. If so, A will be treated as Alanine, C as Cysteine, etc...

Output format

Replies are sent by email (if you give us one) and shown as a web page if you click the link we give you after you submit. The email response and main web page contain the same information in the same format. SCLpred-EMS's replies come as text. You might have to "view attachments inline" in your email client to see these replies. If you submit multiple sequences you will receive one single email/web page with all the predictions. The web page version of the results will be updated incrementally (every 60 seconds) until your query is complete. Here you have an example of prediction:


Subject: SCLpred-EMS response to 4 queries

Query_name: A1CBI6_1
Query_length: 93
Query_sequence:
MPGKELTDPCVDCADAEAILTVRCRRLCQDCYARFVNFKVFKRMENYRLRRNMSRTGPCK
LLLPLSYGTSSSVLLHILNAQIQHERAKSHPSP
Prediction: non-EMS
Confidence: 8 (0=lowest; 9=highest)


Query_name: A1CBI6_2
Query_length: 87
Query_sequence:
GFELHVLVIEPSTVSTSSPPHDEGFDLLQQTFPSHSFTRVSLHNVFELDPSIQDVLSQFS
SEGFTDDATMSDKDRLDAFRASITTAT
Prediction: non-EMS
Confidence: 0 (0=lowest; 9=highest)


Query_name: A1CDAW_1
Query_length: 89
Query_sequence:
SRVDVDYILITRLVVAFAKKIECRGVVWGDSDTRLAAKTLANVAKGRGSAITWQVCDGMS
PFGLEFSFPLRDLYKAEVQNYASFFPELA
Prediction: non-EMS
Confidence: 2 (0=lowest; 9=highest)


Query_name: A1CDAW_2
Query_length: 105
Query_sequence:
KIIIPDEPPSENILTKNLSIDELMMRYVQTQGEKYPGIMANVTRTASKLQASLVPANVPR
CSFCGGSMLNQDGQIIMGGAAGNSEVRQGAELCYACTRSRPEVSY
Prediction: non-EMS
Confidence: 5 (0=lowest; 9=highest)



Query served in 360 seconds

Your query is split into individual proteins. For each protein you have 5 records:

  • Query_name: As divined from the fasta file.
  • Query_length: Number of residues in the query.
  • Query_sequence: The residues in the query, split into 50-residue lines.
  • Prediction:
    • EMS: endomembrane system and secretory pathway
    • non-EMS: all other classes
  • Confidence: a number between 0 and 9, with 9 signifying maximal confidence.

References

A nearly up-to-date ECCB 2018 poster describing some of the SCLpred-EMS methods

M.Kaleel, A.Khalid, T.Kumar, Z.Yandan, C.Jialiang, F.Xuanming, G.Pollastri and C.Mooney, "DeepSCLpred: Protein subcellular localization prediction by Deep N-to-1 neural networks", ECCB 2018, Athens, Greece.
PDF

Journal articles on older versions of SCLpred

C.Mooney, A.Cessieux, D.Shields, G.Pollastri, "SCL-Epred: A Generalised De novo Eukaryotic Protein Subcellular Localisation Predictor", Amino Acids, 45(2), 291-9, 2013; doi: 10.1007/s00726-013-1491-3.
Abstract and PDF (Amino Acids web site)

A.Adelfio, V.Volpato, G.Pollastri, "SCLpredT: Ab initio and homology-based prediction of subcellular localization by N-to-1 Neural Networks", SpringerPlus, 2:502, 2013. DOI: 10.1186/2193-1801-2-502
Abstract and PDF (SpringerPlus web site)

C.Mooney, Y.Wang, G.Pollastri, "SCLpred: Protein Subcellular Localization Prediction by N-to-1 Neural Networks", Bioinformatics, 27 (20), 2812-2819, 2011.
Abstract and PDF (Bioinformatics web site)

A preprint on a different problem, but with some information about the input encoding used in SCLpred 2.0

M.Torrisi, M.Kaleel, G.Pollastri, "Porter 5: fast, state-of-the-art ab initio prediction of protein secondary structure in 3 and 8 classes", bioRxiv, 2018.
Abstract and PDF

More comprehensive lists of publications from our groups are available here and here.




SCLpred-EMS
Porter 5.0
Brewery (4 structural predictors through one interface)
Distill 2.0

Porter 4.0, PaleAle 4.0

Gianluca Pollastri, gianluca.pollastri at ucd.ie,
Gianluca Pollastri's group
School of Computer Science and Informatics
University College Dublin