PaleAle 5.0: quick help and references
The Servers: description
PaleAle is a server for protein relative solvent accessibility prediction in 4 classes based on ensembles of cascaded BRNNs (bidirectional recurrent neural networks) and Convolutional Neural Networks. PaleAle's feature include:
- New, large training sets.
The server is trained on a recent redundancy reduced subset of the Protein Data Bank, containing nearly 16,000 protein structures.
- More diverse evolutionary information
We use both psiblast and HHblits on recent versions of the UniProt database. While systems based on psiblast or HHblits separately have broadly similar performances, systems using both at the same time, or ensembles of systems trained separately, give us significant improvements.
- More efficient input encoding. We have come up with an encoding for the evolutionary information that gives us a more informative representation of the distribution of alignments and at the same time keeps track of the identity of the primary sequence of the query. This gives us a significant boost compared to older versions of the servers.
PaleAle 5.0, tested on a large independent test set,
achieves approximately 56.5% correct classification of 4 relative solvent accessibility classes, which climbs to 80.5% for two classes (exposed vs. buried, with a 25% exposure threshold between the two classes).
A paper on PaleAle 5.0 is currently undergoing review. Many of the details on the architectures can be found in older references on the previous versions of the server.
Input formats
Email
If you input an email address you will receive an email containing the results.
NOTE: Check that you typed your address correctly. A lot of
the queries handled don't receive an answer because of incorrect typing.
Whether you input an email address or not, a link to a web page will be provided to you
after you click submit. The link will point
you to the results page, which is updated automatically every 60 seconds until the query
is complete.
Notice that if we have many jobs in our queue it may take hours to serve a query containing many sequences.
Even if the queue is empty, a maximal query (64kbytes) would typically take in the region of two hours to be processed.
If you don't want to keep a browser window open for half a day, bookmark the response link or input an email address.
Input sequence(s)
The sequence of amino acids:
- You can submit sequences in FASTA format. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line must begin with a greater-than (">") symbol in the first column.
- You can send up to 64kbytes per submission, which is approximately 200 average sized proteins
- Larger queries can be broken down into 64kbyte chunks, or you can ask us to lift the limit for you on a one-off basis.
- Spaces, newlines and tabs will be ignored, so feel free to have them in your query.
- Characters not corresponding to any aminoacid will be treated as X.
- Only 1 letter amino acid code understood. Please do not send nucleotide sequences. If so, A will be treated as Alanine, C as Cysteine, etc...
Output format
Replies are sent by email (if you give us one) and shown as a web page if you click the link we give you after you submit. The email response and main web page contain the same information in the same format.
PaleAle 5.0's replies come as text.
You might have to "view attachments inline" in your email client to see these replies.
If you submit multiple sequences you will receive one single email/web page with all the predictions.
The web page version of the results will be updated incrementally (every 60 seconds) until your
query is complete.
Here you have an example of prediction:
Query_name: provapaleale
Query_length: 75
SEQ EHVRDAAEREGVMLIKKTAENIDEAAVRAATASRGASQAGAPQGRVPEAR
SA EEEEEEeEEEEbEeBEEBBEeBeEBBEEbEEEEEeEEEEEEEEEEEEEEE
95435102510040102012000210321311320131221221430432
SEQ PNSMVVEHPEFLKAGKEPGLQIWRV
SA EEEEEEEbEEBEEEEEEEEEEEEEE
4321120022006234300011035
Query served in 260 seconds
Your query is split into blocks of 50 residues.
For each block you have 3 lines:
- Line 1: The 1-letter code of your protein primary sequence, preceded by "SEQ ".
- Line 2: Relative Solvent Accessibility prediction by PaleAle 5.0, preceded by "SA ":
- B: very buried (less than 4% exposed)
- b: somewhat buried (between 4% and 25% exposed)
- e: somewhat exposed (between 25% and 50% exposed)
- E: very exposed (more than 50% exposed)
- Line 3: prediction confidence: a number between 0 and 9, with 9 signifying maximal confidence.
References
PaleAle 5.0
M.Kaleel, M.Torrisi, C.Mooney, G.Pollastri,
"PaleAle 5.0: prediction of protein relative solvent accessibility by deep learning"
Amino Acids, 2019, doi: 10.1007/s00726-019-02767-6
AMAC web site (Toll-free Link)
Porter 4.0, PaleAle 4.0
C.Mirabello, G.Pollastri,
"Porter, PaleAle 4.0: high-accuracy prediction of protein secondary structure and relative solvent accessibility",
Bioinformatics, 29(16):2056-2058, 2013, doi: 10.1093/bioinformatics/btt344
Toll free PDF (Bioinformatics web site)
Other (and older) references on Porter and PaleAle
Brewery
M.Torrisi, G.Pollastri,
"Protein Structure Annotations", in Essentials of Bioinformatics, Volume I. Understanding Bioinformatics: Genes to Proteins,
Springer Nature, 2019; doi: 10.1007/978-3-030-02634-9_10
Repository UCD (Abstract)
Porter 5.0
M.Torrisi, M.Kaleel, G.Pollastri,
"Porter 5: state-of-the-art ab initio prediction of protein secondary structure in 3 and 8 classes",
bioRxiv, 289033; doi: 10.1101/289033
bioRxiv web site (Abstract and PDF)
G.Pollastri, A.McLysaght.
"Porter: a new, accurate server for protein secondary structure prediction".
Bioinformatics, 21(8),1719-20, 2005.
Toll-free link to the article
C. Mooney, G.Pollastri.
"Beyond the Twilight Zone: Automated prediction of structural properties of proteins by recursive neural networks and remote homology information"
Proteins, 77(1), 181-90, 2009.
Abstract and PDF (Proteins web site)
G.Pollastri*, A. J. M. Martin, C. Mooney, A. Vullo.
"Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information"
BMC Bioinformatics, 8:201, 2007.
Open access abstract and PDF (BMC Bioinformatics web site).
Distill as a whole
D. Baú, A. J. M. Martin, C. Mooney, A. Vullo, I. Walsh, G. Pollastri.
"Distill: A suite of web servers for the prediction of one-, two- and three-dimensional structural features of proteins"
BMC Bioinformatics, 7:402, 2006.
Open access abstract and PDF (BMC Bioinformatics web site).
A more comprehensive list of publications from our group is available here.
|