Porter 4.0, PaleAle 4.0: quick help and references
The Servers: description
Porter is a server for protein secondary structure prediction based on an ensemble of 25 BRNNs (bidirectional recurrent neural networks). Porter's feature include:
- Efficient input coding.
In Porter 4.0 the input at each residue is coded as a letter out of an alphabet of 25.
Beside the 20 standard amino acids, B (aspartic acid or asparagine), U (selenocysteine),
X (unknown), Z (glutamic acid or glutamine) and . (gap) are considered.
The input presented to the networks is the frequency of each of the 24 non-gap symbols,
plus the overall proportion of gaps in each column of the alignment.
- Output filtering and incorporation of predicted long-range information.
In Porter 4.0 the first-stage predictions are filtered by a second network.
The input to this network includes the predictions of the first stage network
averaged over multiple contiguous windows, covering 225 residues.
- New, large training sets.
Porter 4.0 is trained on a recent redundancy reduced subset of the Protein Data Bank, containing over 7500 protein structures.
- Large ensembles of models
Five two-stage BRNN models are trained independently to build Porter 4.0.
Differences among models are due to different architecture and number of free parameters of the models.
25 models in total are ensemble averaged in Porter 4.0.
Porter, tested by a rigorous 5-fold cross validation procedure,
achieves 82.2% correct classification on the "hard" CASP 3-class assignment.
A paper on Porter 4.0 is currently in preparation, but most of the details on the architectures can be found in older references on the previous versions of the server.
PaleAle 4.0 is a server for the prediction of protein relative solvent accessibility.
Each amino acid is classified as being in one of 4 (approximately equally frequent) classes:
- B=completely buried (0-4% exposed)
- b=partly buried (4-25% exposed)
- e=partly exposed (25-50% exposed)
- E=completely exposed (50+% exposed)
The architecture of PaleAle 4.0's classifier is an exact copy
of Porter 4.0's (described above). PaleAle4.0's accuracy, measured on the same large, non-redundant set of over 7500 proteins adopted to train Porter 4.0 (described above) exceeds 55% correct 4-class classification, and 80% 2-class classification (Buried vs Exposed, with 25% threshold).
A paper on PaleAle 4.0 is currently in preparation, but most of the details on the architectures can be found in older references on the previous versions of the server.
Input formats
Email
If you input an email address you will receive an email containing the results.
NOTE: Check that you typed your address correctly. A lot of
the queries handled don't receive an answer because of incorrect typing.
Whether you input an email address or not, a link to a web page will be provided to you
after you click submit. The link will point
you to the results page, which is updated automatically every 60 seconds until the query
is complete.
Notice that if we have many jobs in our queue it may take hours to serve a query containing many sequences.
Even if the queue is empty, a maximal query (64kbytes) would typically take in the region of two hours to be processed.
If you don't want to keep a browser window open for half a day, bookmark the response link or input an email address.
Input sequence(s)
The sequence of amino acids:
- You can submit sequences in FASTA format. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line must begin with a greater-than (">") symbol in the first column.
- You can send up to 64kbytes per submission, which is approximately 200 average sized proteins
- Larger queries can be broken down into 64kbyte chunks, or you can ask us to lift the limit for you on a one-off basis.
- Spaces, newlines and tabs will be ignored, so feel free to have them in your query.
- Characters not corresponding to any aminoacid will be treated as X.
- Only 1 letter amino acid code understood. Please do not send nucleotide sequences. If so, A will be treated as Alanine, C as Cysteine, etc...
Output format
Replies are sent by email (if you give us one) and shown as a web page if you click the link we give you after you submit. The email response and main web page contain the same information in the same format.
Porter 4.0 and
PaleAle 4.0's replies come as text.
You might have to "view attachments inline" in your email client to see these replies.
If you submit multiple sequences you will receive one single email/web page with all the predictions.
The web page version of the results will be updated incrementally (every 60 seconds) until your
query is complete.
Here you have an example of prediction:
Subject: Porter 4.0,PaleAle 4.0 response to 1 queries
Query_name: 1A1W_._@:83
Query_length: 83
MDPFLVLLHSVSSSLSSSELTELKYLCLGRVGKRKLERVQSGLDLFSMLLEQNDLEPGHT
CCHHHHHHHHHHHCCCHHHHHHHHHHHHHHCCHHHHHCCCCHHHHHHHHHHCCCCCCCCH
EEEbeEbBeeBbEEbeEEbbEeBbEbBeEebeEEebEEbEeBbeBBeeBeEeEebeEEeb
ELLRELLASLRRHDLLRRVDDFE
HHHHHHHHHCCCHHHHHHHHHCC
EbBeeBBEEbEeEeBbEeBEEbE
Your query is split into blocks of 60 residues.
For each block you have 3 lines:
- Line 1: The 1-letter code of your protein primary sequence. This line is always present.
- Line 2: Secondary structure prediction by Porter 4.0:
- H = helix : DSSP's H (alpha helix) + G (3-10 helix) + I (pi-helix) classes.
- E = strand : DSSP's E (extended strand) + B (beta-bridge) classes.
- C = the rest : DSSP's T (turn) + S (bend) + . (the rest).
- Line 4: Relative Solvent Accessibility prediction by PaleAle 4.0:
- B=completely buried (0-4% exposed)
- b=partly buried (4-25% exposed)
- e=partly exposed (25-50% exposed)
- E=completely exposed (50+% exposed)
How confident are the servers on a given residue?
There will be a link on your email response and on the results page to view the response "containing confidence values".
This will look similar to the standard response except that there will be two more lines per block, one after the Porter 4.0 prediction line,
one after the PaleAle 4.0 prediction line. These two lines indicate, respectively, how confident Porter 4.0 and PaleAle 4.0 are
about their predictions for each of the residues. The confidence is expressed by a number between 0 and 9, with 9 being the maximum
confidence and 0 the minimum. Here is an example:
Subject: Porter 4.0,PaleAle 4.0 response to 1 queries
Query_name: T0530
Query_length: 115
MKKAMAILAVLAAAAVICGLLFFHNDVTDRFNPFIHQQDVYVQIDRDGRHLSPGGTEYTL
CCHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCCCCCEEEEEECCCCCEECCCEEEEEE
904789999999999998887524644345466557504899990686062475279998
EEEEEeEeeeEeEEEEEeeeEEEEEEEbeEbbEbbEEEebbBbBeeEbEEEEEEebeBeB
732000100000000000000001042000110022331018252020410341121101
DGYNASGKKEEVTFFAGKELRKNAYLKVKAKGKYVETWEEVKFEDMPDSVQSKLK
EEECCCCCEEEEEEECCCCCCCCCEEEEEECCCEEEEEEEEEHHHCCHHHHHHCC
8782999879999951776677857999975832677788606565688998509
eBbbEEeEEeeBeBebEEEbeEEBbBbBbbEEEeBeebEeBeeEEbbEEBeEEbE
1420470412011110333103001838202431002012000611151327118
Real valued relative solvent accessibility
The main results page also contains a link to PaleAle 4.0's predictions mapped into real valued relative solvent accessibility. These are simply the class probabilities output by PaleAle 4.0, remapped via a linear regressor into values between 0 and 100. Notice that PaleAle 4.0 is not optimised for this task. However, in our experiments the real values show a mean average error of approximately 14%, which is close to the state of the art. Here is an example of what you will see in the real valued solvent accessibility page:
Query_name: T0530
Query_length: 115
M K K A M A I L A V L A A A A
72 59 54 47 44 39 50 42 32 45 47 44 46 49 48
V I C G L L F F H N D V T D R
48 49 40 36 40 41 38 42 51 43 61 53 33 41 48
F N P F I H Q Q D V Y V Q I D
26 19 35 31 21 57 60 61 31 10 10 0 17 2 35
R D G R H L S P G G T E Y T L
45 58 32 63 53 48 59 63 51 38 13 34 5 46 6
D G Y N A S G K K E E V T F F
44 4 18 25 65 72 39 64 55 34 49 6 44 7 42
A G K E L R K N A Y L K V K A
15 59 59 61 15 44 62 49 10 12 0 17 0 21 12
K G K Y V E T W E E V K F E D
57 62 60 42 12 46 35 25 55 41 9 50 47 70 57
M P D S V Q S K L K
12 24 68 57 8 41 72 57 29 76
References
Porter 4.0, PaleAle 4.0
C.Mirabello, G.Pollastri,
"Porter, PaleAle 4.0: high-accuracy prediction of protein secondary structure and relative solvent accessibility",
Bioinformatics, 29(16):2056-2058, 2013, doi: 10.1093/bioinformatics/btt344
Toll free PDF (Bioinformatics web site)
Older references on Porter and PaleAle
G.Pollastri, A.McLysaght.
"Porter: a new, accurate server for protein secondary structure prediction".
Bioinformatics, 21(8),1719-20, 2005.
Toll-free link to the article
C. Mooney, G.Pollastri.
"Beyond the Twilight Zone: Automated prediction of structural properties of proteins by recursive neural networks and remote homology information"
Proteins, 77(1), 181-90, 2009.
Abstract and PDF (Proteins web site)
G.Pollastri*, A. J. M. Martin, C. Mooney, A. Vullo.
"Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information"
BMC Bioinformatics, 8:201, 2007.
Open access abstract and PDF (BMC Bioinformatics web site).
Distill as a whole
D. Baú, A. J. M. Martin, C. Mooney, A. Vullo, I. Walsh, G. Pollastri.
"Distill: A suite of web servers for the prediction of one-, two- and three-dimensional structural features of proteins"
BMC Bioinformatics, 7:402, 2006.
Open access abstract and PDF (BMC Bioinformatics web site).
A more comprehensive list of publications from our group is available here.
|