
Code for predicting subcellular localization in 8 classes, based on ensembles of Deep Convolutional Neural Networks.

Type 'make' to compile a binary file (Predict_wholeset)

Use:

./Predict_wholeset model dataset

The 'model' file should be a master file pointing to the actual model files, 
number of models ensembled on the first line, 
location of the models on the following lines, one per line.
There is a 'model' file pointing to an ensemble of 24 models in the Ensemble subdirectory.

Dataset should be a file containing the input encoding (alignment-based amino acid frequencies) for the proteins to predict.
Refer to the README_sets file in the datasets directory for a description of how these files should be formatted.

Once the binary is correctly compiled, one should be able to run a prediction on a correctly formatted dataset. 
E.g. to run the code on the sample.dataset file (single protein) by typing, while in the root directory of this pack:

code/Predict_wholeset Ensemble/model sample.dataset

This should produce a 'sample.dataset.predictions' file containing something looking like this:

1
CCDC8_HUMAN
538
2
5
0.0568	0.0390	0.1927	0.0058	0.0088	0.6834	0.0049	0.0087	

meaning:

-line1 : number of proteins predicted (global header)

then for each protein, 6 lines
-name of protein
-length in AA
-original class annotation (if unknown to start with, this would have been set to 0 in the source dataset file)
-predicted class number
-probabilities of the 8 classes
-blank line



The class codes are:

Other                                           0
Membrane                                        1
Cytoplasm                                       2
Golgi_apparatus                                 3
Mitochondrion                                   4
Nucleus                                         5
Plastid                                         6
Secreted                                        7


