An enormous amount of raw biological sequences is available in public databases today. The growth in size of these databases has generally not been matched by a proportional increase in the speed of methods for making sense of them. For example, while in 1992 some form of interpretation was available for about a quarter of all known proteins, today this fraction has significantly dropped, down to just a few percentage points. Computers have been increasingly important in trying to bridge the gap. In particular, Machine Learning (ML) techniques have been proposed to mine recurrent features in biological sequences and to construct methods to interpret raw data automatically. There are many fundamental reasons to rely on ML: it is a sound statistical framework; it proves reliable especially in the presence of large amounts of data; it can be orders of magnitude faster than the alternatives. While ML methods have scored some relevant success in computational biology, it is clear that there is vast room for broadening and deepening their role by a tighter integration of domain specific information and algorithmic design. Most biological data are inherently structured (sequences, molecules). These structured data are often analysed by ML methods designed for much simpler data (typically fixed-size arrays). This process eliminates the structure itself and with it a likely source of vital information. Paradigms for designing ML models to process structured data are becoming available. For instance, recently we have proposed a general approach for designing artificial neural networks to deal with transformations between structured objects that can be represented as Directed Acyclic Graphs [Baldi and Pollastri 2003]. These models have already been applied to molecular biology problems, leading to state-of-the-art results (find here a non-exaustive list of publications on the topic). We endeavour to design novel models to process structured biological data of various natures, and to apply these models to an array of critical problems in computational biology. One of our main objectives is the creation of high throughput ML systems for the prediction of protein structures from their amino acid sequences. Of the 1.7 million protein sequences currently known, only about 10% are human-annotated, while for fewer than 2% has the three-dimensional structure been experimentally determined. Being able to bridge the gap between sequence and structure would increase dramatically our knowledge of biological processes, since protein structures encode protein functions. A further objective of our group is the development of an extended paradigm for protein structure prediction, where the focus is shifted from learning the final structure to learning about the protein folding process itself, i.e. the process whereby a protein reaches its stable three-dimensional structure. This paradigm may contribute to improved structure predictions, but even more importantly may provide testing grounds for hypotheses on the folding process, and insights into mis-folding related diseases such as Alzheimer disease and Creutzfeld Jacobs syndrome. We are also active in developing a ML framework for processing molecular data in the form of undirected graphs, and two potential applications of great importance for drug design: fast, computational prediction of properties of chemical compounds based on their chemical structures; the estimation of binding energies for protein-protein and protein-ligand docking.
|