Encadrants: Frédéric Vivien (60%) et Daniel Kahn (40%) Coordonnées: Frederic.Vivien@ens-lyon.fr, kahn@biomserv.univ-lyon1.fr Automated identification and clustering of protein domains The study of proteins and their properties is essential for medicine, agriculture, and biological research. Proteins are comprised of one or more several building blocks, known as domains. The decomposition of a protein into a sequence of domains is a tool helping us to understand its structure and function. Several research efforts have provided identification systems for protein domains. As the number of known protein sequences is already large (more than 2 million, less than 10% of them having been manually curated) and growing steadily, only automated methods can hope to be exhaustive. ProDom [1] is a protein domain family databank automatically built from protein sequence databanks using the mkdom2 algorithm [2]. Currently, mkdom2 takes more than six months to process the UniProt protein sequence databank using a single classical computer. We are on the way to obtain a parallel version of mkdom2. The aim of the proposed thesis is to assess the quality of the version of ProDom produced by this parallel version, and then to change the core algorithm using techniques such as domain boundaries delineation heuristics, multiple alignment algorithms, machine learning approaches, or graph theory. The student will have either to adapt existing algorithms and/or to create new ones. The size of protein databanks is growing exponentially following Moore's law. mkdom2 has quadratic complexity and we need to ensure the long term usability of this type of algorithm. Grid computing is then an obvious solution to provide the necessary computing resources. Therefore, the proposed algorithms will have to be parallel algorithms designed to run on heterogeneous and distributed platforms such as Grids. In particular, one will have to take special care of synchronizations, communications, and data management. The proposed subject requires knowledge in biology and computer science: on one hand all the designed algorithms must be biologically sound, on the other hand they will be implemented and tested on large scale distributed platforms such as Grid5000 [3]. Even if the subject description seems to deal more with the biological side of the project, the computer science part will be at least as demanding, as all algorithms will have to be carefully crafted to be thrifty and scalable while preserving the quality of their output. A detailed version of this subject can be found on the web page: http://graal.ens-lyon.fr/~fvivien/CORDI_GRAAL-HELIX.html [1] ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons by Florence Corpet, Florence Servant, Jérôme Gouzy, and Daniel Kahn, Nucleic Acids Research 28(1): 267-269, 2000. [2] Whole genome protein domain analysis using a new method for domain clustering, by J. Gouzy, F. Corpet and D. Kahn, Computers & Chemistry, 23(3-4):333-340 1999 [3] https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home