Encadrants:  Frédéric Vivien (60%) et Daniel Kahn (40%)
Coordonnées: Frederic.Vivien@ens-lyon.fr, kahn@biomserv.univ-lyon1.fr


Automated identification and clustering of protein domains


The study of proteins and their properties is essential for medicine,
agriculture, and biological research. Proteins are comprised of one or
more several building blocks, known as domains. The decomposition of a
protein into a sequence of domains is a tool helping us to understand
its structure and function. Several research efforts have provided
identification systems for protein domains. As the number of known
protein sequences is already large (more than 2 million, less than 10%
of them having been manually curated) and growing steadily, only
automated methods can hope to be exhaustive.

ProDom [1] is a protein domain family databank automatically built
from protein sequence databanks using the mkdom2 algorithm
[2]. Currently, mkdom2 takes more than six months to process the
UniProt protein sequence databank using a single classical computer.
We are on the way to obtain a parallel version of mkdom2. The aim of
the proposed thesis is to assess the quality of the version of ProDom
produced by this parallel version, and then to change the core
algorithm using techniques such as domain boundaries delineation
heuristics, multiple alignment algorithms, machine learning
approaches, or graph theory. The student will have either to adapt
existing algorithms and/or to create new ones. The size of protein
databanks is growing exponentially following Moore's law. mkdom2 has
quadratic complexity and we need to ensure the long term usability of
this type of algorithm. Grid computing is then an obvious solution to
provide the necessary computing resources. Therefore, the proposed
algorithms will have to be parallel algorithms designed to run on
heterogeneous and distributed platforms such as Grids. In particular,
one will have to take special care of synchronizations,
communications, and data management.

The proposed subject requires knowledge in biology and computer
science: on one hand all the designed algorithms must be biologically
sound, on the other hand they will be implemented and tested on large
scale distributed platforms such as Grid5000 [3]. Even if the subject
description seems to deal more with the biological side of the
project, the computer science part will be at least as demanding, as
all algorithms will have to be carefully crafted to be thrifty and
scalable while preserving the quality of their output.

A detailed version of this subject can be found on the web page:
http://graal.ens-lyon.fr/~fvivien/CORDI_GRAAL-HELIX.html


[1] ProDom and ProDom-CG: tools for protein domain analysis and whole
genome comparisons by Florence Corpet, Florence Servant, Jérôme Gouzy,
and Daniel Kahn, Nucleic Acids Research 28(1): 267-269, 2000.

[2] Whole genome protein domain analysis using a new method for domain
clustering, by  J. Gouzy, F. Corpet and D. Kahn, Computers & Chemistry,
23(3-4):333-340 1999

[3] https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home