Comparative genome and protein family classification

Category:Data Analysis, Bioinformatics
Supervisor:Balaji Rajashekar
Abstract:The main task of this project is to classify protein families across genomes. Each genome of a species contain thousands of proteins and they share similarity in sequences with other organims, we can use the similarity approach and cluster all proteins into families and then look for known and unknown functions. The protein families that are classified can be integrated with functional information from gene ontology, Pfam gives additional support to the annotation for these protein families. Further they can also be linked to gene expression data.


  • For this project you will use some of the existing tools and also write your own programs.
  • The main challenges will be to handle millions of proteins sequences and clustering them using Makov clustering algorithms (TRIBE-MCL)
  • Also, you should provide visualization using any popular graph tools.
  • The results will be displayed as a web browsable database


  • C. Frech and N. Chen, “Genome-Wide Comparative Gene Family Classification,” PloS one 5, no. 10 (2010): e13409
