go directly to the contents
EUChinaGRID

application repository

  • name: ROSETTA
  • domain: Biology/genomics
  • country: Italy
  • author: P. Luisi, F. Polticelli et al
  • institute: Università di Roma Tre - Dipartimento di Biologia
  • contacts: polticel@uniroma3.it
  • description: the application aims at the creation of a large database of “never born proteins”, i.e.random polypeptide sequences with no significant homology with existing (natural) proteins, and predict their three-dimensional structure using the ROSETTA software.
  • functionalities: The application includes two modules:
    • Generation of a database the module generates random sequences of fixed length (70 amino acids) in batches of 10000 sequences per file in FASTA format. Each sequence is then compared to the non redundant protein database (available at http://www.ncbi.nlm.nih.gov/) using the BLAST algorithm, embedded in the software application, and sequences which display significant homology with natural proteins of comparable length are deleted from the database. [This application is written in C++ and, given the relatively low computational power requirements, is currently run locally]
    • Ab Initio Protein folding - the program predicts the three-dimensional structure of never born proteins collected in the DB using the ROSETTA software. Currently only the main and most demanding software application has been successfully adapted to GRID environment. In addition a parametric job submission procedure has been set up in order to be able to run predictions on a large number (of the order of hundreds to thousands) of never born protein sequences at the same time.
    • middleware requirements:
      • Licence software usage - ROSETTA software is distributed with a licence that allows only the members of the licenced group to use the software. Licensed software must be considered as a grid resource with access limited to authorised users. [See also: EGEE PTF #100540. (Status: none)]
      • Shared jobs control - In order to distribute the control of computation process to several persons, when computation covers large numbers of jobs, the applicatiom requires to control the access rights to jobs (list access right, cancellation and get-output access right) for different users of a same VO). [See also: EGEE PTF #100809. (Status: none)]
      • Job status retrieval - facing with more than 107 jobs in batches of 104, loosing a job implies reorganizing the sequences batches and resubmitting the loosed jobs. It is thus critical that a user can retrieve the status of the jobs he/she submitted with reference to the input files that generated that job. [See also: EGEE PTF #100536 (Status: satisfied)]
    • resources requirements: As the 1st module is currently run locally, requirements refer to the second one only. The main software package is written in FORTRAN and requires several input files generated by additional programs. In detail, a file describing the predicted secondary structure for a given sequence is generated through the use of the program PSIPRED. On the basis of the predicted secondary structure of the sequence of interest, a Perl script extracts two sets of fragments (of length 3 and 9 residues) from a large database of fragments generated by pre-processing the Protein Data Bank (PDB, www.rcsb.org).

      Back to list of applications >>

powered by
Consortium GARR