cepceb,
Center for Plant Cell Biology
Systomics Network
Genome Cluster Database (GCD)
CEPCEB   |  IIGB   |  UC Riverside
   Systomics Network  |   GCD  |   Expression  |   POND  |   CWN  |   BAP DB  |   ChemMine  |   Links

About GCD

  1. General Scope of this Database
  2. The Genome Cluster Database (GCD) is an integrated mining tool for the genome-wide family and singlet proteins from Arabidopsis thaliana and Oryza sativa spp. japonica. Their proteomes have been clustered here into families by employing two independent approaches. The program BLASTCLUST was used for similarity-based clustering (BCL) and hmmpfam searches were used for domain-based clustering (HCL). Since the two methods have reciprocal advantages and disadvantages, both cluster sets have been integrated into GCD with efficient tools to mine them together. Additional well annotated genomes may be included into this clustering pipeline in the future. The developed GCD interface provides user-friendly query and visualization functions for intra- and inter-species protein family comparisons, and for retrieval of sequences, multiple alignments, phylogenetic trees and information about putative orthologs from other kingdoms.

    Data Flow in GCD

  3. Limitations of Data Sets
  4. Cluster, Alignment and Tree Data
    Accurate clustering of entire proteomes is a complex task. Currently, GCD provides data of high quality for most but not all families. Due to this limitation, clusters and trees in GCD should only be used after quality inspection of the corresponding alignments using the available consensus and domain shading tools.

  5. Search Functions
  6. Basic Searches in Single or Batch Mode
    Database searches can be performed against the following five field categories:
    1. Functional descriptions (e.g. 'desaturase AND fatty acid'). Boolean query connectors can be included here: 'AND'   'OR'   'NOT'.
    2. Cluster names (e.g. 'oxidoreductase activity)'
    3. One or many locus IDs from Arabidopsis or rice (e.g. 'At1g01190   At3g62720   9631.m01858')
    4. Cluster or Pfam ID numbers (e.g. '53' or 'PF00067')
    5. Gene Ontology keys (e.g. 'GO:0019825')
    Before submitting a query, the correct search category needs to be selected in the drop down menu on the bottom of the search page. The maximum number of query hits can be specified in a separate field. In addition, all searches can be delimited against one organism by selecting/de-selecting one of them. A user-friendly 'Loop Query' system on the resulting List Page allows quick retrieval of all members of a family of interest by clicking on the organism distribution links (e.g. '7 Ath   8 Osa'). A similar facility is in place to quickly retrieve all proteins containing a Pfam domain of interest by clicking on its link under domain cluster ID. This action will loop through the Advanced Search Page.

    Advanced Searches
    Combinatorial queries of expandable complexity can be constructed on the Advanced Search Page. Several predefined queries are available here to retrieve organism-restricted clusters within certain size intervalls.

    Cluster Table Search
    A search- and sortable cluster table enables family mining by cluster sizes, cluster names and family IDs. The cluster method used for generating a cluster is defined in the table by the type of the 'Family ID' number. Clusters that were generated with BLASTCLUST (BCL) have blank numbers, while domain-based clusters (HCL) follow the Pfam ID syntax, e.g.: 'PF00026'.

  7. Result List Pages
  8. General
    All of the above query types return a Result List page that provides the A. thaliana members on the top and the O. sativa members on the bottom. The cluster association of the entries is provided by their cluster identifiers (ID) and the cluster sizes are specified by the cluster size links, e.g.: '7 Ath 15 Osa' stands for a cluster with 7 A. thaliana and 15 O. sativa members. The result statistics on the beginning of the page lists the number of loci, gene models and clusters returned by a query. To restrict a query to a protein family of interest, users can simply click on the organism distribution links (e.g. '7 Ath   8 Osa'). This actions sends the correct query syntax back to the main page which returns the requested entries upon submission.

    Sequence Retrieval
    The following sequence types can be retrieved in batches for any query result by selecting their check-boxes: proteins, CDSs, transcriptional models, UTRs, intergenic and putative promoter regions. The default sequence view is HTML format. Text based retrieval in FASTA format can be activated by making the corresponding selection in the adjacent drop-down menu.

    Alignments and Trees
    Multiple alignments and trees are currently available for all BCL_35% clusters and the HCL clusters. If requested by users, they can also be provided for the less sensitive BCL clustering using cutoffs of 50% and 70% sequence identity. The consensus shading view highlights conserved residues and the domain shading view identifies the Pfam domains in all members of a cluster. The domain viewer can be extremely useful for evaluating the quality of an alignment and localizing the functional regions in the context of a multiple alignment. All alignments are generated with a local installation of the MultAlin program. The obtained alignments for each family are used to calculate phylogenetic trees with the PHYLIP package. A distance-based neighbor-joining method has been chosen for this step that uses PROTDIST with the 'categories model' setting for generating distance matrices, NEIGHBOR for tree construction and the midpoint method in RETREE for defining root positions.

    GO-Based Family Naming
    Functional names were assigned to clusters with an automated strategy that is based on the available Gene Ontology (GO) annotations from TIGR. The approach uses the deepest and most common Molecular Function GO term for assigning a single name to a family. Typically, this method provides useful names for many but not all families. The results will improve over time with updates in the available GO annoations of the two organisms.

    Online Batch Viewing and Analysis Tools
    The link bar on top of the Result List Pages provides access to several online batch viewing and analysis tools. A Gene Structure viewer displays the UTR-Exon-Intron structure of all retrieved A.thalina and O. sativa entries on one page. A chromosome mapping tool visualizes the localization on their chromosomes. Gene Ontology pie charts for the Molecular Function category can also be displayed for both organisms. An online hmmalign service allows users to generate multiple alignments for any sequence set of interest against a chosen Pfam domain.

  9. Sequences in GCD
    1. Protein sequences used for family clustering
      • A. thaliana: TAIR
      • O. sativa spp. japonica: TIGR
    2. Chromosome annotations for feature download and viewing
      • A. thaliana: TAIR
      • O. sativa spp. japonica: TIGR
    3. Ortholog identification in other kingdoms
      • UniProt (indirect hyperlink access through GCD)

  10. Proteome Clustering
  11. The protein sequences in GCD are clustered by two independent approaches:
    1. BLAST-based Similarity Clustering (BCL)
    2. The BLASTCLUST program from NCBI was used to cluster the proteins by sequence similarity. Threshold parameters of 50% overlap and 35% sequence identity were used for high-sensitivity clustering (BCL_35%). Most resources for comparing the different approaches (e.g. Table & Stats pages) were performed with this BCL_35% cluster set. Two less stringent cluster sets with 50% and 70% identity were generated for identifying sub-clusters on the Result List pages (BCL_50%, BCL70%). Prior to the similarity clustering, low-compexity regions of the proteins were masked with the CAST program.
    3. HMM Domain Arrangement Clustering (HCL)
    4. To cluster the proteins based on their domain signatures, their Pfam domains were identified with hmmpfam searches against the latest Pfam_ls HMM library. Subsequently, the proteins were clustered with a custom Perl script based on their order of identified domains using an HMM e-value of ≤0.1 as cutoff.

  12. Cluster Statistics Page
  13. A Cluster Statistics Page has been implemented to summarize and track the cluster results of the two species. It provides the size and number of singlet and family proteins for the two clustering methods.

  14. Automated Update Strategy
  15. All data upload and clustering steps have been automated in GCD with Perl scripts to allow rapid updates when new versions of the genome annotations or Pfam domain database are released in the future. Changes in the results will be tracked on the above Cluster Statistics Page.



















































This site has been hit 2004279 times.

   Thomas Girke, UC Riverside, Email: thomas.girke@ucr.edu