UCR :: IIGB :: CEPCEB
  1. General Functionality

    ChemMine is an integrated compound mining service and database. Its goal is to facilitate chemical genomics screens and to disseminate the generated knowledge. The database provides access to a wide variety of bioactive, natural and screening compounds from public and commercial providers. Their structures and functional annotations can be searched by chemical properties, substructure matches, structural similarities and biological activities. In addition to a comprehensive information retrieval system, ChemMine is also a cheminformatics service for analyzing the structural and chemical properties of lead compounds. This online service is available for compounds that are represented in the database and those provided by the user. The current set of online analysis tools includes structure-based clustering of compounds, generation of chemical descriptors, and various viewing and reformatting functionalities. To efficiently share the developed informatics resources with the community, the ChemMine project uses exclusively open access and open source technology.

    ChemMine flow

    Data resources and web services offered by ChemMine

  2. Developer Team

  3. Compound Data Sets

    ChemMine contains at the moment over 5,800,000 compounds from public and commercial resources. A detailed list of all available compound sets is available at the Compound Source Page. Upon request from users we will upload additional compound sets into ChemMine. Commercial compounds can only be included after the official approval by their providers. Users need to be aware that the available annotation information for these compound sets is extremely variable. For instance, chemical names, activities and literature are only available in isolated cases. Due to this limitation, structure similarity searches should be the standard approach for retrieving compounds of interest from the database. IUPAC names for all compound sets in ChemMine will be integrated when the corresponding software tool becomes publicly available that can generate these names.

  4. How to Use ChemMine

    1. Annotation Search Page Demo Surpin et al., 2005 Demo drugs & targets

      Annotation searches allow fast text-based searching of all annotation data associated with compounds. All fields support some basic operations such as wildcards, ranges, grouping and boolean operations. Wildcards allow partial string matching by allowing every "?" in the query to match any single letter and any "*" to match any number of letters, including zero. Ranges are allowed in the form of "[A TO D]" where any string that would be sorted alphabetically after A but before D, inclusive would match.

      By default all words and ids are ORed together, so an ID search for "6000002 6000441" would return both compounds. To require both terms to match, the keyword AND can be used: "Br > 5 AND I > 2" would be a valid JOELib descriptor query.

      All the fields specified are ANDed together. The following fields are searchable:

      Compound ID
      Compound ID queries in single or batch mode are possible by providing on the 'Annotation Search Page' as many compound IDs as required.
      Weight
      The molecular weight can be searched with simple operators: >, < and normal ranges. All range operations are inclusive, so "> 5" will match 5.
      Plate and Well
      The plate and well locations for the compounds stored at UCR are searchable. To search multiple plates enter each plate "5 6 7". Well searches can be of the form "A5" to look for row "A" and column "5", "A" for anything in row "A" and "5" for anything in column 5.
      JOELib descriptors
      JOELib descriptors are availble and can be searched with >, < and = operations. Several descriptors are availible. Use the abbreviation, such as: "LGP < 0 AND LGP > -0.3".
      All
      The all field will do a full search on any annotation, providing a catch-all search field
    2. Similarity Searches Demo substructure search Demo similarity search

      Structure similarity searches are the most important functionality for compound queries in databases. The following search functions are available for efficiently exploring the chemical space in ChemMine.

      1. Substructure Searches

        They allow the retrieval of all those molecules in a database that contain a user-defined query substructure, irrespective of the structural environment. On the returned result pages the queried substructure is highlighted in color in the matching molecules.

      2. Similarity Searches

        They are an alternative approach for finding similar molecules in databases. In contrast to substructure searches, this approach can retrieve molecules with similarities to a query structure without depending on perfect matches. The generated similarity scores allow ranking of the retrieved molecules based on their degree of similarity to a query structure (nearest neighbor output). An improved 2D fragment-based algorithm from Chen & Reynolds (2002) is implemented in ChemMine. It can use either atom pairs or atom sequences as structural descriptors, and uses the Tanimoto coefficient as similarity measure (Willett et al., 1998). The current search time for this tool is approximately 2 minutes for 1 million compounds. The C++ implementation of both search programs will be available for download soon.

      3. Molecule Drawing and Structure Formats

        To perform similarity searches, the query molecules need to be available in one of the standard structure formats. SDF and SMILES format are currently supported by the ChemMine interface. A structure string in SMILES format can also be generated by drawing a molecule with Peter Ertl's JME Molecular Editor and copying it into the ChemMine search page.

    3. Online Analysis Tools Demo Clustering and Descriptor Generation

      Several online services are currently available for analyzing the structural and chemical properties of compounds. To utilize them, users can retrieve compounds from the database and send them interactively to the analysis page of the interface. Alternatively, users can provide their own compound structures to these analysis tools.

      1. Conversion of Structure Formats and Compound Viewing

        This very basic utility allows users to provide their own compounds in SDF or SMILES format, view the compound structure images in batches and pass them on to the other online services (see below). For reformatting purposes, the compounds can be saved in other

      2. Descriptor Generation

        Molecular descriptors provide textual and quantitative information about chemical properties of compounds. They can be very useful for prioritizing lead compounds, property clustering and basic QSAR analyses. Over forty different molecular descriptors are currently provided by the ChemMine interface either for user compounds or those contained in the database. The JOELib package is used for their calculation.

      3. Structure-Based Clustering

        Clustering of compounds by structural similarities is another powerful approach for correlating structural features of compounds with their activities. ChemMine provides facilities for hierarchical clustering and a simple binning approach by similarity cutoff values. The required distance matrices for hierarchical clustering are calculated by all-against-all comparisons of compounds using the fragment-based search tool (see above) and transforming the generated similarity scores into distance values. The resulting trees are presented on the web interface in interactive mode using an internally developed tree viewing program. The compound indentifiers in the trees are hyperlinked to the corresponding stucture images of the compounds. To simplify the analysis of this output, the compound structures are sorted in the same order as they appear in the tree. In addition, the descriptors of the clustered compounds can be displayed in the same order for generating simple structure-activity tables in local streadsheet programs.

    4. Result List Page

      The query results in ChemMine are structured into two different levels: an initial 'Result List Page' and a more detailed 'Result Annotation Page'. The initial Result List Page allows the selection of compounds via check boxes and sorting of the table content in various ways by clicking on the column titles. The descriptors of the retrieved compounds can be displayed on the same page after clicking the 'Generate JOELib Descriptor' link. The structures of the retrieved compounds can be downloaded from this level in batch format.

    5. Result Annotation Page

      The Result Annotation Page represents the deepest and most detailed level of information retrieval. It provides the following information:

      1. Color images of the compound structures.
      2. Substructure highlighting in compound images after a substructure search.
      3. Download of the compound structures in SDF, SMILES, MOL, PDB and many other formats.
      4. Query for compounds with similar structures by clicking on the link 'find similar'.
      5. External annotations that are provided by the compound supplier.
      6. The follwing fields are only available for registered users:
        1. Annotation data from internal screening results including links to additional data and image files.
        2. Links for associating screening data with compounds. This 'manual' upload function is an alternative to the batch upload function on the side menu.
        3. Edit and delete functions for screening data. This requires the password of the user who has provided the annotations.
    6. Precomputed Clusters

      To easily identify clusters of similar compounds in entire compound sets, ChemMine contains precomputed cluster tables. The compounds of identified clusters can be retrieved by clicking on the hyperlinks in the summary table. This data representation is particularly useful for identifying structural redundancy in screening libraries. Commercial libraries will only be included on this public page after the offical approval by the supplier.

    7. Upload of Screening Data

      Registered users can upload and search their screening data after clicking the 'Login' link. The field 'Batch Upload of Screening Data' allows users to manage their screening annotations in an Excel spread sheet and to upload these annotations for many compounds in a single step. Data files from analytical instruments and image data can be included as well. To use this function, users will add their annotations according to the instructions provided in the Excel Template file, save the content to a tab-delimited text file and upload this file to the database. The upload is only permitted for registered users. The Registration link on the upload page allows participants to register themselves. An alternative annotation function in a one-by-one mode is available on the Result Annotation Page (see below). Annotations can be edited and deleted by their owners at any time.

  5. JOELib Descriptors

    The following JOELib descriptors are generated and searchable: