Szilárd Dóránt May 2006 Building on JChem Base. Contents Introduction Structural overview The Property Table JChem structure tables The log table Standardization.

Szilárd Dóránt

May 2006

Building on JChem Base

Contents

Introduction

Structural overviewThe Property TableJChem structure tablesThe log tableStandardizationMemory considerationsThe search process

Performance tips

Duplicate filteringDisplaying hitsAPI examplesJSP exampleUpgrading JChemFuture plans

Introduction

• JChem Base provides high performance Java based tools for the storage, search and retrieval of chemical structures and associated data.

• These components can be integrated into web-based or standalone applications in association with other ChemAxon tools.

Structural overview

Web

browserApplication Web application

JChem Base API:Chemical logicStructure cache

JDBC driver: Standard interface to the RDBMS

RDBMS (e.g. Oracle, MySQL, etc.) :

Storage and security

The Property Table

The property table stores information about JChem structure tables, including:

• Fingerprint parameters

• Custom standardization rules

• Other table options and information

• Database-related license keys

More than one property table can be used, each property table represents a particular JChem environment.

The structure of JChem tables

Column name Explanation

cd_id unique numeric identifier in the table

cd_structure the imported structure in the original format, without modifications (except for the removal of data fields)

cd_smiles the standardized structure in ChemAxon Extended Smiles (cxsmiles) format, used by the search process

cd_formula the formula of the standardized structure

cd_molweight the molecular weight of the standardized structure

cd_hash hash code used for duplicate filtering (PERFECT search)

cd_flags can store row specific option, e.g. overriding the chiral flag

cd_timestamp the date and time of the insertion of the row

cd_fp… fingerprint columns

[user fields] custom data fields can be added by the user

The log table

• For efficient cache update it is essential to keep track of modifications to the table

• A log table (“<table_name>_UL”) keeps track of the modifications (insert / update/ delete) performed through the JChem API.

• To limit the number of log entries, old rows are deleted right after cache update (at the beginning a database search)

• DELETE right is needed for searches

• Option for number of rows to preserve

Standardization

• Essential for the graph search algorithm

• A basic standardization is provided as default

• A custom standardization can be specified for each table. The setting is saved in the Property Table.

• “Set once and forget”. Automatically utilized during:– Import (chemaxon.jchem.db.Importer)

– Insert (chemaxon.jchem.db.UpdateHandler)

– Update (chemaxon.jchem.db.UpdateHandler)

– Search (chemaxon.jchem.db.JChemSearch)

– Regeneration (chemaxon.jchem.db.UpdateHandler)

Memory

Quick facts:

• The default JVM heap size is 64 MB

• On 32 bit systems no more than ~2 GB memory can be allocated to a single process

• 2 GB can hold approximately 20 million small structures

• JChem caches only whole structure tables (not rows)

Table 1

Table 2

Table 3

Structure Cache

Temporary allocations

JChem Base

Application

Total memory need

Memory: Structure Cache

• The Structure Cache stores structures and fingerprints in a highly optimized, compact form

• Memory need is dependent on– The number of structures– The average size of the structures– The size of the fingerprint used

• Approximately 100MB is needed for 1 million drug-like (small) structures (using 512 bit fingerprints)

• The exact cache size can be retrieved from the API via:

chemaxon.jchem.db.JChemSearch.getCachedTables()

Memory: temporary allocations

• Not directly related to the size of structure tables

• Increases with the number of parallel operations

• Also includes the memory needed for – your application routines– the running environment (application server)

• Cannot be predicted exactly: performing stress test is recommended

• Specify the amount of memory JChem should not use for the Structure Cache:

chemaxon.jchem.db.JChemSearch.setMinNonCachedMemory()

The search process

A two stage method provides optimal search performance:

1. Rapid pre-screening reduces the number ofpossible hit candidates

- Chemical Hashed Fingerprints are used forsubstructure and superstructure searches

- Hash code is used for duplicate filtering(usually during compound registration)

2. Graph search algorithm is used to determine the final hit list

Performance: fingerprints

• The number of screened structures and the number of hits should be close

• Search statistics can be obtained via:

JChemSearch.setInfoToStdError(true)

• Fingerprints should not be too “dark”

• Statistics can be obtained by invoking:

jcman s <table_name>

See: http://www.chemaxon.com/jchem/doc/user/fingerprint.html

http://www.chemaxon.com/jchem/doc/user/fingerprint.html

Performance: limiting the number of hits

A large final result may be unnecessary for the chemist: the number of hits can be limited

Number of structures

Tim

e

Without limit

Total search time

Screening time

Performance: multiple processors

• JChem automatically utilizes the number of processors available for the JVM

• Adding processors is a straightforward way to improve search performance

• The thread count per search can be limited to prevent the creation of too many threads on systems with high number of parallel searches:

JChemSearch.setNumberOfProcessingThreads()

Performance: server mode JVM

The server mode (“-server” JVM option ) increases the efficiency of run-time optimization

• Slower start-up speed

• Needs more processing time to reach optimum performance

• Higher final performance

Duplicate filtering

• Designated search mode for duplicate filtering:chemaxon.sss.search.SearchConstants.PERFECT

• A 32 bit hash code is used to rapidly find possible duplicates (cd_hash column), these are further checked by graph search

• The structure cache is not used in PERFECT search mode

• Available in the following API classes:– chemaxon.jchem.db.Importer– chemaxon.jchem.db.UpdateHandler– chemaxon.jchem.db.JChemSearch– chemaxon.sss.search.MolSearch

Displaying hits (1)

• The structures are stored in their original (non-standardized) format in the “cd_structure” column of the JChem table.

• Always use the cd_structure column to display structures

• If needed, the structure can be standardized for display on-the-fly via Standardizer

• Helper methods to retrieve structures (supports multiple column types of cd_structure):

DatabaseTools. readBytes(ResultSet rs, String columnName)

DatabaseTools. readBytes(ResultSet rs, int idx)

Displaying hits (2)

• Only molecule IDs are stored during DB search

• Search again with chemaxon.sss.search.MolSearch to obtain the hit atoms

• Helper method for hit alignment (rotation):– MolHandler.align(Molecule mol, int[] indexes)

• Since Marvin automatically colors the connecting half of the bonds, some bonds have to be explicitly set to neutral color:– MolHandler. getNonHitBondEndpoints()– MolHandler. getNonHitBonds()

See: jchem\examples\java\HitAlignmentAndColoringExample.java

API example : connecting to a database

ConnectionHandler ch = new chemaxon.jchem.db.ConnectionHandler(); ch.setDriver(“oracle.jdbc.driver.OracleDriver”);ch.setUrl(“jdbc:oracle:thin:@localhost:1521:mydb”);ch.setPropertyTable(“JChemProperties”);ch.setLoginName(“scott”);ch.setPassword("tiger");ch.connect();// the java.sql.Connection object is available if needed:Connection con=ch.getConnection();…// closing the connection:ch.close();

API example : database import

Importer importer = new chemaxon.jchem.db.Importer();importer.setConnectionHandler(conh);importer.setInput(“sample.sdf”);// importer.setInput(is); // alternatively a stream can also be specifiedimporter.setTableName(“SCOTT.STRUCTURES”); importer.setHaltOnError(false);importer.setDuplicateImportAllowed(false); //can filter duplicates

// specifying SDFile field - table field pairs:String fieldPairs = “DB_Field1=SDF_Field1; DB_Field2=SDF_Field2”;importer.setFieldConnections(fieldPairs);int importedCount = importer.importMols();System.out.println( “Imported” + importedCount + “structures” );

API example : database export

Exporter exporter = new chemaxon.jchem.db.Exporter();exporter.setConnectionHandler(conh);

exporter.setTableName(“structures”); //data fields to be exported with the structure:exporter.setFieldList(“cd_id cd_formula name comments”);String fileName=“output.sdf”;OutputStream os=new FileOutputStream(fileName);exporter.setOutputStream(os);exporter.setFormat(“sdf”); int exportedCount = exporter.writeAll();System.out.println(“Exported ” + exportedCount + “structures”);

API example : database search

JChemSearch searcher = new chemaxon.jchem.db.JChemSearch();searcher.setConnectionHandler(ch);searcher.setSearchType(JChemSearch.SUBSTRUCTURE)searcher.setQueryStructure(“c1ccccc1”);searcher.setStructureTable(“SCOTT.STRUCTURES”);// a query that returns cd_id values can be used for prefiltering:Searcher.setFilterQuery(

“SELECT cd_id FROM structures, biodata WHERE ”+ “structures.cd_id = biodata.cd_id AND biodata.toxicity < 0.3” );

searcher.setWaitingForResult(true); // otherwise runs in a separate threadsearcher.run();// getting the results as cd_id values:int[] results=searcher.getResults();

API example : inserting a structure

// ConnectionHandler, mode, table name and data field names:UpdateHandler uh = new chemaxon.jchem.db.UpdateHandler(

ch, UpdateHandler.INSERT, “structures”, “comment, stock”);uh.setValueForFixColumns(“c1ccccc1”); // the structure// specifying data field values:uh.setStructureValueForAdditionalColumn(1, “some text”); uh.setStructureValueForAdditionalColumn(2, new Double(8.5));uh.setDuplicateFiltering(true); // filtering duplicate structuresint id=uh.execute(true); // getting back the cd_id of the inserted structureif ( id > 0 ) { System.out.println(“Inserted, cd_id value : ” + id);} else { System.out.println(“Already exists with cd_id value : ” + (-id));}// storing update information, the database connection remains open : uh.close();

JSP example application

• Some of the functions implemented:– Different search types– Hit alignment and coloring– Chemical Terms filtering– Import / Export– Insert / Modify / Delete

• Open source, customizable

• The source is available in the JChem package under:jchem\examples\jsp1_x

Upgrading JChem

• Tables of old versions have to be upgraded for two reasons:– Calculated values (e.g. fingerprints) may be out

of date (should be recalculated at every upgrade)– In some versions there is a change in the table

structure

• Normally done by JChemManager (jcman) GUI at startup or command-line by invoking “jcman u”

• An general upgrade API will be available for integrators from version 3.2

Future plans

• Support for storing and searching Markush structures

• Database field access from Chemical Terms expressions

• Tautomer search

• Chemical Terms columns

• Tables storing query structures

Summary

ChemAxon’s JChem Base API provides sophisticated tools for the developer to deal with chemical structures and associated data.

Building on the JChem API is convenient, because:

• Our various tools integrate seamlessly

• Both high and low level API classes are available

• Responsive developer-to-developer support

Links

• JChem home page:– www.jchem.com

• Live demos:– www.jchem.com/examples

• API documentation:– www.jchem.com/doc/api

• Brochure:– www.chemaxon.com/brochures/JChemBase.pdf

http://www.jchem.com/

http://www.jchem.com/examples

http://www.jchem.com/doc/api/index.html

http://www.chemaxon.com/brochures/JChemBase.pdf

Máramaros köz 3/a Budapest, 1037Hungary

[email protected]

www.chemaxon.com

Thank you for your attention

mailto:[email protected]

http://www.chemaxon.com/

Szilárd Dóránt May 2006 Building on JChem Base. Contents Introduction Structural overview The Property Table JChem structure tables The log table Standardization.

Documents

memory jchem

jchem base slide

table cd

jchem api

standardized structure

updatehandler slide

introduction jchem base

search process cd