Szilárd Dóránt May 2006 Building on JChem Base
Mar 27, 2015
Szilárd Dóránt
May 2006
Building on JChem Base
Contents
Introduction
Structural overviewThe Property TableJChem structure tablesThe log tableStandardizationMemory considerationsThe search process
Performance tips
Duplicate filteringDisplaying hitsAPI examplesJSP exampleUpgrading JChemFuture plans
Introduction
• JChem Base provides high performance Java based tools for the storage, search and retrieval of chemical structures and associated data.
• These components can be integrated into web-based or standalone applications in association with other ChemAxon tools.
Structural overview
Web
browserApplication Web application
JChem Base API:Chemical logicStructure cache
JDBC driver: Standard interface to the RDBMS
RDBMS (e.g. Oracle, MySQL, etc.) :
Storage and security
The Property Table
The property table stores information about JChem structure tables, including:
• Fingerprint parameters
• Custom standardization rules
• Other table options and information
• Database-related license keys
More than one property table can be used, each property table represents a particular JChem environment.
The structure of JChem tables
Column name Explanation
cd_id unique numeric identifier in the table
cd_structure the imported structure in the original format, without modifications (except for the removal of data fields)
cd_smiles the standardized structure in ChemAxon Extended Smiles (cxsmiles) format, used by the search process
cd_formula the formula of the standardized structure
cd_molweight the molecular weight of the standardized structure
cd_hash hash code used for duplicate filtering (PERFECT search)
cd_flags can store row specific option, e.g. overriding the chiral flag
cd_timestamp the date and time of the insertion of the row
cd_fp… fingerprint columns
[user fields] custom data fields can be added by the user
The log table
• For efficient cache update it is essential to keep track of modifications to the table
• A log table (“<table_name>_UL”) keeps track of the modifications (insert / update/ delete) performed through the JChem API.
• To limit the number of log entries, old rows are deleted right after cache update (at the beginning a database search)
• DELETE right is needed for searches
• Option for number of rows to preserve
Standardization
• Essential for the graph search algorithm
• A basic standardization is provided as default
• A custom standardization can be specified for each table. The setting is saved in the Property Table.
• “Set once and forget”. Automatically utilized during:– Import (chemaxon.jchem.db.Importer)
– Insert (chemaxon.jchem.db.UpdateHandler)
– Update (chemaxon.jchem.db.UpdateHandler)
– Search (chemaxon.jchem.db.JChemSearch)
– Regeneration (chemaxon.jchem.db.UpdateHandler)
Memory
Quick facts:
• The default JVM heap size is 64 MB
• On 32 bit systems no more than ~2 GB memory can be allocated to a single process
• 2 GB can hold approximately 20 million small structures
• JChem caches only whole structure tables (not rows)
Table 1
Table 2
Table 3
Structure Cache
Temporary allocations
JChem Base
Application
Total memory need
Memory: Structure Cache
• The Structure Cache stores structures and fingerprints in a highly optimized, compact form
• Memory need is dependent on– The number of structures– The average size of the structures– The size of the fingerprint used
• Approximately 100MB is needed for 1 million drug-like (small) structures (using 512 bit fingerprints)
• The exact cache size can be retrieved from the API via:
chemaxon.jchem.db.JChemSearch.getCachedTables()
Memory: temporary allocations
• Not directly related to the size of structure tables
• Increases with the number of parallel operations
• Also includes the memory needed for – your application routines– the running environment (application server)
• Cannot be predicted exactly: performing stress test is recommended
• Specify the amount of memory JChem should not use for the Structure Cache:
chemaxon.jchem.db.JChemSearch.setMinNonCachedMemory()
The search process
A two stage method provides optimal search performance:
1. Rapid pre-screening reduces the number ofpossible hit candidates
- Chemical Hashed Fingerprints are used forsubstructure and superstructure searches
- Hash code is used for duplicate filtering(usually during compound registration)
2. Graph search algorithm is used to determine the final hit list
Performance: fingerprints
• The number of screened structures and the number of hits should be close
• Search statistics can be obtained via:
JChemSearch.setInfoToStdError(true)
• Fingerprints should not be too “dark”
• Statistics can be obtained by invoking:
jcman s <table_name>
See: http://www.chemaxon.com/jchem/doc/user/fingerprint.html
Performance: limiting the number of hits
A large final result may be unnecessary for the chemist: the number of hits can be limited
Number of structures
Tim
e
Without limit
Total search time
Screening time
Performance: multiple processors
• JChem automatically utilizes the number of processors available for the JVM
• Adding processors is a straightforward way to improve search performance
• The thread count per search can be limited to prevent the creation of too many threads on systems with high number of parallel searches:
JChemSearch.setNumberOfProcessingThreads()
Performance: server mode JVM
The server mode (“-server” JVM option ) increases the efficiency of run-time optimization
• Slower start-up speed
• Needs more processing time to reach optimum performance
• Higher final performance
Duplicate filtering
• Designated search mode for duplicate filtering:chemaxon.sss.search.SearchConstants.PERFECT
• A 32 bit hash code is used to rapidly find possible duplicates (cd_hash column), these are further checked by graph search
• The structure cache is not used in PERFECT search mode
• Available in the following API classes:– chemaxon.jchem.db.Importer– chemaxon.jchem.db.UpdateHandler– chemaxon.jchem.db.JChemSearch– chemaxon.sss.search.MolSearch
Displaying hits (1)
• The structures are stored in their original (non-standardized) format in the “cd_structure” column of the JChem table.
• Always use the cd_structure column to display structures
• If needed, the structure can be standardized for display on-the-fly via Standardizer
• Helper methods to retrieve structures (supports multiple column types of cd_structure):
DatabaseTools. readBytes(ResultSet rs, String columnName)
DatabaseTools. readBytes(ResultSet rs, int idx)
Displaying hits (2)
• Only molecule IDs are stored during DB search
• Search again with chemaxon.sss.search.MolSearch to obtain the hit atoms
• Helper method for hit alignment (rotation):– MolHandler.align(Molecule mol, int[] indexes)
• Since Marvin automatically colors the connecting half of the bonds, some bonds have to be explicitly set to neutral color:– MolHandler. getNonHitBondEndpoints()– MolHandler. getNonHitBonds()
See: jchem\examples\java\HitAlignmentAndColoringExample.java
API example : connecting to a database
ConnectionHandler ch = new chemaxon.jchem.db.ConnectionHandler(); ch.setDriver(“oracle.jdbc.driver.OracleDriver”);ch.setUrl(“jdbc:oracle:thin:@localhost:1521:mydb”);ch.setPropertyTable(“JChemProperties”);ch.setLoginName(“scott”);ch.setPassword("tiger");ch.connect();// the java.sql.Connection object is available if needed:Connection con=ch.getConnection();…// closing the connection:ch.close();
API example : database import
Importer importer = new chemaxon.jchem.db.Importer();importer.setConnectionHandler(conh);importer.setInput(“sample.sdf”);// importer.setInput(is); // alternatively a stream can also be specifiedimporter.setTableName(“SCOTT.STRUCTURES”); importer.setHaltOnError(false);importer.setDuplicateImportAllowed(false); //can filter duplicates
// specifying SDFile field - table field pairs:String fieldPairs = “DB_Field1=SDF_Field1; DB_Field2=SDF_Field2”;importer.setFieldConnections(fieldPairs);int importedCount = importer.importMols();System.out.println( “Imported” + importedCount + “structures” );
API example : database export
Exporter exporter = new chemaxon.jchem.db.Exporter();exporter.setConnectionHandler(conh);
exporter.setTableName(“structures”); //data fields to be exported with the structure:exporter.setFieldList(“cd_id cd_formula name comments”);String fileName=“output.sdf”;OutputStream os=new FileOutputStream(fileName);exporter.setOutputStream(os);exporter.setFormat(“sdf”); int exportedCount = exporter.writeAll();System.out.println(“Exported ” + exportedCount + “structures”);
API example : database search
JChemSearch searcher = new chemaxon.jchem.db.JChemSearch();searcher.setConnectionHandler(ch);searcher.setSearchType(JChemSearch.SUBSTRUCTURE)searcher.setQueryStructure(“c1ccccc1”);searcher.setStructureTable(“SCOTT.STRUCTURES”);// a query that returns cd_id values can be used for prefiltering:Searcher.setFilterQuery(
“SELECT cd_id FROM structures, biodata WHERE ”+ “structures.cd_id = biodata.cd_id AND biodata.toxicity < 0.3” );
searcher.setWaitingForResult(true); // otherwise runs in a separate threadsearcher.run();// getting the results as cd_id values:int[] results=searcher.getResults();
API example : inserting a structure
// ConnectionHandler, mode, table name and data field names:UpdateHandler uh = new chemaxon.jchem.db.UpdateHandler(
ch, UpdateHandler.INSERT, “structures”, “comment, stock”);uh.setValueForFixColumns(“c1ccccc1”); // the structure// specifying data field values:uh.setStructureValueForAdditionalColumn(1, “some text”); uh.setStructureValueForAdditionalColumn(2, new Double(8.5));uh.setDuplicateFiltering(true); // filtering duplicate structuresint id=uh.execute(true); // getting back the cd_id of the inserted structureif ( id > 0 ) { System.out.println(“Inserted, cd_id value : ” + id);} else { System.out.println(“Already exists with cd_id value : ” + (-id));}// storing update information, the database connection remains open : uh.close();
JSP example application
• Some of the functions implemented:– Different search types– Hit alignment and coloring– Chemical Terms filtering– Import / Export– Insert / Modify / Delete
• Open source, customizable
• The source is available in the JChem package under:jchem\examples\jsp1_x
Upgrading JChem
• Tables of old versions have to be upgraded for two reasons:– Calculated values (e.g. fingerprints) may be out
of date (should be recalculated at every upgrade)– In some versions there is a change in the table
structure
• Normally done by JChemManager (jcman) GUI at startup or command-line by invoking “jcman u”
• An general upgrade API will be available for integrators from version 3.2
Future plans
• Support for storing and searching Markush structures
• Database field access from Chemical Terms expressions
• Tautomer search
• Chemical Terms columns
• Tables storing query structures
Summary
ChemAxon’s JChem Base API provides sophisticated tools for the developer to deal with chemical structures and associated data.
Building on the JChem API is convenient, because:
• Our various tools integrate seamlessly
• Both high and low level API classes are available
• Responsive developer-to-developer support
Links
• JChem home page:– www.jchem.com
• Live demos:– www.jchem.com/examples
• API documentation:– www.jchem.com/doc/api
• Brochure:– www.chemaxon.com/brochures/JChemBase.pdf
Máramaros köz 3/a Budapest, 1037Hungary
www.chemaxon.com
Thank you for your attention