Top Banner
Introduction Depictions of 2D chemical structures published in the literature are stored as bitmap images in most electronic sources of chemical information such as patents, journals and reports. Although the original chemical structures are usually created using chemical drawing programs which generate complete structural information, this information is lost during the publication process and if required, is normally regenerated by redrawing the structure with a computer program, which is time- consuming and prone to errors. CLiDE Pro is a chemical OCR software tool aimed at automatic extraction of chemical information from either the printed chemistry literature, or from the equivalent electronic PDF version. CLiDE Pro is the latest incarnation of software to emerge from the long-term CLiDE (Chemical Literature Data Extraction) project [1-3]. [1] P. Ibison, M. Jacguot, F. Kam, A. Neville, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Literature Data Extraction: The CLiDE Project. J. Chem. Inf. Comput. Sci. 1993, 33(3), 338-344. [2] P. Ibison, F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Structure Recognition and Generic Text in the CLiDE Project. In Proceedings on Online Information 92. 1992, London, England. [3] A. Simon and A.P. Johnson. Recent Advances in the CLiDE Project: Logical Layout Analysis of Chemical Documents. J. Chem. Inf. Comput. Sci. 1997, 37(1), 109-116. [4] R.O. Duda and P.E. Hart. Use of the Hough Transform to Detect Lines and Curves in Pictures. Graphics Image Process. 1972, 1. [5] J. Sklansky and V. Gonzalez. Fast Polygonal Approximation of Digitized Curves. Pattern Recognit. 1980, 12, 327-331. Features Converts 2D structure diagrams into connection tables Interprets generic structures Supports document-oriented processing as opposed to page-oriented processing The whole document is loaded and processed at once rather than individual pages. Handles various difficult features Loads PDF documents, as well as TIFF and BMP image files Exports chemical information into MDL SDF and RG files RG is the generic extension of SDF defining the root molecule and its associated R-groups in one query. Generic structures can be exported in either RG or SDF file format. In the latter case, R-groups are automatically substituted with their substitution values. Operates in interactive or batch mode 3 main problems involved in chemical OCR a) Identification of chemical images within a document. b) Compilation of chemical graphs of individual molecules from chemical images. c) Interpretation of complex objects such as generic structures using the retrieved chemical graphs. A chemical OCR tool Difficult features Correctly handled by CLiDE Pro With a chemical OCR system, it is very likely that an incorrect connection table is built if there are no specific rules to detect that a structure diagram contains a feature which is unusual or conveys an ambiguous situation. Crossing bonds are often used to preserve some sense of the shape of a 3D molecule in the drawing, particularly in bridged structures. CLiDE Pro uses a set of rules – which includes the proximity, length, collinearity, and ring membership of potential crossing bonds – to correctly detect and interpret various crossing bond situations. Some bond formations can be easily misinterpreted and post processing of the interpreted connection table is needed to get correct results. Merely relying on the end points of the vectors calculated for each bond line, a single bond and a triple bond joined together can be recognized as a long single bond and a double bond half way over the single bond. Some simple components, such as isolated single lines can cause ambiguity in interpretation. For example, a vertical line can occur in several different kind of chemical entities such as single and multiple bonds, dashed bonds, and character string representing atom labels (e.g. I, Cl) and other information related to the structure. Such ambiguous situations can be resolved by analyzing the environment of the connected components and applying a set of rules with conditions on chemical and spatial context. For instance, if a ‘C’ letter is on the left side of a vertical line which is not part of dashed bond, the vertical line represents the ‘l’ letter of a Chlorine atom. Circles are often used to represent aromaticity in benzene rings, especially in older publications. A generic structure Bibliography CLiDE Pro’s solutions a) Document image segmentation applied on the digitized pages of a document: b) The connection table of a chemical structure diagram is established as follows: c) The recognition of generic structures is done in two steps: Connected on-pixel regions or connected components are found. Noise-like connected components are eliminated. Layout analysis by building the tree structure of a page in a bottom-up manner, i.e. it starts by processing the connected components and results successively in a list of words, text lines, text and graphic blocks. Initial classification of connected components into basic groups – characters, dashes, lines, graphics and noise – based on size, aspect ratio and on-pixel density. Connected components are reclassified at later stages if necessary. Construction of dashed bonds based on the Hough transform method [4] by searching for sets of dashes situated equally spaced along a straight line. Vectorization to identify straight and wedged line parts of connected components classified as lines and graphics based on a polygon approximation method [5]. OCR based on topological and geometrical feature analysis. Grouping characters into atom labels, also taking vertical atom labels into account. Establishing connection information by connecting bond lines to appropriate atoms and joining bond lines to form implicit Carbon atoms. Further methods are included to deal with aromatic rings, crossing bonds and other potential difficult situations. Generic text interpretation to determine R-groups, the number and type of substituents, whether any label is present for each substituent, etc. Association generic text blocks to structures. For a structure, this is done by searching the generic text blocks which best match the structure in terms of the number of R-groups present in both the structure and the generic text block. A.T. Valko & A.P. Johnson, Keymodule Ltd. N N O CH 3 Cl Figures show crossing bonds of different bond types (single and triple) and bond styles (solid straight, solid wedged and dashed wedged). O O H t-Bu O O O O HO H HO H H Br O OBz H H N N H H N O I I I I N O O N O Cl Cl O OH N O Cl Cl O O N O Cl Cl A worked example – Extraction of molecules from a digitized document page X N O CO 2 R Y 1 Y 2 N S 41a: X=N, Y 1 =H, Y 2 =Cl, R=Et 41b: X=CF, Y 1 =Y 2 =F, R=Et Figure left depicts a generic structure that can be interpreted by CLiDE Pro, by successfully identifying the R-groups occurring both in the text and in the atom labels of the structure, recognizing the R-group substitution values CF, Cl, Et, F, H and N, and finding a match between the text and the structure based on the R-groups found. Currently, the generic interpreter is limited to the presence of ‘=‘ sign separating the R-groups and the substituents. However, combined assignments to R-groups are handled successfully (e.g. assignment of Y1 and Y2 in row 41b). a) b) c) a) Digitized image of a document page of a patent b) Segmented document highlighting recognized text blocks and graphic blocks c) Extracted molecular structures
1

A chemical OCR tool With a chemical OCR system, it is … Pro is a chemical OCR software tool aimed at automatic extraction of ... detect and interpret various crossing bond situations.

Mar 09, 2018

Download

Documents

phamngoc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A chemical OCR tool With a chemical OCR system, it is … Pro is a chemical OCR software tool aimed at automatic extraction of ... detect and interpret various crossing bond situations.

Introduction

Depictions of 2D chemical structures published in the literature are stored asbitmap images in most electronic sources of chemical information such as patents,journals and reports. Although the original chemical structures are usually createdusing chemical drawing programs which generate complete structural information,this information is lost during the publication process and if required, is normallyregenerated by redrawing the structure with a computer program, which is time-consuming and prone to errors.

CLiDE Pro is a chemical OCR software tool aimed at automatic extraction ofchemical information from either the printed chemistry literature, or from theequivalent electronic PDF version. CLiDE Pro is the latest incarnation of softwareto emerge from the long-term CLiDE (Chemical Literature Data Extraction) project[1-3].

[1] P. Ibison, M. Jacguot, F. Kam, A. Neville, R.W. Simpson, C. Tonnelier, T. Venczel andA.P. Johnson.Chemical Literature Data Extraction: The CLiDE Project.J. Chem. Inf. Comput. Sci. 1993, 33(3), 338-344.

[2] P. Ibison, F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson.Chemical Structure Recognition and Generic Text in the CLiDE Project.In Proceedings on Online Information 92. 1992, London, England.

[3] A. Simon and A.P. Johnson.Recent Advances in the CLiDE Project: Logical Layout Analysis of ChemicalDocuments. J. Chem. Inf. Comput. Sci. 1997, 37(1), 109-116.

[4] R.O. Duda and P.E. Hart.Use of the Hough Transform to Detect Lines and Curves in Pictures.Graphics Image Process. 1972, 1.

[5] J. Sklansky and V. Gonzalez.Fast Polygonal Approximation of Digitized Curves.Pattern Recognit. 1980, 12, 327-331.

Features

Converts 2D structure diagrams into connection tablesInterprets generic structuresSupports document-oriented processing as opposed to page-oriented processingThe whole document is loaded and processed at once rather than individual pages.

Handles various difficult featuresLoads PDF documents, as well as TIFF and BMP image filesExports chemical information into MDL SDF and RG filesRG is the generic extension of SDF defining the root molecule and its associated R-groups in one query.Generic structures can be exported in either RG or SDF file format. In the latter case, R-groups are automaticallysubstituted with their substitution values.

Operates in interactive or batch mode

3 main problems involvedin chemical OCR

a) Identification of chemical images within a document.b) Compilation of chemical graphs of individual molecules from chemical images.c) Interpretation of complex objects such as generic structures using the retrieved

chemical graphs.

A chemical OCR tool

Difficult featuresCorrectly handled by CLiDE Pro

With a chemical OCR system, it is very likely that an incorrect connection table isbuilt if there are no specific rules to detect that a structure diagram contains afeature which is unusual or conveys an ambiguous situation.

Crossing bonds are often used to preserve some sense of the shape of a 3D molecule in thedrawing, particularly in bridged structures. CLiDE Pro uses a set of rules – which includes theproximity, length, collinearity, and ring membership of potential crossing bonds – to correctlydetect and interpret various crossing bond situations.

Some bond formations can be easily misinterpreted and post processing of the interpretedconnection table is needed to get correct results. Merely relying on the end points of the vectorscalculated for each bond line, a single bond and a triple bond joined together can be recognizedas a long single bond and a double bond half way over the single bond.

Some simple components, such as isolated single lines can cause ambiguity in interpretation.For example, a vertical line can occur in several different kind of chemical entities such as singleand multiple bonds, dashed bonds, and character string representing atom labels (e.g. I, Cl) andother information related to the structure. Such ambiguous situations can be resolved byanalyzing the environment of the connected components and applying a set of rules withconditions on chemical and spatial context. For instance, if a ‘C’ letter is on the left side of avertical line which is not part of dashed bond, the vertical line represents the ‘l’ letter of aChlorine atom.

Circles are often used to represent aromaticity in benzene rings, especially in olderpublications.

A generic structure Bibliography

CLiDE Pro’s solutions

a) Document image segmentation applied on the digitized pages of a document:

b) The connection table of a chemical structure diagram is established as follows:

c) The recognition of generic structures is done in two steps:

Connected on-pixel regions or connected components are found.Noise-like connected components are eliminated.Layout analysis by building the tree structure of a page in a bottom-up manner, i.e. it startsby processing the connected components and results successively in a list of words, textlines, text and graphic blocks.

Initial classification of connected components into basic groups – characters, dashes, lines,graphics and noise – based on size, aspect ratio and on-pixel density. Connectedcomponents are reclassified at later stages if necessary.Construction of dashed bonds based on the Hough transform method [4] by searching forsets of dashes situated equally spaced along a straight line.Vectorization to identify straight and wedged line parts of connected components classifiedas lines and graphics based on a polygon approximation method [5].OCR based on topological and geometrical feature analysis.Grouping characters into atom labels, also taking vertical atom labels into account.Establishing connection information by connecting bond lines to appropriate atoms andjoining bond lines to form implicit Carbon atoms.Further methods are included to deal with aromatic rings, crossing bonds and other potentialdifficult situations.

Generic text interpretation to determine R-groups, the number and type of substituents,whether any label is present for each substituent, etc.Association generic text blocks to structures. For a structure, this is done by searching thegeneric text blocks which best match the structure in terms of the number of R-groupspresent in both the structure and the generic text block.

A.T. Valko & A.P. Johnson, Keymodule Ltd.

N

NO

CH3

Cl

Figures show crossingbonds of different bondtypes (single and triple)and bond styles (solidstraight, solid wedgedand dashed wedged).

O

O

H

t-Bu

O

O

OO HO

HHO

H

H

Br

O

OBzH

HN

NH

HN

O

I

I

I

I

N

O

O

N

O

Cl

Cl

O

OH

N

O

Cl

Cl

O

O

N

O

Cl

Cl

A worked example –Extraction of molecules from

a digitized document page

X

N

O

CO2R

Y1

Y2

N S

41a: X=N, Y1=H, Y2=Cl, R=Et

41b: X=CF, Y1=Y2=F, R=Et

Figure left depicts a generic structure that can beinterpreted by CLiDE Pro, by successfullyidentifying the R-groups occurring both in the textand in the atom labels of the structure, recognizingthe R-group substitution values CF, Cl, Et, F, Hand N, and finding a match between the text andthe structure based on the R-groups found.

Currently, the generic interpreter is limited to thepresence of ‘=‘ sign separating the R-groups andthe substituents. However, combined assignmentsto R-groups are handled successfully (e.g.assignment of Y1 and Y2 in row 41b).

a) b)

c)

a) Digitized image of a document pageof a patent

b) Segmented document highlightingrecognized text blocks and graphicblocks

c) Extracted molecular structures