OC Miner

• HIGH PERFORMANCE PROCESSING OF DOCUMENT COLLECTIONS

• SERVER BACK-END FOR INTELLIGENT KNOWLEDGE MINING

OCMiner® is OntoChem’s high performance text analysis and data mining tool box. It is designed to meet the specifi c needs of our clients

instead of providing a one-size-fi ts-all solution. High quality and performance are achieved by straightforward implementation of tailor made

and modular products for information retrieval and display of medium up to very large scale data sources and document collections.

OCMiner® is used by small and large life science companies to automatically index, analyze and search internal or external data collections,

extracting product related knowledge and supporting the development of novel products by transitive knowledge discovery.

OC MINER®

chemonto

TECHNOLOGY

OCMiner® is a modular processing pipeline for unstructured

information based on the Apache UIMA framework. Custom

data mining is implemented by integrating any number

of different tool box modules into a pipeline that produces

the desired output. Tunable modules to select from consist

of a broad range of different readers, analysis engines or

consumers that may perform tasks in parallel on multiprocessor

machines and even distributed over several computers.

Readers are reading data from a variety of sources,

standardizing the input for further analysis:

● Document readers for offi ce documents and many

other fi le formats

● Extended support for XML and PDF documents

● Database readers allow direct access to relational

databases, ontologies or document management

systems (DMS)

PRODUCT FEATURES

INPUT

● Fast and scalable processing of large content

sources like fi le collections or databases

● Offi ce documents and many other fi le formats with

extended support for XML and PDF documents

MODULES

● Document structure, sentence and language

recognition

● Annotation of named entities

◦ Small or very large controlled vocabularies,

taxonomies, multi-faceted ontologies,

meta-ontologies in any format – e.g. OBO,

OWL, SKOS, CSV, …

◦ Specialized unique ontologies such as chemistry

and proteins or genes

◦ Resolution of abbreviations, acronyms,

homonyms and anaphora

◦ Intelligent treatment of word forms and

special characters

● Relationship extraction using syntax rule based

shallow or deep parsing

OUTPUT

● Annotated content, search results or extracted

knowledge as fi les or databases

● Browser based search and display interfaces

● Data analysis and graphical representation of

complex relationships

● API - local or web-based

OCMiner® allows to extract complex relationships – for example between compounds (here phlorizin) and species or diseases. The strength of the relationship is shown by the size of the found concept.

Analysis engines work on the standardized information and

add further data:

● Recognition of document structure such as headlines,

paragraphs, sentences, as well as specifi c document

section types, for example title, abstract, authors,

keywords, abbreviation lists and references section.

● Dictionary based named entity (NE) recognition is a

high performance dictionary look-up technology with

support for very large dictionaries (>100 Mill. entries).

It implements specifi c language and dictionary

dependent treatment options such as:

◦ Adaptable to recognize spelling variations

- Spaces/hyphens (e.g. “HIV-1” or “HIV1” or “HIV-I”),

- Handling of letters with umlaut or other diacritics

(e.g. “Sögrens disease” → “Soegrens disease”)

- British/American English (e.g. “behaviour” → “behavior”)

- Greek letters (e.g. “α-amino acid” → “alpha-amino acid”)

- Plural forms

- Apostrophe s (e.g. “Soegrens disease” or “Soegren’s

disease” or “Soegren disease”)

- Conditional black- and white-lists

◦ Homonym resolution is provided by context sensitive

ontological similarity. For example, to decide whether

“monitor” is a computer screen or a lizard species

(e.g. the Savanna monitor, Varanus exanthematicus) will

depend on the use of related NE in the near context.

This is especially useful for very short NE, which is

often the case with protein or chemistry names.

◦ Case sensitive handling of homonyms, for example

distinguishing “aids” or “AIDS”

◦ Resolving document specifi c abbreviations and

acronyms, for example

- Abbreviations: kb → kilobase(s)

- Acronyms: TAT → Tyrosine aminotransferase

◦ Expansion of shortened word list forms like

- “vitamin A, B and C” → vitamin A + vitamin B + vitamin C

- “white and gray matter” → white matter + gray matter

● Specifi c ontologies and tools are available to annotate

chemistry in text documents:

◦ Validated chemistry dictionaries with chemistry

structures

◦ Recognition of chemistry with name-to-structure

with high performance, identifi ed compounds are

stored in chemistry database

◦ For recognized compounds a connection table and

the respective InChI is generated and looked up for

novelty

◦ Annotation of documents with our compound

classes using our chemistry ontology, generated by

OntoChem’s chemistry ontology editor SODIAC.

● Anaphora resolution recognizes underdetermined NE

and searches for their more precise meaning through-

out the complete document

Consumers may work independently and in parallel, utilizing

the data provided by the analysis engines. They provide the

fi nal output to the search and display applications of our clients.

The modular OCMiner® processing pipeline is using custom designed modules to process documents and to produce the desired output.

Tagging of text documents with different ontologies – vizualization of chemistry

READER

FILE READER

OCMDB READER

MEDLINE READER

CleanupAnnotation

AncestorAnnotator

DomainAnnotator

… DomainAnnotator

LuceneCoocIndex

Lucene StdIndex

OCMDBConsumer

Consumer

COLLECTION PROCESSING ENGINE

Abbrev.Annotator

AnalysisEngine

AnalysisEngine

AnalysisEngine

ValidateShort

Annotation

Coord.Entity

Annotator

DocumentParts

Annotator

XMLAnnotatorDetagger

LanguageDetector

NormalizeText

Tokenizer

PersonAnnotator

CASCAS CAS CAS

● Text tagging and annotations, e.g. for annotating

scientifi c publications for printing houses or extracting

compounds from patents into custom databases

● Web-based search engines (www.ocminer.com), for

example together with our PDF-to-HTML converter

● Thematic searches or document ranking based on

ontology terms to receive instant knowledge based

(pre-calculated relations) results from very big data

collections

● Document similarity based on concepts rather than

words allows fi nding more relevant, related docu-

ments. For example, we may search for documents

that deal with similar compounds, to treat related

diseases.

● Relationship extraction ranges from simple

co-occurrence detection up to sophisticated semantic

relationship analysis based on specifi c syntax knowledge.

This technology is based on OntoChem’s unique

relationship ontology and syntax analysis software.

● Knowledge mining, by analyzing extracted data

further – for example answering complex questions

such as “What is the distribution of different compound

classes that are found in different plant families?”

Implicit or transitive knowledge can be searched that

is distributed over heterogeneous data sets – enabling

knowledge discovery that is not mentioned explicitly

in a document. This feature allows for generation of

new intellectual property.

● Structure searching: We are the leader in integrating

text and powerful chemistry searching - providing a

unique feature not present in other search engines.

Thus, we have implemented chemical identity and

stereoisomer searches in a straightforward way. The

whole range of chemical searches like substructure

and similarity searching is available via the integration

of ChemAxon’s JChem libraries if needed.

QUALITY

With OCMiner® we can achieve better precision and recall

than with competing technology – we would be glad to

demonstrate this to you! For example, with protein names

and using our protein ontology we guarantee to achieve

precision rates >95% and recall rates > 85%. However,

sometimes the meaning of specifi c sentences may not even

be resolved by human readers, therefore we have introduced

confi dence value based annotations. Each annotation gets

a specifi c confi dence value using a proprietary algorithm.

This value may be used in custom applications to extract

highly certain facts or a broader range of facts that have

only a low confi dence.

Co-occurrence of the compound phlorizin and proteins in different documents of Medline

OntoChem GmbHHeinrich-Damerow-Str. 406120 HalleGermany

[email protected]

OC Miner

Documents