Top Banner
David Baehrens Large-Scale Patent Classification at the European Patent Office
34

David Baehrens: Large-Scale Patent Classification at the European Patent Office

Apr 14, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: David Baehrens: Large-Scale Patent Classification at the European Patent Office

David Baehrens

Large-Scale Patent Classification

at the European Patent Office

Page 2: David Baehrens: Large-Scale Patent Classification at the European Patent Office

ABOUT AVERBIS

Founded: 2007

Location: Freiburg im Breisgau

Team: Domain & IT-Experts

Focus: Leverage structured & unstructured information

Current Sectors: Pharma, Health, Automotive, Publishers & Libraries

Page 3: David Baehrens: Large-Scale Patent Classification at the European Patent Office

PORTFOLIO

Solutions

Libraries Pharma Patents Healthcare Social Media

Terminology Management Text Mining

Search & Analytics NoSQL

Categorization & Clustering

Automotive

Page 4: David Baehrens: Large-Scale Patent Classification at the European Patent Office

TERMINOLOGY MANAGEMENT

Terminology management

software

Provision of terminologies

Mappings between

terminologies

Building terminology-based

applications

Page 5: David Baehrens: Large-Scale Patent Classification at the European Patent Office

Synonyms: dimethyl sulfoxide, dimethylsulfoxide, Domoso, Infiltrina

Hierarchies: cancer, carcinoma, melanoma, lymphoma, glioblastoma…

Patterns: dates, citations, mail addresses…

Rule-based extraction of all different kinds of complex information

Persons, Locations, Genes, ….

Coocurrences, Typed Relations, e.g. Genes / Diseases / Modification Type

TEXT MINING

Term Detection

Regular

Expressions

Rule Engine

Named Entities

Relations

Sentences, Tokens, POS-Tags, Chunks, Paragraphs, Sections, Stemming, Decompounding… Syntax Detection

Page 6: David Baehrens: Large-Scale Patent Classification at the European Patent Office

RULE ENGINE

1. NAME OF THE MEDICINAL PRODUCT

Desloratadine ratiopharm 5 mg film-coated tablets

Primary Field Name Secondary Field Name Field Value

MedicalProductName coveredText Desloratadine ratiopharm 5 mg film-coated tablets

inventedPartName DESLORATADINE

strengthPart 5 mg

pharmaceuticalDoseFormPart FILM-COATED TABLET

Te

xt

Reg

el

Erg

eb

nis

Page 7: David Baehrens: Large-Scale Patent Classification at the European Patent Office

SEARCH & NOSQL

Free text + concept based

search

Text mining integration

Guided navigation / facets

NoSQL functionalities

Multi- & cross lingual search

Related documents

Based on Apache Solr

• Extended Query Syntax

• JSON-API

• Scalability

Page 8: David Baehrens: Large-Scale Patent Classification at the European Patent Office

DOCUMENT CLASSIFICATION

Hotel Reviews

Patents

Page 9: David Baehrens: Large-Scale Patent Classification at the European Patent Office

SEARCH & NOSQL

Page 10: David Baehrens: Large-Scale Patent Classification at the European Patent Office

INFORMATION DISCOVERY

Terminology Management Text Mining

Search & Analytics NoSQL

Categorization & Clustering

Delivery / Deployment / Runtime Environment

Integration Tests / Continuous Integration

Extensive Documentation

Common Architecture / Application Design

User & Role Management, Security

Communication Bus

Project Management

Page 11: David Baehrens: Large-Scale Patent Classification at the European Patent Office

PATENT CLASSIFICATION AT EPO

Tender No. 1585

1) Pre-Classification of

unpublished patents into departments

2) Re-Classification on

published patents, if category system changes

Page 12: David Baehrens: Large-Scale Patent Classification at the European Patent Office

ABOUT EPO

• The European Patent Office (EPO)

grants European patents for the

Contracting States to the European

Patent Convention

• Second largest intergovernmental

institution in Europe

• Not an EU institution

• Self-financing, i.e. revenue

from fees covers operating

and capital expenditure

Page 13: David Baehrens: Large-Scale Patent Classification at the European Patent Office

NUMBER OF STAFF

Status: December 2008

Page 14: David Baehrens: Large-Scale Patent Classification at the European Patent Office

PATENT APPLICATIONS

Page 15: David Baehrens: Large-Scale Patent Classification at the European Patent Office

http://www.epo.org/about-us/annual-reports-statistics/annual-report/2014.html

Page 16: David Baehrens: Large-Scale Patent Classification at the European Patent Office

COOPERATIVE PATENT CLASSIFICATION

• Patent Classification System based on ECLA / IPC

• jointly developed by the European Patent Office (EPO)

and the United States Patent and Trademark Office

(USPTO)

• used by both the EPO and USPTO since 1 January 2013

• currently contains about 250.000 classes

Page 17: David Baehrens: Large-Scale Patent Classification at the European Patent Office

EXAMPLE CPC CLASS

Page 18: David Baehrens: Large-Scale Patent Classification at the European Patent Office

GRANTED PATENT

Page 19: David Baehrens: Large-Scale Patent Classification at the European Patent Office

EARLY PATENT

Page 20: David Baehrens: Large-Scale Patent Classification at the European Patent Office

EARLY PATENT

Page 21: David Baehrens: Large-Scale Patent Classification at the European Patent Office

EARLY PATENT

Page 22: David Baehrens: Large-Scale Patent Classification at the European Patent Office

PATENT CLASSIFICATION AT EPO

Tender No. 1585

1) Pre-Classification of

unpublished patents into departments

Our Motivation:

• Great Classification Use-Case

– Big Data (80 Mio. patents available)

– Large Scale Category System >250.000 CPC codes

– Tough classification quality and response time

constraints

• Text Mining Success Story

Page 23: David Baehrens: Large-Scale Patent Classification at the European Patent Office

OLD CLASSIFICATION PROCESS

PATENTS CLA SSIFICATION DEPARTMENTS

Page 24: David Baehrens: Large-Scale Patent Classification at the European Patent Office

CLASSIFICATION COMPLEXITY

~250.000

CPC Codes

~1.500

Ranges

250

Departments

Page 25: David Baehrens: Large-Scale Patent Classification at the European Patent Office

CLASSIFICATION PROCESS

PATENTS CLA SSIFICATION DEPARTMENTS

Page 26: David Baehrens: Large-Scale Patent Classification at the European Patent Office

NEW CLASSIFICATION PROCESS

PATENTS CLA SSIFICATION DEPARTMENTS

Page 27: David Baehrens: Large-Scale Patent Classification at the European Patent Office

SOME FACTS

• about 650k training documents from 2005-2013

• supervised learning: light-weight and fast linear support

vector machine

• Training time (16 Cores, 128 GB RAM)

– Feature Extraction: ~1 hour

– Training of Classifiers: ~1 hour

– 90/10 tests with a look-a-head of 3 levels

and reporting 3 best candidates: ~1 hour

• Prediction: 5 docs in 5 sec

Page 28: David Baehrens: Large-Scale Patent Classification at the European Patent Office

HIERARCHICAL CLASSIFICATION

Page 29: David Baehrens: Large-Scale Patent Classification at the European Patent Office

STATUS & OUTLOOK

Range-specific quality

evaluation

Going live with best

ranges

• Continuous optimization

Page 30: David Baehrens: Large-Scale Patent Classification at the European Patent Office

PATENT CLASSIFICATION AT EPO

Tender No. 1585

1) Re-Classification on

published patents, if category system changes

Challenges and Facts:

– 250.000 CPC codes, regular changes/refinements

– Several re-classification projects at any one time, great

variation in size, a class is split into 5-20(?) subclasses

– No training material available

Page 31: David Baehrens: Large-Scale Patent Classification at the European Patent Office

NEW RE-CLASSIFICATION PROCESS

Training Data

• Human Annotator starts labeling about 20% of

the documents with new subclasses

Statistical Models

• are generated on-the-fly, and

• Cross-validation test are carried out

Threshold

• If cross-validation achieves certain threshold

(e.g. 90%), the remaining documents are

classified fully automatically without further

review

• Otherwise, more training data is being generated

Page 32: David Baehrens: Large-Scale Patent Classification at the European Patent Office

STATUS & OUTLOOK

Currently in evaluation

phase

• Going live in the next

weeks

Page 33: David Baehrens: Large-Scale Patent Classification at the European Patent Office

…NOT ONLY PATENTS

Solutions

Libraries Pharma Patents Healthcare Social Media

Terminology Management Text Mining

Search & Analytics NoSQL

Categorization & Clustering

Automotive

Page 34: David Baehrens: Large-Scale Patent Classification at the European Patent Office

For further questions, please contact:

David Baehrens

+ 49 (0)761 203 97690

[email protected]