Top Banner
Apache Tika 2007-11-15 Jukka Zitting [email protected] Apache Tika An extensible, configurable content analysis framework toolkit
17
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

Apache Tika

An extensible, configurable

content analysis frameworktoolkit

Page 2: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

Agenda

The Problem

The Solution

The Project

The Design

Page 3: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

The Problem

PDFBoxApache Poi

Apache XercesICU4J

NekoHTMLetc.

Lucene index

Page 4: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

It’s Worse Than That

LicensingDependencies

Metadata extractionStructured content

Encryption/CompressionPackage formats

Streaming

Processing ofdigital media

?

?

?

???

??

Page 5: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

Agenda

The Problem

The Solution

The Project

The Design

Page 6: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

The Solution: Technical

• Generic API for extracting metadata and structured text content from a document– Input: byte stream + optional metadata– Output: XHTML SAX events + metadata

• Automatic content type detection– Magic bytes– File name patterns

Page 7: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

The Solution: Legal / Social

• Apache License– (L)GPL projects can implement the Tika API

• Pooling of efforts– Active development and maintenance– Already beyond the functionality of most

custom solutions– Cool future goals: OCR, speech recognition, …

Page 8: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

Agenda

The Problem

The Solution

The Project

The Design

Page 9: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

Project Status

• Initially planned already in early 2006

• Incubating since March 2007

• Sponsoring PMC: Apache Lucene

• No releases yet– 0.1 release being planned

• Small development team– 6 committers, 3-4 currently active

Page 10: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

Current Features

• Media type framework– Shared MIME info spec (freedesktop.org)– Default media type registry (incl. glob and magic patterns)

• Parser components– PDF (PDFBox)– Plain text (ICU4)– XML (SAX)– HTML (NekoHTML)– Word, PowerPoint, Excel (POI)– ODF (SAX)– RTF (Swing)

Page 11: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

Project Statistics

Page 12: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

Codebase History

LiusNutch

Lius Lite

Tika

textmining

Jackrabbit

Andy Clark

Jukka Zitting

Rida BenjellounChris MattmanJerome Charron

Sami Siren

Bertrand DelacretazKeith Bennett

Page 13: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

Agenda

The Problem

The Solution

The Project

The Design

Page 14: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

Content Extraction

PPT

Type: application/vnd.ms-powerpointTitle: Apache Tika

Author: Jukka Zitting

new PowerPointParser().parse(…);

Page 15: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

Media Type Detection

application/vnd.ms-powerpoint

MimeTypes types = …;MimeType type = types.getMimeType(…);

tika-mimetypes.xml/etc/magic

mime.types

?

Page 16: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

Combined Detection and Extraction

PPT

Type: application/vnd.ms-powerpointTitle: Apache Tika

Author: Jukka Zitting

TXT

PDF

XML

new AutoDetectParser().parse(…);

?

Page 17: Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

Agenda

The Problem

The Solution

The Project

The DesignThank You!