Apache Tika

Apache Tika2007-11-15

Jukka [email protected]

Apache Tika

An extensible, configurable

content analysis frameworktoolkit



Agenda

The Problem

The Solution

The Project

The Design



The Problem

PDFBoxApache Poi

Apache XercesICU4J

NekoHTMLetc.

Lucene index



It’s Worse Than That

LicensingDependencies

Metadata extractionStructured content

Encryption/CompressionPackage formats

Streaming

Processing ofdigital media

?

?

?

???

??



Agenda

The Problem

The Solution

The Project

The Design



The Solution: Technical

• Generic API for extracting metadata and structured text content from a document– Input: byte stream + optional metadata– Output: XHTML SAX events + metadata

• Automatic content type detection– Magic bytes– File name patterns



The Solution: Legal / Social

• Apache License– (L)GPL projects can implement the Tika API

• Pooling of efforts– Active development and maintenance– Already beyond the functionality of most

custom solutions– Cool future goals: OCR, speech recognition, …



Agenda

The Problem

The Solution

The Project

The Design



Project Status

• Initially planned already in early 2006

• Incubating since March 2007

• Sponsoring PMC: Apache Lucene

• No releases yet– 0.1 release being planned

• Small development team– 6 committers, 3-4 currently active



Current Features

• Media type framework– Shared MIME info spec (freedesktop.org)– Default media type registry (incl. glob and magic patterns)

• Parser components– PDF (PDFBox)– Plain text (ICU4)– XML (SAX)– HTML (NekoHTML)– Word, PowerPoint, Excel (POI)– ODF (SAX)– RTF (Swing)



Project Statistics



Codebase History

LiusNutch

Lius Lite

Tika

textmining

Jackrabbit

Andy Clark

Jukka Zitting

Rida BenjellounChris MattmanJerome Charron

Sami Siren

Bertrand DelacretazKeith Bennett



Agenda

The Problem

The Solution

The Project

The Design



Content Extraction

PPT

Type: application/vnd.ms-powerpointTitle: Apache Tika

Author: Jukka Zitting

new PowerPointParser().parse(…);



Media Type Detection

application/vnd.ms-powerpoint

MimeTypes types = …;MimeType type = types.getMimeType(…);

tika-mimetypes.xml/etc/magic

mime.types

?



Combined Detection and Extraction

PPT

Type: application/vnd.ms-powerpointTitle: Apache Tika

Author: Jukka Zitting

TXT

PDF

XML

new AutoDetectParser().parse(…);

?



Agenda

The Problem

The Solution

The Project

The DesignThank You!

Apache Tika

Technology

apache tika author

media type detection

apache lucene

content extraction ppt

mimetype type

project status

project statistics

default media type registry