Apache Tika: 1 point Oh!

Chris A. MattmannNASA JPL/Univ. Southern California/ASF

mattmann@apache.org November 9, 2011

• Apache Member involved in– OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS

(Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor)

• Senior Computer Scientist at NASA JPL in Pasadena, CA USA

• Software Architecture/Engineering Prof at Univ. of Southern California

And you are?

Roadmap• 1st part of the talk

– Why Tika?– What is Tika?– What are the current versions of Tika?– What can it do?

• 2nd part of the talk– NASA Earth Science Data Systems– Data System Needs and Requirements– How does Tika help?

The Information Landscape

Proliferation of content types available

• By some accounts, 16K to 51K content types*

• What to do with content types?– Parse them

• How?• Extract their text and structure

– Index their metadata• In an indexing technology like Lucene, Solr, or in

Google Appliance– Identify what language they belong to

• Ngrams

*http://filext.com/

Importance of content types

Importance of content type detection

Search Engine Architecture

• Identify and classify file types– MIME detection

• Glob pattern– *.txt– *.pdf

• URL– http://…pdf– ftp://myfile.txt

• Magic bytes• Combination of

the above means

• Classification means reaction can be targeted

• A content analysis and detection toolkit• A set of Java APIs providing MIME type detection,

language identification, integration of various parsing libraries

• A rich Metadata API for representing different Metadata models

• A command line interface to the underlying Java code

• A GUI interface to the Java code

Tika’s (Brief) History• Original idea for Tika came from Chris Mattmann and

Jerome Charron in 2006• Proposed as Lucene sub-project

– Others interested, didn’t gain much traction

• Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in Apache Jackrabbit– A Content Management System

• Graduated from the Incubator to Lucene sub-project in 2008

• Graduated to Apache TLP in April 2010• 40, 88 and 29 issues resolved in versions 1.0, 0.10,

and 0.9

Community• Mailing lists

– User: 125 peeps, ~70 msg/mo.– Dev: 210 peeps, ~250 msg/mo.

• Committers/PMC– 13 peeps– Large majority of them active

• Releases– 11 releases so far– Just pushed out 1 point OH

• http://s.apache.org/N0I

Credit: svnsearch.org

Use in the classroom• Have used Apache Tika for the past 2 years in

both my Search Engines/Information Retrieval class and my Software Architecture class– Several student final projects have turned into

contributions for the project and merit for the students

• Define data management projects that involve the use of OODT, and other technologies like Solr, Tika, Nutch, Hadoop, etc.

Some recent 1 point oh press

Getting started rapidly…like now!

• Download Tika from:– http://tika.apache.org/download.html

• Grab tika-app-1.0.jar

• alias tika “java –jar tika-app-1.0.jar”

• tika < somefile.doc > extracted-text.xhtml

• tika –m < somefile.doc > extracted.met• Works on Windows too (alias only on UNIX)

A quick NASA dataset• Atmospheric Infrared Sounder Mission (AIRS)

– Level 2 Cloud Clear Radiance Product– Grab it from here:

• ftp://airspar1u.ecs.nasa.gov/ftp/data/s4pa/Aqua_AIRS_Level2/AIRI2CCF.003/2007/005/

– Just grab the first file• java -jar tika-app-1.0.jar -m <

AIRS.2007.01.05.001.L2.CC.v4.0.9.0.G07006021239.hdf– Hopefully this worked for you, if not, blame..

• Windows– And Bill Gates

25-Mar-11 CORDEX-MATTMANN 16

Detecting MIME types from Java

• String type = Tika.detect(…)– java.io.InputStream– java.io.File– java.net.URL– java.lang.String

Adding new MIME types

• Got XML?

• Based on freedesktop.org spec (loosely)

Many custom applications and tools

• You need this: to read this:

Third-party parsing libraries

• Most of the custom applications come with software libraries and tools to read/write these files– Rather than re-invent the wheel, figure out a way to

take advantage of them

• Parsing text and structure is a difficult problem– Not all libraries parse text in equivalent manners– Some are faster than others– Some are more reliable than others

Parsing

• String content = Tika.parseToString(…)– InputStream– File– URL

Streaming Parsing

• Reader reader = Tika.parse(…)– InputStream– File– URL

Extraction of Metadata

• Important to follow common Metadata models– Dublin Core – any electronic resource– XMP – also general like Dublin Core– Word Metadata – specific to .doc, .ppt, etc.– EXIF – image related

• Lots of standards and models out there– The use and extraction of common models allows for content

intercomparison– All standardize mechanisms for searching– You always know for X file type that field Y is there and of type

String or Int or Date

Cancer Research Example

Attributes

Relationships

Credit: A. Hart

Tika Sponsoring the Any23 Project

• Tika PMC is sponsoring the Any23 project in the Incubator (entered: 10/1/2011)

• Any23 = “Anything to Triples”• Semantic Toolkit for parsing, identification of

all major semantic web content types (RDF, etc.)

• Related to Apache Jena• Looking for synergies between 2 efforts

Metadata

• Metadata met = new Metadata();//Dubiln Coremet.set(Metadata.FORMAT, “text/html”);//multi-valuedmet.set(Metadata.FORMAT, “text/plain”);System.out.println(met.getValues(Metadata.FORMAT));

• Other met models supported (HTTP Headers, Word, Creative Commons, Climate Forecast, etc.)– Run: tika --list-met-models

Methods for language identification

• N-grams– Method of detecting next character or set of

characters in a sequence– Useful in determine whether small snippets of

text come from a particular language, or character set

• Non-computational approaches– Tagging– Looking for common words or characters

Language Detection

• LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(FileUtils.readFileToString(newFile(filename))));

• System.out.println(lang.getLanguage());• Uses Ngram analysis included with Tika

– Originating from Nutch– Can be improved

Running Tika in GUI form

• tika --gui

Integrating Tika into your App

• Maven

• Ant

• Eclipse

• It’s just a set of jars– tika-core– tika-parsers– tika-app– tika-bundle– tika-server

tika-core

tika-parsers

tika-app

tika-bundle

tika-server

Some really great stuff in 1.0• Super improved OSGi support

– New tika-bundle module

• Improved RTF parsing support, OO support, and parsing of Outlook email attachments

• Language Detection for Belarusian, Catalan, Esperanto, Galician, Lithuanian Romanian, Slovak,Slovenian, and Ukrainian

• Improved PDF parsing (extract annotation)

NICK ALREADY TALKED ABOUT THIS!!! Thunder stolen

Things to watch out for

• Deprecated APIs->gone–Recompile code

• No more JDK 1.4 version of Tika–Upgrade

Improvements to Tika

• Adding more parsers for content types• Improve the JAX-RS server support• Expanding ability to handle random access

file parsing– Scientific data file formats, some work on this– Leverage improvements in file representation

TIKA-701, TIKA-654, TIKA-645, TIKA-153

• Geospatial parsing support through GDAL• Improving language and charset detection

Part 2

Science Data Systems at NASA

Credit: http://www.jpl.nasa.gov/news/news.cfm?release=2011-295

NASA Ground Data Systems

Credit: D. Woollard

Context• NASA develops science data processing systems

for multiple earth science missions• These systems convert the instrument telemetry

delivered to earth from space into useful data for scientific research

• Typical characteristics– Remote sensing instruments that orbit the Earth multiple

times daily– Data are acquired constantly– Complex algorithms convert instrument measurements to

geophysical quantities

The Square Kilometer Array

• 1 sq. km ofantennas

• Never-beforeseen resolution looking intothe sky

• 700 TB– Per second!

NASA DESDynI Mission

• 16 TB/day

• Geographically distributed

• 10s of 1000s of jobs per day

• Tier 1 Earth Science Decadal Mission

Some Considerations• Scale

– Data throughput rates– # of data types– # of metadata types– # of users to send the data to

• Federation– Must leave the data where it is– Socio/Economic/Political

• Heterogeneity– Technology, data formats, skills!

Apache OODT

• We’ve got some components to deal with these issues

How are we building these systems now? -Allow for

push/pull of data over arbitrary

protocols

- Ingestion builds std catalog and

Apache Tika: 1 point Oh!

Documents

Apache Tika - what's new with 2.0?

Tika - Mastoiditis

Scientific data curation and processing with Apache Tika...

Evaluating Text Extraction: Apache Tika’s New tika-eval...

TREC Dynamic Domain · Each web crawl used Apache Nutch as....

Tika lagado

TIKA TIKA BIRDS - Freddie The Frog · TIKA TIKA BIRDS 2010....

Apache Tika

Scientific data curation and processing with Apache Tika

Ginek Tika

Apache CXF, Tika and Lucene · 2017-12-14 · Apache CXF,.....

Album tika

Jurding Tika

What's new with Apache Tika? -...

Apache Tika End-to-End An introduction to Apache Tika, and.....

Evaluating Text Extraction: Apache Tika’s New tika-eval...