Top Banner
Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application
18

Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Dec 17, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Apache TikaEnd-to-End

An introduction to Apache Tika,

and integrating it to your application

Page 2: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Nick Burch

Software EngineerAlfresco

Page 3: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Apache Tikahttp://tika.apache.org/

• Project which started in 2006• Grew out of the Lucene community, now widely

used• Provides detection of files – eg this binary blob is

really a word file, that one is UTF-8 plain text• Plain text, HTML and XHTML versions of a wide

range of different file formats• Consistent Metadata from different files• Tika hides the complexity of the different formats

and their libraries, instead presents a simple, powerful API

• Easy to use and extend

Page 4: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

What's new?

• Lots of new parsers – text, office formats, publishing formats, images, audio, CAD, fonts etc

• Long standing parsers improved – better HTML from word for example

• Embedded resources and containers• Use expanding – used by many SOLR

users, Alfresco, lots of people crunching masses of data on Hadoop

Page 5: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Supported Formats Page 1• Audio – WAV, RIFF, MIDI• DWG (CAD)• Epub• RSS and ATOM Feeds• True Type Fonts• HTML• Images – JPEG, GIF, PNG, TIFF, Bitmap

(including EXIF where found)• iWork (Keynote, Pages etc)• RFC822 mbox Mail

Page 6: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Supported Formats Page 2• Microsoft Outlook .msg Email• Microsoft Office (Binary) – Word,

PowerPoint, Excel, Visio, Publisher, Works• Microsoft Office (OOXML) – Word,

PowerPoint, Excel• MP3 (id3 v1 and v2)• CDF (Scientific Data)• Open Document Format (Open Office)• Old-style Open Office (.sxw etc)• PDF

Page 7: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Supported Formats Page 3

• Zip and Tar archives• RDF• Plain Text• FLV Video• XML• Java class files

And I probably forgot one...!

Page 8: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Metadata

• Tika provides consistent metadata across the range of parsers

• No need to know if it's “Last Author”, “Last Editor” or “Previous Author” in a file format, they all come back with the same metadata key

• Keys and values are strings, but strongly typed metadata entries provide converters to dates, ints etc

Page 9: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Text Content• Tika generates HTML-like SAX events

as it parses• Uses Java SAX API• Events can be captured or transformed• Body Content Handler used for plain text• HTML and XHTML available• Can customise with your own handler,

with XSLT or with E4X from JavaScript• eg HTML Table → CSV

Page 10: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Calling Tika

Page 11: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

// Get a content detector, and an auto-selecting ParserTikaConfig config = TikaConfig.getDefaultConfig();ContainerAwareDetector detector = new ContainerAwareDetector(config.getMimeRepository() );Parser parser = new AutoDetectParser(detector);

// We’ll only want the plain text contentsContentHandler handler = new BodyContentHandler();

// Tell the parser what we haveMetadata metadata = new Metadata();metadata.set(Metadata.RESOURCE_NAME_KEY, filename);

// Have it processedparser.parse(input, handler, metadata, new ParseContext());

Page 12: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

// Plain text only content handlerContentHandler handler = new BodyContentHandler();String text = handler.toString();

// XHTML content handlerSAXTransformerFactory factory = SAXTransformerFactory.newInstance();TransformerHandler handler = factory.newTransformerHandler();handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");StringWriter sw = new StringWriter();handler.setResult(new StreamResult(sw));String text = sw.toString();

Page 13: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Tika Parsers

Page 14: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Parser Interface• Two key methods – what mime types are

supported, and do the parsing

public interface Parser {Set<MediaType>

getSupportedTypes(ParseContext context);void parse(InputStream stream, ContentHandler

handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException;

}

Page 15: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

public class HelloWorldParser implements Parser { public Set<MediaType> getSupportedTypes(ParseContext context) { Set<MediaType> types = new HashSet<MediaType>(); types.add(MediaType.parse("hello/world")); return types; } public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws SAXException { XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); xhtml.startDocument(); xhtml.startElement("h1"); xhtml.characters("Hello, World!"); xhtml.endElement("h1"); xhtml.endDocument(); metadata.set("hello","world"); metadata.set("title","Hello World!"); }}

Page 16: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Demo: Tika-App

Page 17: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Demo: Geo-Tagged Imagesin Alfresco Share via Tika

Page 18: Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Any Questions?