XTF in Depth

Powerful Search and Display for Electronic Text

Martin HayeCalifornia Digital Library

January 2009 presentationat University of Sydney

XTF in Depth Part 1:

What is XTF and how does it compare? Who is using it? What needs does it address? New features in 2.1 Design and data flow Adapting Lucene and Saxon Planned improvements

Part 2: Interactive demos

XTF in 5 minutes eXtensible Text Framework Search and display technology from CDL Open-source Java framework Powerful and highly configurable All about rapid prototyping, fast deployment,

and incremental improvement XML + Full text search Also indexes PDF, HTML, Word

Excel and Powerpoint coming soon

XTF in 5 minutes Search: Query power/speed of Lucene, plus:

search results shown in context keyword search, facets, spelling, lots more

View: Processing power of Saxon, plus: large file optimizations, hit markup

Configure and customize exclusively in XSLT Flexible, overlapping collections Mature, tightly integrated, well documented In use at CDL and many other places

What XTF is not It is not a content management system

Creation (conversion, scanning, manual) Ingest / administration Editing Preservation

Not built for remote administration Not a true XML database

but close Not Google

Google: one interface to vast grab-bag of data XTF: crafted interfaces to high-quality data sets

How does XTF compare?

Customizable / Powerful ---------------------------------------->

Green-stone

XTF 2.0

XTF 2.1

* caveat: based on my limited experience with Greenstone and Solr

Online Archive of California

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

eScholarship Editions

calisphere

Mark Twain Project Online

UC Berkeley

University of Sydney

Encyclopedia of Chicago

Indiana University: Newton

Indiana University: Swinburne

Sweden

Brazil

Let’s look at four needs that XTF was created to address: Diverse data Open software Rapid deployment Community involvement

Needs: 1. Diverse data Our collections: many and diverse

eScholarship (TEI, PDF)• UC Press monographs (a text may be > 10 megs)• 25,000 scholarly articles in PDF

Mark Twain• Hand-crafted critical edition (TEI + MODS)

OAC: finding aids, images, books, manuscripts• Japanese American Relocation Digital Archives• TEI, EAD, MODS

Book scanning projects (Google, Internet Archive)• Thousands of scanned books (PDF + DC)• Millions of Melvyl catalog records (MARC)

Needs: 2. Open software

Digital Publishing Products “Black box” (no control over fixes & features) Often not standards-based Tech companies have short lifespans Support often spotty Data can be held hostage, or even lost $$$$$

Needs:3. Rapid deployment

New collections arriving Users don't want to wait a year for access Many “what if” and “wouldn't it be cool”

requests from our staff Java programmers are expensive Look & feel goes stale quickly Barrage of feature requests

Needs:4. Community involvement

We want to share the load For XTF 2.1, we asked the XTF

community to vote for features they wanted

At CDL we try to align our development to needs of the community

Result: Everybody benefits

New and improved in 2.1

Faceted browse Search flexibility Bookbag Spelling correction Similar items OAI-PMH

Faceted browse

Previously implementing faceted browse required lots of XSLT programming.

Hierarchical facets: even harder Required us to deeply refactor the

stylesheets, but now it’s simple to add new facets.

Faceted browse

Hierarchical facets

Search flexibility

Keyword search: single box (now default). Internally, searches multiple fields.

Advanced search: explicitly fill in constraints for various fields

Freeform search (new): text-based field specifiers, AND, OR, parentheses, etc.

Keyword search

Advanced search

Freeform search

OAI-PMH

This fit nicely into XTF’s architecture Simple but conforming implementation

Bookbag

Refactored the AJAX to use YUI (Yahoo User Interface widgets)

Still session based Now supports emailing the bookbag

Bookbag

Spelling correction

Unicode bug fixes On by default and fully integrated

Spelling correction

Similar items

Other changes in XTF 2.1 Built-in NLM “Blue”, TEI P5, MS Word support

(still support TEI P4, EAD, PDF, HTML, text) Valid XHTML output RawQuery servlet to provide a query back-end

to a (e.g. Ruby) front-end or mash-up. Bug fixes and minor changes (many

reported/requested by users)

Wiki documentation

Design philosophy Adaptation through programming XTF is still about building what you want using a set of

powerful tools

But now: Stylesheets are more modular Build interfaces faster using honed widgets Prettier UI to start with

XTF is open, standards based Based on free, open-source tools:

Java SDK 1.5+ Lucene 2.1 full-text search toolkit Saxon 8.9 XSLT processor

UNICODE support throughout XTF itself is open-source (BSD license) No native code – pure Java and XSLT 2.0 Runs on Windows, Solaris, Linux, MacOS Drops right in to Tomcat or Resin Lots of user-fixable documentation

Modular Use crossQuery servlet to search, dynaXML

to display and navigate. Deploy one or both. Stylesheets govern flow of data – no Java

programming required Easy to add features incrementally 100% configurable “look and feel” Skin & slice: one system can have several

interfaces and multiple “brands” Collection subsetting driven by meta-data

Why XSLT? XSLT is a natural fit for XML

Powerful, dynamic language Incredibly high-quality, free processor (Saxon)

Why not Java/Struts? Poor for rapid prototyping, steep learning curve

Why not Ruby? Not necessarily a good match for XML data Can be too clever by half But a smart mash-up might be cool...

Indexing Process

Indexing

Input filters adapt to many doc types Any XML doc type PDF, MS Word, plain text, untidy HTML

XTF is agnostic regarding: Document identifiers Filesystem organization

• Uses document selector stylesheet to identify and classify documents in filesystem

Meta-data storage Incremental indexing

Simply update filesystem then run indexer.

crossQuery servlet

Flexible Search/Display

One query, many collections XTF enables “Virtual collections”

Output filters for various result views e.g. simple vs. advanced search form, results in

brief vs. long format, etc. Query parsers for different search interfaces

Interface to other query protocols SRU and OAI-PMH already implemented Should be easy to adapt other queries:

• Very extensive set of query operators• Flexible query composition

Faceted browse

Query Power

Many operators AND, OR, NEAR, NOT, phrase, range, wildcard Or-Near, multi-field AND, “more like this”

Arbitrarily complex queries Combine full-text search with meta-data Unusual queries like:"dynamic duo" near "red phone"

Structure-aware searching e.g. search only headings, or only bibliographies But must pre-define which structures to search

More Power

Fixed-length snippets Highlight the hit and just the hit

Sort by relevance, or any meta-data fields Spelling correction No penalty for huge documents

XTF “lazily” pulls in only those parts used by a particular request (e.g. show just Chapter 1)

Scalable Proven with 10 million records / 14 gigs data but beyond that, Solr looks better

Authentication: IP lists, LDAP, or external

dynaXML servlet

Adapting Lucene and Saxon

Adapting Lucene Chunking, flattening, hit marking, stop-words,

setting limits, insensitivity, special queries, faceted browsing, spelling correction

Adapting Saxon Lazy trees, misc. extensions

Adapting Lucene:Chunking Why

Lucene's proximity searches perform best on small documents

Small chunks enable efficient generation of 80-character “snippet” surrounding each hit

How XTF breaks text blocks into 200-word chunks Chunks overlap to detect a hit starting in one and

ending in the next. Each chunk carries structural info, plus pointer to

location in XML doc. Only first chunk carries meta-data for doc

Adapting Lucene:Flattening XML

XSLT prefilter flattens XML structure Series of text blocks Block tagged with structural info for search Prefilter can boost or suppress sections Fine control over proximity matching

Prefilter gathers/marks meta-data Can come from within the document, from an XML

doc in filesystem, or fetched from a URL. Synthesize meta-data (e.g. sort fields, facets)

Adapting Lucene:Hit Marking

Marking search hits in context Lucene doesn't pinpoint location of hits, only gives a

score per-document Custom enhancements to Lucene's “span” logic

score and locate each hit. dynaXML dynamically adds ranked hits to original

XML doc, then sends to XSLT formatter. crossQuery forms a snippet around and highlights

each hit.

Adapting Lucene:Stop-words

Robust, efficient stop-word handling “the, a, an, it, on...” People do use them, and expect corresponding

results. Lucene normally ignores stop-words, for speed. XTF quietly joins stop-words to adjacent words,

forming “n-grams” Example: “man on the moon” ->

man-on on-the the-moon Queries are internally rewritten to search for n-grams

automatically.

Adapting Lucene:Setting Limits

Limits on aberrant queries Adjustable limits on number of terms matched by

range or wildcard queries N-grams naturally make most queries efficient Configurable limits on amount of “work” performed by

a single query. Numeric range query

Avoids term expansion Efficiently filters very granular data, e.g. timestamps: 2006-11-14:12:46:03.77

Adapting Lucene:Insensitivity

Accent/diacritic marks Many users can't or don't know how to type them XTF indexer uses configurable map to remove

accents crossQuery maps query terms

Plural Convenient for “cat” to match “cats” also Configurable map of plural to singular used at index

and query time

Adapting Lucene:Special Queries

OR-NEAR Standard OR query doesn't use proximity OR-NEAR: if words nearby, score is boosted

Multi-field AND All terms must be present, in any field. Essential for certain keyword searches: against all enemies clarke(matches against title and author)

More like this Auto-calculates “interesting” terms in meta-data Creates OR-NEAR query to find similar docs

Adapting Lucene:Faceted Browsing

Draws facet term list from Lucene index Each facet cached in-memory Counts per group created dynamically Special mini-language to sort/select (esp.

useful for hierarchical facets)

Adapting Lucene:Spelling Correction Any standard dictionary won't match place and

proper names Idea: use the index as source of suggestions XTF searches words within edit distance 2 Candidates ranked by weighted score:

Edit distance (transpositions discounted) Frequency of use in the index Double-metaphone match

Multi-word correction uses pair frequencies On test data, 80% right suggestion

Adapting Saxon:Lazy Trees

The need: display small parts of large (> 10MB) XML documents

Solution: create a binary, random-access version of each document

XSL keys calc'd once and stored Only elements accessed by a given request are

loaded from disk Care must be taken in stylesheets Profile mode is useful for optimization

Adapting Saxon:Extensions More complete SQL database connection Ability to call external tools

Automatic XML conversion in/out Timeout enforcement

File utilities Check file existence Get file length and timestamp

Session data Key/value pairs Value can be XML or plain string

The future XTF 2.2:

Better out-of-box for large EADs Fixes for incremental indexing; other bug fixes Specify any number of sub-dirs to index Possible TEI P5 refactoring Background auto-warming of new index Support for indexing Powerpoint and Excel files

Further out: A page-turner for scanned texts and converted PDFs Pop-up image/PDF page snippets And of course, features suggested by users

I’ll demonstrate the features we talked about on several different XTF sites “out in the wild.”

Fin Project: xtf.sourceforge.net

Docs: xtf.wiki.sourceforge.net

Discuss: groups.google.com/group/xtf-user

This talk: xtf.sourceforge.net/talks/2009-01-23.ppt

Me: martin.haye@ucop.edu

XTF in Depth

xtf community

placeswhat xtf

loadfor xtf

vast grabbag of data

chicagoindiana university

newtonindiana university

data flowadapting lucene

contextkeyword search

Documents

PowerShell in Depth

Depth Of Field In Depth

Anjit in Depth

ﻝﻮﻴﻤﻟﺍ ﺔﻴﻌﻓﺍﺪﺑ...

In Depth Research

Infrastructure In-depth: Philippines - ASEAN...

ﻢﻠﻌﺘـﻠﻟ...

XTF-CD187 DS SPA-2020-01-17

XTF in Depth Powerful Search and Display for Electronic Text...

Defense in Depth is Dead, Long Live Depth in Defense ·...

C# in Depth and XNA in Depth

ﺓﺮﺳﻷﺍ...

In Depth Exploration

Simple Figures and Perceptions in Depth (2): Stereo...

In Depth Landescape

LinkedIn In Depth