Taming Text

Post on 10-May-2015

1186 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See http://www.manning.com/ingersoll for more.

Transcript

Taming Text

Grant IngersollCTO, LucidWorks

@tamingtext, @gsingers

About the Book

• Goal: An engineer’s guide to search and Natural Language Processing (NLP) and Machine Learning

• Target Audience: You• All examples in Java, but concepts easily ported• Covers:– Search, Fuzzy string matching, human language

basics, clustering, classification, Question Answering, Intro to advanced topics

Answer Me This!

• What is trimethylbenzene?– http://localhost:8983/solr/answer?q=What+is+trimethylbenzene%3F&defType

=qa&qa=true&qa.qf=body

• who is ten minute warning?– http://localhost:8983/solr/answer?q=who+is+ten+minute+warning%3F&defTy

pe=qa&qa=true&qa.qf=body

• what station serves the A train?– http://localhost:8983/solr/answer?q=what+station+serves+the+A+train%3F&d

efType=qa&qa=true&qa.qf=body

Fact-based QA Demo

What does it take to build this system?

Agenda

• Question Answering In Detail– Building Blocks– Indexing– Search/Passage Retrieval– Classification– Scoring

• Other Interesting Topics– Clustering– Fuzzy-Wuzzy Strings

• What’s next?• Resources

A Grain of Salt

• Text is a strange and magical world filled with…– Evil villains– Jesters– Wizards– Unicorns– Heroes!

• In other words, no system will be perfect

http://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg

The Ugly Truth

• You will spend most of your time in NLP, search, etc. doing “grunt” work nicely labeled as:– Preprocessing– Feature Selection– Sampling– Validation/testing/etc.– Content extraction– ETL

• Corollary: Start with simple, tried and true algorithms, then iterate

Getting Started

• git clone git@github.com:tamingtext/book.git• See the README for pre-requisites• ./bin contains useful scripts to get started• You’ll need to download some pretty big

dependencies:– OpenNLP Models– WordNet– Wikipedia subset

Question Answering (QA)

What is QA?

• You’ve seen QA in action already thanks to IBM and Jeopardy!

• Instead of providing 10 blue links, provide the answer!

• Exercises many search and NLP features

• See Ch. 8

Simple QA Workflow

Building Blocks

• Sentence Detection

• Part of Speech Tagging

• Parsing

• Ch. 2

QA in Taming Text

• Apache Solr for Passage Retrieval and integration

• Apache OpenNLP for sentence detection, parsing, POS tagging and answer type classification

• Custom code for Query Parsing, Scoring– See com.tamingtext.qa package

• Wikipedia for “truth”

Demo

• $TT_HOME/bin/start-solr.sh solr-qa– http://localhost:8983/solr/answer

• Once that is up and running– $TT_HOME/bin/indexWikipedia.sh --wikiFile

~/projects/manning/maven.tamingtext.com/freebase-wex-2011-01-18-articles-first10k.tsv

• When done, you can ask questions!

Indexing

• Ingest raw data into the system and make it available for search

• Garbage In, Garbage Out– Need to spend some time understanding and modeling

your data just like you would with a DB– Lather, rinse, repeat

• See the $TT_HOME/apache-solr/solr-qa/conf/schema.xml for setup

• WikipediaWexIndexer.java for indexing code

Aside: Named Entity Recognition

• NER is the process of extracting proper names, etc. from text

• Plays a vital role in a QA and many other NLP systems• Often solved using classification approaches

• Custom Query Parser takes in user’s natural language query, classifies it to find the Answer Type and generates Solr query

• Retrieve candidate passages that match keywords and expected answer type

• Unlike keyword search, we need to know exactly where matches occur

Answer Type Classification

• Answer Type examples:– Person (P), Location (L), Organization (O), Time Point

(T), Duration (R), Money (M)– See page 248 for more

• Train an OpenNLP classifier off of a set of previously annotated questions, e.g.:– P Which French monarch reinstated the divine right

of the monarchy to France and was known as `The Sun King' because of the splendour of his reign?

Scoring

Other Areas of NLP/Machine Learning

Clustering

• Group together content based on some notion of similarity

• Book covers (ch. 6):– Search result clustering using

Carrot2

– Whole collection clustering using Mahout

– Topic Modeling• Mahout comes with many

different algorithms

Clustering Use Cases

• Google News

• Outlier detection in smart grids

• Recommendations– Products– People, etc.

In Focus: K-Means

http://en.wikipedia.org/wiki/K-means_clustering

Fuzzy-Wuzzy Strings

• Fuzzy string matching is a common, and difficult, problem

• Useful for solving problems like:– Did you mean spell checking– Auto-suggest– Record linkage

Common Approaches

• See com.tamingtext.fuzzy package• Jaccard– Measure character overlap

• Levenshtein (Edit Distance)– Count the number of edits required to transform

one word into the other• Jaro-Winkler– Account for position

Trie

• The Trie is a very useful data structure for working with strings

• Find common subsequences

• Auto-suggest, others

• Ternary Search Trie

What’s Next?

Much Harder Problems

• Chapter 9• Semantics, Pragmatics and beyond• Sentiment Analysis• Document and collection summarization• Relationship Extraction• Cross-language Search• Importance

Thank You!

• 3 copies of Taming Text

Resources

• http://www.manning.com/ingersoll– http://github.com/tamingtext/

book• http://www.tamingtext.com• @tamingtext• Me:– @gsingers– grant@lucidworks.com

top related