Top Banner
Taming Text Grant Ingersoll CTO, LucidWorks @tamingtext, @gsingers
31

Taming Text

May 10, 2015

Download

Technology

Grant Ingersoll

Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See http://www.manning.com/ingersoll for more.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Taming Text

Taming Text

Grant IngersollCTO, LucidWorks

@tamingtext, @gsingers

Page 2: Taming Text

About the Book

• Goal: An engineer’s guide to search and Natural Language Processing (NLP) and Machine Learning

• Target Audience: You• All examples in Java, but concepts easily ported• Covers:– Search, Fuzzy string matching, human language

basics, clustering, classification, Question Answering, Intro to advanced topics

Page 3: Taming Text

Answer Me This!

• What is trimethylbenzene?– http://localhost:8983/solr/answer?q=What+is+trimethylbenzene%3F&defType

=qa&qa=true&qa.qf=body

• who is ten minute warning?– http://localhost:8983/solr/answer?q=who+is+ten+minute+warning%3F&defTy

pe=qa&qa=true&qa.qf=body

• what station serves the A train?– http://localhost:8983/solr/answer?q=what+station+serves+the+A+train%3F&d

efType=qa&qa=true&qa.qf=body

Page 4: Taming Text

Fact-based QA Demo

Page 5: Taming Text

What does it take to build this system?

Page 6: Taming Text

Agenda

• Question Answering In Detail– Building Blocks– Indexing– Search/Passage Retrieval– Classification– Scoring

• Other Interesting Topics– Clustering– Fuzzy-Wuzzy Strings

• What’s next?• Resources

Page 7: Taming Text

A Grain of Salt

• Text is a strange and magical world filled with…– Evil villains– Jesters– Wizards– Unicorns– Heroes!

• In other words, no system will be perfect

http://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg

Page 8: Taming Text

The Ugly Truth

• You will spend most of your time in NLP, search, etc. doing “grunt” work nicely labeled as:– Preprocessing– Feature Selection– Sampling– Validation/testing/etc.– Content extraction– ETL

• Corollary: Start with simple, tried and true algorithms, then iterate

Page 9: Taming Text

Getting Started

• git clone [email protected]:tamingtext/book.git• See the README for pre-requisites• ./bin contains useful scripts to get started• You’ll need to download some pretty big

dependencies:– OpenNLP Models– WordNet– Wikipedia subset

Page 10: Taming Text

Question Answering (QA)

Page 11: Taming Text

What is QA?

• You’ve seen QA in action already thanks to IBM and Jeopardy!

• Instead of providing 10 blue links, provide the answer!

• Exercises many search and NLP features

• See Ch. 8

Page 12: Taming Text

Simple QA Workflow

Page 13: Taming Text

Building Blocks

• Sentence Detection

• Part of Speech Tagging

• Parsing

• Ch. 2

Page 14: Taming Text

QA in Taming Text

• Apache Solr for Passage Retrieval and integration

• Apache OpenNLP for sentence detection, parsing, POS tagging and answer type classification

• Custom code for Query Parsing, Scoring– See com.tamingtext.qa package

• Wikipedia for “truth”

Page 15: Taming Text

Demo

• $TT_HOME/bin/start-solr.sh solr-qa– http://localhost:8983/solr/answer

• Once that is up and running– $TT_HOME/bin/indexWikipedia.sh --wikiFile

~/projects/manning/maven.tamingtext.com/freebase-wex-2011-01-18-articles-first10k.tsv

• When done, you can ask questions!

Page 16: Taming Text

Indexing

• Ingest raw data into the system and make it available for search

• Garbage In, Garbage Out– Need to spend some time understanding and modeling

your data just like you would with a DB– Lather, rinse, repeat

• See the $TT_HOME/apache-solr/solr-qa/conf/schema.xml for setup

• WikipediaWexIndexer.java for indexing code

Page 17: Taming Text

Aside: Named Entity Recognition

• NER is the process of extracting proper names, etc. from text

• Plays a vital role in a QA and many other NLP systems• Often solved using classification approaches

Page 18: Taming Text

• Custom Query Parser takes in user’s natural language query, classifies it to find the Answer Type and generates Solr query

• Retrieve candidate passages that match keywords and expected answer type

• Unlike keyword search, we need to know exactly where matches occur

Page 19: Taming Text

Answer Type Classification

• Answer Type examples:– Person (P), Location (L), Organization (O), Time Point

(T), Duration (R), Money (M)– See page 248 for more

• Train an OpenNLP classifier off of a set of previously annotated questions, e.g.:– P Which French monarch reinstated the divine right

of the monarchy to France and was known as `The Sun King' because of the splendour of his reign?

Page 20: Taming Text

Scoring

Page 21: Taming Text

Other Areas of NLP/Machine Learning

Page 22: Taming Text

Clustering

• Group together content based on some notion of similarity

• Book covers (ch. 6):– Search result clustering using

Carrot2

– Whole collection clustering using Mahout

– Topic Modeling• Mahout comes with many

different algorithms

Page 23: Taming Text

Clustering Use Cases

• Google News

• Outlier detection in smart grids

• Recommendations– Products– People, etc.

Page 24: Taming Text

In Focus: K-Means

http://en.wikipedia.org/wiki/K-means_clustering

Page 25: Taming Text

Fuzzy-Wuzzy Strings

• Fuzzy string matching is a common, and difficult, problem

• Useful for solving problems like:– Did you mean spell checking– Auto-suggest– Record linkage

Page 26: Taming Text

Common Approaches

• See com.tamingtext.fuzzy package• Jaccard– Measure character overlap

• Levenshtein (Edit Distance)– Count the number of edits required to transform

one word into the other• Jaro-Winkler– Account for position

Page 27: Taming Text

Trie

• The Trie is a very useful data structure for working with strings

• Find common subsequences

• Auto-suggest, others

• Ternary Search Trie

Page 28: Taming Text

What’s Next?

Page 29: Taming Text

Much Harder Problems

• Chapter 9• Semantics, Pragmatics and beyond• Sentiment Analysis• Document and collection summarization• Relationship Extraction• Cross-language Search• Importance

Page 30: Taming Text

Thank You!

• 3 copies of Taming Text

Page 31: Taming Text

Resources

• http://www.manning.com/ingersoll– http://github.com/tamingtext/

book• http://www.tamingtext.com• @tamingtext• Me:– @gsingers– [email protected]