Top Banner
Deep Machine Reading: Taming Unstructured, Natural Language Data Naveen Ashish University of Southern California & Cognie Inc., BigData TECHCON, San Francisco, October 29 th 2014
55

Deep Machine Reading

Jun 29, 2015

Download

Technology

Naveen Ashish

Talk on Deep Machine Reading (technologies) given at BigData Techcon 2014
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deep Machine Reading

Deep Machine Reading:

Taming Unstructured, Natural

Language DataNaveen Ashish

University of Southern California & Cognie Inc.,

BigData TECHCON, San Francisco, October 29th 2014

Page 2: Deep Machine Reading

This is about …..

DEEP MACHINE READING

The hard nut of having computers “understand” natural language (text) ….

Pushing the boundaries of what we can achieve ….

Page 3: Deep Machine Reading

A True AI Challenge

"It's (the problem of computers understanding natural language) ambitious ...in fact there's no more important project than understanding intelligence and recreating it.“ - Ray Kurzweil (2013)

Alan Turing based the Turing Test entirely on written language….To really master natural language …that’s the key to the Turing Test–to a human requires the full scope of human intelligence. …So the point is that natural language is a very profound domain to do artificial intelligence in. -Ray Kurzweil (2013)

“Another example of a good language problem is question answering, like What’s the second-biggest city in California that is not near a river?” Michael Jordan, in response to “What would you do with $1B?”, IEEE Spectrum Interview Oct 2014

Page 4: Deep Machine Reading

Commercial Relevance Today

the problem of taming unstructured data is far from solved ….. !!!!

search

text analytics

big data analytics

health informatics

social-media intelligence

mining research literature

Page 5: Deep Machine Reading

Cognie Inc., Cognie Inc., Incorporated in 2006

High-end consulting for semantic-search

Focus is on machine reading technologies

Work leverages Information extraction work and systems conceptualized as part of

university research XAR: eXtraction with Adaptive Rules (Ashish and Mehrotra, 2009)

PEP: Pathology Extraction Pipeline (Ashish, Dahm and Boicey 2014)

Team Developers, Student interns, Researchers

Blog http://cognie.blog.com

Today Building custom text analytics engines

Page 6: Deep Machine Reading

Model

Build custom text understanding engines for domains

CognieTM Platform for Building Text Analytics Engines

Retail Text

Engine

Health NLP

Engine

Research Mining

Engine

Customization, Application Integration, Evolution

Page 7: Deep Machine Reading

Outline

Deep machine reading: What is, and why needed

State-of-the-art

Fundamentals

Approach

Details

Case studies

Retail, Health, Risk assessment, Customer support, Intelligence

Conclusions

Page 8: Deep Machine Reading

What is “Deep” machine reading ?

Page 9: Deep Machine Reading

Deep Machine Reading is ….

The ability to distill the abstract from text

The ability to comprehensively extract multiple concepts and relationships from the text

The ability to link extracted elements to known concepts

The ability to use the text (data) itself, to improve understanding of that text

Page 10: Deep Machine Reading

The Abstract, in Text

The abstract, not explicitly mentioned !

What falls in this category

Expressions

Contextual sentiment

Aspects or Categories

I think you need better chefs SUGGESTION

The mocha is too sweet NEGATIVE

I used to take Lipitor for … PERSONAL EXPERIENCE

The dim lights have a cozy effect …. AMBIENCE

Page 11: Deep Machine Reading

Classification, rather than Extrication

Much of the technology, up to recently, is extrication focused

Extricate particular terms, elements, concepts from the text

Extrication

Named-Entity extraction PERSONS, ORGANIZATIONS, LOCATIONS, …

Sentiment extraction Based on polar words

Need for much more sophisticated classification of text snippets

Along different dimensions of interest

Page 12: Deep Machine Reading

A Comprehensive Signature of TextCognie experienceMany applications have unique requirements of what they want from

the text “ …and for six months I was indeed taking Lipitor but I must say ….” PERSONAL EXPERIENCE “…there is direct correlation between Cadmium exposure and lung …” CAUSALITY

But, many groups of applications have common requirements within

Primary elements required from text Expressions Entities Sentiment Contextual Qualified

Emotion Topics Categories/Aspects Specific signal (“directionality”) Relationships

Page 13: Deep Machine Reading

Deeper Text Analysis Better Insights

Goal: Get actionable insights from data !

Hypothesis: Deeper extraction Better insights !

The top advice items advised for skin rash are aloe vera,

vitamin E oil and oatmeal

Complaints comprise 36% of the overall feedback with top

issues being slow service, drinks and coffee

73% of all research articles indicate that Cadmium is a causal

factor for lung irritation

Page 14: Deep Machine Reading

Context

COGNIETM: A PLATFORM for text analytics

COGNIE TM

XAR UCI-PEP

SHIP SURVEY

ANALYTICS RETAIL

ANALYTICS

RISK

ASSESSMENT

Page 15: Deep Machine Reading

Modus Operandi

All applications require a structured representation of the (unstructured) data

A structured database/meta-base that powers Analytics dashboards

Data coding processes

Risk assessment computations

Consumer health portals

….

Manual extraction processes are typically in place

Goal is to eliminate or alleviate manual effort

Page 16: Deep Machine Reading

Text Analytics Spectrum

Gamut of Text Analytics Engines

in Market

• Lexalytics

• Alchemy API

• Semantria

• Clarabridge

• ConveyAPI

• Linguamatics

• ….

Engines Aiming Deeper

• Luminoso

• Attensity

• …

Availability of Open-source Text

Analysis Tools

• UIMA

• GATE

• Deep Learning for Sentiment

Analysis (Stanford)

• Recursive Neural Networks

• http://openair.allenai.org

Page 17: Deep Machine Reading

Approach

Page 18: Deep Machine Reading

Approach

natural language processing

machine learning

semantics

Page 19: Deep Machine Reading

Architecture: COGNIE TM Platform

Segmentation

POS Tagging

Entity extraction

Anaphora

Parsing

Gram analysis

Existing (DMOZ, SNOMED,UMLS)

Creation

Declarative

Naïve-Bayes

MaxEnt

TFIDF

CRF

RNN Deep Learning

ENSEMBLE

NLP

Machine Learning

Knowledge Engineering

Page 20: Deep Machine Reading

COGNIE TM : Open-source Leverage Framework UIMA

Classification Weka Mallet

NLP Stanford CoreNLP

Indexing Lucene

Databases MySQL, MongoDB

Knowledge Engineering Protégé

Topic mining Mallet

Sentiment Stanford Deep Learner

Page 21: Deep Machine Reading

Step 0: Basic Text Analysis

Text Segmentation

In many cases the “unit” of distillation is a sentence

Segmentation strategies Built-in, such as in UIMA or GATE

Custom segmentation

Sentence decomposition Decompose sentence into individual clauses

Page 22: Deep Machine Reading

Expressions

Beyond entities and sentiment : EXPRESSSIONS

EXPRESSIONS

Introduced in [Ashish et al, 2011]

Page 23: Deep Machine Reading

Expressions

…showers had no hot water !… COMPLAINT

..you should have more veggie options… SUGGESTION

RETAIL/ENTERPRISE

..meats on special this weekend… ANNOUNCEMENT

..this is the best store on the west side… ADVOCACY

There is hardly any evidence to suggest a link between salt and diabetes -

This results confirm that high intake of salt leads to increase in BP +

RISK ASSESSMENT

Page 24: Deep Machine Reading

Expressions

You should try Vitamin E oil … ADVICE

..I have had arthritis since 1991… EXPERIENCE

HEALTH

..for me lipitor worked like a charm… OUTCOME

Page 25: Deep Machine Reading

The Indicators: “Give Aways”

A combination of multiple types of elements !

…showers had no hot water !… COMPLAINT

(You) should have more veggie options… SUGGESTION

..i have been on lipitor… EXPERIENCE

..this is the best store on the west side… ADVOCACY

Page 26: Deep Machine Reading

Approach: Given Indicators

NLP

Identification of individual elements Unsupervised

Relationships between elements

Semantics

Identification of individual elements Knowledge driven

Machine Learning Classification

Combine elements classify

Page 27: Deep Machine Reading

Expression Classification: Relevant Features

Curated lexicons of specific indicative phrases

Examples “could you”, “I took”, ….

Approach Manual creation of “seed” lexicons

Automated expansion from data plus resource such as WordNet

The Sentiment

For instance a Complaint would almost always have negative sentiment

Punctuations, Other expressions or emoticons

Page 28: Deep Machine Reading

Expression Classification Features

Positional information of words, phrases, or part-of-speech patterns in the sentence

Suggestions will usually begin with certain ‘request’ words

Custom patterns

Such as subject-verb-object for PERSONAL EXPERIENCE

Ontology concepts

Page 29: Deep Machine Reading

Expression Classification: Results

Have achieved 75% precision and recall for all expressions considered

Factors

Feature engineering

Classifier selection

Knowledge engineering

Page 30: Deep Machine Reading

Before Automated Classification: Manual Patterns

SoL: Sequences of Labels

Labels LEX-FOODADJ spicy

LEX-EXCESS too, very

ONT-FOOD

POS-NOUN

Sequences (Patterns)ANY LEX-EXCESS LEX-FOODADJ ANY Negative

POS-VB POS-MD * Suggestion

Page 31: Deep Machine Reading

Classification: Machine Learning

Classification tasks

Expression

(Contextual) Sentiment

Aspect category

Frameworks

Weka

Mallet

Page 32: Deep Machine Reading

Baseline Classifiers for Expressions

Mallet and Weka

NaiveBayes

MaxEnt

CRF

Gram-based

Uni, Bi and Trigram features

Baseline

~ 10% accuracy

Page 33: Deep Machine Reading

Expression Classifiers

Trees

Decision Tree (J48)

Functions

Logistic Regression

SVM

Sequence Tagging

CRF: Conditional Random Fields

Page 34: Deep Machine Reading

Entities

Named-entity extractors

The generic PERSON, ORGANIZATION, LOCATION

Ngram and part-of-speech analysis

Frequently mentioned ‘entities’

Improves recall

Ontology driven concept mapping

Using pre-assembled domain ontologies/taxonomies/dictionaries

Based on modules like UIMA ConceptMapper

Scale is a challenge

Page 35: Deep Machine Reading

Contextual Sentiment

(Just) polar words can be misleading !

Polar words many not be present at all !

Combination of elements

The mocha is too sweet

Wait time is over an hour

Aisles are too narrow

Service is slow

Page 36: Deep Machine Reading

Qualified Sentiment

Classify negative comments

Further segregate into

Immediately actionable items

‘Long term’ issues

Approach

Curation of Ngrams for each type of negative comments

Classifier

Page 37: Deep Machine Reading

Topic Mining

Motivated by feedback survey analytics People can talk about “anything”

Interested in broad ‘topics’ of discussion But the set of topics is dynamic, not necessarily known

Unsupervised topic mining LDA: Latent Dirichlet Allocation

As-is led to very fragmented topics that were semantically not meaningfulSolution: consolidation of terms using WordNet Expand terms using WordNet synonyms Consolidate with manual curation after

Semi-automated approach

Page 38: Deep Machine Reading

Cohesive Topic Mining

Problem with WordNet (synonym) expansion

Prone to semantic divergence

Example

Presentation Project(or) Milestones

(Almost) strongly connected components in relationship graph

Manual review after

Page 39: Deep Machine Reading

Aspect Classification

Binning data into few broad categories

Approach Ngram mining

Classification

Page 40: Deep Machine Reading

Categories over Topics

Consolidate topics into broad, fixed categoriesOntology mapping approach Each category has associated concepts Topic signature maps to category concepts

HersheyBieberCocoa beans

Personnel Competitors

Yearly reviews

Page 41: Deep Machine Reading

Emotion Extraction

Plutchik wheel of emotions Fundamental emotion concepts captured in ontology

Augmented with indicator terms, and their synonymsOntology driven extraction for emotion concepts

Page 42: Deep Machine Reading

Semantics is Key

Page 43: Deep Machine Reading

Semantics

Domain knowledge is not ‘nice-to-have’ but critical

HEALTH

• Condition names

• Drug names

• Symptoms

• Procedures

• ..

RETAIL

• Food items

• Other products

• Competitors

• …

RESEARCH

• Chemical substances

• Harmful conditions

• …

INTELLIGENCE

• Manufacturers

• Vehicles

Page 44: Deep Machine Reading

Leverage Existing Knowledge Sources

Health informatics UMLS http://www.nlm.nih.gov/research/umls/ NCI Thesaurus http://ncit.nci.nih.gov/

SNOMED http://www.nlm.nih.gov/snomed

Retail DMOZ http://www.dmoz.org

Many other Freebase http://www.freebase.com

Wikipedia, DBPedia

OpenData data.gov

Page 45: Deep Machine Reading

Knowledge Engineering Tools

Getting available ontologies into usable formats

Available as database dumps, RDF, or Web data

“Mini” ontology creation

Curate manually when possible (small dictionaries) Example: list of competitors

API access

Freebase https://www.freebase.com/query Query using ‘MQL’ – Metaweb Query Language (Sparql like)

BioPortal http://data.bioontology.org/documentation

Provided sometimes by customer !

Page 46: Deep Machine Reading

Practical Requirements

Confidence Measures

Quantitative confidence score for extracted elements

Binary confidence Y/N Not confident Routed for manual review

‘Explanation’ for classification

Relevant snippets

“….and the checkout times continue to be long despite …”

Complaint

Page 47: Deep Machine Reading

Feedback Learning Mechanisms

Manual overview is not dismissed entirely

Comprehensive pipeline for manual review

Learn and improve from feedback

Page 48: Deep Machine Reading

Applications

Page 49: Deep Machine Reading

Applications

Core Cognie

Platform

Retail Analytics

Engine

Health Distillation

Engine

Survey Analytics

Engine

Research Mining

Engine

Coding Validation

Engine

Risk Analysis

System

Coding

ProcessesHealth Insights

Portal

Page 50: Deep Machine Reading

Scale

Page 51: Deep Machine Reading

Scalability

Scale requirements Large numbers of documents as opposed to large

document size

Throughput can be an issueComplex language processing algorithms

Feature extraction can be complex

Large ontologies in some cases

SolutionsMulti-threading and Thread pooling architecture

Hadoop MapReduce [Kahn and Ashish, 2014]

Page 52: Deep Machine Reading

Conclusions

Page 53: Deep Machine Reading

Grand Challenge Projects

AristoAt AI2, Allen AI Institute http://www.allenai.org

Areas Knowledge Extraction

Reasoning

Question Answering

Can the system answer 4th, 6th grade exams ?

Project NELL Never Ending Language Learning http://rtw.ml.cmu.edu/rtw/

“Learnt” 50+million facts from Web data

Page 54: Deep Machine Reading

Conclusions

Deeper distillation from text is required

Can be achieved by

Detecting and combining multiple elements in text Feature engineering

Knowledge engineering

Classifier selection

Semantics and Knowledge Engineering is key

Have been successful in leveraging the CognieTM

Platform to develop custom text analytics engines in multiple domains