Deep Distillation from Text Naveen Ashish University of Southern California & Cognie Inc., March 18 th 2014
May 10, 2015
Deep Distillation from TextNaveen Ashish
University of Southern California & Cognie Inc.,
March 18th 2014
This is about …..“DEEP TEXT DISTILLATION”The hard nut of having computers “understand” natural
language (text) …. Pushing the boundaries of what we can achieve ….
"It's (the problem of computers understanding natural language) ambitious ...in fact there's no more important project than understanding intelligence and recreating it.“ - Ray Kurzweil (2013)
Alan Turing based the Turing Test entirely on written language….To really master natural language …that’s the key to the Turing Test–to a human requires the full scope of human intelligence. …So the point is that natural language is a very profound domain to do artificial intelligence in. - Ray Kurzweil (2013)
Why ….
the problem is far from solved ….. !!!! unstructured data everywhere
95 % !
search
text analytics
big data analytics
health informatics
social-media intelligence
Introduction
About myselfAssociate Professor (Informatics), Keck School of Medicine,
University of Southern CaliforniaCognie Inc.,
Work leverages Information extraction work and systems developed at UC Irvine
XAR, UCI-PEP
Advisory consulting engagements with several companies and start-ups
Outline
Deep distillation: What is and why State-of-the-artFundamentalsApproach Details
Expressions, Entities, SentimentCase studies
Retail, Health, Risk assessmentConclusions
What is “Deep” text distillation ?
Data
AbstractThis paper describes the results of a study investigating ….…..We conclude that salt and diabetes are largely unrelated.
Deep Distillation
The abstract, not explicitly mentioned !What falls in this category
ExpressionsContextual sentimentAspect classification
I think you need better chefs SUGGESTION
The mocha is too sweet NEGATIVE
I used to take Lipitor for … PERSONAL EXPERIENCE
The dim lights have a cozy effect …. AMBIENCE
A Common Intersection
Distill at sentence level Aggregate to entire feedback, post, comment or
threadThree primary elements
Expression/IntentEntities/Aspects (and Classes) Sentiment
Why Deeper ?
Goal: Get actionable insights from data ! Hypothesis: Deeper extraction Better insights !
The top advice items advised for skin rash are aloe vera, vitamin E oil and oatmeal
Complaints comprise 36% of the overall feedback with top issues being slow service, drinks and coffee
Context
COGNIETM: A PLATFORM for text analytics
COGNIE TM
XAR UCI-PEP
SHIP SURVEY ANALYTICS
RETAIL ANALYTICS
RISK ASSESSMENT
Expressions
Beyond entities and sentiment : EXPRESSSIONSEXPRESSIONS
Introduced in [Ashish et al, 2011]
Expressions
You should try Vitamin E oil … ADVICE
..I have had arthritis since 1991… EXPERIENCE
HEALTH
..for me lipitor worked like a charm… OUTCOME
Expressions
…showers had no hot water !… COMPLAINT
..you should have more veggie options… SUGGESTION
RETAIL/ENTERPRISE
..meats on special this weekend… ANNOUNCEMENT
..this is the best store on the west side… ADVOCACY
There is hardly any evidence to suggest a link between salt and diabetes -
This results confirm that high intake of salt leads to increase in BP +
RISK ASSESSMENT
The Landscape
Text Analytics Spectrum
Wide offering of Text analytics engines Text analysis tools – many open-source
Largely still for “spotting things” entities, concepts, sentiment, topics, emotions ….
Going deeper Luminoso Attensity (Intents)
Deep Learning for Sentiment Stanford
Recursive Neural Networks
Approach
Approach
natural language processing
machine learning
semantics
Architecture: COGNIE TM Platform
Segmentation
POS Tagging
Entity extraction
Anaphora
Parsing
Gram analysis
Existing (DMOZ, SNOMED,UMLS)
Creation
Declarative
Naïve-Bayes
MaxEnt
TFIDF
CRF
RNN Deep Learning
ENSEMBLE
NLP
Machine Learning
Knowledge Engineering
The Indicators: “Give Aways”
A combination of multiple types of elements !
…showers had no hot water !… COMPLAINT
(You) should have more veggie options… SUGGESTION
..i have been on lipitor… EXPERIENCE
..this is the best store on the west side… ADVOCACY
Approach: Given Indicators
NLP Identification of individual elements
Unsupervised
Relationships between elementsSemantics
Identification of individual elements Knowledge driven
Machine Learning ClassificationCombine elements classify
Natural Language Processing
UIMA and GATE Stanford NLP Tools
POS tagging Parsing NE Recognizer Geo-tagger ….
Natural Language Processing
Text Segmentation In many cases the “unit” if distillation is a sentence
Segmentation UIMA (or GATE) Custom
Complex sentence segmentation Breakup into individual clauses
NLP
Part-of-speech tags are key indicators Expression distillation
Entity extractionNames, Locations, Organizations
Parsing If required
Anaphora
NGram Analysis
Unigram and Bigram analysisObtain
Grams FrequencyEntropy
Grams of tokens as well as POS PatternsVB VBD
Before Automated Classification: Manual PatternsSoL: Sequences of LabelsLabels
LEX-FOODADJ spicy
LEX-EXCESS too, very
ONT-FOODPOS-NOUN
Sequences (Patterns)ANY LEX-EXCESS LEX-FOODADJ ANY POS-VB POS-MD ….
Classification: Machine Learning
Classification tasksExpression (Contextual) SentimentAspect category
FrameworksWeka Mallet
Baseline Classifiers
Mallet and WekaNaiveBayesMaxEntCRF
Gram-basedUni, Bi and Trigram features
Baseline~ 10% accuracy
Expression Classification: Features
FeaturesPolar wordsPunctuationsNgramsPOS patterns Length !Beginning Ontology…
Classifiers
TreesDecision Tree (J48)
Functions Logistic Regression SVM
Sequence TaggingCRF: Conditional Random Fields
Expression Classification: Results
Have achieved 75% precision and recall for all expressions considered
Factors Feature engineeringClassifier selectionKnowledge engineering
Contextual Sentiment
(Just) polar words can be misleading !Polar words many not be present at all !Combination of elements
The mocha is too sweet
Wait time is over an hour
Aisles are too narrow
Service is slow
Semantics: Ontologies Health
Drugs Conditions Procedures Symptoms …
Retail (Dining) Food/Entrees Service Ambience ….
Leverage Existing Knowledge Sources
Health informatics UMLS
NCI Thesaurus
SNOMEDRetail
DMOZMany other
Freebase Wikipedia, DBPedia
OpenData data.gov
Knowledge Engineering Tools
“Mini” ontology creationAPI access
FreebaseBioPortal
WrappersDMOZ, ….
Practical Requirements
Confidence MeasuresBelow threshold routed to manual transcription teams
PolaritySnippets
Open-Source Leverage
COGNIE TM : Open Source ToolsFramework
UIMAClassification
WekaMallet
NLP Stanford tools
Indexing Lucene
DatabasesMySQL, MongoDB
Knowledge EngineeringProtégé
Select Case Studies
Case Study: Health Informatics
Distillation
Case Study: Retail & Survey Analytics
Feedback Direct, device collected Social-media
Typically short, few sentencesStrong requirement for aspect classification
[Food,Service,Ambience,Pricing,Other]Negative : “Immediate” vs “Long Term” classification
…food was awesome, service needs improvement ….
you need to be open longer !
Case Study: Risk Assessment
Biomedical Literature AbstractsCorrelation direction (+ -) SubjectArticle type
FeaturesClausesNegation and Triggers Semantic Heterogeneity
Performance
MapReduce
Throughput can be an issueComplex language processing algorithms Large ontologies in some cases
Hadoop MapReduce [Kahn and Ashish, 2014]
Conclusions
Conclusions
Deeper distillation from text is importantCan be achieved by
Detecting and combining multiple elements in text Feature engineering Knowledge engineering Classifier selection
Does not have to be perfect Every domain, dataset has its nuances