Text Analytics in Enterprise Search - Daniel Ling
Post on 20-Jun-2015
916 Views
Preview:
DESCRIPTION
Transcript
Text Analytics in Enterprise Search Daniel Ling (Findwise)
What will I cover?
Intro
About Text Analytics
Benefits and possibilities
Examples
Solution Techniques to Examples
Conclusions
3
My Background
Daniel Ling
Findwise
Enterprise Search and Findability Consultant
Experience and expertise
5+ years of Enterprise Search Experience
20+ enterprise search implementations, ranging industries
Lucene, FAST ESP, Solr
Apache Solr my primary search platform
Focus areas includes Findability and Search Architecture and Implementation, Text Analytics, Document Processing.
4
About Text Analytics
5
Text Analytics in the Enterprise
Challenges:
80% of data in the Enterprise is unstructured.
Reduce the time looking for information (currently 9.6 hours per week)
Reduce the time reading documents / e-mails (currently 14.5 hours per week)
Benefits:
More predictable scale and domain
Well-understood domain
Supporting content for analytics can be identified
6
Text Analytics
The definition
A set of linguistic, statistical and machine learning techniques used to model and structure information content of textual source.
- Wikipedia.org
7
Types of Applications
• Entity Extraction
• Document Categorization
• Sentiment Analysis
• Summarization
8
Frameworks and Techniques
9
Framework Techniques
Solr Statistics, Lingustics
Mallet, Classifier4j, etc, etc..
Statistical natural language processing
Mahout (Hadoop) Machine Learning, Statistics
GATE General language processing framework
UIMA Content analytics, text mining, pipeline
OpenNLP Machine learning toolkit for NLP
Benefits and possibilities
10
Benefits and possibilities
Text analytics can bring some structure to the unstructured content
Enhance discovery and findability of content
• Works well together with search
Increase relevance and precision with extracted keywords and meta-data
Generating content for dynamic pages / topic pages
• Selection of documents and extracts from documents
Track and discover sentiments
Reduce the time for user to analyze content
11
Examples
12
Entity Extraction
Types of Entities for Extraction:
• Dates
• Places
• Companies
• Objects (Product names, etc)
• People
• Events
13
Example – Presenting the data
14
15
Example – Presenting the data
16
Example – Facets on the data
Example Solution: Entity Extraction Rule-based entity extraction
Combination of lists and regular expressions
Works within well-understood domains.
Requires maintaining lists.
Lists from: Country lists from World Factbook, Public Companies from Google Finance, Customers from CRM.
Workflow: Document for indexing > Update Request Handler > Update Chain (lookup and match entities) > Writes to index
17
Update Chain (processor)
(lists | input fields | entity fields) Lucene Index
(entity fields)
Example Solution: Entity Extraction
18
Register a custom class to lookup resources and extract found entities to specific Solr fields, setup in solrconfig.xml:
Document Categorization
To assign a label to the document / content / data.
Labels for the category or for the sentiment.
Threshold values for matching a category before labeling.
Statistics and “knowledge” from previous examples can be used.
19
20
Example – Facets from Categories
Example Solution: Document Categorization
Training the component, Mallet (Machine Learning for Language Toolkit).
• Alternative components includes Lucene (TFIDF) index (MoreLikeThis), OpenNLP, Textcat, Classifier4j.
Running the new documents against the model/index of trained documents.
Training from interface, adhoc, or index pre-categorized.
21
*
* Figure from the book Taming Text.
Example Solution: Document Categorization
Mallet and the process of setup and train:
22
Example Solution: Document Categorization
Evaluation of new document:
23 23
Update Chain (processor)
(input document) Lucene Index
(category field)
Setting the evaluated category tag to the document in pipeline:
Document Summarization
Summarize a document, at index time or on-demand.
Leverage from the knowledge and term statistics of the document and the index.
Picks the “most important” sentences based on the statistics and displays those.
24
25
Example – Summarize content
Static Summaries
Dynamic Summaries
26
Example – Summarize content - 1
27
Example – Summarize content - 2
Example Solution: Document Summarization
Custom RequestHandler that receives document ID and field to summarize.
Custom Search Component making the selection of top sentences.
Selecting a subset of sentences and sends these back in a field.
28
RequestHandler (SearchComponent for summariziation)
Lucene Index
Wrap Up
• Examples: Entity Extraction, Document Categorization, Summarization.
• Technology: You can take small steps and get a great deal of gain, since you can leverage from features and components of Solr and Lucene (as well as other open source NLP frameworks).
• Value: Benefits from text analytics includes the increase in discovery, findability and productivity from the solution.
29
Questions ?
daniel.ling@findwise.com
www.findabilityblog.com
30
top related