One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005.
Post on 28-Mar-2015
220 Views
Preview:
Transcript
One Tool, Many Industries
Text Mining with Oracle
Omar AlonsoChuck Adams
Oracle Corp.
Text Mining Summit, Boston, 2005
Agenda
Introduction Text mining Define problems Present solutions A look at Oracle’s technology stack Oracle’s roadmap A case study Conclusions
Data mining and Text mining
OLTP
OLAP
DM
Keyword search
BK
TM• Classification
• Clustering
• Ontologies
• NLP
• Inexact match
Structured Data Unstructured Data
An analogy
RFID and robot vision– Put tags on everything instead having the
robot do the vision
Similar approach for text mining– Language is very social, not technical– Instead, start with a unified storage model– Then do mining
What about text mining?
Text mining is one of many features in text technology
Real future of text technology is business intelligence (BI)
What is BI? – Ability to make better decisions
What are the obstacles today?– Structured data is well understood– Unstructured data is different
Text and XML
Increased exploitationof structure
Plain Old File System
File System on Steroids(WinFS)
Records Mgmt, ECMDynamic Doc Generation
Traditional Content Mgmt
XML Content Mgmt.
First problem: access
No uniform access over all sources Each source has separate storage and
algebra Examples
– Email – Databases– Applications– Web
Second problem: management Management of unstructured of data
very poor compared with structure data Cleaning Noise is larger than in structure data Security Multilingual
Third problem – user needs Perception with current search engines Large data -> 80/20 rule Doesn't provide uniform information Two users type same query and get the
same results– Cricket the game or cricket the bug?
Foundations
XML as the common model XML allows:
– Manipulation data with standards– Mining becomes more data mining– RDF emerging as a complementary model
The more structure you can explore the better you can do mining
Integration use cases
Foundations - II
Unstructured data is too AI Too easy to get fooled by the complexity Hybrid solution Domain knowledge
– You know your domain– You own the content – You can do better
Remember?
Personalization problem
Lack of personalization You own the content, you own the user Two users type the same query:
“financials”– Sales rep looks for customers and other deals– Tech guy looks for bugs, architecture, etc.
LDAP shows who they are Combination with query logs shows
patterns in the same peer group Recommendation systems
Better Answers: Beyond Keywords
Noise theory– As you cast your nets ever wider, you catch disproportionately more
junk Must develop new models of Quality in the face of comprehensiveness
– Combine Link-Analysis with Context-sensitive relevance– Personalization
Must summarize information– Theme Maps, Gists
Show patterns in information vs. many pages of hit-lists– Tree Maps, Stretch Viewer
Ability to post-process and refine search hit lists– Dynamic categories for navigation– Reorder by date
Progressive query relaxation– Nearest inexact match
Technology StackBetter Answers
Relevance Toward BI
Progressive Relaxation
Multi-Criterion Support
Visualization
Classification
Personalization
Direct Answers
Link Analysis
Query Log Analysis
Metadata Extraction
Keyword Ranking
Intelligent Match
Duplicate Elimination
Oracle’s position
Text mining is one of many tools for information retrieval and discovery in many assets
Text mining is best used in the context of other techniques
– Personalization– Search query logs– Visualization
Product: one integrated platform
Oracle platform
Integrated platform vs. niche technology
Full-text searching
XML
Classification
Clustering
Visualization
Google, FAST
Tamino
Autonomy
Vivisimo
Inxight
One platform, low cost, low complexity
Several products, different APIs, performance, maintenance cost, etc.
Application search SAP/TREX
Oracle platform
“If I can see further than anyone else, it is only because I am standing on the shoulders of giants” – Isaac Newton
Oracle provides you all the functionality– Plus you get backup, recovery, scalability,
and other benefits
You build the mining application
Case study
Federal customer High Performance Text Information
Mining and Entity Extraction
Business Need
Enterprise Search Capability Information Fusion Profiles and alerting Security – user need to know Entity identification and extraction High Performance ingestion, search, and
indexing Scalability
Challenges
Search quality Performance Scalability Document formats Integration Operations and maintenance
Solutions Architecture
Oracle 10g Integrated Framework 10g release 2
– Oracle Real Application Clusters– Oracle Text
Full text and rule based indexingExtensible thesauriDocument classificationDocument filters
– Oracle Partitioning– Oracle Virtual Private database– Oracle Advanced Security
Technical Architecture
Application Server
EDL Portal User
EDL Portal User
Oracle 10g RAC
Application Server
LoadBalancer
Oracle 10g RACInterconnect
Enterprise MetaData Layer
Scalar, Domain, andB*Tree Indices
EDL Portal User
EDL Portal User
ADS OID
Process Isolated RAC DBNodes. 1 tuned for Userquery and the other fordata synchronization
Application Server
Key meta dataconsolidated and indexedfor enterprise data layer
access.
CIA PKI Authenticationfrom ADSN clients
ADS LDAP Integrated forClient and Server
Authentication
ExistingMissionSystem
Network BasedIntegration Hub and EDLSynchronization Services
Federated Data AccessJ2EE Services for
mission system drill
ExistingMissionSystem
ExistingMissionSystem
Scalable load and indexing
Oracle 9i& 9i Text
Raw Payload Payload Index
Scalar Indexes XML Indexes
DataCollec-
tion
Preprocess&
Filtering
JavaLoad
Thread
JavaLoad
Thread
JavaLoad
Thread
JavaLoad
Thread
JavaLoad
Thread
JavaLoad
Thread
JavaLoad
Thread
JavaLoad
Thread
Java LoadDistri-bution
Process
Standard-ized
Xml DTD
UTF8 TextExtracted
fromCollection
Real world results
Single search for user Profiles and alerts Couple second query response 80,000,000 + documents indexed 1.2 TB raw text and growing 700 Gig index size Incremental index 1-2 Gig / day
Next Steps
Oracle 10gText Indexstructure
Entityidentification
andextraction
engine
Languagespecific
dictionary
Languagespecific
dictionary
Languagespecific
dictionary
Languagespecific
dictionary
ExtractedEntities
XMLInterface
Relationshipdetectionengine
• Entity Extraction and Relationship Awareness
Oracle database 10g release 2 Enterprise Search Capability Information Fusion Profiles and alerting Security – user need to know Entity identification and extraction High Performance ingestion, search, and
indexing Scalability
Conclusions
Text mining is one of many features needed for BI on unstructured data
– Not a silver bullet in itself
Must exploit other approaches – metadata (XML, RDF), personalization, classification, entity extraction, full-text search, …
– Hybrid solution
Focus on an integrated platform that gives you all the functionality
Drive the platform for your information need
top related