Mark Davis Distinguished Engineer, Dell Software Group IEEE Cloud Computing Initiative IEEE Intercloud Interoperability Testbed [email protected]
Mark Davis Distinguished Engineer, Dell Software Group
IEEE Cloud Computing Initiative
IEEE Intercloud Interoperability Testbed
Industries Vertical domains Concepts and Algorithms Technologies
200EB = 1018 B
1ZB = 1021 B
10EB
100TB
2000 1985 1900 1750
Industrial
Revolution
#1
Industrial
Revolution
#2
Industrial
Revolution
#3
Industrial
Revolution
#4
R. J. Gordon: Is US economic growth over? Faltering innovation confronts the six headwinds. CEPR Policy Insight No 63
Krugman, P. Is Growth Over? New York Times, 12 December 2012.
I visualize a time when we will be to
robots what dogs are to humans, and
I'm rooting for the machines.
Claude Shannon, IEEE Medal of
Honor 1966
Faster reporting
Interactive visualization
$$$
Data warehousing
Advanced analytics
Just-in-time decisionmaking
The Perl Scripting option
4TB of data on disk
4 channels of I/O at 100Mb/s
2.7 hours
Server farms are too expensive
Oracle is too expensive
We just need key-value stores
VCs say no CAPEX
Be lean, young entrepreneurs, be lean
Analyze log files from web
servers
Group users based on behavior patterns
Build machine learning
models of behavior
Recommend news and information
Social Media
Collect millions of data
points from sensors
Analyze failure modes
Proactively improve products
Optimize systems
Industrial Control
Defense Intelligence
Anomaly Detection
Scientific Analysis and Visualization
Collective Intelligence
Social Network Analysis
Collective Intelligence
Anomaly Detection
Distribution Network Optimization
Scientific Analysis and Visualization
Spam Filtering
Defense Intelligence
Product Design
Internet of Things
Social Network Analysis
Search
Linguistic
Understand language
Use linguistic concepts to create computational systems
Understand
Statistical
Data driven
Language is structured information
Hybrid
Understand language some
Use statistics and machine learning to help with recall
Parts-of-Speech Tagging
Tokenization
Lemmatization
Finite State Transducer Finite State Transducer
Finite State Transducer
Machine-Learning
Random Indexing:
Assign random, sparse vectors to words
(or entities)
Add together the vectors for a context
Related contexts cluster in high-
dimensional space
Assign new metadata from discovered
clusters to documents
Why? Model for human/animal sparse distributed memory Success in the TOEFL test (64.5-67%) Distributional hypothesis: words with similar co-occurrence patterns
have similar meanings Johnson-Lindenstrauss Lemma: projecting a matrix through a random
matrix preserves the relative distances between points if R is high dimensionality
Fixed context vectors (say, 4000 bits) reduce complexity Versatile: words-words contexts, words-document contexts, entity-
entity contexts, x-y contexts
MapReduce RI requires reduce phase that merges random spaces
Or…Precompute sparse word vectors Or…Serve the map phase from a common
term signature service Or…Same hash function across instances and
compute a sparse vector (1% occupancy)
Clustering of terms, entities, or documents Autosuggestion Categorization Abstractly: generalized similarity engine
Query Language
Metadata Extraction
Indexing
Facet Browsing Facet Charting
Resource Integration
Autosuggest Spellcheck
Big Data search and analytics has many challenges: Volume of data Variety of data Velocity of data Extracting structure from unstructured information:
▪ Machine Learning ▪ Human Intelligence ▪ Knowledge Engineering
Enabling the 21st Century