Securely explore your data WHAT'S NEXT FOR BIGTABLE? Adam Fuchs, CTO Sqrrl Data, Inc. May 22, 2014
Aug 07, 2015
Securely explore your data
WHAT'S NEXT FOR BIGTABLE?
Adam Fuchs, CTO Sqrrl Data, Inc. May 22, 2014
TODAY’S TALK
• History of the World: Part 3 • Bigtable/Accumulo Technology Overview • Accumulo Demonstration • Database Technology Survey
© 2014 Sqrrl Data, Inc. | All Rights Reserved 2
TIMELINE OF RELEVANT EVENTS
© 2014 Sqrrl Data, Inc. | All Rights Reserved
Google’s BigTable Paper
2006
NSA Builds Accumulo
2008 Sqrrl Founded
2012 1st Sqrrl Release and Customers
2013
NSA Open Sources
Accumulo 2011
3
Accumulo is a: • Apache Software Foundation (ASF) Open-
Source Software Project • Clone of Google’s Bigtable • Secure, Sorted Key-Value Store • Row-level ACID (locally) Distributed NoSQL
Database
© 2014 Sqrrl Data, Inc. | All Rights Reserved 4
Sqrrl is: • A commercial software company located in
Cambridge, MA • A search and Exploration Platform built with
Apache Accumulo • An exciting startup with a long roadmap of
challenging problems to solve • Hiring!
© 2014 Sqrrl Data, Inc. | All Rights Reserved 5
6
BIGTABLE & ACCUMULO TECH OVERVIEW
1. Data Model & API 2. Underlying Architecture 3. Distinguishing Features
© 2014 Sqrrl Data, Inc. | All Rights Reserved 7
An Accumulo key is a 5-tuple, consisting of: • Row: Controls Atomicity • Column Family: Controls Locality • Column Qualifier: Controls Uniqueness • Visibility Label: Controls Access • Timestamp: Controls Versioning
Row Col. Fam. Col. Qual. Visibility Timestamp Value
John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute …
John Doe Test Results Cholesterol JD|PCP_JD 20120912 183 John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass
John Doe Test Results X-Ray JD|PHYS_JD 20120513 1010110110100…
Accumulo Key/Value Example
ACCUMULO DATA FORMAT
© 2014 Sqrrl Data, Inc. | All Rights Reserved 8
Instance new ZooKeeperInstance(...)
new MockInstance()
Connector
getConnector(...)
TableOperations
InstanceOperations
SecurityOperations Scanner BatchScanner
createScanner(...) createBatchScanner(...)
Range
IteratorOption
Map.Entry
Key Value
iterator()
BatchWriter
createBatchWriter(...)
Mutation
addMutation(...)
THE ACCUMULO CLIENT API
© 2014 Sqrrl Data, Inc. | All Rights Reserved 9
• Collections of KV pairs form Tables • Tables are partitioned into Tablets
• Metadata tablets hold info about other tablets, forming a 3-level hierarchy
• A Tablet is a unit of work for a Tablet Server
Data Tablet -‐∞ : thing
Data Tablet thing : ∞
Data Tablet -‐∞ : Ocelot
Data Tablet Ocelot : Yak
Data Tablet Yak : ∞
Data Tablet -‐∞ to ∞
Table: Adam’s Table Table: Encyclopedia Table: Foo
ACCUMULO TABLETS
Well-‐Known Loca9on
(zookeeper)
Root Tablet -‐∞ to ∞
Metadata Tablet 2 “Encyclopedia:Ocelot” to ∞
Metadata Tablet 1 -‐∞ to “Encyclopedia:Ocelot”
© 2014 Sqrrl Data, Inc. | All Rights Reserved 10
Tablet Server
Tablet
Tablet Server
Tablet
Tablet Server
Tablet
Applica9on
Zookeeper
Zookeeper
Zookeeper
Master
HDFS
Read/Write
Store/Replicate
Assign/Balance
Delegate Authority
Delegate Authority
Applica9on
Applica9on
ACCUMULO PROCESSES
© 2014 Sqrrl Data, Inc. | All Rights Reserved 11
In-‐Memory Map
Write Ahead Log
(For Recovery)
Sorted, Indexed File
Sorted, Indexed File
Sorted, Indexed File
Tablet Reads
Iterator Tree
Minor Compac<on
Merging / Major Compac<on
Iterator Tree
Writes Iterator Tree
Scan
TABLET DATA FLOW
© 2014 Sqrrl Data, Inc. | All Rights Reserved 12
Iterator Operations: • File Reads • Block Caching • Merging • Deletion • Isolation • Locality Groups • Range Selection • Column Selection • Cell-level Security • Versioning • Filtering • Aggregation • Partitioned Joins
ITERATOR FRAMEWORK
© 2014 Sqrrl Data, Inc. | All Rights Reserved 13
WORD COUNT: SUMMING AGGREGATING ITERATOR
Input Corpus
© 2014 Sqrrl Data, Inc. | All Rights Reserved 14
Ingesters Queriers Tablet Servers
ACCUMULO LATENCIES
Input Batch Writer
In-Memory
Map
Scan Iterators
Scanner/Batch
Scanner
In-Memory
Map
RFile
Compaction
Iterators
Scan Iterators
RFile
Compaction
Iterators
In-Memory
Map
RFiles
Compaction
Iterators
Scan Iterators
Output
~ms ~ms ~ms
ms
- min
© 2014 Sqrrl Data, Inc. | All Rights Reserved 15
ACCUMULO THROUGHPUT
Ingesters Queriers Tablet Servers
Input Batch Writer
In-Memory
Map
Scan Iterators
Scanner/Batch
Scanner
In-Memory
Map
RFile
Compaction
Iterators
Scan Iterators
RFile
Compaction
Iterators
In-Memory
Map
RFiles
Compaction
Iterators
Scan Iterators
Output
~ms ~ms ~ms
ms
- min
Scan: ~1M entries/s per
node
Ingest: ~200K entries/s
per node
Read-Modify-Write Latency: ~ms ê
>1K entries/s challenging with R-M-W
© 2014 Sqrrl Data, Inc. | All Rights Reserved 16
Securely explore your data
DEMO
R-M-R VS. COMPACTION-TIME AGGREGATION
Read/Modify/Write (HBase) vs. Iterators/Combiners (Accumulo)
© 2014 Sqrrl Data, Inc. | All Rights Reserved 18
SURVEY OF DATABASE TECHNOLOGY
• Exercises in Center-Seeking • SQL vs. NoSQL • Ingest-time vs. Query-time Analytics • ACID vs. BASE • Normalized vs. Denormalized Data Models
• Primary Use Cases for Sqrrl+Accumulo
© 2014 Sqrrl Data, Inc. | All Rights Reserved 19
SQL VS. NOSQL
NoSQL • Optimized for get/put
operations • Specialized for client
languages • High concurrency • More client-side
control
Hybrid • Extend and evolve
SQL • Standardize and
incorporate NoSQL paradigms
SQL • Optimized for joins • Strong mathematical
roots in set theory • Automatic query
optimization
© 2014 Sqrrl Data, Inc. | All Rights Reserved 20
INGEST-TIME VS. QUERY-TIME ANALYTICS
Ingest-Time • Optimized for online
statistics • Can reduce storage
footprint • Can be indexed for
low latency • Leverages a variety
of indexes • Requires extensive
data organization at ingest
Hybrid • Create partial
summary at ingest (Question-focused datasets, knowledge bases, etc.)
• Support ad-hoc queries over summaries
• Leverage all known indexing strategies **
Query-Time • Can compute holistic
statistics, like ranking, topN, etc.
• Ad-hoc analytics: don’t know the query ahead of time
• High latency and low concurrency at scale
• Leverages block indexes, columnar layout
• Ingest can be “stream to disk”
© 2014 Sqrrl Data, Inc. | All Rights Reserved 21
ACID VS. BASE
ACID • Atomicity: all or
nothing for a group of operations
• Consistency and Isolation: support simple reasoning for distributed, multithreaded clients
• Durability: simple reasoning for whether data might be lost
Hybrid • Must make some
relaxations for performance at scale (under failure modes)
• Many options for “Lightweight” transaction support
• Accumulo limits atomicity, consistency, and isolation to row-level operations
BASE • Basically Available:
ensure that core operations always complete in an advertised time
• Soft-State: relaxation of referential integrity, etc.
• Eventual Consistency: relaxation of
© 2014 Sqrrl Data, Inc. | All Rights Reserved 22
NORMALIZED VS. DENORMALIZED DATA MODELS
Normalized • “Normal Form
Relational Database” • Minimizes data
footprint • Minimizes cost of
data maintenance • Can lead to
expensive joins at query time
Hybrid • Start with document
store • Introduce links/edges
for quick joins • Dynamically adapt to
flexible or sparse schemas
• Similar to property graphs
Denormalized • “Document Store” • Flexible schema lets
applications adapt quickly to changing environments
• Pre-joined to eliminate joins at query-time
• Optimized for “append-only” data
• Can inflate data sizes and slow data ingest
© 2014 Sqrrl Data, Inc. | All Rights Reserved 23
KNOWLEDGE-BASE USE CASE
2014-04-14 06:36:09 429 73.105.179.202 [email protected] 500 POST application/json
2014-04-14 06:36:09 429 73.105.179.202 [email protected] 500 POST application/json HTTPS “wikipedia.org:443/grouchinesses/?215=felled&297=wading&768=shimmies...” "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31” 208.80.152.201
HR
Netflow
Proxy Logs
HTTPS “wikipedia.org:443/grouchinesses/?215=felled&297=wading&768=shimmies...” "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31” 208.80.152.201
Social Media
© 2014 Sqrrl Data, Inc. | All Rights Reserved 24
STREAM PROCESSING USE CASE
© 2014 Sqrrl Data, Inc. | All Rights Reserved
Dashboards
Actions
Interactive Analysis Tools (Discovery + Forensics)
1. SPE queries Sqrrl to enrich streaming data 2. SPE persists results in Sqrrl for future query 3. SPE takes action automatically 4. SPE issues data-driven alerts
5. Sqrrl provides context for dashboards 6. Analysis tools query use Sqrrl to search and
manipulate historical data
DATA
SPE
25
SQRRL OPERATIONALIZES ACCUMULO WITH...
© 2014 Sqrrl Data, Inc. | All Rights Reserved 26
Data-Centric Security
Petabyte Scale and Operational Speeds
Document and Graph Data Models
SqrrlQL, including Aggregates, Secure Full-Text Search, and Secure Graph Search
Analytics, including Real-Time Statistics and Hadoop Integrations
MODERNIZING VISUALIZATION
© 2014 Sqrrl Data, Inc. | All Rights Reserved 27
Sqrrl is building the next generation of operational analytics visualizations
UPCOMING EVENTS Accumulo Summit 2014 • June 12 in College Park, MD • http://accumulosummit.com • Multiple tracks of talks from the leaders of the Accumulo community
IEEE HPEC Conference 2014 • September 9-11 in Waltham, MA • http://www.ieee-hpec.org/ • Accumulo Users Group Meeting as a Special Event • Accumulo tutorial
Watch for more meetup opportunities coming soon!
© 2014 Sqrrl Data, Inc. | All Rights Reserved 28