Accumulo @ Bloomberg Accumulo Summit 2015 Skand Gupta Bloomberg LP
Accumulo @ BloombergAccumulo Summit 2015
Skand GuptaBloomberg LP
Bloomberg• Bloomberg technology helps drive the world’s financial markets
– We build our own software, digital platforms, mobile applications and state of the art hardware
– We run one of the world’s largest private networks with over 20,000 routers across our network
– We have the largest server side JavaScript deployment in the world – 22 million lines of JavaScript code
– We developed “cloud computing” and deployed “software as a service” well ahead of the general marketplace
– Our technology, has brought transparency to the global financial markets • Bloomberg technologists
– More than 3,000 software developers and designers located around the world (London, NYC, SF “tech hubs”)
– BloombergLabs.com (@BloombergLabs) is our platform for dialogue between our experts and the broader tech community
• Our clients – Over 320,000 subscribers – Primarily financial professionals including investment bankers, CFOs, investor
relations, hedge funds managers, foreign exchange, etc.
Source: Wall Street Journal, CFTC , New York Times, Marketplace.org
Source: Wall Street Journal, CFTC , New York Times
Importance of Compliance
Source: Commodity Futures Trading Commission
Hiding in Plain Sight
Compliance Platform and Processing Pipeline
Chat
Reference Data
Trade Data
Customer Data
Product Data
Market Data
Counterparty
Social Media Voice
Human-‐ and Machine-‐generated Data
Surveillance Pipeline
Communication Data
Transactional Data
User Data
Case Management
Compliance Platform
Compliance Storage
Compliance Officers
Search, Review, Analyze
HDFS
Spark
Kafka Storm
Mesos (Cluster Resource Manager)
Elastic data-‐processing and analytics stack
Open REST API (Play)
WORM
Pre-‐fabricated Hardware
Applications
Need for a robust, scalable, high performance, geo-‐distributed data storage and retrieval system
❑ More than 3 Peta Bytes of archived data
❑ 80+ Billion indexed objects ❑ Real-‐time scanning of 35 million
objects per day
100’s G
igab
ytes/year
Communication Data Growth Cumulative Data Growth
Over 3
Petab
ytes to
day
$0.00
$0.75
$1.50
$2.25
$3.00
List Price Replication DR Isolation
$2.31
$1.15
$0.58$0.19
Storing 1GB of Data
Storage Cost
2000 2002 2004 2006 2008 2010 2012
Need for Low Level Security Primitives
Document Level Security
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum
Company Level Security
Data StoreData Pipe Application
User Level Security
Data Store
Security Solutions
• Post-process the queries
– Too slow
– Nasty bugs
• Generate unique document for each view
– Exponential growth in number of documents
• Use application specific features
– Solr dynamic fields, Mangled Fields
• Accumulo Visibility
– Fast, Clean, Generic
Data Model
Row ID Value
CompanyA_userX_20150426 <bytes>
CompanyA_userX_20150426 <bytes>
CompanyA_userX_20150427 <bytes>
CompanyA_userX_20150428 <bytes>
CompanyA_userY_20150427 <bytes>
CompanyB_userX_20150428 <bytes>
CompanyB_userX_20150428 <bytes>
CompanyB_userX_20150428 <bytes>
Find all Communications for a Set of Users for a Date Range
Row ID Value
CompanyA_userX_20150426 <bytes>
CompanyA_userX_20150426 <bytes>
CompanyA_userX_20150427 <bytes>
CompanyA_userX_20150428 <bytes>
CompanyA_userY_20150427 <bytes>
CompanyB_userX_20150428 <bytes>
CompanyB_userX_20150428 <bytes>
CompanyB_userX_20150428 <bytes>
Batch ScannerApplication
Find all Records with “Libor”
Filter
Row ID Value
CompanyA_userX_20150426 <bytes>
CompanyA_userX_20150426 <bytes>
CompanyA_userX_20150427 <bytes>
CompanyA_userX_20150428 <bytes>
CompanyA_userY_20150427 <bytes>
CompanyB_userX_20150428 <bytes>
CompanyB_userX_20150428 <bytes>
CompanyB_userX_20150428 <bytes>
Batch ScannerApplication
Count Number of Objects that Match a Filter
Counting Iterator Filter
Row ID Value
CompanyA_userX_20150426 <bytes>
CompanyA_userX_20150426 <bytes>
CompanyA_userX_20150427 <bytes>
CompanyA_userX_20150428 <bytes>
CompanyA_userY_20150427 <bytes>
CompanyB_userX_20150428 <bytes>
CompanyB_userX_20150428 <bytes>
CompanyB_userX_20150428 <bytes>
Batch ScannerApplication
Scaling OutAp
plic
atio
n
Row ID Value
CompanyA_userX_20150426 <bytes>
CompanyA_userX_20150426 <bytes>
CompanyA_userX_20150427 <bytes>
CompanyA_userX_20150428 <bytes>
CompanyA_userY_20150427 <bytes>
CompanyB_userX_20150428 <bytes>
CompanyB_userX_20150428 <bytes>
CompanyB_userX_20150428 <bytes>
CountingIterator Filter Batch
Scanner
Counting Iterator Filter Batch
Scanner
Counting Iterator Filter Batch
Scanner
Spar
k Pr
oces
sing
Low Latency Writes using Accumulo ‘File System’
RowID Family Qualifier Valueattach.pdf chunk “00001” <bytes>
attach.pdf chunk “00002” <bytes>
… … … …
attach.pdf metadata file_size <file size>
attach.pdf metadata chunk_size <chunk size>
attach.pdf metadata sha256 <checksum>
Writ
e Ti
mes
(ms)
0 5 10 15 20
HDFS Accumulo File System
Conclusion
• Understand the data
• Free your data… but enforce access control
• Need sensible systems that help achieve these goals
Thank You!
http://careers.bloomberg.com [email protected]
We Are Hiring!