This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Cassandra at Proofpoint (& Nexgate) !Harold Nguyen, Data Scientist, Nexgate division of Proofpoint !Slides created with the help of Proofpoint colleagues: Bryan Burns, Brian Hawkins, Wayne Lewis, Andy Maas, Anand Somani, Grey Saylor, and Rich Sutton
Cassandra Uses for Proofpoint Nexgate social media security and compliance Spam multiplicity Trending topics Archive Search Data integrity and connectedness across the globe
Data engineer/scientist ! Responsible for content classification, fraudulent detection, and security research ! Work with entering, marketing and research teams
Security and compliance for enterprise messaging (email, social, and mobile) Founded 2002 1100 employees worldwide $2.5B public company: PFPT $200M revenue ! Cassandra used all overthe organization
What is Targeted Attack ? Attack aimed at specific user or organization, designed to breech a specific target !
What is TAP ? Combats targeted threats by monitoring suspicious messages containing malicious URLs and attachments, and analyzing user clicks ! Predictive defense by using machine learning techniques to determine would ‘could likely’ by malicious and take preemptive steps ! Insights into threat by determining if an organization is under attack, who is being targeted, what threats are received, and if they are still valid threats
C* use case with TAP Uses Cassandra as an indexer - index URLs (row key) to email messages (columns) that contain them Store a blob of email message to display on dashboard for malicious alerting !
C* infrastructure 40-node cluster in AWS, c3.2xlarge nodes About 2 TB of EBS storage on each Replication factor of 4 Data has increased by 100% since a year ago !
KairosDB and C* JMX metrics inserted into KairosDB, where they are read and monitored from Over the 3 clusters (9, 6, 6), 14 billion metrics a month from 1000s of machines Has become critical to Proofpoint being able to track metrics from systems
Problem: • Proofpoint collects billions of threat data points a day that aren’t being correlated
Solution: • Build a custom graph database on top of C* • Key is vertex, wide rows are edges • 18 nodes, 24 TB of data, ingest peaks of 1M events per second
Benefits: • Security researchers can now identify relationships between hosts, actors and threats
that they couldn’t before • Dridex campaign, detection of numerous targeted attacks
Proofpoint security research team created a graph database on top of Cassandra (CQL application) ! Why didn’t we use TitanDB, or other existing Graph DBs) ?
These DBs want to generate their own IDs- causes unnecessary querying for us This killed insert performance
Created our own ID generation scheme so an ID could be deterministically generated without querying the db Cassandra allowed us to overwrite the same data multiple times if needed without needing to query the db to reconcile duplicates Titan could be “hacked” to use a hash-based id and not call Cassandra for id generation, but their keys were contained to 64-bit integers (too small for us) !
Other design differences from Titan: A key cache is used in the import application, so we avoid having to write the vertex key over and over Shard data into many subgraphs -
queries can thus include time ranges, and reduces compaction overhead !
Edges design is similar to Titan - edges of a vertex are kept in the same data partition
Email General Purpose Infrastructure (Use Case 3) We also have a 6-node cluster in 2 datacenters (3 nodes in each DC) Stores email and attachments as large encrypted blobs (from 20M to 2 GB) - for “SecureShare” - a product that securely shares emails As an identity database - users, customers, etc..
Chosen over SQL because of its distributed / multi-DC nature
Uses Cassandra as a store for clustered email topics Uses Word2Vec algorithm with a 100-dimensional vector, and apply Spark-streaming MLib k-means clustering algorithm on incoming stream of email subjects
Tried k= 20, 50, and 100 Word2vec translates synonymous words into the same vector space
Content classification is what we do. The completeness of any classification system is predicated on the breadth of the corpus of data upon which it is built.!!
Look up content quickly (by hitting hashed index) Number of columns = number of times content was seen Value provides information for offline analysis (time series, patterns in content, etc…)
8 use cases that take advantage of Cassandra: Data modeling Distributed nature Other tools can easily plugin (Solr, Spark) Ease of Use Community’s amazing support