Apache Eagle: Secure Hadoop in Real Time http://eagle.incubator.apache.org/ Hao Chen (@haozch) | Ralph Su (@raphaelsu)
Apache Eagle: Secure Hadoop in Real Timehttp://eagle.incubator.apache.org/
Hao Chen (@haozch) | Ralph Su (@raphaelsu)
About Us
Apache Eagle Co-creator,PMC & [email protected]
Sr. Software Engineer at [email protected]
Apache Eagle PMC & [email protected]
MTS. Software Engineer at [email protected]
Ralph Su / 苏良飞
Hao Chen / 陈浩
Agenda• Introduction• Architecture• Case Study• Q & A
is a distributed real-time monitoring and alerting engine for hadoop from eBay
Open sourced as Apache Incubator Project on Oct 26th 2015
Secure Hadoop in Realtime a data activity monitoring solution to instantly identify access to sensitive data, recognize attacks/ malicious activity and block access in real time.
See http://eagle.incubator.apache.org or http://github.com/apache/incubator-eagle
Apache Eagle
Eagle was initialized by end of 2013 for hadoop ecosystem monitoring as any existing tool like zabbix, ganglia can not handle the huge volume of metrics/logs generated by hadoop system in eBay.
2014/201510,000 nodes150,000+ cores200 PB2000+ user
3000+ nodes10,000+ cores50+ PB2012
20111000+ nodes10,000+ cores10+ PB
100+ nodes1000 + cores1 PB2010
200950+ nodes
20071-‐10 nodes
Hadoop Data • Security• ActivityHadoop Platform • Heath• Availability• Performance
Hadoop @ eBay Inc
Initiative – Why build Eagle?
7+ CLUSTERS10000+ NODES200+ PB DATA
10 B+ EVENTS / DAY500+ METRIC TYPES50,000+ JOBS / DAY50,000,000+ TASKS / DAY
Initiative – Core ChallengesLarge Scale Processing and Storage• Scale IO Complexity (Stream)• Scale Computation Complexity (Policy)• Scale Storage & Query (Event, Log, Metric)
1
Real-‐Time Alerting• Real-‐time Data Collection• Real-‐time Stream Processing• Real-‐time Alerting
2
Expressive Correlation Model• Complex policy model• Stream GroupBy, Join, Window• Machine Learning
3
Hadoop Ecosystem Integration• Data Source Integration• Anomaly Detection Model
4
1 Eagle FrameworkDistributed real-‐time framework for efficiently developing highly scalable monitoring applications
Eagle Ecosystem
2 Eagle AppsSecurity/ Hadoop/ Operational Intelligence / …
3 Eagle InterfaceREST Service / Management UI / Customizable Analytics Visualization
4 Eagle IntegrationAmbari / Docker / Ranger / Dataguise
Appsà Securityà Hadoopà Cloudà Database
Interfaceà Web Portalà REST Servicesà Analytics Visualization
Integrationà Ambarià Dockerà Rangerà Dataguise Eagle
Framework
Open SourceCommunity-‐driven and Cross-‐community cooperation5
Eagle @ eBay Inc.7+ CLUSTERS10000+ NODES200+ PB DATA
10 B+ EVENTS / DAY500+ METRIC TYPES50,000+ JOBS / DAY50,000,000+ TASKS / DAY
Eagle Deployment at eBay Production• 100+ security policies• 8 nodes• 30 worker process• 64 kafka partitionEagle Performance• Avg Latency: ~ 50 ms• Max Throughput/Cluster: 300 k /s
Eagle Architecture
Distributed Policy Engine
METADATA MANAGER
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Real Time Alerts
Alerts
Policy Management
Policy
Dynamical Policy Deployment
Real-‐time Event Stream
Stream_{1}
Stream_{*}
Dynamical Stream Schema
Stream Processing
• Real-‐time Streaming: Apache Storm (Execution Engine) + Kafka (Message Bus)• Declarative Policy: SQL (CEP) on Streaming + Hot Deploy• Linear Scalability: Data volume scale + Computation scale• Metadata-‐Driven: Schema Management and Dynamical Policy Lifecycle
METADATA MANAGER
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Real Time Alerts
Alerts
Policy Management
Policy
Dynamical Policy Deployment
Real-‐time Event Stream
Stream_{1}
Stream_{*}
Dynamical Stream Schema
Stream Processing
from MetricStream[(name == 'ReplLag') and (value > 1000)] select * insert into outputStream;
• Real-‐time Streaming: Apache Storm (Execution Engine) + Kafka (Message Bus)• Declarative Policy: SQL (CEP) on Streaming + Hot Deploy• Linear Scalability: Data volume scale + Computation scale• Metadata-‐Driven: Schema Management and Dynamical Policy Lifecycle
Distributed Policy Engine
Declarative Policy
• Filter• Join• Aggregation: Avg, Sum , Min, Max, etc• Group by• Having• Stream handlers for window: TimeWindow, Batch Window,
Length Window • Conditions and Expressions: and, or, not, ==,!=, >=, >, <=, <,
and arithmetic operations• Pattern Processing• Sequence processing• Event Tables: intergrate historical data in realtime processing• SQL-‐Like Query: Query, Stream Definition and Query Plan
compilation
Distributed SQL on Streaming : Siddhi CEP + Storm by default
from MetricStream[(name == 'ReplLag') and (value > 1000)] select * insert into outputStream;
Declarative Policy -‐ Examples
from hadoopJmxMetricEventStream[metric == "hadoop.namenode.fsnamesystemstate.capacityused" and value > 0.9] select metric, host, value, timestamp, component, site insert into alertStream;
Example 1: Alert if hadoop namenode capacity usage exceed 90 percentages
from every a = hadoopJmxMetricEventStream[metric=="hadoop.namenode.fsnamesystem.hastate"] -> b = hadoopJmxMetricEventStream[metric==a.metric and b.host == a.host and a.value != value)] within 10 min select a.host, a.value as oldHaState, b.value as newHaState, b.timestamp as timestamp, b.metric as metric, b.component as component, b.site as site insert into alertStream;
Example 2: Alert if hadoop namenode HA switches
1
Distributed Policy Engine -‐ Scalability
2
Distributed Streaming Cluster Environment
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Stream_{1}
Stream_{*}
Stream Processing
Dynamic policy partition by {event} * {policy}
• N Users with 3 partitions, M policies with 2 partitions, then 3*2 physical tasks• Physical partition + policy-‐level partition
Linear Scalability Principle
3
Algorithm Weights of Executors By Partition User
Random 0.0484 0.152 0.3535 0.105 0.203 0.072 0.042 0.024
Greedy 0.0837 0.0837 0.0837 0.0837 0.0737 0.0637 0.0437 0.0837
Stream Partition Skew (15:1)
Distributed Policy Engine – OptimizationStream Partition Problem https://en.wikipedia.org/wiki/Partition_problem
Distributed Real-time Policy Engine
Siddhi CEP Policy Evaluator
Machine Learning Policy Evaluator• Support WSO2 Siddhi CEP as first class
• Extensible Policy Engine Implementation• Extensible Policy Lifecycle Management• Metadata-‐based Module Management
Extensible Policy Evaluator
public interface PolicyEvaluatorServiceProvider {public String getPolicyType(); // literal string to identify one type of policypublic Class getPolicyEvaluator(); // get policy evaluator implementationpublic List getBindingModules(); // policy text with json format to object mapping
}
METADATA MANAGER
Policy/Metadata
Distributed Policy Engine – ExtensibilityPolicy Engine Extensibility
Stream Processing Framework
Optimizer
1. Development 2. Optimization 3. Compile to native app
Use eagle alert framework as library
• Light-‐weight ORM Framework for HBase/RDMBS• Full-‐function SQL-‐Like REST Query • Optimized Rowkey design for time-‐series data• Native HBase Coprocessor• Secondary Index Support
@Table("alertdef")@ColumnFamily("f")@Prefix("alertdef")@Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME)@JsonIgnoreProperties(ignoreUnknown = true)@TimeSeries(false)@Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"})@Indexes({
@Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true),})public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{
@Column("a")private String desc;@Column("b")private String policyDef;@Column("c")private String dedupeDef;
Query=AlertDefinitionService[@dataSource="hiveQueryLog"]{@policyDef}
Large Scale Storage and Query
Uniform HBase rowkey design
• Metric
• Entity
• Log
Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | …Rowvalue ::= Log Content
Large Scale Storage and Query
http://opentsdb.net
Multi-‐Tenants – Topology Scheduler• Dynamical Topology Management• No-‐downtime Topology Maintenance• Topology High Availability & Balance• Resource Scheduling & Isolation: Runtime,Woker, Topology or Cluster
Multi-‐Tenants -‐ Dynamical Correlation
• Dynamical Correlation on Runtime:• Sort, Groupby, Join, Window• Hot Deploy Logic• Policy Management
• Multi-‐Correlation on the Single Stream• Group by different fields of same stream• Resort same stream by different order• Join certain stream in different way
• Multi Correlation on Multi Streams• Cross Streams Join• Real-‐time & Historical Stream Join
Multi-‐Tenants -‐ Dynamical Correlation
Eagle Use Cases
Data Activity MonitoringSecure Hadoop in Realtime a data activity monitoring solution to instantly identify access to sensitive data, recognize attacks/ malicious activity and block access in real time.
Job Performance MonitoringHadoop, Spark Job Profiling & Performance Monitoring, Cluster Health Anomaly Detection
1
2
4eBay Unified Monitoring PlatformProvide unified monitoring-‐as-‐service for everything around infrastructure or business.
3
Eagle Data Activity MonitoringData Loss PreventionGet alerted and stop a malicious user trying to copy, delete, move sensitive data from the Hadoop cluster.
Malicious LoginsDetect login when malicious user tries to guess password. Eagle creates user profiles using machine learning algorithm to detect anomalies
Unauthorized accessDetect and stop a malicious user trying to access classified data without privilege.
Malicious user operationDetect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle user profiles. Eagle supports multiple native operation types.
User
Privileges
Common Data Sets
Patterns
CommandsZones
Query
Columns
§ HDFS Policies§ Access to Sensitive files§ HDFS Commands used (read, write, update…)§ Client host § Destination§ Security Zones
§ Hive Policies§ Access to tables with PII Data§ SQL Query Profiles§ Client Host§ Security Zone
Eagle Data Activity MonitoringHadoop Data Security: Detect anomalies in accessing HDFS and Hive
Offline: Determine bandwidth from training dataset the kernel density function parameters (KDE)Online: If a test data point lies outside the trained bandwidth, it is anomaly (Policy)
PCs(Principle Components) in EVD(Eigenvalue Value Decomposition)
Kernel Density Function
Eagle Machine Learning User Profile
Use Case Detect node anomaly by analyzing task failure ratio across all nodesAssumption Task failure ratio for every node should be approximately equalAlgorithm Node by node compare (symmetry violation) and per node trend
Eagle Job Performance Monitoring
Alerting: Anomaly Detection Alerting
Insight: Task failure drill-‐down
Insight: Task failure drill-‐down
Task Failure based Anomaly Host Detection
Counters & Features
Use Case Detect data skew by statistics and distributions for attempt execution durations and countersAssumption Duration and counters should be in normal distribution
mapDurationreduceDurationmapInputRecordsreduceInputRecordscombineInputRecordsmapSpilledRecordsreduceShuffleRecordsmapLocalFileBytesReadreduceLocalFileBytesReadmapHDFSBytesReadreduceHDFSBytesRead
Modeling & Statistics
AvgMinMax DistributionsMax z-scoreTop-NCorrelation
Threshold & Detection
Counters
Correlation > 0.9 & Max(Z-Score) > 90%
Hadoop Job Data Skew Detection
Counters & Features
Use Case Detect data skew by statistics and distributions for attempt execution durations and countersAssumption Duration and counters should be in normal distribution
mapDurationreduceDurationmapInputRecordsreduceInputRecordscombineInputRecordsmapSpilledRecordsreduceShuffleRecordsmapLocalFileBytesReadreduceLocalFileBytesReadmapHDFSBytesReadreduceHDFSBytesRead
Modeling & Statistics
AvgMinMax DistributionsMax z-scoreTop-NCorrelation
Threshold & Detection
Counters
Correlation > 0.9 & Max(Z-Score) > 90%
Hadoop Job Data Skew Detection
Learn More about Apache EagleCommunity• Website: http://eagle.incubator.apache.org• Github: http://github.com/apache/incubator-‐eagle• Mailing list: [email protected]
Resources• Documentation: http://eagle.incubator.apache.org/docs/• Docker images: https://hub.docker.com/r/apacheeagle/sandbox/
Publications & Patents• EAGLE: USER PROFILE-‐BASED ANOMALY DETECTION IN HADOOP CLUSTER (IEEE)• EAGLE: DISTRIBUTED REALTIME MONITORING FRAMEWORK FOR HADOOP CLUSTER
Q & A
http://eagle.incubator.apache.org
The slide is licensed under Creative Commons Attribution 4.0 International license.