Bi with apache hadoop(en)
Post on 14-Jun-2015
802 Views
Preview:
DESCRIPTION
Transcript
Business Integration withCDH 4
(including Apache Hadoop)
Alexander Alten-Lorenz, Cloudera INCMuenchen, 22. February 2013
Challenges
Volume Velocity Variety
Business Integration• CRM
• Analytics
• Social Networks
• Marketing
• Document Store
• Search-Indices
• Invoicing
• Risk Management
• Universal Data Access
• Data Governance
• SAP / Salesforce
• Article and Storage Management
Use Cases
Risk Management
• Problem: Scoring of Customers and Projects
• Solution: Finance History, Communication and Pattern Detection
• User: Finance, Insurance
Recommendations
• Problem: Recommend convenient products to purchased products, matching the interests
• Solution: Statistical analysis of interests, purchase history, detect matching swarm patterns
• Users: eCommerce, Advertising
Graph-Analytics
• Problem: Detect trends and curves in large distributed networks (Wired, Social, Mesh)
• Solution: Collecting and Data Mining all data, applying to self learning patterns to detect trends and forecasts
• User: Enterprises, Gov, NGO, Provider, Telco, Stock Exchange
Detection of Dangerous Use
• Problem: Spam, Credit Card Abuse
• Solution: Pattern Detection, Prioritizing, heuristically Analytics
• Users: Retail, Finance, Reseller
Text Analysis
• Problem: Detect the meaning of the written word (Sentiment Analysis)
• Solution: Keyword patterns, Coherences detection, Path detection
• Users: eCommerce, Social Media Service Provider, Attitude Research
Amounts of real Data
• Ebay: 12 PB, Search Optimization
• Facebook: 50 PB, Logs, Reports
• Walmart, 4.5 PB, Customer Transactions
http://wiki.apache.org/hadoop/PoweredByhttp://en.wikipedia.org/wiki/Big_data
Apache Hadoop
• Software Framework for large amounts of unstructured data
• Apache-License
• Two main cores
• HDFS: Distributed data storage
• MapReduce: Distributed data handling
Hadoop ClusterData Node
Data Node: 4-16 Cores, 4-16 Disks, 8-64 GB RAM, 1-10GB Network
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Hadoop Distributed File System
File
Block Block Block Block Block Block Block
Data Node Data Node Data Node
MapReduceData
QueryRDBMS
Data
QueryHadoop
Features
HDFS MapReduce
Distribution
Fault Tolerance
Scalability
✔ ✔
✔ ✔
✔ ✔
Hadoop Eco System
MapReduce
HDFSJava API
RDBMS
Sqoop Flume
Logs
Connectors
...
Pig
Scripts
Hive
SQL HBase
Oozie
Zookeeper
Mahout
Hue
Whirr
Avro
Example of a Integration
Scope• Successful Audits per ISO 27001
• Analyze different Data Sources from different Data Bases and CRM Systems
• Realtime and Lifetime Statistics per Product
• Periodical Analytic and Statistic Jobs
• Weekly Re-Import into CRM
• Single Queries per User (Analyst) over a Secured GUI
Solution Path• Cluster Authentication and Authorization via
Kerberos and crypted data communication / Data Protection
• Sqoop Connector to CRM / DB
• Terradata, Oracle, Postgres, MySQL, MS SQL
• Hive - HBase Integration
• Hive Analytics, controlled automatically over Oozie Workload Orchestrator
• Hue Shell, Authentication via Kerberos SPNEGO
Sqoop
HiveHBase
Kerberos(AD, MITv5)
Oozie
HUEEnduser
CRM Park CDHIntegration Authentification
Automation
Real Time
How to Manage?
Cloudera Manager• Automated Deployment
• Monitoring
• Service Management
• Log Management
• Events and Alerts
• Reporting
• Support Integration
Cloudera
• Founded 2009 in Palo Alto
• Cloudera's Distribution Including Hadoop
• CDH4 / Cloudera Manager 4
• > 320 employees worldwide
• Training, Consulting, Support, Development
• Enterprise Tools
Thank You!
• alexander@cloudera.com
• Twitter: @mapredit
• Blog: mapredit.blogspot.com
• http://www.cloudera.com/
• http://hadoop. apache.org/
top related