Transcript
Petabyte Scale Data Warehousing at Facebook
Ning Zhang
Data InfrastructureFacebook
Overview
Motivations– Data-driven model– Challenges
Data Infrastructure– Hadoop & Hive– In-house tools
Hive Details– Architecture– Data model– Query language– Extensibility
Research Problems
Motivations
Facebook is just a Set of Web Services …
… at Large Scale
The social graph is large– 400 million monthly active users– 250 million daily active users– 160 million active objects (groups/events/pages)– 130 friend connections per user on average– 60 object (groups/events/pages) connections per user on
average Activities on the social graph
– People spent 500 billion minutes per month on FB– Average user creates 70 pieces of content each month– 25 billion pieces of content are shared each month– Millions of search queries per day
Facebook is still growing fast– New users, features, services …
Facebook is still growing and changing
Nov-0
0
Jan-
01
Mar-0
1
May-0
1
Jul-0
1
Sep-
01
Nov-0
1
Jan-
02
Mar-0
2
May-0
2
Jul-0
2
Sep-
02
Nov-0
2
Jan-
03
Mar-0
3
May-0
3
Jul-0
3
Sep-
03
Nov-0
3
Jan-
04
Mar-0
4
May-0
4
Jul-0
4
Sep-
04
Nov-0
4
Jan-
05
Mar-0
5
May-0
5
Jul-0
5
Sep-
05
Nov-0
5
Jan-
060
50
100
150
200
250
300
350
400
450
Timeline of Monthly Active Users
MAU
Under the Hook
Data flow from users’ perspective– Clients (browser/phone/3rd party apps) Web Services
Users– Another big topic on the Web Services
To complete the feedback system …– The developers want to know how a new app/feature
received by the users (A/B test)– The advertisers want to know how their ads perform
(dashboard/reports)– Based on historical data, how to construct a model and
predicate the future (machine learning) Need data analytics!
– Data warehouse: ETL, data processing, BI …– Closing the loop: decision-making based on analyzing the
data (users’ feedback)
Data-driven Business/R&D/Science …
DSS is not new but Web gives it new elements. “In 2009, more data will be generated by individuals
than the entire history of mankind through 2008.”-- by Andreas Weigend, Harvard Business Review
“The center of the universe has shifted from e-business to me-business.”
-- same as above “Invariably, simple models and a lot of data trump
more elaborate models based on less data.” -- by Alon Halevy, Peter Norvig and Fernando Pereira, The
Unreasonable Effectiveness of Data
Problems and Challenges
Data-driven development/business – Huge amount of log data/user data generated every day– Need to analyze these data to feedback
development/business decisions– Machine learning, report/dashboard generation, A/B testing
And many more problems– Scalability (more than petabytes)– Availability (HA)– Manageability (e.g., scheduling)– Performance (CPU, memory, disk/network I/O)– And many more…
Facebook Engineering Teams (backend)
Facebook Infrastructure– Building foundations that serves end users/applications– OLTP workload– Components include MySQL, memcached, HipHop (PHP),
thrift, Cassandra, Haystack, flashcache, … Facebook Data Infrastructure (data warehouse)
– Building systems that serves data analysts, research scientists, engineers, product managers, executives, etc.
– OLAP workload– Components include Hadoop, Hive, HDFS, scribe, HBase,
tools (ETL, UI, workflow management etc.) Other Engineering teams
– Platform, search, site integrity, monetization, apps, growth, etc.
DI Key Challenges (I) – scalability
Data, data and more data– 200 GB/day in March 2008 12 TB/day at the end of 2009– About 8x increase per year – Total size is 5 PB now (x3 when considering replication)– Same order as the Web (~25 billion indexable pages)
DI Key Challenges (II) – Performance
Queries, queries and more queries– More than 200 unique users query on the data warehouse
every day– 7K queries/day at the end of 2009– 25K queries/day now– Workload is a mixture of ad-hoc queries and ETL/reporting
queries. Fast, faster and real-time
– Users expect faster response time on fresher data (e.g., fighting with spam/fraud in near real-time)
– Sampling subset of data are not always good enough
Other Requirements
Accessibility– Everyone should be be able to log & access data easily, not
only engineers (a lot of our users do not have CS degrees!)– Schema discovery (more than 20K tables)– Data exploration and visualization (learning the data by
looking)– Leverage existing prevalent and familiar tools (e.g., BI tools)
Flexibility– Schema changes frequently (adding new columns, changing
column types, different partitions of tables, etc.)– Data formats could be different (plain text, row store,
column store, complex data types) Extensibility
– Easy to plug in user defined functions, aggregations etc. – Data storage could be files, web services, “NoSQL stores”–
Why not Existing Data Warehousing Systems?
Cost of analysis and storage on proprietary systems does not support trends towards more data.– Cost based on data size (15 PB costs a lot!)– Expensive hardware and supports
Limited Scalability does not support trends towards more data– Product designed decades ago (not suitable for petabyte
DW)– ETL is a big bottleneck
Long product development & release cycle– Users requirements changes frequently (agile programming
practice)
Closed and proprietary systems
Lets try Hadoop (MapReduce + HDFS) …
Pros– Superior in availability/scalability/manageability (99.9%)– Large and healthy open source community (popular in both
industry and academic organizations)
But not quite …
Cons: Programmability and Metadata– Efficiency not that great, but throw more hardware– MapReduce hard to program (users know SQL/bash/python)
hard to debug, so it takes longer to get the results– No schema
Solution: Hive!
What is Hive ?
A system for managing and querying structured data built on top of Hadoop– Map-Reduce for execution– HDFS for storage– RDBMS for metadata
Key Building Principles:– SQL is a familiar language on data warehouses– Extensibility – Types, Functions, Formats, Scripts (connecting
to HBase, Pig, Hybertable, Cassandra etc.)– Scalability and Performance– Interoperability (JDBC/ODBC/thrift)
Hive: Familiar Schema Concepts
Name HDFS Directory
Table pvs /wh/pvs
Partition ds = 20090801, ctry = US /wh/pvs/ds=20090801/ctry=US
Bucketuser into 32 buckets
HDFS file for user hash 0/wh/pvs/ds=20090801/ctry=US/
part-00000
Column Data Types
• Primitive Types• integer types, float, string, date, boolean
• Nest-able Collections• array<any-type>• map<primitive-type, any-type>
• User-defined types• structures with attributes which can be of any-type
Hive Query Language
DDL– {create/alter/drop} {table/view/partition}– create table as select
DML– Insert overwrite
QL– Sub-queries in from clause– Equi-joins (including Outer joins)– Multi-table Insert– Sampling– Lateral Views
Interfaces– JDBC/ODBC/Thrift
Optimizations
Column Pruning– Also pushed down to scan in columnar storage (RCFILE)
Predicate Pushdown– Not pushed below Non-deterministic functions (eg. rand())
Partition Pruning Sample Pruning Handle small files
– Merge while writing– CombinedHiveInputFormat while reading
Small Jobs– SELECT * with partition predicates in the client
Restartability (Work In Progress)
Hive: Simplifying Hadoop Programming
$ cat > /tmp/reducer.shuniq -c | awk '{print $2"\t"$1}‘$ cat > /tmp/map.shawk -F '\001' '{if($1 > 100) print $1}‘$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -
input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
$ bin/hadoop dfs –cat /tmp/largekey/part*
vs.
hive> select key, count(1) from kv1 where key > 100 group by key;
MapReduce Scripts Examples
add file page_url_to_id.py;add file my_python_session_cutter.py;FROM (SELECT TRANSFORM(uhash, page_url, unix_time) USING 'page_url_to_id.py' AS (uhash, page_id, unix_time) FROM mylog DISTRIBUTE BY uhash SORT BY uhash, unix_time) mylog2SELECT TRANSFORM(uhash, page_id, unix_time) USING 'my_python_session_cutter.py' AS (uhash, session_info);
Hive Architecture
Hive: Making Optimizations Transparent
Joins:– Joins try to reduce the number of map/reduce jobs needed.– Memory efficient joins by streaming largest tables.– Map Joins
User specified small tables stored in hash tables on the mapper No reducer needed
Aggregations:– Map side partial aggregations
Hash-based aggregates Serialized key/values in hash tables
– 90% speed improvement on Query SELECT count(1) FROM t;
– Load balancing for data skew
Hive: Making Optimizations Transparent
Storage:– Column oriented data formats– Column and Partition pruning to reduce scanned data– Lazy de-serialization of data
Plan Execution– Parallel Execution of Parts of the Plan
Hive: Open & Extensible
Different on-disk storage(file) formats– Text File, Sequence File, …
Different serialization formats and data types– LazySimpleSerDe, ThriftSerDe …
User-provided map/reduce scripts– In any language, use stdin/stdout to transfer data …
User-defined Functions– Substr, Trim, From_unixtime …
User-defined Aggregation Functions– Sum, Average …
User-define Table Functions– Explode …
Hive: Interoperability with Other Tools
JDBC– Enables integration with JDBC based SQL clients
ODBC– Enables integration with Microstrategy
Thrift– Enables writing cross language clients– Main form of integration with php based Web UI
Powered by Hive
Usage in Facebook
Usage
Types of Applications:– Reporting
Eg: Daily/Weekly aggregations of impression/click counts Measures of user engagement Microstrategy reports
– Ad hoc Analysis Eg: how many group admins broken down by state/country
– Machine Learning (Assembling training data) Ad Optimization Eg: User Engagement as a function of user attributes
– Many others
Hadoop & Hive Cluster @ Facebook
Hadoop/Hive cluster– 13600 cores– Raw Storage capacity ~ 17PB– 8 cores + 12 TB per node– 32 GB RAM per node– Two level network topology
1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch
2 clusters– One for adhoc users– One for strict SLA jobs
Hive & Hadoop Usage @ Facebook
Statistics per day:– 800TB of I/O per day– 10K – 25K Hive jobs per day
Hive simplifies Hadoop:– New engineers go though a Hive training session– Analysts (non-engineers) use Hadoop through Hive– Most of jobs are Hive Jobs
Data Flow Architecture at Facebook
Scirbe-HDFS
Web Servers
Production Hive-Hadoop ClusterOracle RAC Federated MySQL
Scribe-Hadoop Cluster
Adhoc Hive-Hadoop Cluster
Hivereplication
Scribe-HDFS: 101
Scribed
Scribed
Scribed
Scribed
Scribed
<category, msgs>
HDFSData Node
HDFSData Node
HDFSData Node
Append to /staging/<category>/<file>
Scribe-HDFS
Scribe-HDFS: Near real time Hadoop
Clusters collocated with the web servers
Network is the biggest bottleneck
Typical cluster has about 50 nodes.
Stats:– 50TB/day of raw data logged– 99% of the time data is available within 20 seconds
Warehousing at Facebook
Instrumentation (PHP/Python etc.) Automatic ETL
– Continuous copy data to Hive tables Metadata Discovery (CoHive) Query (Hive) Workflow specification and execution (Chronos) Reporting tools Monitoring and alerting
Future Work
Scaling in a Dynamic and Fast Growing Environment– Erasure codes for Hadoop– Namenode scalability past 150 million objects
Isolating Adhoc queries from jobs with strict deadlines– Hive Replication
Resource Sharing– Pools for slots
More scalable loading of data– Incremental load of site data– Continuous load of log data
Future Work
Discovering Data from > 20K tables– Collaborative Hive
Finding Unused/rarely used Data
Future Dynamic Inserts into multiple partitions More join optimizations Persistent UDFs, UDAFs and UDTFs Benchmarks for monitoring performance IN, exists and correlated sub-queries Statistics Materialized Views
Research Challenges
Reducing response time for small/medium jobs– 20 thousands queries per day 1 million queries per day– Indexes on Hadoop, data mart strategy– Near real-time query processing – pipelining MapReduce
Distributed systems problems in large scale: – Job scheduling problem: mixed throughput and response
time workloads– Orchestra commits on thousands of machines (scribe conf
files)– Cross data center replication and consistency
Full SQL compliant– Required by 3rd party tools (e.g., BI) through ODBC/JDBC.
Query Optimizations
Efficiently compute histograms, median, distinct values in a distributed shared-nothing architecture
Cost models in the MapReduce framework
Social Graph
Every user sees a different, personalized stream of information (news feed)– 130 friend + 60 object updates in real time– Edge-rank: ranking of updates that should be shown on the
top Social graph is stored in distributed MySQL
databases– Data replication between data centers: an update to one
data center should be replicated to other data centers as well
– How to partition a dense graph such that data transfer from different partitions is minimized.
Questions?