INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: Benchmarking Big Data Systems Lei Wang Institute of Computing Technology, CAS 2013-10-31 1 http://prof.ict.ac.cn/BigDataBench/
INS
TIT
UTE
OF
CO
MP
UTIN
G T
EC
HN
OL
OG
Y
BigDataBench: Benchmarking Big Data Systems
Lei Wang
Institute of Computing Technology, CAS
2013-10-31
1
http://prof.ict.ac.cn/BigDataBench/
BPOE HPCChina 2013
Big chance in big data era
2
It is an innovation chance, but how to do it?
Measuring big data architecture, systems
and data management quantitatively
BPOE HPCChina 2013
What is BigDataBench
An open source project on big data benchmarking:
• http://prof.ict.ac.cn/BigDataBench/
• Six raw real data sets – Synthetics data can scale up to PB
• Six application scenarios
– Micro-benchmarks, Search engine, Social network and E-Commerce
• A full spectrums of system software stacks
– Hadoop, MPI, Spark, Hive and Impala……..
3/
BPOE HPCChina 2013
Who can use BigDataBench
4/
BigDataBench
Architecture design the innovative Processor
the innovative Memory
the innovative Network
…….....
System design the innovative OS for big data
the innovative File system for big data
…………………………..
Data management
design …………..
Performance
optimization micro-architecture
characterization
Distributed system
optimization scheduling policy
program model
BPOE HPCChina 2013
Outline
5/
Benchmarking Methodology and Decision 1
2
Case Study
4 How to use
5
Scalable Data Generation Tool
3
BPOE HPCChina 2013
Methodology
6/
Representative
Real Data Sets
Diverse and
Important
Workloads
Data
Sources Text data
Graph data
Table data
Extended …
Data Types Structured
Semi-structured
Unstructured
Big Data Sets
Preserving
4V
BigDataBench
Investigate
Typical
Application
Domains
Synthetic data generation tool
preserving data characteristics
Application
Types Offline analytics
Realtime analytics
Online services
Basic & Important
Operations and
Algorithms
Extended…
Represent
Software Stack
Extended…
Big Data
Workloads
BPOE HPCChina 2013
Typical Application Domains
7/
Search Engine, Social Network and Electronic Commerce hold 80% page
views of all the Internet service.
BPOE HPCChina 2013
Data Sets Chosen
Data type Pay equal attention to structured, semi- structured and
unstructured data
Data source Important data source in the domain application
Application domain
Search engine, Social network and E-commence
8/
BPOE HPCChina 2013
Representative Data sets
9/
Application Domain Data Type Data Source Data set
Search Engine
unstructured data Text data Wikipedia Entries
Graph data Google Web Graph
Semi-structured
data
Table data ProfSearch Person
Resume
E-commence
Semi-structured
data
Text data
Amazon Movie
Reviews
structured data Table data ABC Transaction
Data
Social Network unstructured data Graph data Facebook Social
Graph
BPOE HPCChina 2013
Workloads Chosen
10/
• Covering workloads in diverse and representative application scenarios • Search Engine, E-commerce, Social Network
• Paying equal attentions to different applications: • online service, real-time data analysis, offline data analysis
• Including different data sources • Text data, Graph data, Table data
• Covering the representative software stack • Data store system, Data management system, Programming framework
BPOE HPCChina 2013
Chosen Workloads Summary
11/
Application Scenarios
Micro-Benchmark
Operations & Algorithm
Basic Operations
Basic Cloud OLTP
Basic Relational Query
Search Engine
E-commerce
Social Network
BPOE HPCChina 2013
Basic Operations
12/
Operations &
Algorithm
Data Type Data
Source
Software
stack
Application
type
Sort Unstructured Text MapReduce,
Spark, MPI
Offline
Analytics
Grep Unstructured Text MapReduce,
Spark, MPI
Offline
Analytics
WordCount Unstructured Text MapReduce,
Spark, MPI
Offline
Analytics
BFS Unstructured Graph MapReduce,
Spark, MPI
Offline
Analytics
BPOE HPCChina 2013
Basic Cloud OLTP
13/
Operations & Algorithm Data Type Data
Source
Software
stack
Applicatio
n type
Read Semi-structured Table Hbase,
Cassandra
MongoDB,
MySQL
Online
Service
Write Semi-structured Table Hbase,
Cassandra
MongoDB,
MySQL
Online
Services
Scan Semi-structured Table Hbase,
Cassandra
MongoDB,
MySQL
Online
Services
BPOE HPCChina 2013
Basic Relational Query
14/
Operations & Algorithm Data Type Data
Source
Software
stack
Application
type
Select Query Structured Table Impala,
Shark,
MySQL, Hive
Realtime
Analytics
Aggregate Query Structured Table Impala,
Shark,
MySQL, Hive
Realtime
Analytics
Join Query Structured Table Impala,
Shark,
MySQL, Hive
Realtime
Analytics
BPOE HPCChina 2013
Operations & Algorithms
in Search Engine
15/
Operations & Algorithm Data Type Data
Source
Software
stack
Applicatio
n type
Nutch Server Structured Table Hadoop Online
Services
PageRank Unstructured Graph Hadoop, MPI,
Spark
Offline
Analytics
Index Unstructured Text Hadoop, MPI,
Spark
Offline
Analytics
BPOE HPCChina 2013
Operations & Algorithms
in Social Network
16/
Operations & Algorithm Data Type Data
Source
Software
stack
Applicatio
n type
Olio Server Structured Table MySQL Online
Service
Kmeans Unstructured Graph Hadoop, MPI,
Spark
Offline
Analytics
Connected Components Unstructured Graph Hadoop, MPI,
Spark
Offline
Analytics
BPOE HPCChina 2013
Operations & Algorithms
in E-commerce
17/
Operations & Algorithm Data Type Data
Source
Software
stack
Applicatio
n type
Rubis Server Structured Table MySQL Online
Service
Collaborative Filtering Unstructured Text Hadoop, MPI,
Spark
Offline
Analytics
Naive Bayes Unstructrued Text Hadoop, MPI,
Spark
Offline
Analytics
BPOE HPCChina 2013
Outline
18/
Benchmarking Methodology and Decision 1
2
How to Use BigDataBench
4 Case Study
5
Scalable Data Generation Tool
3
BPOE HPCChina 2013
Data Generation Tools
Seed Data Source
Text, Graph and Table
• Six real raw data
Synthetics Data Scale
From GB to PB
Features of the synthetics data
To preserve the characteristics of real-world data
19/
BPOE HPCChina 2013
Text generator Use latent dirichlet allocation to generate text
corpus.
topic model & generative probabilistic model
David M Blei, et al., “Latent
dirichlet allocation,” the
Journal of machine Learning
research, vol. 3, pp. 993–1022,
2003.
BPOE HPCChina 2013
Graph generator
Use the Stochastic Kronecker Graph model (Jure Leskovec,et al.) to generate graph Application-specific: obtained from real represented data set of
specific applications.
BPOE HPCChina 2013
Table generator
Related structured table Parallel Data Generation Framework (Tilmann
Rabl, et al.)
BPOE HPCChina 2013
Outline
23/
Benchmarking Methodology and Decision 1
2
Case Study
4 How to Use BigDataBench
5
Scalable Data Generation Tool
3
BPOE HPCChina 2013
Case study of BigDataBench
24/
BigDataBench
Evaluating Different
Platforms
Performance evaluation
Characterizing Workloads
Performance
diagnosis Evaluating Energy
Efficiency
USTC
ICT, CAS
SIAT, CAS
CNCERT XJTU
SJTU
BPOE HPCChina 2013
Evaluating Different Platforms
Evaluating the different system platforms performances in big data computing • University of Science and Technology of China
25/
"The Implications from Benchmarking Three Different Data Center Platforms“ First BPOE in
conjunction with IEEE Big Data 2013
BPOE HPCChina 2013
Big Data Workload Characterization
26/
"The Implications from Benchmarking Three Different Data Center Platforms“ First BPOE in
conjunction with IEEE Big Data 2013
Analyzing the redundancy among big data benchmarks • Shenzhen Institutes of Advanced Technology, CAS
BPOE HPCChina 2013
Performance diagnosis
27/
An ensemble MIC(Maximum Information Criterion)-based approach to pinpoint the culprits of performance problems in the big data platform. • XI’AN JiaoTong University
"An Ensemble MIC-based Approach for Performance Diagnosis in Big Data Platform “ First
BPOE in conjunction with IEEE Big Data 2013
BPOE HPCChina 2013
Evaluating energy efficiency
28/
New metrics that measures the power usage effectiveness of IT equipment and data center systems • National Computer network Emergency Response Technical Team Coordination
Center of China
"AxPUE: Application Level Metrics for Power Usage Effectiveness in Data Centers” First BPOE
in conjunction with IEEE Big Data 2013
BPOE HPCChina 2013
Evaluating Virtualization Systems
29/
A new network socket library in virtualization scenario which utilizes shared memory for data transmission. • Shanghai JiaoTong University
"Virtualization I/O Optimization Based on Shared Memory” First BPOE in conjunction with IEEE
Big Data 2013
BPOE HPCChina 2013
Big Data Workload Characterization
30
BigDataBench: a Big Data Benchmark Suite from Internet Services, Lei Wang etc. ICT Technical Report
Big data workloads have very low floating point operation intensities (on the
average 0.009), which is two order of magnitude lower than the theory number
of state of practice CPU
BPOE HPCChina 2013
Big Data Workload Characterization
31
BigDataBench: a Big Data Benchmark Suite from Web Search Engines, Wanling Gao etc.
ASBD 2013 in conjunction with The 40th ISCA
Architecture researches using only simple applications and limited data sets are not feasible for big data scenarios.
BPOE HPCChina 2013
One International Benchmark Workshop
http://prof.ict.ac.cn/bpoe
BigDataBench
32
HPCA 2013 a full –day tutorial
http://prof.ict.ac.cn/HPCA/
Two Invited Talks WBDB (workshop on big data benchmarking)
BPOE HPCChina 2013
BigDataBench Website
33/
We expect more users join us and we will do our best for you
English Website
http://prof.ict.ac.cn/BigDataBench/
Chinese Website
http://prof.ict.ac.cn/BigDataBench/zh/
Highlights
Benchmark introduction
Benchmark download
Publications & News
User……
BPOE HPCChina 2013
Outline
34/
Benchmarking Methodology and Decision 1
2
Case Study
4 How to Use BigDataBench
5
Scalable Data Generation Tool
3
BPOE HPCChina 2013
BigDataBench Class
For Architecture
19 of 19 workloads
For OS
19 of 19 workloads
For Runtime environment (Hadoop)
9 of 19 workloads • Sort, Grep, WordCount, PageRank, Index, Kmeans, Connected Components,
Collaborative Filtering and Naive Bayes.
For Data management
6 of 19 workloads • Read, Write, Scan, Select Query, Aggregate Query, Join Query
35/
BPOE HPCChina 2013
BigDataBench Class: data source
Text related
6 of 19 workloads • Sort, Grep, WordCount, Index, Collaborative Filtering and Naive Bayes
Graph related
4 of 19 workloads • BFS, PageRank, Kmeans and Connected Components
Table related
9 of 19 workloads • Read, Write, Scan, Select Query, Aggregate Query, Join Query, Nutch Server,
Olio Server and Rubis Server
36/
BPOE HPCChina 2013
BigDataBench Class: application type
Online Services
6 of 19 workloads • Read, Write, Scan, Nutch server, Olio Server and Rubis server
Offline Analytics 10 of 19 workloads
• Sort, Grep, WordCount, BFS, PageRank, Index, Kmeans, Connected Components, Collaborative Filtering and Naive Bayes.
Realtime Analytics 3 of 19 workloads
• Select Query, Aggregate Query and Join Query
37/
BPOE HPCChina 2013
BigDataBench Class: application domains
Search engine related: Basic Operations + Search Engine
7 of 19 workloads • Sort, Grep, WordCount, BFS, PageRank, Index and Nutch Server
Social network related: Basic Cloud OLTP+ Basic Relational Query+ Social
Network
9 of 19 workloads • Read, Write, Scan, Select Query, Aggregate Query, Join Query, Olio Server, Kmeans and
Connected Components
E-commerce related: Basic Cloud OLTP+ Basic Relational Query+ Social
Network
9 of 19 workloads
• Read, Write, Scan, Select Query, Aggregate Query, Join Query, Rubis server,
Collaborative Filtering and Naive Bayes
38/
BPOE HPCChina 2013
Usage Examples
Designing Experiments
What will I do ?
Choosing workloads and data sets
Workloads chosen are determined by your need
Date sets chosen are determined by your platform scale and workloads requirements
Experiments configurations
Doing the experiments & Analyzing the results
39/
BPOE HPCChina 2013
One Example
Motivation
Assuming that I have five Xeon nodes cluster and want to evaluate the performance of one optimized version Hadoop
40/
Native Hadoop
Optimized Hadoop
How to
evaluate
performances
under different
data scale?
BPOE HPCChina 2013
Step 1: Designing Experiments
Test bed
Choosing the five nodes cluster as the platform
Set up
Set up native Hadoop and optimized Hadoop
Metric
DPS (Data processing per second)
• (input data size)/(wall time)
Data Scale
1GB-500GB
41/
BPOE HPCChina 2013
Step 2-1: Choosing workloads
Map Reduce related workloads
Sort, Grep, WordCount, PageRank, Index, Kmeans, Connected Components, Collaborative Filtering and Naive Bayes.
9 (of 19) workloads in the BigDataBench
42/
BPOE HPCChina 2013
Step 2-2: Choosing data sets
Text and Graph data Wikipedia Entries: Sort, Grep, WordCount, Index, Naive Bayes.
Google Web Graph: PageRank, Connected Components, Kmeans
Amazon Movie Reviews: Collaborative Filtering
Data scale
Vary from 1GB to 500GB
Generating Data
Using data generation tool
43/
BPOE HPCChina 2013
Step 3: Experiments configurations
Hadoop configuration One master node, four slave nodes
Map slot, Reduce slot and Java heap….
http://hadoop.apache.org/
Monitor
Perf: architecture level
linux/proc: OS level
44/
BPOE HPCChina 2013
Step 4: Doing the experiments
Running the workloads one by one
Clearing the runtime environment after each experiment
multi-times
Analysis
……………………………………
45/
Visiting http://prof.ict.ac.cn/BigDataBench/ for more…
BPOE HPCChina 2013
Quick Tutorial
Running Naïve Bayes as the example
Generating Text Data
• Analyzing the Seed Data
• Generating Data
Running the workloads
46/
BPOE HPCChina 2013
Generating Text Data
Analyzing the Seed Data
47/
BPOE HPCChina 2013
Generating Big Data
Generating Data
48/
• An example
- $HADOP_HOME/bin/hadoop jar TextProduce.jar test file-100G 20
75000000 5
BPOE HPCChina 2013
Running programs
Training
./run-train.sh <in-dir> <out-dir> • <in-dir>: the data directory which is used to train
• <out-dir>: the training model output directory
Classification
./run-bayes.sh <in-dir> <out-dir> • <in-dir>: the training model directory
• <out-dir>: the input data directory
An example
./run-train.sh file-1G file-1G-Model
./run-bayes.sh file-1G-Model file-100G
49/
BPOE HPCChina 2013
THANKS
50/
Visiting http://prof.ict.ac.cn/BigDataBench/ for more…
BPOE HPCChina 2013
BACKUP
51/
BPOE HPCChina 2013
Metrics
User observation metrics
For Cloud OLTP online services
• The number of processed operations per second (OPS)
For other online services
• The number of processed requests per second (RPS)
For Analytics applications
• The data processed per second (DPS)
52/