Top Banner
Muga Nishizawa (西澤 無我) Using Embulk at Treasure Data
31

Using Embulk at Treasure Data

Jan 09, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Using Embulk at Treasure Data

Muga Nishizawa (西澤 無我)

Using Embulk at Treasure Data

Page 2: Using Embulk at Treasure Data

Today’s talk

> What’s Embulk?

> Why our customers use Embulk? > Embulk > Data Connector

> Data Connector > The architecture > The use case

> with MapReduce Executor > How we configure MapReduce Executor?

2

Page 3: Using Embulk at Treasure Data

What’s Embulk?

> An open-source parallel bulk data loader > loads records from “A” to “B”

> using plugins > for various kinds of “A” and “B”

> to make data integration easy. > which was very painful…

3

Storage, RDBMS, NoSQL, Cloud Service,

etc.

broken records,transactions (idempotency),

performance, …

Page 4: Using Embulk at Treasure Data

HDFS

MySQL

Amazon S3

Embulk

CSV Files

SequenceFile

Salesforce.com

Elasticsearch

Cassandra

Hive

Redis

✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Resuming

Plugins Plugins

bulk load

Page 5: Using Embulk at Treasure Data

Why our customers use Embulk?

> Upload various types of their data to TD with Embulk > Various file formats

> CSV, TSV, JSON, XML,.. > Various data source

> Local disk, RDBMS, SFTP,.. > Various network environments

> embulk-output-td > https://github.com/treasure-data/embulk-output-td

5

Page 6: Using Embulk at Treasure Data

Out of scope for Embulk

> They develop scripts for > generating Embulk configs

> changing schema on a regular basis > logic to select some files but not others

> managing cron settings > e.g. some users want to upload yesterday’s dataas daily batch

> Embulk is just “bulk loader”

6

Page 7: Using Embulk at Treasure Data

Best practice to manage Embulk!!

7http://www.slideshare.net/GONNakaTaka/embulk5

Page 8: Using Embulk at Treasure Data

Yes, yes,..

8

Page 9: Using Embulk at Treasure Data

Data Connector

Users/Customers

PlazmaDBConnector Worker

submit connector jobs see loaded data

on Console

Guess/Preview API

Page 10: Using Embulk at Treasure Data

Data Connector

Users/Customers

PlazmaDBConnector Worker

submit connector jobs see loaded data

on Console

Guess/Preview API

Page 11: Using Embulk at Treasure Data

2 types of hosted Embulk service

11

Import (Data Connector)

Export (Result Output)

MySQL PostgreSQL Redshift AWS S3 Google Cloud Storage SalesForce Marketo …etc

MySQL PostgreSQL Redshift BigQuery …etc

Page 12: Using Embulk at Treasure Data

Guess/Preview API

Users/Customers

PlazmaDBConnector Worker

submit connector jobs see loaded data

on Console

Guess/Preview API

Page 13: Using Embulk at Treasure Data

Guess/Preview API

> Guesses Embulk config based on sample data > Creates parser config

> Adds schema, escape char, quote char, etc.. > Creates rename filter config

> TD requires uncapitalized column names

> Preview data before uploading

> Ensures quick response

> Embulk performs this functionality running on our web application servers

13

Page 14: Using Embulk at Treasure Data

Connector Worker

Users/Customers

PlazmaDBConnector Worker

submit connector jobs see loaded data

on Console

Guess/Preview API

Page 15: Using Embulk at Treasure Data

Connector Worker

> Generates Embulk config and executes Embulk > Uses private output plugin instead of embulk-output-td to upload users’ data to PlazmaDB directly

> Appropriate retry mechanism

> Embulk runs on our Job Queue clients

15

Page 16: Using Embulk at Treasure Data

Timestamp parsing

Users/Customers

PlazmaDBConnector Worker

submit connector jobs see loaded data

on Console

Guess/Preview API

Page 17: Using Embulk at Treasure Data

Timestamp parsing

> Implement strptime in Java > Ported from CRuby implementation > Can precompile the format

> Faster than JRuby’s strptime > Has been maintained in Embulk repo obscurely..

> It will be merged into JRuby

17

Page 18: Using Embulk at Treasure Data

How we use Data Connector at TD

> a. Monitoring our S3 buckets access > e.g. “IAM users who accessed our S3 buckets?”“Access frequency” > {in: {type: s3}} and {parser: {type: csv}}

> b. Measuring KPIs for development process > e.g. “phases that we took a long time on the process” > {in: {type: jira}}

> c. Measuring Business & Support Performance > {in: {type: Salesforce, Marketo, ZenDesk, …}}

18

Page 19: Using Embulk at Treasure Data

Scaling Embulk

> Requests for massive data loading from users > e.g. “Upload 150GB data by hourly batch”“Start PoC and upload 500GB data today”

> Local Executor can not handle this scale > MapReduce Executor enables us to scale

19

Page 20: Using Embulk at Treasure Data

W/ MapReduce

Users/Customers

PlazmaDBConnector Worker

submit connector jobs see loaded data

on Console

Guess/Preview API

Hadoop Clusters

Page 21: Using Embulk at Treasure Data

What’s MapReduce Executor?

21

Task

Task

Task

Task

Map tasks

Task queue

run tasks on Hadoop

Page 22: Using Embulk at Treasure Data

MapReduce Executor with TimestampPartitioning

22

Task

Map tasks

Task queue

run tasks on Hadoop

Reduce tasksShuffle

Page 23: Using Embulk at Treasure Data

built Embulk configs

23

exec: type: mapreduce job_name: embulk.100000 config_files: - /etc/hadoop/conf/core-site.xml - /etc/hadoop/conf/hdfs-site.xml - /etc/hadoop/conf/mapred-site.xml config: fs.defaultFS: “hdfs://my-hdfs.example.net:8020” yarn.resourcemanager.hostname: "my-yarn.example.net" dfs.replication: 1 mapreduce.client.submit.file.replication: 1 state_path: /mnt/xxx/embulk/ partitioning: type: timestamp unit: hour column: time unix_timestamp_unit: hour map_side_partition_split: 3 reducers: 3in: ...

Connector Workers (single-machine workers) are still able to generate config

Page 24: Using Embulk at Treasure Data

Different sized files

24

Map tasks Reduce tasksShuffle

Page 25: Using Embulk at Treasure Data

Same time range data

25

Map tasks Reduce tasksShuffle

Page 26: Using Embulk at Treasure Data

Grouping input files - {in: {min_task_size}}

26

Map tasks Reduce tasksShuffle

Task

Task

Task

It also can reduce mapper’s launch cost.

Page 27: Using Embulk at Treasure Data

One partition into multi-reducers - {exec: {partitioning: {map_side_split}}}

27

Map tasks Reduce tasksShuffle

Page 28: Using Embulk at Treasure Data

Prototype of console Integration

28

Page 29: Using Embulk at Treasure Data

29

Page 30: Using Embulk at Treasure Data

30

¥

Page 31: Using Embulk at Treasure Data

Conclusion

> What’s Embulk?

> Why we use Embulk? > Embulk > Data Connector

> Data Connector > The architecture of Data Connector > The use case

> with MapReduce Executor

31