Data Integration with Embulk DATA SCIENCE WEEKEND 2016, YOGYAKARTA TEGUH NUGRAHA
Data Integration with EmbulkDATA SCIENCE WEEKEND 2016, YOGYAKARTATEGUH NUGRAHA
Multi Data Formats and Storages
MySQL
PostgreSQL
MongoDB
CSV files
BigQuery
Redshift
HDFS
Google Analytics
Mixpanel
Zendesk
Elasticsearch
Multi Data Sources Users data in MySQL
Offline data in CSV
Traffics data in Google Analytics
Log data
Bulk Data Loading: Load data from A to B
Problems Parsing files
Error handling
Idempotent Retrying
Performance
Scalability
Format compatibility
SolutionReliable framework with parallel execution, data validation, error recovery, auto guessing, resuming and extensive plugins
github.com/embulk/embulk
Embulk: Bulk Data Loader
Plugins by Category•Input plugins
•Output plugins
•Filter plugins
•File parser plugins
•File decoder plugins
•File formatter plugins
•File encoder plugins
•Executor plugins
Getting Started1. Embulk requires Java
2. Download embulk:http://dl.embulk.org/embulk-latest.jar
3. Make it executable$ embulk --version
4. Run an example:$ embulk example
Installing Embulk Plugin$ embulk gem install embulk-input-mysql
$ embulk gem install embulk-output-postgresql
List of plugins:
https://embulk.org/plugins
Embulk Configuration File
Embulk Configuration File (YAML)
in: Input plugin options. ◦ parser: If the input is file-based, parser plugin parses a file format (built-in csv, json,
etc).◦ decoder: If the input is file-based, decoder plugin decodes compression or
encryption (built-in gzip, bzip2, zip, tar.gz, etc).
out: Output plugin options. ◦ formatter: If the output is file-based, formatter plugin formats a file format (such
as built-in csv, JSON)◦ encoder: If the output is file-based, encoder plugin encodes compression or
encryption (such as built-in gzip or bzip2)
filters: Filter plugins options (optional).
exec: Executor plugin options. An executor plugin control parallel processing (such as built-in thread executor, Hadoop MapReduce executor)
Using Guess CommandGuess command guesses parser and decoder options
$ embulk guess seed.yml –o config.yml
Using guess command
Previewing and Running$ embulk preview config.yml
$ embulk run config.yml
Setup cron schedule
embulk-input-mysql https://github.com/embulk/embulk-input-jdbc/tree/master/embulk-input-mysql
$ embulk gem install embulk-input-mysql
embulk-output-postgresql
https://github.com/embulk/embulk-output-jdbc/tree/master/embulk-output-postgresql
$ embulk gem install embulk-output-postgresql
Using Variables configuration file name must end with .yml.liquid
Environment variables are set to env variable
Include fileFile will be searched from the relative path of the input configuration file and file name will be _<name>.yml.liquid
Thank YouTEGUH NUGRAHADATA SCIENCE LEAD, [email protected]: / /WWW.SLIDESHARE.NET/TEGUHN
References https://embulk.org
https://github.com/embulk/embulk
http://www.slideshare.net/frsyuki/fighting-against-chaotically-separated-values-with-embulk