Top Banner
Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB
18

Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

Big Data for Dummies using DataStageBig Data for Dummies using DataStage

By Peter BjelvertInfoSphere Architect

Middlecon AB

Page 2: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

ETL – Relational DB

Extract Transform in DataStage

Load

Your powerful DataStage server will handle all complex transformation and the database is only used for reading and writing.

Page 3: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

ELT – Relationel DB

ExtractLoad with Transform

If you have powerful Database servers you can push down much of the work to the database, then DataStage will mostly control the flow

Page 4: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

Balanced Optimization

Bal. Opt. create a second copy of the jobb that push everything into target. Creates one big SQL statement.

Bal. Opt. creates a new copy of the jobb that push the load into Source and Target

Use DataStage Balanced Optimization to select how to push the load: -To Source-To Target -To Both

The DataStage job is re-written into SQL code.

Page 5: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

ETL Balanced Optimization feature of Datastage

ELT – PushDown

DB DataStage is doing the main work

Bal. Opt. creates a new copy of the job with SQL code:SELECT * FROM (SELECT distinct BRANCH_CITY, BRANCH_STATE, BRANCH_ZIP FROM JK_BANK2.BANK_BRANCH) AS A, ( Select distinct BRANCH_CITY,

DB server is doing the main job

Page 6: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

Hadoop Distributed File System - HDFS

Application Layer

Workload mgmt Layer

Data Layer

One file3 copies

Page 7: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

MapReduce example

Page 8: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

Hadoop application stack

Application Layer

Workload mgmt Layer

Data LayerHDFS

MapReduce

JACL, AQL….

Page 9: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

IBM’s Hadoop implementation

Page 10: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

ETL – HDFS

Extract Transform in DataStage Load

HDFS

Node

Node

Node

Node

Node

Node

HDFS

Node

Node

Node

Node

Node

Node

Your powerful DataStage server can read and write to the distributed file system

Page 11: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

DataStage HDFS example

Read and write to a Hadoop system using the new BDFS stage

Page 12: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

ELT – Hadoop system

Extract

Use DataStage Balanced Optimization to select how to push the load: -To Source-To Target -To Both

The DataStage job is re-written into JACL code.

Load with Transform

Hadoop

Node

Node

Node

Node

Node

Node

Hadoop

Node

Node

Node

Node

Node

Node

Page 13: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

DataStage JACL example

Bal. Opt. create a second copy of the jobb that push everything into target. Creates one big JACL statement.

Page 14: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

ETL Balanced Optimization feature of Datastage

ELT – PushDown

DB DataStage is doing the main work

Bal. Opt. creates a new copy of the job with SQL code:SELECT * FROM (SELECT distinct BRANCH_CITY, BRANCH_STATE, BRANCH_ZIP FROM JK_BANK2.BANK_BRANCH) AS A, ( Select distinct BRANCH_CITY,

DB server is doing the main job

HDFS DataStage is doing the main work Bal. Opt. creates a new copy of the job with JACL code: SetOptions({conf:{"mapred.job.name":"DataStage BalOp job BIGDATA:dstage1 ff_read_write_to_hadoop_jaql_balopt_join CustomerTarget 16_#DSJobInvocationId#"}}); setOptions({conf:{"mapred.reduce.tasks":1}}));

Hadoop application server execute the JACL code onall nodes.

BDFS

Node

Node

Node

Node

Node

NodeHadoop

Node

Node

Node

Node

Node

Node

Page 15: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

Extract, Transform and filter in DataStage Load good data into HDFS

BDFS

Node

Node

Node

Node

Node

Node

DataStage can read from many different sources. Convert common data (like time/date) to failitate following queries. Send unwanted data to garbage

A good scenario for DS customer

Analytic functionsAQL …

Page 16: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

o LIVE DEMO

Page 18: Big Data for Dummies using DataStage By Peter Bjelvert InfoSphere Architect Middlecon AB.

Handling Big Data without angst