Top Banner
BIG HANDLING LARGE DATA The CloverETL Cluster Architecture Explained Wednesday, August 14, 13
42

CloverETL Cluster - Big Data Parallel Processing Explained

May 13, 2015

Download

Technology

CloverETL

An explanation of how CloverETL Cluster can process big data using parallel processing on multiple nodes, including features like load balancing, data locality principle and robustness.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CloverETL Cluster - Big Data Parallel Processing Explained

BIGHANDLING LARGE DATA

The CloverETL Cluster Architecture Explained

Wednesday, August 14, 13

Page 2: CloverETL Cluster - Big Data Parallel Processing Explained

The Reality:You have a really big pile to deal with.

One traditional digger might not be enough.

Really

Big D

ata

Wednesday, August 14, 13

Page 3: CloverETL Cluster - Big Data Parallel Processing Explained

You could get a really big, expensive digger...

Really

Big D

ata

Wednesday, August 14, 13

Page 4: CloverETL Cluster - Big Data Parallel Processing Explained

…or several smaller ones and get the job done faster & cheaper.

Really

Big D

ata

Wednesday, August 14, 13

Page 5: CloverETL Cluster - Big Data Parallel Processing Explained

But what if the one big one suffers a mechanical failure?

Really

Big D

ata

Wednesday, August 14, 13

Page 6: CloverETL Cluster - Big Data Parallel Processing Explained

With small diggers, failure of one does not affect the rest.

Really

Big D

ata

Wednesday, August 14, 13

Page 7: CloverETL Cluster - Big Data Parallel Processing Explained

Which one do you choose ?

vs

Wednesday, August 14, 13

Page 8: CloverETL Cluster - Big Data Parallel Processing Explained

CloverETL Cluster resiliency features

Optimizing for robustness...

Wednesday, August 14, 13

Page 9: CloverETL Cluster - Big Data Parallel Processing Explained

Fault resiliency – HW & SW

automatic fail-over

Before After

Node 2 Node 1 Node 2Node 1

Wednesday, August 14, 13

Page 10: CloverETL Cluster - Big Data Parallel Processing Explained

automatic load balancing

Load Balancing

New ta

sk

Before After

Node 2

Node 1 Node 1

Node 2

Wednesday, August 14, 13

Page 11: CloverETL Cluster - Big Data Parallel Processing Explained

CloverETL Cluster - BIG DATA features

Optimizing for speed...

Wednesday, August 14, 13

Page 12: CloverETL Cluster - Big Data Parallel Processing Explained

Traditionally, data transformations were run on a single, big serverwith multiple CPUs and plenty of RAM.

And it was expensive.

Wednesday, August 14, 13

Page 13: CloverETL Cluster - Big Data Parallel Processing Explained

Then the CloverETL team developed the concept of a data

transformation cluster.

The CloverETL Cluster was born

It creates a powerful data transformation beast from a set of low-cost commodity hardware machines.

Wednesday, August 14, 13

Page 14: CloverETL Cluster - Big Data Parallel Processing Explained

Now, one data transformation can be set to run in parallel on all available nodes of the CloverETL Cluster.

Wednesday, August 14, 13

Page 15: CloverETL Cluster - Big Data Parallel Processing Explained

Each cluster node executing the transformation is automatically fed with a

different portion of the input data.

Part 1

Part 2

Part 3

Wednesday, August 14, 13

Page 16: CloverETL Cluster - Big Data Parallel Processing Explained

Part 1

Part 2

Part 3

Now

Before

=

=

Working in parallel, they finish the job faster, with less resources needed individually.

Wednesday, August 14, 13

Page 17: CloverETL Cluster - Big Data Parallel Processing Explained

That sounds nice and simple.But how is it really done?

Wednesday, August 14, 13

Page 18: CloverETL Cluster - Big Data Parallel Processing Explained

CloverETL allows certain transformation components to be assigned to multiple cluster nodes.

runs1x

runs1x

runs3x

Allocated to

Allocated toAllocated to

Allocated toNode 1

Node 2

Node 3

CloverETL Cluster

Such components then run in multiple instances.

We call this Allocation.

Alloca

ted to

Wednesday, August 14, 13

Page 19: CloverETL Cluster - Big Data Parallel Processing Explained

Special components allow incoming data to be split and sent in parallel flows to multiple nodes where the processing flow continues.

Node 1

Node 2

Node 3

Serial data Partitioned data

Node 1

1st instance

2nd instance

3rd instance

Wednesday, August 14, 13

Page 20: CloverETL Cluster - Big Data Parallel Processing Explained

Other components gather data from parallel flows back into a single, serial one.

Node 1

Node 2

Node 3

Serial dataPartitioned data

Node 1

1st instance

2nd instance

3rd instance

Wednesday, August 14, 13

Page 21: CloverETL Cluster - Big Data Parallel Processing Explained

The original transformation is automatically “rewritten” into several smaller ones, which are executed by cluster nodes in parallel.

Which nodes will be used is determined by Allocation.

Node 1

Node 2 Node 3

2nd instance

3rd instance

Serial data Serial dataPartitioned data

1st instance

Node 3

Wednesday, August 14, 13

Page 22: CloverETL Cluster - Big Data Parallel Processing Explained

Let’s take a look at an example.

Wednesday, August 14, 13

Page 23: CloverETL Cluster - Big Data Parallel Processing Explained

In this example, we’ll read data about company addresses. There are 10,499,849 records in total.

We also calculate statistics of the number of companies residing in each US state.

We get a total of 51 records – one record per US state.

serial processing

Wednesday, August 14, 13

Page 24: CloverETL Cluster - Big Data Parallel Processing Explained

Here, we’re processing the same input data, but in parallel now.

We get a total of 51 records again.

Split Gather

work in3 parallel streams

Each parallel stream gets a portion of the

input data

Partial results

Wednesday, August 14, 13

Page 25: CloverETL Cluster - Big Data Parallel Processing Explained

Go parallel in 1 minute.

drag&drop drag&

drop

serial

parallel

Wednesday, August 14, 13

Page 26: CloverETL Cluster - Big Data Parallel Processing Explained

What’s the Trick?

Split the input data into parallel streams.

Do the heavy lifting on smaller data portions in parallel.

Bring the individual pieces of results together at the end.

☜DONE

Wednesday, August 14, 13

Page 27: CloverETL Cluster - Big Data Parallel Processing Explained

Let’s continue.

More on allocation and partitioned sandboxes

Wednesday, August 14, 13

Page 28: CloverETL Cluster - Big Data Parallel Processing Explained

A Sandbox

We assume you are familiar with the CloverETL Server’s concept of a SANDBOX.

SANDBOX is a logical name for a file directory structure managed by the Server. It allows individual projects on the Server to be separated into logical units. Each CloverETL data transformation can access multiple sandboxes either locally or remotely.

Let’s look at a special type of sandbox – partitioned

Wednesday, August 14, 13

Page 29: CloverETL Cluster - Big Data Parallel Processing Explained

The sandbox presents “originals” – combined data.

Part 2

Part 1 Partitionedsandbox“SboxP”

Part 3

Node 1

Node 2

Node 3

SboxP

In a partitioned Sandbox, the input file is split into subfiles, each residing on a different node of the Cluster in a similarly structured folder.

Wednesday, August 14, 13

Page 30: CloverETL Cluster - Big Data Parallel Processing Explained

Partitioned Sandboxes

A partitioned sandbox is a logical abstraction on top of similarly structured folders on different Cluster nodes.

The Sandbox’s logicalstructure with a unified view of folders & files

The Sandbox’s physicalstructure with listed locations/nodes of

files’ portions

Wednesday, August 14, 13

Page 31: CloverETL Cluster - Big Data Parallel Processing Explained

Partitioned SandboxPartitioned sandbox defines how

data is partitioned across nodes of the CloverETL

Cluster

Allocation

Allocation defines how a transformation’s run is distributed

across nodes of the CloverETL Cluster

☜☞The allocation can be set to derive from the sandbox layout.

Data processing happens where data resides.

We tell the cluster to run our transformation components on nodes that also contain portions of

data we want to process.

Wednesday, August 14, 13

Page 32: CloverETL Cluster - Big Data Parallel Processing Explained

Allocation Determined By a Partitioned Sandbox:

4 partitions ⇒ 4 parallel

transformations.

There’s no gathering at the end - partitioned results are stored directly to the partitioned sandbox. Allocation for the

aggregator is derived from sandbox being used.

Wednesday, August 14, 13

Page 33: CloverETL Cluster - Big Data Parallel Processing Explained

Allocation Determined By an Explicit Number:

8 parallel transformations.

Partitioning at the beginning and gathering at the end is necessary as we need to cross the

serial⇿parallel boundary twice.

Wednesday, August 14, 13

Page 34: CloverETL Cluster - Big Data Parallel Processing Explained

A Data Skew

This is called a data skew.

Data is not uniformly distributed across partitions. This indicates that chosen partitioning key is not the best for the maximum performance.

However, the chosen key allows us to perform only single pass aggregation (no semi-results) - thus it’s a good tradeoff.

The busiest worker will have to process 2.5 million rows whereas the least busy, only 0.67 million – that is, approximately 3.5x less.

Wednesday, August 14, 13

Page 35: CloverETL Cluster - Big Data Parallel Processing Explained

Parallel Pitfalls

When processing data in parallel, a few things should be considered.

Aggregating, Sorting, Joining…

Working in parallel means producing “parallel”/semi results.

First, we produce 4 aggregated semi-results. Then we aggregate the semi-results to get the final result.

➔semi-result1➔semi-result 2

➔semi-result3➔semi-result4

record stream1record stream2

record stream3record stream4

These partial results have to be further processed to get final result.

➔final resultsemi-result1,2,3,4 ➔

The good news: When increasing or changing the number of parallel streams, we don’t have to

change the transformation.

Wednesday, August 14, 13

Page 36: CloverETL Cluster - Big Data Parallel Processing Explained

Parallel Pitfalls

Full transformation – parallel aggregation & post-processing semi results

sum()here

count()here

Why ?

Example: A parallel counting of occurrences of companies per state using count().

In step 1, we produce partial results. Because records are partitioned in a round-robin, data for one state may appear in multiple parallel streams.

For example, we might get data for NY as 4 partial results in 4 different streams.

In step 2, we merge all the partial results from the 4 parallel streams into a sequence and then aggregate again to get the final numbers.

At this step the aggregation function is sum() – we sum the partial counts.

Step 1

Aggregating, Sorting, Joining…

Step 2

Wednesday, August 14, 13

Page 37: CloverETL Cluster - Big Data Parallel Processing Explained

Parallel Pitfalls

Parallel sorting

mergehere

sorthere

Why ?

Sorting in parallel ➔ records are sorted in individual parallel streams, but not across all streams.

Bringing parallel sorted streams together into serial stream ➔ records have to be merged according to the same key as used in parallel sorting ➔ to produce overall sorted serial result.

1 2

Aggregating, Sorting, Joining…

Wednesday, August 14, 13

Page 38: CloverETL Cluster - Big Data Parallel Processing Explained

Parallel Pitfalls

Why ?

Joining in parallel➔master&slave(s) records must be partitioned by the same key/field. The same key must be used for joining records.

!In another case, there is a danger that records from master & slave with the same key will not join as they end up in different parallel streams. Joiner joins only within one stream and not across streams.

!

Aggregating, Sorting, Joining…

Parallel joining

Wednesday, August 14, 13

Page 39: CloverETL Cluster - Big Data Parallel Processing Explained

Parallel Pitfalls

Example

Result(all master records joined)

Parallel joining - 3 parallel streams - partitioning by state

[AL AK AZ AR CA CO CT DC DE FL][GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND][OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY]

[AK AZ DE][IL MD NY][OR PA VA]

[AK AZ DE]

[IL MD NY][OR PA VA]

1⥤

2⥤3⥤

1⥤

2⥤3⥤

1⥤

2⥤3⥤

Aggregating, Sorting, Joining…

stream

stream

stream

stream

stream

stream

Wednesday, August 14, 13

Page 40: CloverETL Cluster - Big Data Parallel Processing Explained

Parallel Pitfalls

Result(some master records joined)

Parallel joining - 3 parallel streams - partitioning round robin

[AL AR CT FL GA HI IA LA MA MS NE NJ NC OH PA SD UT WA WY][AK CA DC IL KS ME MI MO NV NM ND OK RI TN VT WV][AZ CO DE ID IN KY MD MN MT NH NY OR SC TX VA WI]

[][][DE NY]

[AK IL OR][AZ MD VA][DE NY PA]

1⥤

3⥤2⥤

1⥤

2⥤3⥤

1⥤

2⥤3⥤

Aggregating, Sorting, Joining…

Example

stream

stream

stream

stream

stream

stream

Wednesday, August 14, 13

Page 41: CloverETL Cluster - Big Data Parallel Processing Explained

Bringing it all together…

Going parallel is easy!Try it out for yourself.

☞ BIG DATA problems are handled through Cluster’s scalability

☞ Existing transformations can be easily converted to parallel

☞ There’s no magic – users have full control over what’s happening

☞ CloverETL Cluster has built in fault resiliency and load balancing

Wednesday, August 14, 13

Page 42: CloverETL Cluster - Big Data Parallel Processing Explained

If you have any questions, check out:

www.cloveretl.comforum.cloveretl.comblog.cloveretl.com

Wednesday, August 14, 13