CloverETL Cluster - Big Data Parallel Processing Explained

BIGHANDLING LARGE DATA

The CloverETL Cluster Architecture Explained

Wednesday, August 14, 13

The Reality:You have a really big pile to deal with.

One traditional digger might not be enough.

Really

Big D

ata


You could get a really big, expensive digger...

Really

Big D

ata


…or several smaller ones and get the job done faster & cheaper.

Really

Big D

ata


But what if the one big one suffers a mechanical failure?

Really

Big D

ata


With small diggers, failure of one does not affect the rest.

Really

Big D

ata


Which one do you choose ?

vs


CloverETL Cluster resiliency features

Optimizing for robustness...


Fault resiliency – HW & SW

automatic fail-over

Before After

Node 2 Node 1 Node 2Node 1


automatic load balancing

Load Balancing

New ta

sk

Before After

Node 2

Node 1 Node 1

Node 2


CloverETL Cluster - BIG DATA features

Optimizing for speed...


Traditionally, data transformations were run on a single, big serverwith multiple CPUs and plenty of RAM.

And it was expensive.


Then the CloverETL team developed the concept of a data

transformation cluster.

The CloverETL Cluster was born

It creates a powerful data transformation beast from a set of low-cost commodity hardware machines.


Now, one data transformation can be set to run in parallel on all available nodes of the CloverETL Cluster.


Each cluster node executing the transformation is automatically fed with a

different portion of the input data.

Part 1

Part 2

Part 3


Part 1

Part 2

Part 3

Now

Before

=

=

Working in parallel, they finish the job faster, with less resources needed individually.


That sounds nice and simple.But how is it really done?


CloverETL allows certain transformation components to be assigned to multiple cluster nodes.

runs1x

runs1x

runs3x

Allocated to

Allocated toAllocated to

Allocated toNode 1

Node 2

Node 3

CloverETL Cluster

Such components then run in multiple instances.

We call this Allocation.

Alloca

ted to


Special components allow incoming data to be split and sent in parallel flows to multiple nodes where the processing flow continues.

Node 1

Node 2

Node 3

Serial data Partitioned data

Node 1

1st instance

2nd instance

3rd instance


Other components gather data from parallel flows back into a single, serial one.

Node 1

Node 2

Node 3

Serial dataPartitioned data

Node 1

1st instance

2nd instance

3rd instance


The original transformation is automatically “rewritten” into several smaller ones, which are executed by cluster nodes in parallel.

Which nodes will be used is determined by Allocation.

Node 1

Node 2 Node 3

2nd instance

3rd instance

Serial data Serial dataPartitioned data

1st instance

Node 3


Let’s take a look at an example.


In this example, we’ll read data about company addresses. There are 10,499,849 records in total.

We also calculate statistics of the number of companies residing in each US state.

We get a total of 51 records – one record per US state.

serial processing


Here, we’re processing the same input data, but in parallel now.

We get a total of 51 records again.

Split Gather

work in3 parallel streams

Each parallel stream gets a portion of the

input data

Partial results


Go parallel in 1 minute.

☟

drag&drop drag&

drop

serial

parallel


What’s the Trick?

Split the input data into parallel streams.

Do the heavy lifting on smaller data portions in parallel.

Bring the individual pieces of results together at the end.

☞

☜DONE


Let’s continue.

More on allocation and partitioned sandboxes


A Sandbox

We assume you are familiar with the CloverETL Server’s concept of a SANDBOX.

SANDBOX is a logical name for a file directory structure managed by the Server. It allows individual projects on the Server to be separated into logical units. Each CloverETL data transformation can access multiple sandboxes either locally or remotely.

Let’s look at a special type of sandbox – partitioned


The sandbox presents “originals” – combined data.

Part 2

Part 1 Partitionedsandbox“SboxP”

Part 3

Node 1

Node 2

Node 3

SboxP

In a partitioned Sandbox, the input file is split into subfiles, each residing on a different node of the Cluster in a similarly structured folder.


Partitioned Sandboxes

A partitioned sandbox is a logical abstraction on top of similarly structured folders on different Cluster nodes.

The Sandbox’s logicalstructure with a unified view of folders & files

The Sandbox’s physicalstructure with listed locations/nodes of

files’ portions


Partitioned SandboxPartitioned sandbox defines how

data is partitioned across nodes of the CloverETL

Cluster

Allocation

Allocation defines how a transformation’s run is distributed

across nodes of the CloverETL Cluster

☜☞The allocation can be set to derive from the sandbox layout.

Data processing happens where data resides.

We tell the cluster to run our transformation components on nodes that also contain portions of

data we want to process.

☟


Allocation Determined By a Partitioned Sandbox:

4 partitions ⇒ 4 parallel

transformations.

There’s no gathering at the end - partitioned results are stored directly to the partitioned sandbox. Allocation for the

aggregator is derived from sandbox being used.


Allocation Determined By an Explicit Number:

8 parallel transformations.

Partitioning at the beginning and gathering at the end is necessary as we need to cross the

serial⇿parallel boundary twice.


A Data Skew

This is called a data skew.

Data is not uniformly distributed across partitions. This indicates that chosen partitioning key is not the best for the maximum performance.

However, the chosen key allows us to perform only single pass aggregation (no semi-results) - thus it’s a good tradeoff.

The busiest worker will have to process 2.5 million rows whereas the least busy, only 0.67 million – that is, approximately 3.5x less.


Parallel Pitfalls

When processing data in parallel, a few things should be considered.

Aggregating, Sorting, Joining…

Working in parallel means producing “parallel”/semi results.

First, we produce 4 aggregated semi-results. Then we aggregate the semi-results to get the final result.

➔semi-result1➔semi-result 2

➔semi-result3➔semi-result4

record stream1record stream2

record stream3record stream4

These partial results have to be further processed to get final result.

➔final resultsemi-result1,2,3,4 ➔

The good news: When increasing or changing the number of parallel streams, we don’t have to

change the transformation.


Parallel Pitfalls

Full transformation – parallel aggregation & post-processing semi results

sum()here

count()here

Why ?

Example: A parallel counting of occurrences of companies per state using count().

In step 1, we produce partial results. Because records are partitioned in a round-robin, data for one state may appear in multiple parallel streams.

For example, we might get data for NY as 4 partial results in 4 different streams.

In step 2, we merge all the partial results from the 4 parallel streams into a sequence and then aggregate again to get the final numbers.

At this step the aggregation function is sum() – we sum the partial counts.

Step 1


Step 2


Parallel Pitfalls

Parallel sorting

mergehere

sorthere

Why ?

Sorting in parallel ➔ records are sorted in individual parallel streams, but not across all streams.

Bringing parallel sorted streams together into serial stream ➔ records have to be merged according to the same key as used in parallel sorting ➔ to produce overall sorted serial result.

1 2



Parallel Pitfalls

Why ?

Joining in parallel➔master&slave(s) records must be partitioned by the same key/field. The same key must be used for joining records.

!In another case, there is a danger that records from master & slave with the same key will not join as they end up in different parallel streams. Joiner joins only within one stream and not across streams.

!


Parallel joining


Parallel Pitfalls

Example

Result(all master records joined)

Parallel joining - 3 parallel streams - partitioning by state

[AL AK AZ AR CA CO CT DC DE FL][GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND][OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY]

[AK AZ DE][IL MD NY][OR PA VA]

[AK AZ DE]

[IL MD NY][OR PA VA]

1⥤

2⥤3⥤

1⥤

2⥤3⥤

1⥤

2⥤3⥤


stream

stream

stream

stream

stream

stream


Parallel Pitfalls

Result(some master records joined)

Parallel joining - 3 parallel streams - partitioning round robin

[AL AR CT FL GA HI IA LA MA MS NE NJ NC OH PA SD UT WA WY][AK CA DC IL KS ME MI MO NV NM ND OK RI TN VT WV][AZ CO DE ID IN KY MD MN MT NH NY OR SC TX VA WI]

[][][DE NY]

[AK IL OR][AZ MD VA][DE NY PA]

1⥤

3⥤2⥤

1⥤

2⥤3⥤

1⥤

2⥤3⥤


Example

stream

stream

stream

stream

stream

stream


Bringing it all together…

Going parallel is easy!Try it out for yourself.

☞ BIG DATA problems are handled through Cluster’s scalability

☞ Existing transformations can be easily converted to parallel

☞ There’s no magic – users have full control over what’s happening

☞ CloverETL Cluster has built in fault resiliency and load balancing


If you have any questions, check out:

www.cloveretl.comforum.cloveretl.comblog.cloveretl.com


http://www.cloveretl.com

http://www.cloveretl.com

CloverETL Cluster - Big Data Parallel Processing Explained

Technology

instance node

different node

data transformation

cloveretl data transformation

incoming data

data transformations

allocatedto allocatedto

smaller data portions