Shannon Holgate: Bending non-splittable data to harness distributed performance

Semi-structured Data and Hadoop

Shannon Holgate

Why am I here?

Why am I here?Semi-structured Data



... and Hadoop

Agenda

Motivation for Analysing XML

Solution design in Spark on Hadoop

Tuning the Solution for performance

What to take away

Thumbs up to XML processing in Hadoop

Spark can process XML quite easily

Tuning is absolutely essential

Motivation

We should care about XML

XML is Incorrectly used

Extensive Schemas

SOAPBloat on the wire

XML is Incorrectly used

XML is alsoHuman ReadableExtremely PortableStorable in Databases

Making XML perfect for communicating Data

XML is consistently the preferred data transfer protocol we see in Financial

Institutions

We should care about XML

Why should we care?Financial Institutions have the data

and financial backing to use Big Data technologies

Our first Big Data engagement is with a Global

Insurance Provider

This customer would like to process XML at Scale within

Budget

Along came an Elephant

We want Hadoop, let's prove it can keep up with our current applications

Words from the customer:

Create a solution which ingests, extracts and exports XML on Hadoop

And make sure this solution has the performance to replace Teradata

Cost

Hadoop provides the opportunity to pay only for what you need

This means no Oracle guy knocking on the maintenance door

And no absurd Teradata licensing fees

XML is worth analysingFinancial Institutions are ready for Big Data

Hadoop is a cost effective Big Data solution

We must find a way to process XML in a performant fashion on Hadoop

Solution

Gathering Requirements

Specifications of the Hadoop Cluster

Expected Load

Data sources

Integration pointsTransformation Logic

Cluster Capacity

Installed Services

YARN

Kerberos and Sentry

Specifications of the Hadoop Cluster

Development Test Production

Cluster Size 6 Nodes,128GB 12 cores

12 Nodes,128GB

12 cores

12 Nodes,128GB

12 cores

Installed Services

Spark v1.3Flume v1.5

Sqoop v1.4.5 ....


Sqoop v1.4.5 ....


Sqoop v1.4.5 ....

YARN Yes Yes Yes

KerberosSentry Both Both Both

Data Sources

Batch loads vs. StreamingApples vs. Oranges

Flume and Spark can be used for Streaming XML messages

Batch Loads are better suited to a scheduled Oozie job

Expected Load

Messages per Second

Message scheduling

Expected Message Size

Expected Load - Streaming Messages

Messages per Second 48MPS, 192MPS peak

Message Scheduling 24/7

Expected Message Size 15KB

Expected Load - Batch Messages

Messages per Second 15GB daily

Message Scheduling 22:00 daily

Expected Message Size 15KB

Transformation Logic

Capture User Stories for Extraction Criteria

Work with Product Owner to Create Data Mapping Spreadsheet

Integration points

XML source location

Export destination

Audit endpoints

Integration points

XML source location JMS source

Export destination Exadata

Audit endpoints Exadata

Building the Pipeline

The Pipeline

Deep Dive

The Tech Stack

Worked Example - Streaming Messages

Data Flow

The Pipeline

The Tech Stack

Flume Ingestion

Spark Xml to Avro

Spark Avro to PSV

Sqoop Export of PSV

Deep Dive

JMS XML message source

Flume Agent with JMS source and HDFS sink

Spark Job every 10 minutes

Reads 10 minutes of streamed XML

Converts to Avro Datafiles

Spark job running prior to export

Converts Avro to PSV for Sqoop

Sqoop export for each table in Exadata

Reads PSV versions of the Avro data

Data warehouse holding the completed pipeline data

XML Processing with Spark

Spark on Scala

Access to Java LibrariesScala is Functional by design

Reading XML in Spark

Keep things simple

Use the Xml Input Format from Hadoop Streaming

Inputs are split from opening to closing tag

Avro Support in Spark

Use Kryo Serialisation for correct Avro support

Avro Serialisation and data format

Avro Serialisation and Parquet Data Format

Design Extractors

In this case we want to turn one XML message into 5 different Avros

5 Extraction Classes should be created

Design Extractors - cont

Use DOM/SAX if you have no definite XSD for the XML

DOM is acceptable as the data is already in memory

Spark Processing

All extractions should occur within a single Map

Map only job

Try not to cause any shuffles

Writing the Avro output

Use the AvroJob MapReduce output format

Create Avro Datafiles

Take time to understand incoming XML

Design solution to fit the Hadoop Cluster

XML in Spark should be processed carefully

Tuning

Remember those Cluster Specs?

Development Test Production

Cluster Size 6 Nodes,128GB 12 cores

12 Nodes,128GB

12 cores

12 Nodes,128GB

12 cores

Installed Services


Sqoop v1.4.5 ....


Sqoop v1.4.5 ....


Sqoop v1.4.5 ....

YARN Yes Yes Yes

KerberosSentry Both Both Both

Time to use them

Tuning Flume

Build Flume Cluster and Load balance

Source should read enough events for 1 block

File channel vs. Memory channel

Tuning Spark - ExecutorsVital Tuning point

Tuning Spark - Executors

Memory allocation

Number of cores

Number of Executors

Memory allocation--executor-memory

Max out but leave some for the daemons

Number of coresSpark can run 1 task per core

HDFS Client doesn't like concurrent threads

Limit to 5

Number of ExecutorsMore Executors, less cores

Big nodes? 3-5 Executers per node

Adjust memory and cores to match

Don't forget YARN

If the Cluster is YARN enabled it will limit memory and cores

Solution?

Number of Executers = Number of Cores/Number of Container Cores * Nodes

Or,

Number of Executers = Memory per Node/Memory per Container * Nodes

Use the minimum value

Tuning Spark - Monitoring

Spark Context Web UI

Coda Hale Metrics

Tuning Sqoop Exports

Use Direct Connectors

Tweak Number of Mappers

Cluster Flume where possible

More Spark Executers > More cores

Sqoop mappers

Summary

XML is absolutely a worthwhile data source to analyse

Focus on using Spark to Extract XML data and move into Avro and Parquet

Tuning should revolve around Spark allocated resources

Thanks

Shannon Holgate

Senior Analytics Engineer @ Kainos

@sholgate13

Shannon Holgate: Bending non-splittable data to harness distributed performance

Data & Analytics

Shannon Holgate: Bending non-splittable data to harness distributed performance