Semi-structured Data and Hadoop
Shannon Holgate
Why am I here?
Why am I here?Semi-structured Data
Why am I here?Semi-structured Data
Why am I here?Semi-structured Data
... and Hadoop
Agenda
Motivation for Analysing XML
Solution design in Spark on Hadoop
Tuning the Solution for performance
What to take away
Thumbs up to XML processing in Hadoop
Spark can process XML quite easily
Tuning is absolutely essential
Motivation
We should care about XML
XML is Incorrectly used
Extensive Schemas
SOAPBloat on the wire
XML is Incorrectly used
XML is alsoHuman ReadableExtremely PortableStorable in Databases
Making XML perfect for communicating Data
XML is consistently the preferred data transfer protocol we see in Financial
Institutions
We should care about XML
Why should we care?Financial Institutions have the data
and financial backing to use Big Data technologies
Our first Big Data engagement is with a Global
Insurance Provider
This customer would like to process XML at Scale within
Budget
Along came an Elephant
We want Hadoop, let's prove it can keep up with our current applications
Words from the customer:
Create a solution which ingests, extracts and exports XML on Hadoop
And make sure this solution has the performance to replace Teradata
Cost
Hadoop provides the opportunity to pay only for what you need
This means no Oracle guy knocking on the maintenance door
And no absurd Teradata licensing fees
XML is worth analysingFinancial Institutions are ready for Big Data
Hadoop is a cost effective Big Data solution
We must find a way to process XML in a performant fashion on Hadoop
Solution
Gathering Requirements
Specifications of the Hadoop Cluster
Expected Load
Data sources
Integration pointsTransformation Logic
Cluster Capacity
Installed Services
YARN
Kerberos and Sentry
Specifications of the Hadoop Cluster
Development Test Production
Cluster Size 6 Nodes,128GB 12 cores
12 Nodes,128GB
12 cores
12 Nodes,128GB
12 cores
Installed Services
Spark v1.3Flume v1.5
Sqoop v1.4.5 ....
Spark v1.3Flume v1.5
Sqoop v1.4.5 ....
Spark v1.3Flume v1.5
Sqoop v1.4.5 ....
YARN Yes Yes Yes
KerberosSentry Both Both Both
Data Sources
Batch loads vs. StreamingApples vs. Oranges
Flume and Spark can be used for Streaming XML messages
Batch Loads are better suited to a scheduled Oozie job
Expected Load
Messages per Second
Message scheduling
Expected Message Size
Expected Load - Streaming Messages
Messages per Second 48MPS, 192MPS peak
Message Scheduling 24/7
Expected Message Size 15KB
Expected Load - Batch Messages
Messages per Second 15GB daily
Message Scheduling 22:00 daily
Expected Message Size 15KB
Transformation Logic
Capture User Stories for Extraction Criteria
Work with Product Owner to Create Data Mapping Spreadsheet
Integration points
XML source location
Export destination
Audit endpoints
Integration points
XML source location JMS source
Export destination Exadata
Audit endpoints Exadata
Building the Pipeline
The Pipeline
Deep Dive
The Tech Stack
Worked Example - Streaming Messages
Data Flow
The Pipeline
The Tech Stack
Flume Ingestion
Spark Xml to Avro
Spark Avro to PSV
Sqoop Export of PSV
Deep Dive
JMS XML message source
Flume Agent with JMS source and HDFS sink
Spark Job every 10 minutes
Reads 10 minutes of streamed XML
Converts to Avro Datafiles
Spark job running prior to export
Converts Avro to PSV for Sqoop
Sqoop export for each table in Exadata
Reads PSV versions of the Avro data
Data warehouse holding the completed pipeline data
XML Processing with Spark
Spark on Scala
Access to Java LibrariesScala is Functional by design
Reading XML in Spark
Keep things simple
Use the Xml Input Format from Hadoop Streaming
Inputs are split from opening to closing tag
Avro Support in Spark
Use Kryo Serialisation for correct Avro support
Avro Serialisation and data format
Avro Serialisation and Parquet Data Format
Design Extractors
In this case we want to turn one XML message into 5 different Avros
5 Extraction Classes should be created
Design Extractors - cont
Use DOM/SAX if you have no definite XSD for the XML
DOM is acceptable as the data is already in memory
Spark Processing
All extractions should occur within a single Map
Map only job
Try not to cause any shuffles
Writing the Avro output
Use the AvroJob MapReduce output format
Create Avro Datafiles
Take time to understand incoming XML
Design solution to fit the Hadoop Cluster
XML in Spark should be processed carefully
Tuning
Remember those Cluster Specs?
Development Test Production
Cluster Size 6 Nodes,128GB 12 cores
12 Nodes,128GB
12 cores
12 Nodes,128GB
12 cores
Installed Services
Spark v1.3Flume v1.5
Sqoop v1.4.5 ....
Spark v1.3Flume v1.5
Sqoop v1.4.5 ....
Spark v1.3Flume v1.5
Sqoop v1.4.5 ....
YARN Yes Yes Yes
KerberosSentry Both Both Both
Time to use them
Tuning Flume
Build Flume Cluster and Load balance
Source should read enough events for 1 block
File channel vs. Memory channel
Tuning Spark - ExecutorsVital Tuning point
Tuning Spark - Executors
Memory allocation
Number of cores
Number of Executors
Memory allocation--executor-memory
Max out but leave some for the daemons
Number of coresSpark can run 1 task per core
HDFS Client doesn't like concurrent threads
Limit to 5
Number of ExecutorsMore Executors, less cores
Big nodes? 3-5 Executers per node
Adjust memory and cores to match
Don't forget YARN
If the Cluster is YARN enabled it will limit memory and cores
Solution?
Number of Executers = Number of Cores/Number of Container Cores * Nodes
Or,
Number of Executers = Memory per Node/Memory per Container * Nodes
Use the minimum value
Tuning Spark - Monitoring
Spark Context Web UI
Coda Hale Metrics
Tuning Sqoop Exports
Use Direct Connectors
Tweak Number of Mappers
Cluster Flume where possible
More Spark Executers > More cores
Sqoop mappers
Summary
XML is absolutely a worthwhile data source to analyse
Focus on using Spark to Extract XML data and move into Avro and Parquet
Tuning should revolve around Spark allocated resources
Thanks
Shannon Holgate
Senior Analytics Engineer @ Kainos
@sholgate13