Top Banner
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera
16

An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

Dec 19, 2015

Download

Documents

Griffin Ball
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

An Information Architecture for Hadoop

Mark Samson – Systems Engineer, Cloudera

Page 2: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

2© Cloudera, Inc. All rights reserved.

Background

• The trend is for organisations to build business-wide Hadoop implementations

• Enterprise Data Hub / Data Lake / Hadoop as a Service• Many data sources• Many lines of business• Many use cases• Many engines and tools available to process and analyse data• Need to meet SLAs for data consumers

• How do I organise my information architecture within Hadoop to cope with this variety?

• Need a Logical Information Architecture for Hadoop!

Page 3: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

3© Cloudera, Inc. All rights reserved.

What are the requirements?

• Ingest data in its full fidelity, in as close to its original, raw form as possible

• Provide a data discovery and exploration facility for analysts and data scientists

• Bring together and link multiple data sets• Serve data efficiently to business users and applications –

meeting SLAs

Page 4: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

4© Cloudera, Inc. All rights reserved.

DataConsumers

Where does an Enterprise Data Hub fit?

DataSources

EnterpriseData Hub

Data consumers can be:• Analysts• Data Scientists• Business Users (Reports)• Applications

Data Sources can be:• Databases / DWs• File Sources• Machines, Sensors (IoT)• Internet (Social Media etc)• Mobile

Enterprise Data Hub sits in between!

(but it’s not the only thing in between)

Page 5: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

5© Cloudera, Inc. All rights reserved.

DataConsumers

How does data arrive?

DataSources

Enterprise Data Hub

Data can arrive in any form e.g.• Event data• Log files• Streaming e.g. via MQ, Kafka

• Relational tables with any data model• Star schema• 3NF

• Files with any format• Text• JSON• XML• Avro• …

Page 6: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

6© Cloudera, Inc. All rights reserved.

DataConsumers

DataSources

Raw Layer

RawLayer

Principle: Ingest data raw, in full fidelity – as close as possible to the form in which it arrives

Data organised in HDFS by data source e.g. /landing/<source>

Writeable by ingestion processes e.g. Flume, Sqoop

Readable by transformation processes e.g. Hive, Pig, MR, Spark

Page 7: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

7© Cloudera, Inc. All rights reserved.

RawLayer

DataSources

DataConsumers

Discovery Layer

DiscoveryLayer

Used for Discovery and Exploration by small teams of Analysts and Data Scientists

Users or teams given their own “sandpits” (at a cost?)

Mix of views and materialised data

Some data sets “enriched” e.g. by joining reference data

Tools: Impala, Solr, Spark

Page 8: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

8© Cloudera, Inc. All rights reserved.

RawLayer

DiscoveryLayer

DataSources

DataConsumers

Shared Layer

SharedLayer

Available across LOBs (subject to security constraints)

Incentives for Analyst / Data Science teams to move their data and use cases into this Layer

Data from multiple sources joined together

Tools: Impala, Hive, Pig, Spark

Page 9: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

9© Cloudera, Inc. All rights reserved.

RawLayer

DiscoveryLayer

SharedLayer

DataSources

DataConsumers

Optimised Layer

OptimisedLayer

Build this when you need to operationalise the use case

Organised by data consumer and use case not by source

Data modeled to provide optimised performance• Often denormalised• Uses optimised storage formats

e.g. Parquet with partitioning, HBase

• Accessed by low latency query engines e.g. HBase, Impala, Solr

Page 10: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

10© Cloudera, Inc. All rights reserved.

RawLayer

DiscoveryLayer

SharedLayer

DataSources

DataConsumers

What About Real Time?

OptimisedLayer

To operationalise use cases in real time:• Low latency components e.g.

Kafka, Flume, Spark Streaming• Consume straight from sources• Transform/analyse it• Deliver it direct to the Optimised

Layer for low-latency query• Or deliver direct to consumer• Generally still persist raw data in

Raw Layer• Follows the Lambda Architecture

Spee

d La

yer

Page 11: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

11© Cloudera, Inc. All rights reserved.

This is a Complex, Multi-Tenant ArchitectureCritical Enablers

A broad and open ecosystem

Security and Governance• Authentication• Authorisation• Auditing• Lineage and Metadata• Encryption

Resource Management

Chargeback

RawLayer

DiscoveryLayer

SharedLayer

DataSources

DataConsumers

OptimisedLayer

Spee

d La

yer

Page 12: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

12© Cloudera, Inc. All rights reserved.

ConsiderationsThis is not prescriptive

• There could be more or fewer layers, depending on use cases

This is a logical architecture

There may be multiple physical clusters due to non functional requirements e.g.• Compliance and security e.g. some

data can only be kept in EU• If there are tight SLAs, some engines

perform better on dedicated clusters e.g. HBase, Kafka

RawLayer

DiscoveryLayer

SharedLayer

DataSources

DataConsumers

OptimisedLayer

Spee

d La

yer

Page 13: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

13© Cloudera, Inc. All rights reserved.

Conclusion

Move from Big Data Spaghetti

DataSources

DataConsumers

EDWs Marts SearchServers Document Stores Storage

Page 14: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

14© Cloudera, Inc. All rights reserved.

Conclusion

RawLayer

DiscoveryLayer

SharedLayer

DataSources

DataConsumers

OptimisedLayer

Spee

d La

yer

Move from Big Data Spaghetti

…to Big Data Lasagne!

Page 15: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

15© Cloudera, Inc. All rights reserved.

BOOK SIGNINGS THEATER SESSIONS

TECHNICAL DEMOS GIVEAWAYS

Visit us at Booth #101HIGHLIGHTS:

Apache Kafka is now fully supported with Cloudera

Learn why Cloudera is the leader for security & governance in Hadoop

Page 16: An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

Thank you