BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

Post on 20-Sep-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Peeling the OnionHow Data Abstractions Help Build BigData Apps

Andreas Neumann @caskoid

November 2016

Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.

cask.co2

Abstractions are Everywhere

cask.co3

Piet Mondrian, Trafalgar Square, Broadway Boogie Woogie

cask.co4

The Case for Abstractions

Abstraction is a mental process we use when trying to discern what is essential or relevant to a problem.

Tom G. Palmer

cask.co5

Common Abstractions in Computing- Programming Languages

- Assembler > C > C++ > Java > Scala > ? - Memory management, Concurrency, Closures, …

- Web App Servers - CGI-bin > Servlets > JAX-RS - Connection Pools, Security, ...

- Relational Databases - Primitive types -> Semi-structured -> ORM - Transactions, rollback, isolation

cask.co6

Abstractions in Hadoop- MapReduce

- Input/OutputFormat provides some kind of abstraction - Intermediate data (mapper output) must Writable

- HBase

- Row/column keys and values are byte[] - Client must implement encoding of higher level types

- Transactions: Isolation, Consistency

- Existing data abstractions for Hadoop - Apache Hive, Apache Phoenix, …

cask.co7

Layers of Abstractions

engine

capa injecbility tion

dat hara s ing

int atiegr ons

enc ulaaps tion

acc attess p erns

con tensis cy

iso tila on

sto forrage mat

esch ma

cask.co8

Storage Engine Abstraction- Storage Engine

- Physical Storage Medium - Lowest level of the abstraction stack

- Benefits - Application code not “polluted” with low-level storage APIs - Portability across storage engines - Portability across different version of the storage engine - Testability in environments with different storage engine - Reusability of code

cask.co9

Storage Format Abstraction- Representation of data in the storage engine

- Serialization of data types to native storage format - Mapping complex types to storage format (ORM) - Schema representation - Provided partially by some storage engines SQL)

- Benefits - Application is not concerned with serialization/deserialization - Schema evolution - Enforces correct schema and representation

cask.co10

Consistency Abstractions- Strong vs. Eventual Consistency - Transactional (ACID) consistency

- Protect data from concurrent modification - Isolation / visibility guarantees - Optimistic Concurrency Control: Handling conflicts

- Benefits - Application code not concerned with consistency - “Framework level correctness”

cask.co11

Data Sharing Abstractions- Sharing/Reusing data across programming paradigms

- Write with Spark Streaming, query with SQL - Share data between batch (MapReduce) and realtime streaming - Data as a s Service (DaaS)

- Benefits - No data silos - Less redundancy in data access

cask.co12

Data Access Pattern Abstractions- Encapsulation of common data access patterns

- Examples: - Indexed Table - TimeSeries - Cube

- Benefits - Cleaner application code - Enforcement of best practices - Avoid data corruption - Separation of concerns/responsibilities

cask.co13

Capability Injection- Framework level Enterprise capabilities

- Metrics - Meta Data - Lineage, Access Audit Trail, Usage stats - Access Control

- Benefits - Operational Capabilities solved at the framework level - Compliance, Governance

cask.co14

The Cost of Abstraction

First you learn the value of abstraction, then you learn the cost of abstraction,

then you're ready to engineer. Kent Beck

cask.co15

Clean Cut Abstractions

con tensis cysto forrage mat

engine

enc ulaaps tiondat hara s ing

capa injecbility tion

cask.co16

Abstractions Gone Wrong

cask.co17

Fried Abstractions

cask.co18

What Makes a Good Abstraction- Minimal Overhead

- Injection happens once - Not in critical path / inner loop

- Not more code - Separation from app code - Reusability

- Storage Optimization - May not expose all the knobs and dials of the storage engine - Allow to bypass the abstraction when necessary

cask.co19

• Application Development and Management

• Provides Data and Programming Abstractions

• Provides Integrations

• Data-As-A-Service

• Empower developers

• Simple Access to Powerful Tech

• WYSIWYG Data Pipelines • Streaming• Batch

• Ingestion, Transformation, Blending (complex joins) and Lookup.

• Machine Learning, Aggregation and Reporting

• Connectors for varied sources and sinks

• Easy way to catalog application and pipeline level metadata

• Search across technical, business and operational metadata

• Track Lineage and Provenance,

• Data Quality Measure

• Integration with other MDM systems

cask.co20

Data Abstractions In Practice- Use Case:

- Ingest from Twitter into a Dataset - Run MapReduce over the Dataset to compute frequent #hashtags - Service to retrieve the top #hashtags - See the lineage for this Dataset

cask.co21

Demo

cask.co22

Conclusion

Brevity is the soul of wit. William Shakespeare

Thank You!cdap-user@googlegroups.com

@CaskData

github.com/caskdata/cdapgithub.com/caskdata/hydrator-plugins

Questions?23

top related