L'architettura di classe enterprise di nuova generazione - Massimo Brignoli

Massimo BrignoliPrincipal Solution [email protected]@massimobrignoli

L’architettura di Classe Enterprise di Nuova Generazione

Agenda

• Nascita dei Data Lake• Overview di MongoDB • Proposta di

un’architettura EDM• Case Study & Scenarios• Data Lake Lessons

Learned

Quanti dati?• Una cosa non manca alla aziende: dati

– Flussi dei sensori– Sentiment sui social– Log dei server– App mobile

• Analisti stimano una crescita del volume di dati del 40% annuo, 90% dei quali non strutturati.

• Le tecnologie tradizionali (alcune disegnate 40 anni fa) non sono sufficienti

La Promessa del “Big Data”• Scoprire informazioni collezionando ed analizzando i

dati porta la promessa di– Un vantaggio competitivo– Risparmio economico

• Un esempio diffuso dell’utilizzo della tecnologia Big Data è la “Single View”: aggregare tutto quello che si conosce di un cliente per migliorarne l’ingaggio e i ricavi

• Il tradizionale EDW scricchiola sotto il carico, sopraffatto dal volume e varietà dei dati (e dall’alto costo).

La Nascita dei Data Lake• Molte aziende hanno iniziato a guardare verso

un’architettura detta Data Lake:– Piattaforma per gestire i dati in modo flessibile– Per aggregare e correlare i dati cross-silos in un unico posto– Permette l’esplorazione di tutti i dati

• La piattaforma più in voga in questo momento è Hadoop:– Permette la scalabilità orizzontale su hardware commodity– Permette una schema di dati variegati ottimizzato in lettura– Include strati di lavorazione dei dati in SQL e linguaggi comuni– Grandi referenze (Yahoo e Google in primis)

Perché Hadoop?• Hadoop Distributed FileSystem è disegnato per scalare

su grandi operazioni batch• Fornisce un modello write-one read-many append-only • Ottimizzato per lunghe scansione di TB o PB di dati• Questa capacità di gestire dati multi-strutturati è

usata:– Segmentazione dei clienti per campagne di marketing e

recommendation– Analisi predittiva– Modelli di Rischio

Ma va bene per tutto?• I Data Lake sono disegnati per fornire l’output di

Hadoop alle applicazioni online. Queste applicazioni hanno dei requisiti tra cui:– Latenza di risposta in ms– Accesso random su un sottoinsieme di dati indicizzato– Supporto di query espressive ed aggregazioni di dati– Update di dati che cambiano valori frequentemente in real-time

Hadoop è la risposta a tutto?• Nel nostro mondo guidato ormai dai dati, i millisecondi

sono importanti.– Ricercatori IBM affermano che il 60% dei dati perde valore alcuni

millisecondi dopo la generazione– Ad esempio identificare una transazione di borsa fraudolenta

può essere inutile dopo alcuni minuti• Gartner predice che il 70% delle installazioni di Hadoop

fallirà per non aver raggiunto gli obiettivi di costo e di incremento del fatturato.

Enterprise Data Management Pipeline

…

Siloed source databases

External feeds (batch)

Streams

Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png

Transform

Store raw data

AnalyzeAggregate

Pub-sub, ETL, file imports

Stream Processing

Users

Other System

s

Veloce Overview di MongoDB

Document Model{ first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: {

type : ‘Point’,coordinates : [45.123,47.232]

}, cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ]}

MongoDB

RDBMS

Documents are Rich Data Structures{ first_name: ‘Paul’, surname: ‘Miller’, cell: 447557505611, city: ‘London’, location: {

type : ‘Point’,coordinates : [45.123,47.232]

}, Profession: [‘banking’, ‘finance’, ‘trader’], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ]}

Fields can contain an array of sub-documents

Typed field values

Fields can contain arrays

String

Number

Geo-Location

Fields

Documents are Flexible

{ product_name: ‘Acme Paint’, color: [‘Red’, ‘Green’], size_oz: [8, 32], finish: [‘satin’, ‘eggshell’]}

{ product_name: ‘T-shirt’, size: [‘S’, ‘M’, ‘L’, ‘XL’], color: [‘Heather Gray’ … ], material: ‘100% cotton’, wash: ‘cold’, dry: ‘tumble dry low’}

{ product_name: ‘Mountain Bike’, brake_style: ‘mechanical disc’, color: ‘grey’, frame_material: ‘aluminum’, no_speeds: 21, package_height: ‘7.5x32.9x55’, weight_lbs: 44.05, suspension_type: ‘dual’, wheel_size_in: 26}

Documents in the same product catalog collection in MongoDB

Drivers & Frameworks

Morphia

MEAN Stack

Development – The Past

{ CODE } DB SCHEMAXML CONFIG

APPLICATION RELATIONAL DATABASEOBJECT RELATIONAL MAPPING

Development – With MongoDB

{ CODE } DB SCHEMAXML CONFIG

APPLICATION RELATIONAL DATABASEOBJECT RELATIONAL MAPPING

MongoDB is Full-Featured

Rich Queries Find Paul’s carsFind everybody in London with a car between 1970 and 1980

Geospatial Find all of the car owners within 5km of Trafalgar Sq.

Text Search Find all the cars described as having leather seats

Aggregation Calculate the average value of Paul’s car collection

Map Reduce What is the ownership pattern of colors by geography over time (is purple trending in China?)

Dynamic Lookup

• Combine data from multiple collections with left outer joins for richer analytics & more flexibility in data modeling

Model of the Aggregation Framework

Analytics & BI Integration

Replica Sets

• Replica set – 2 to 50 copies• Replica sets make up a self-healing ‘shard’• Data center awareness• Replica sets address:

• High availability• Data durability, consistency• Maintenance (e.g., HW swaps)• Disaster Recovery

Application

Driver

Primary

Secondary

Secondary

Replication

Query Routing

• Multiple query optimization models• Each of the sharding options are appropriate for

different apps / use cases

3.2 Features Relevant for EDM• WiredTiger operational Engine• In-memory storage engine• Encryption• Compass (data viewer & query builder)• Connector for BI (Visualization)• Connector for Hadoop• Connector for Spark• $lookUp (left outer join)

Enterprise Data Management Pipeline

…



Streams


Transform

Store raw data

AnalyzeAggregate


Stream Processing

Users

Other System

s

Come scegliere il layer di Data Management per ognuno degli stage?

Processing Layer

?

When you want:1. Secondary

indexes2. Sub-second

latency3. Aggregations in

DB4. Updates of data

For:1. Scanning files2. When indexes

not needed

Wide column store (e.g. HBase)

For:1. Primary key

queries2. If multiple indexes

& slices not needed

3. Optimized for writing, not reading

Data Store for Raw Dataset

…



Streams


Transform

Store raw data

AnalyzeAggregate


Stream Processing

Users

Other System

s

Store raw data

Transform

- Typically just writing record-by-record from source data

- Usually just need high write volumes

- All 3 options handle that

Transform read requirements- Benefits to reading multiple

datasets sorted [by index], e.g. to do a merge

- Might want to look up across tables with indexes (and join functionality in MDB v3.2)

- Want high read performance while writes are happening

Interactive querying on the raw data could use indexes with MongoDB

Data Store for Transformed Dataset

…



Streams


Transform

Store raw data

AnalyzeAggregate


Stream Processing

Users

Other System

s

Aggregate

Transform

Often benefits to updating data as merging multiple datasets

Dashboards & reports can have sub-second latency with indexes

Aggregate read requirements- Benefits to using indexes for grouping- Aggregations natively in the DB would

help- With indexes, can do aggregations on

slices of data- Might want to look up across tables with

indexes to aggregate

Data Store for Aggregated Dataset

…



Streams


Transform

Store raw data

AnalyzeAggregate


Stream Processing

Users

Other System

s

AnalyzeAggregate


Analytics read requirements- For scanning all of data,

could be in any data store- Often want to analyze a

slice of data (using indexes)- Querying on slices is best in

MongoDB

Data Store for Last Dataset

…



Streams


Transform

Store raw data

AnalyzeAggregate


Stream Processing

Users

Other System

s

Analyze

Users


- At the last step, there are many consuming systems and users

- Need expressive querying with secondary indexes

- MongoDB is best option for the publication or distribution of analytical results and operationalization of data

Other SystemsOften digital

applications- High scale- Expressive querying- JSON preferred

Often RESTful services,

APIs

Architettura EDM Completa

…



Streams

Data processing pipeline


Stream ProcessingDownstrea

m Systems

… …

Single CSR

Application

Unified Digital Apps

Operational

Reporting

…

… …

Analytic Reporting

Drivers & Stacks

Customer

Clustering

Churn Analysi

s

Predictive

Analytics

…

Distributed Processing

Governance to choose where to load and process data

Optimal location for providing operational response times & slices

Can run processing on all data or slices

Data Lake

Example scenarios1.Single Customer View

a. Operationalb. Analytics on customer segmentsc. Analytics on all customers

2.Customer profiles & clustering

3.Presenting churn analytics on high value customers

Single View of CustomerSpanish bank replaces Teradata and Microstrategy to increase business and avoid significant cost

Problem Why MongoDB ResultsProblem Solution Results

Took days to implement new functionality and business policies, inhibiting revenue growth

Branches needed an app providing single view of the customer and real time recommendations for new products and services

Multi-minute latency for accessing customer data stored in Teradata and Microstrategy

Built single view of customer on MongoDB – flexible and scalable app easy to adapt to new business needs

Super fast, ad hoc query capabilities (milliseconds), and real-time analytics thanks to MongoDB’s Aggregation Framework

Can now leverage distributed infrastructure and commodity hardware for lower total cost of ownership and greater availability

Cost avoidance of 10M$+

Application developed and deployed in less than 6 months. New business policies easily deployed and executed, bringing new revenue to the company

Current capacity allows branches to load instantly all customer info in milliseconds, providing a great customer experience

Large Spanish Bank

Case StudyInsurance leader generates coveted single view of customers in 90 days – “The Wall”


No single view of customer, leading to poor customer experience and churn

145 years of policy data, 70+ systems, 15+ apps that are not integrated

Spent 2 years, $25M trying build single view with Oracle – failed

Built “The Wall” pulling in disparate data and serving single view to customer service reps in real time

Flexible data model to aggregate disparate data into single data store

Churn analysis done with Hadoop with relevant results output to MongoDB

Prototyped in 2 weeks

Deployed to production in 90 days

Decreased churn and improved ability to upsell/cross-sell

Top 15 Global Bank

Kicking Out OracleGlobal bank with 48M customers in 50 countries terminates Oracle ULA & makes MongoDB database of choice


Slow development cycles due to RDBMS’ rigid data model hindering ability to meet business demands

High TCO for hardware, licenses, development, and support (>$50M Oracle ULA)

Poor overall performance of customer-facing and internal applications

Building dozens of apps on MongoDB, both net new and migrations from Oracle – e.g., significant portion of retail banking, including customer-facing and backoffice apps, fraud detection, card activation, equity research content mgt.)

Flexible data model to develop apps quickly and accommodate diverse data

Ability to scale infrastructure and costs elastically

Able to cancel Oracle ULA. Evaluating what apps can be migrated to MongoDB. For new apps, MongoDB is default choice

Apps built in weeks instead of months or years, e.g., ebanking app prototyped in 2 weeks and in production in 4 weeks

70% TCO reduction

L'architettura di classe enterprise di nuova generazione - Massimo Brignoli

Data & Analytics