Morning Tech#1 BigData - Oxalide Academy

MorningTech #1 – BigDatale 15 décembre 2016 –Ludovic Piot

Les événements Oxalide

• Objectif : présentation d’une thématique métier ou technique• Tout public : 80 à 100 personnes• Déroulé : 1 soir par trimestre de 18h à 21h

• Introduction de la thématique par un partenaire• Tour de table avec des clients et non clients• Echange convivial autour d’un apéritif dînatoire

• Objectif : présentation d’une technologie• Réservé aux clients : public technique avec laptop – 30 personnes• Déroulé : 1 matinée par trimestre de 9h à 13h

• Présentation de la technologie• Tuto pour la configuration en ligne de commande

• Objectif : présentation d’une thématique métier ou technique• Réservé aux clients : 30 personnes• Déroulé : 1 matin par trimestre de 9h à 12h

• Big picture• Démonstration et retour d’expérience

Apérotech

Workshop

Morning Tech

Les speakers

Ludovic PiotConseil / Archi / DevOps @ Oxalide

@lpiot

Oxalide Recrute !Contactez-nousà[email protected]

Enjeux & tendances

SoLoMo et IoT – l’explosion de la data

SOcial

LOcal

MObile

IoT – l’explosion de la data

Copyright!©!2014,!Hortonworks,!Inc.!All!rights!reserved.! 11!

Enterprise!Data!Trends!@!Scale!

The!volume!of!data!that!is!available!for!analysis!is!transforming!organizations,!as!well!as!

the!entire!IT!industry.!Everyone!is!seeing!data!external!to!an!organization!as!becoming!

just!as!strategic!as!internal!data.!!SemiMstructured!and!unstructured!data!volume!is!

beginning!to!dwarf!the!traditional!data!in!relational!databases!and!data!warehouses.!

• Facebook!has!around!50!PB!warehouse!and!it’s!constantly!growing.!!

• Twitter!messages!are!140!bytes!each!generating!8TB!data!per!day.!

• Data!is!more!than!doubling!every!year.!

• Almost!80%!of!data!will!be!unstructured!data.!

• Netflix:!75%!of!streaming!video!results!from!recommendations.!

• Amazon:!35%!of!product!sales!come!from!product!recommendations.!

!

!

!

Enterprise Data Trends @ Scale Organizations are redefining data strategies due to the requirements of the evolving Enterprise Data Warehouse (EDW).

Enterprise Data

VoIP

Machine Data

Social Media

Les 3V : les dimensions du Gartner

• Volume : Le volume de données crées et gérées est en constante augmentation (+59% / an en 2011)

• Variété : Les types de données collectées sont très variés (texte, son, image, logs…). Nécessité que les outils de traitement prennent en compte cette diversité

• Vélocité : Besoin de rapidité pour pouvoir utiliser les données au fur et à mesure qu'elles sont collectées. Il faut les utiliser rapidement, ou elles n'ont aucune valeur.

Les 2 nouveaux V émergeant :

• Véracité : dimension apportant une notion de qualité de la donnée pour le métier

• Visibilité : pour souligner la nécessité que la data soit accessible pour le métier afin de permettre la prise de décision rapide

Evolution des tendances de la BigData

batchtemps réel

prédict

rapport alertes prévision

Principes

BigData vs. gestion traditionnelle des données

20! Copyright!©!2014,!Hortonworks,!Inc.!All!rights!reserved.!

Traditional!Systems!vs.!Hadoop!

Hadoop!is!not!designed!to!replace!existing!relational!databases!or!data!warehouses.!!Relational!databases!are!designed!to!manage!transactions.!They!contain!a!lot!of!feature/functionality!designed!around!managing!transactions.!They!are!based!upon!schemaMonMwrite.!Organizations!have!spent!years!building!Enterprise!Data!Warehouses!(EDW)!and!reporting!systems!for!their!traditional!data.!The!traditional!EDWs!are!not!going!anywhere!either.!EDWs!are!also!based!on!schemaMonMwrite.!!!

Hadoop!is!not:!

• Relational!

• NoSQL!

• RealMtime!

• A!database!

Hadoop!is!a!data!platform!that!compliments!existing!data!systems.!Hadoop!is!designed!for!schemaMonMread!and!can!handle!the!large!data!volumes!coming!from!semiMstructured!and!unstructured!data.!With!the!low!cost!of!storage!on!Hadoop,!organizations!are!looking!at!using!Hadoop!more!for!archiving.! !

!

Traditional Systems vs. Hadoop

Traditional Database

SCALE (storage & processing)

Hadoop Distribution NoSQL MPP

Analytics EDW

schema

speed

governance

best fit use

processing

Required on write Required on read

Reads are fast Writes are fast

Standards and structured Loosely structured

Limited, no data processing Processing coupled with data

data types Structured Multi and unstructured

Interactive OLAP Analytics Complex ACID Transactions

Operational Data Store

Data Discovery Processing unstructured data Massive Storage/Processing

Le stockage distribué


Data!Integrity!–!Writing!Data!

High!performing!applications!stream!data!to!files.!!HDFS!does!this!as!well;!the!HDFS!client!caches!packets!of!data!in!memory.!!Once!that!data!reaches!the!HDFS!block!size,!the!client!will!notify!the!NameNode.!!The!NameNode!will!provide!the!DataNode!information!about,!and!the!locations,!for!the!block!replicas.!!The!client!will!then!stream!the!packet!of!data!to!the!first!targeted!DataNode.!!Replication!is!performed!in!a!pipeline!fashion;!the!first!DataNode!will!start!writing!the!block!and!will!then!transfer!that!data!to!the!second!DataNode.!!The!second!DataNode!will!start!sending!the!data!to!the!third!DataNode!and!so!on.!

When!the!blocks!in!a!directory!reach!a!defined!limit,!which!is!controlled!via!dfs.datanode.numblocks,!the!DataNode!will!define!a!new!subdirectory.!!After!defining!the!subdirectory!it!will!start!placing!new!data!blocks!and!the!corresponding!metadata!in!that!subdirectory.!!This!is!performed!using!a!fanMout!structure!ensuring!no!single!directory!is!overloaded!with!files!or!becomes!too!deep.!!!

! !

!

Data Pipeline

DataNode 1

Data Integrity – Writing Data

6. Success!

3. Data +

checksum

4. Verify Checksum

4. Data and checksum

5. Success! 5.Success!

DataNode 4 DataNode 12

Client 2. OK,

please use DataNodes

1, 4, 12. 1. I want to write a block

of data. NameNode

Le théorème de CAP

Le Map/Reduce

154! Copyright!©!2014,!Hortonworks,!Inc.!All!rights!reserved.!

MapReduce!

The!original!useMcase!for!Hadoop!was!distributed!batch!processing.!MapReduce!is!a!power!application!paradigm!for!processing!massive!amounts!of!data.!!!Core!features!of!MapReduce!are:!!

• Co?locating!processing!with!data!blocks:!Take!the!computing!to!where!the!data!lives,!rather!than!querying!or!reading!data!into!a!remote!application.!Would!you!rather!move!hundreds!of!GB/TB!of!data!around!your!network,!or!would!you!rather!move!an!application!that!processes!the!same!data!to!where!the!data!actually!lives?!!

• Map!Phase:!This!is!the!initial!phase!of!all!MapReduce!jobs.!This!is!where!raw!data!can!be!read,!extracted,!transformed,!and!results!written!out!to!HDFS!or!moved!on!to!Reducers!for!aggregate!processing,!such!as!a!final!count,!sum,!min,!max,!etc.!The!Map!phase!can!also!be!thought!of!as!the!ETL!or!projection!step!for!MapReduce.!

!• Reduce!Phase:!This!is!the!final!phase!where!data!is!sorted!on!a!userMdefined!key!

and!grouped!by!that!same!key.!!!The!Reducer!has!the!option!to!perform!an!

!

MapReduce Map$Phase$ Shuffle/Sort$

Mapper $

Mapper $

Mapper $

Data$is$shuffled$across$the$network$

and$sorted$

NM + DN

NM + DN

NM + DN

Reduce$Phase$

Reducer $

Reducer $

NM + DN

NM + DN

La table des latences

Le pipeline BigData

data answersingest / collect store process analyse

Time to answer (latency)Throughput

Cost

La Lambda Architecture


Defining!Data!Layers!

There!are!multiple!ways!of!organizing!data!in!an!Enterprise!Data!Warehouse!and!the!same!goes!for!Hadoop.!!

One!way!is!the!Lambda!Architecture,!which!defines!different!data!layers.!!A!Hadoop!cluster!can!work!by!itself!or!be!integrated!with!HBase!and!other!EDWs!and!ODSs!to!build!different!data!layers!that!meet!the!data!needs!of!an!organization.!

The!process!of!building!different!data!layers!is!a!familiar!concept!within!data!warehousing!and!analytics.!!The!data!layers!are!built!in!a!Hadoop!cluster!for!the!same!reasons!they!have!been!built!in!data!warehouses!for!the!last!30!years,!the!facilitate!speed.!!There!are!3!data!layers:!

• Batch!Layer:!!Immutable!master!data!set!(source!of!truth).!!Used!to!create!views!for!the!batch!layer.!

• Serving!Layer:!Contains!preMcomputed!views.!!!

• Speed!Layer:!!Contains!additional!levels!of!preMcomputed!views,!structures!and!indexes!to!reduce!the!latency!that!exists!in!the!serving!layer.!

!

!

Defining Data Layers

Serving Layer

Standardize, Cleanse, Integrate, Filter, Transform

Batch Layer

Extract & Load

Conform, Summarize, Access

Speed Layer

•  Organize data based on source/derived relationships

•  Allows for fault

and rebuild process

•  There are lots of different ways of organizing data in an enterprise data platform that includes Hadoop.

Ecosystème

Evolution des traitements Big Data

Evolution des traitements Big Data

Dataflow

Dataproc

BigQueryBigTable

CloudSQL

CloudPub/Sub

Demo Time

Amazon S3

http://bit.ly/2grJMMf

Shard 0

Amazon KinesisAmazon Cognito

Amazon EC2

R Shiny-Server

https://github.com/lpiot/amazon-kinesis-IoT-sensor-demo

Machine learning& deep learning

La démarche de datascience

Le Machine Learning

• Jeu de données : labellisé (avec les réponses)• Objectif d’apprentissage :

• Régression (prévision)• Classification

Apprentissage supervisé

Hypothèse et fonction de coût

But du jeu :Trouver une fonction h qui représente fidèlement les données.

Régression linéaire :ℎ " = $% + $'"' + $("( + ⋯+ $*"*

Le Machine Learning

• Jeu de données : non-labellisé (sans réponse)• Objectif d’apprentissage :

• Identifier / détecter des structures dans les données

Apprentissage non-supervisé

Algorithmes de classification

But du jeu :Trouver l’algorithme qui distingue au mieux les structures dans les données.

Réseaux neuronaux

• Basés sur le fonctionnementd’un cerveau

• Hypothèse non linéaire !• Classification multi-classe

• Comme avant, on essayede minimiser la fonction de coût en modifiant peu àpeu les coefficients Θ(i)

Questions ?

?

Sources

• [6, 10] : Hortonworks : Operations Management with HDP

• [8, 11, 12] : http://www.slideshare.net/1Strategy/2016-utah-cloud-summit-big-data-architectural-patterns-and-best-practices-on-aws

Morning Tech#1 BigData - Oxalide Academy

Internet