Top Banner
53

Prezentare: Big Data demistificat

Jul 07, 2015

Download

Engineering

ALTBrasov

Big Data: ce este? ce oportunitati ne ofera? cum il putem folosi? Afli raspunsurile in aceasta prezentare susținută de Pentalog în cadrul evenimentului ALT Festival, organizat de Clusterul pentru Inovare și Tehnologie ALT Brasov.
http://www.altbrasov.org/
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Prezentare: Big Data demistificat
Page 2: Prezentare: Big Data demistificat

2

N o v 2 0 1 4 – B i g d a t a 2 Of 53

What is Big Data ?

* Data so large and complex that it becomes difficult to process with traditional systems

* First time coined in 1997, NASA report

* Petabytes and Exabytes of data

Page 3: Prezentare: Big Data demistificat

3

N o v 2 0 1 4 – B i g d a t a 3 Of 53

Big data is everywhere

* Every 2 days we create as much information as we did from the beginning of time until 2003

* Google processes over 40 thousand search queries per second, making it over 3.5 billion in a single day.

* Around 100 hours of video are uploaded to YouTube every minute and it would take you around 15 years to watch every video uploaded by users in one day

* Every minute we send 204 million emails, generate 1,8 million Facebook likes, send 278 thousand Tweets, and upload 200,000 photos to Facebook

* Trillions of sensors monitor, track, communicate with each other , populating the IoT with realtime data

Page 4: Prezentare: Big Data demistificat

4

N o v 2 0 1 4 – B i g d a t a 4 Of 53

Big data is not new

Page 5: Prezentare: Big Data demistificat

5

N o v 2 0 1 4 – B i g d a t a 5 Of 53

Characteristics

Page 6: Prezentare: Big Data demistificat

6

N o v 2 0 1 4 – B i g d a t a 6 Of 53

Volume

* More data beats == better model* Scalable storage, and distributed approach to querying

Page 7: Prezentare: Big Data demistificat

7

N o v 2 0 1 4 – B i g d a t a 7 Of 53

Variety

* Big data includes all data* Data no longer fits into neatly structured tables

Page 8: Prezentare: Big Data demistificat

8

N o v 2 0 1 4 – B i g d a t a 8 Of 53

Velocity

* Frequency at which data is generated, captured , stored and processed* Need for real-time processing

Page 9: Prezentare: Big Data demistificat

9

N o v 2 0 1 4 – B i g d a t a 9 Of 53

Data sources

Page 10: Prezentare: Big Data demistificat

10

N o v 2 0 1 4 – B i g d a t a 10 Of 53

Importance of Big Data

* Media* Retailing* Public service* Health* Industry

Page 11: Prezentare: Big Data demistificat

11

N o v 2 0 1 4 – B i g d a t a 11 Of 53

Importance of Big Data

* Gaining a more complete understanding of business

customers productscompetitors

* Which can lead to efficiency improvements

increased saleslower costsbetter customer serviceimproved products

Page 12: Prezentare: Big Data demistificat

12

N o v 2 0 1 4 – B i g d a t a 12 Of 53

The problem

* Overall information available10% structured data

used in decision making90% unstructured data

wasted, not captured or analyzed

* Valuable information VS data which is best left ignored

* 37.5% of large organizations said that analyzing big data is their biggest challenge

* More that 90% said that Big Data is a top ten priority

Page 13: Prezentare: Big Data demistificat

13

N o v 2 0 1 4 – B i g d a t a 13 Of 53

It’s not the only the size

* Collect -> Analyze -> Understand -> Generate Value

* Find a meaning* Find interconnexions* Find hidden data

Page 14: Prezentare: Big Data demistificat

14

N o v 2 0 1 4 – B i g d a t a 14 Of 53

Purpose

* Take more precise actions that brings value and reduce costs * Make the right decision within the right amount of time

Page 15: Prezentare: Big Data demistificat

15

N o v 2 0 1 4 – B i g d a t a 15 Of 53

How big will big data get?

* 3.2 zettabytes today to 40 zettabytes in only six years. * More than 30 billion devices will be wirelessly connected by 2020.

Page 16: Prezentare: Big Data demistificat

16

N o v 2 0 1 4 – B i g d a t a 16 Of 53

Challenges

* Storing data* Analysis* Search* Sharing * Transfer * Visualization

Page 17: Prezentare: Big Data demistificat

17

N o v 2 0 1 4 – B i g d a t a 17 Of 53

NoSQL and Big Data Analytics

* Storing data* Distribution* Processing

Page 18: Prezentare: Big Data demistificat

18

N o v 2 0 1 4 – B i g d a t a 18 Of 53

NoSQL

* Scalability/ cluster friendly* Availability/ fault tolerance* Schema-less* Low latency* High performance* Open-source

Page 19: Prezentare: Big Data demistificat

19

N o v 2 0 1 4 – B i g d a t a 19 Of 53

Dynamic scaling

* adding/removing nodes dynamically

→ storage/performance capacity can grow or shrink as needed

Page 20: Prezentare: Big Data demistificat

20

N o v 2 0 1 4 – B i g d a t a 20 Of 53

Auto-sharding

* Natively and automatically spread data across servers* Data and query load automatically balanced across servers

Page 21: Prezentare: Big Data demistificat

21

N o v 2 0 1 4 – B i g d a t a 21 Of 53

Replication

* Support automatic replication → high availability → disaster recovery → no need for separate applications to manage these tasks

Page 22: Prezentare: Big Data demistificat

22

N o v 2 0 1 4 – B i g d a t a 22 Of 53

Schemaless

* No predefined schema* Insertion of aggregates → puts together data that is commonly accessed together

Page 23: Prezentare: Big Data demistificat

23

N o v 2 0 1 4 – B i g d a t a 23 Of 53

NoSQL vanillas

Page 24: Prezentare: Big Data demistificat

24

N o v 2 0 1 4 – B i g d a t a 24 Of 53

NoSQL vanillas

* Key-value store→ Amazon DynamoDB, Redis→ Content caching (focus on scaling to huge amounts of data, designed to handle

massive load), logging, etc

* Document store → CouchDB, MongoDb→ Web applications

* Column family store → Cassandra, HBase→ Distributed file systems

* Graph store → Neo4J, InfoGrid, Infinite Graph→ Social networking, Recommendations (Focus on modeling the structure of data –

interconnectivity)

Page 25: Prezentare: Big Data demistificat

25

N o v 2 0 1 4 – B i g d a t a 25 Of 53

Reasons for choosing NoSQL

* Working on large amount of data

* Scaling out with ease

* Need of: → high-availability → low-latency systems with eventual consistency

* Model fits aggregate: → as a natural choice → structure is changing with time

Page 26: Prezentare: Big Data demistificat

26

N o v 2 0 1 4 – B i g d a t a 26 Of 53

… and associates

Page 27: Prezentare: Big Data demistificat

27

N o v 2 0 1 4 – B i g d a t a 27 Of 53

What is hadoop?

● Distributed file system

● Distributed processing system

● Batch / offline oriented

● Open source

Page 28: Prezentare: Big Data demistificat

28

N o v 2 0 1 4 – B i g d a t a 28 Of 53

In the beginning...

● Created by Doug Cutting and Mike Cafarella

● Inteded as a distribution support for

● Built based on Google's MapReduce and Google File System● papers

Page 29: Prezentare: Big Data demistificat

29

N o v 2 0 1 4 – B i g d a t a 29 Of 53

Who uses Hadoop?

Most notable users are …

+ many others

Page 30: Prezentare: Big Data demistificat

30

N o v 2 0 1 4 – B i g d a t a 30 Of 53

Hadoop in the real world

● Recommandation system● Data warehousing● Financial analysis● Market research/forecasting● Log analysis● Threat analysis● Image processing● Social networking● Advertising

Page 31: Prezentare: Big Data demistificat

31

N o v 2 0 1 4 – B i g d a t a 31 Of 53

Why Hadoop?

● Scalable

● Cost effective

● Flexible

● Efficient

● Resilient to failure

● Schema on read

Page 32: Prezentare: Big Data demistificat

32

N o v 2 0 1 4 – B i g d a t a 32 Of 53

Why not Hadoop?

● Inefficient when used at small scale● Not good for real time systems

Page 33: Prezentare: Big Data demistificat

33

N o v 2 0 1 4 – B i g d a t a 33 Of 53

Hadoop major components

● Hadoop commons● YARN● HDFS● Map/Reduce

Page 34: Prezentare: Big Data demistificat

34

N o v 2 0 1 4 – B i g d a t a 34 Of 53

Arhitecture

Page 35: Prezentare: Big Data demistificat

35

N o v 2 0 1 4 – B i g d a t a 35 Of 53

Arhitecture

Page 36: Prezentare: Big Data demistificat

36

N o v 2 0 1 4 – B i g d a t a 36 Of 53

Arhitecture

Page 37: Prezentare: Big Data demistificat

37

N o v 2 0 1 4 – B i g d a t a 37 Of 53

Arhitecture

Page 38: Prezentare: Big Data demistificat

38

N o v 2 0 1 4 – B i g d a t a 38 Of 53

Arhitecture

Page 39: Prezentare: Big Data demistificat

39

N o v 2 0 1 4 – B i g d a t a 39 Of 53

MapReduce

● Split input files● Operate on key/value ● Mappers filter & transform input data

● Reducers aggregate mappers output

● Move code to data

Page 40: Prezentare: Big Data demistificat

40

N o v 2 0 1 4 – B i g d a t a 40 Of 53

Page 41: Prezentare: Big Data demistificat

41

N o v 2 0 1 4 – B i g d a t a 41 Of 53

… and associates

Page 42: Prezentare: Big Data demistificat

42

N o v 2 0 1 4 – B i g d a t a 42 Of 53

Apache Ambari

The project is aimed at making Hadoop management simpler by developing software for provisioning, managing,

and monitoring Apache Hadoop clusters

Page 43: Prezentare: Big Data demistificat

43

N o v 2 0 1 4 – B i g d a t a 43 Of 53

Apache Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure

for evaluating these programs

Page 44: Prezentare: Big Data demistificat

44

N o v 2 0 1 4 – B i g d a t a 44 Of 53

Apache Hive

The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism

to project structure onto this data and query the data using a SQL-like language called HiveQL

Page 45: Prezentare: Big Data demistificat

45

N o v 2 0 1 4 – B i g d a t a 45 Of 53

Apache Chukwa

It is a data collection system for monitoring large distributed systems. Chukwa comes with a flexible and powerful toolkit for displaying, monitoring and analyzing

results to make the best use of the collected data.

Page 46: Prezentare: Big Data demistificat

46

N o v 2 0 1 4 – B i g d a t a 46 Of 53

Apache Avro

A remote procedure call and data serialization framework

Page 47: Prezentare: Big Data demistificat

47

N o v 2 0 1 4 – B i g d a t a 47 Of 53

Apache Hbase

Apache Hbase offers random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables

-- billions of rows X millions of columns -- atop clusters of commodity hardware

Page 48: Prezentare: Big Data demistificat

48

N o v 2 0 1 4 – B i g d a t a 48 Of 53

Apache Mahout

The Apache Mahout™ project's goal is to build a scalable machine learning library

Page 49: Prezentare: Big Data demistificat

49

N o v 2 0 1 4 – B i g d a t a 49 Of 53

Apache Spark

Apache Spark™ is a fast and general engine for large-scale data processing

Page 50: Prezentare: Big Data demistificat

50

N o v 2 0 1 4 – B i g d a t a 50 Of 53

Apache Zookeeper

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Page 51: Prezentare: Big Data demistificat

51

N o v 2 0 1 4 – B i g d a t a 51 Of 53

Big data – in the future

● 87% of enterprises believe Big Data analytics will redefine the competitive landscape of their industries within the next three years

● 89% believe that companies that do not adopt a Big Data analytics strategy in the next year risk losing market share and momentum.

Page 52: Prezentare: Big Data demistificat

52

N o v 2 0 1 4 – B i g d a t a 52 Of 53

Big data – in the future

Page 53: Prezentare: Big Data demistificat

Va multumim!