Top Banner
Real time big data with Apache Kafka, Spark Streaming, Scala, Elastic search. By S Annu Ahmed(122N1A0573) V Indu Priyanka(122N1A0532) S Ravindra(122N1A0572) M Imran Basha(122N1A0556) P B Sravanthi(122N1A0558) B Baby Likhitha(122N1A0514)
32

963

Apr 13, 2017

Download

Documents

Annu Ahmed
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 963

Real time big data with Apache Kafka, Spark Streaming, Scala, Elastic search.

By

S Annu Ahmed(122N1A0573)

V Indu Priyanka(122N1A0532)

S Ravindra(122N1A0572)

M Imran Basha(122N1A0556)

P B Sravanthi(122N1A0558)

B Baby Likhitha(122N1A0514)

Page 2: 963

Contents:• From Data Mining to Big Data • Introduction to Data• What is Big Data• Hadoop• Scala• Spark Streaming• Elastic Search

Page 3: 963

From Data Mining to Big DataIn early 90’s, a buzzword called

Data Mining appearedMany years after, we have another one

called Big DataWell, what’s the difference?

Page 4: 963

Status of Data Mining and Machine Learning

Over the years, we have all kinds of effective methodsfor classification, clustering, and regression We also have good integratedtoolsfor data mining(e.g., Weka, R, Scikit-learn)However, mininguseful informationremains difficult for some real-world applications

Page 5: 963

What’s Big Data?• Though many definitions

are available, we consider the situation thatdata are larger than the capacity of a computer

• I think this is a main difference between data mining and big data

• So in a sense we are talking aboutdistributeddata mining or machine learning

(a), (b):distributed systems

Page 6: 963

What is Data ? “A set of values that may be Qualitative or Quantitate in nature”What is Big Data ? “Data so large and voluminous that it overwhelms the existing data storage and processing infrastructure, is said to be big enough to be called as-Big data” What is Real time Big Data ? The demand for stream processing is increasing a lot these days. The reason is that often processing big volumes of data is not enough. Data has to be processed fast, so that a firm can react to changing business conditions in real time.

Page 7: 963

Parameters of big data: Huge amount of data Complex data which consists of lots of unstructured data Speed of generating data

Page 8: 963

Big Data versus Fast DataBig data is one of the most used buzzwords. You can best define it by thinking of three Vs: Big data is not just about Volume, but also about Velocity and Variety .

Often, masses of structured and semi-structured historical data are stored in Hadoop (Volume + Variety). On the other side, stream processing is used for fast data requirements (Velocity + Variety). We focus on real-time and stream processing.

Page 9: 963

Challenges…CapturingPrivacy and securityData access and sharing InformationDuration StorageSearchAnalyzing &Visualization

Page 10: 963

What We Need ?•Fault Tolerant•Failure Detection•Low latency, distributed, data locality•DataCenters•Partition-Aware•Elasticity•Parallelism

Page 11: 963

Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. It Include HDFS and MAPREDUCE

Hadoop

Hadoop HDFS Hadoop MAP REDUCE

Page 12: 963

APACHE HADOOP ECO SYSTEM:

Page 13: 963

Let’s recall basic concepts ofMessaging System

Page 14: 963

Point to Point Messaging (Queue)

Page 15: 963

Publish-Subscribe Messaging (Topic)

Page 16: 963

Apache Kafka

Page 17: 963

Overview An apache project initially developed at LinkedIn Distributed publish-subscribe messaging system• Designed for processing of real time activity stream data e.g. logs,

metrics collections• Written in Scala Features

Persistent messaging High-throughput Supports both queue and topic semantics Uses Zookeeper for forming a cluster of nodes

(producer/consumer/broker)and many more…

Page 18: 963

How it works

Page 19: 963

Real time transfer

Consumer3

(Group2)Kafka Broker

Consumer4

(Group2)

Producer

Zookeeper

Consumer2

(Group1)

Consumer1

(Group1)

get K

afka

brok

er

addr

ess

Streaming

Fetch messages

Update ConsumedMessage offset

QueueTopolog

y

Topic Topolog

y

Kafka Broker

Broker does not Push messages to Consumer, Consumer Polls messages from Broker.

Page 20: 963

About Apache Spark

Initially started at UC Berkeley in 2009

Fast and general purpose cluster computing system

10x (on disk) - 100x (In-Memory) faster

Most popular for running Iterative Machine Learning Algorithms.

Provides high level APIs in

Java

Scala

Python

Integration with Hadoop and its eco-system and can read existing data.

Page 21: 963

Why Spark, why not Hadoop?

Page 22: 963

Spark Streaming

Makes it easy to build scalable fault-tolerant streaming applications.

Ease of UseFault ToleranceCombine streaming with batch and interactive

queries.

Page 23: 963

zillions of bytes gigabytes per second

Spark Streaming

Page 24: 963

Input & Output Sources

Page 25: 963

Spark Streaming

Kinesis, S3

Page 26: 963

Scala Scala was created by Martin Odersky and he released the

first version in 2001 Scala is the language that addresses the major needs of

the modern developer. It is a statically typed, mixed-paradigm, JVM language

with a succinct, elegant, and flexible syntax, a sophisticated type system, and idioms that promote scalability from small , interpreted scripts to large, sophisticated applications.

Page 27: 963

• Functional• Object oriented programming

• On the JVM

• Static typing - easier to control performance

Why Scala?

Page 28: 963

Continued…. Scala is compelling because it feels like a dynamically

typed scripting language, due to its succinct syntax and type inference.

Yet Scala gives you all the benefits of static typing, a modern object model, functional programming, and an advanced type system.

Scala's aim to provide advanced constructs for the abstraction and composition of components is shared by several recent research efforts.

Page 29: 963

What is elasticsearch? In short, it can be thought of as “search engine software” It provides the realistic potential for you to run your own search engine

service (like a Bing or a Google) but with say, private, sensitive, or confidential data/documents that you don’t want on the public web

great extra capability for your company, enterprise, app, startup, client elasticsearch is an open-source, distributed web application that runs on

top of Lucene, and it is written in Java, and it sports a REST API Apache Lucene is the best open-source search engine, and probably one

of the best search engines available, and holds its own even when compared against the most expensive commercial alternatives

very fast search

Page 30: 963

Where did elasticsearch come from?

Originally there was a search application project called Apache Compass, which was primarily worked on by @kimchy

Compass also relied on Lucene, but was not distributed kimchy decided to write elasticsearch to be distributed from the

get go, and so you could say it was built with the cloud in mind Add more servers and they play together nicely, and they know

how to work together to split up the work load (and search queries can be resource intensive and expensive in terms of memory/disk requirements)

Page 31: 963

Elastic search is an advanced distributed app

It has some very cool properties and abilities when it comes to operations that involve lots of nodes

It scales extremely gracefully It has its own optimized binary protocol and makes its

own “internal network” …as long as you know what you are doing when it

comes to configuration It is open source

Page 32: 963