Top Banner
Introduction to Big Data Sri Kanajan
33

Big data Intro - Presentation to OCHackerz Meetup Group

May 06, 2015

Download

Education

Sri Kanajan

Introduction to Big Data
Hadoop ,MapReduce – Storage, Processing
Machine Learning – Analytics
Visualization
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big data Intro - Presentation to OCHackerz Meetup Group

Introduction to Big Data

Sri Kanajan

Page 2: Big data Intro - Presentation to OCHackerz Meetup Group

Big Data• When data is too VVV (volume, variety, velocity) to manage with traditional

RDBMS, then you enter BIG DATA!• Data Storage and Manipulation, at Scale

– MapReduce, Hadoop, relationship to databases (Framework)– Key-value stores and NoSQL; tradeoffs of SQL and NoSQL (Database type)– Entity resolution, record linkage, data cleaning (data integration)

• Analytics (Machine Learning)– Basic statistical modeling, experiment design, overfitting– Supervised learning: overview, simple nearest neighbor, decision trees/forests,

regression – Unsupervised learning: k-means, multi-dimensional scaling – Graph Analytics: PageRank, community detection, recursive queries, iterative processing – Text Analytics: latent semantic analysis – Collaborative Filtering: slope-one

• Communicating Results – Visualization, data products, visual data analytics

Page 3: Big data Intro - Presentation to OCHackerz Meetup Group

Outline

• What is Big Data?• Why is this important now?• Key Concepts– Hadoop ,MapReduce – Storage, Processing – Machine Learning – Analytics – Visualization

Page 4: Big data Intro - Presentation to OCHackerz Meetup Group

Big Data Everywhere!

• Lots of data is being collected and warehoused – Web data, e-commerce– purchases at department/

grocery stores– Bank/Credit Card

transactions– Social Network

Unknown Hidden Relationships within this Data !!!

Page 5: Big data Intro - Presentation to OCHackerz Meetup Group
Page 6: Big data Intro - Presentation to OCHackerz Meetup Group

How much data?• Google processes 20 PB a day (2008)• Wayback Machine has 3 PB + 100 TB/month (3/2009)• Facebook has 2.5 PB of user data + 15 TB/day (4/2009) • eBay has 6.5 PB of user data + 50 TB/day (5/2009)• CERN’s Large Hydron Collider (LHC) generates 15 PB a

year

640K ought to be enough for anybody.

Page 7: Big data Intro - Presentation to OCHackerz Meetup Group

Type of Data• Relational Data (Tables/Transaction/Legacy Data)• Unstructured Text Data – Log data, Comments, User generated text

• Semi-structured Data (XML) • Graph Data– Social Network, Semantic Web (RDF)

• Real time Data – You can only scan the data once and need to do

analytics quickly

Page 8: Big data Intro - Presentation to OCHackerz Meetup Group

What does Big Data Give You?• Without Big Data

– Many data warehouses that were separate and on non distributed architectures– Had to modify data structures and unique programming to merge databases

together– Scaling database size is a continual problem– Any large scale analytics took days and weeks and large coordination effort within

IT to get database accesses– Data analysis is a large effort and lots of data tend to remain unanalyzed or even

worse not stored• With Big Data

– Hadoop provides a single view of all databases that can be distributed– Database size is a non issue– Ability to perform advanced statistical analysis on very large datasets very quickly– Data analysis is the competitive edge for many companies since barriers of entry

are continually dropping through the development of platforms

Page 9: Big data Intro - Presentation to OCHackerz Meetup Group

Examples• Norwegian Food Safety Authority

– accumulates data on all farm animals– birth, death, movements, medication, samples, ...

• Hafslund– time series from hydroelectric dams, power prices, meters of individual

customers, ...• Social Security Administration

– data on individual cases, actions taken, outcomes...• Statoil

– massive amounts of data from oil exploration, operations, logistics, engineering, ...

• Retailers– see Target example above– also, connection between what people buy, weather forecast, logistics, ...

Page 10: Big data Intro - Presentation to OCHackerz Meetup Group

Big Data

Page 11: Big data Intro - Presentation to OCHackerz Meetup Group

Power of Distribution

45 Minutes! 4.5 Minutes!

Page 12: Big data Intro - Presentation to OCHackerz Meetup Group

Outline

• What is Big Data?• Why is this important now?• Key Concepts– Hadoop ,MapReduce – Storage, Processing – Machine Learning – Analytics – Visualization

Page 13: Big data Intro - Presentation to OCHackerz Meetup Group

Hadoop

• A framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model (I.e. MapReduce)– Distributed data processing– Works with structured and unstructured data– Open source– Master-slave architecture– Fault tolerant using commodity hardware

Page 14: Big data Intro - Presentation to OCHackerz Meetup Group

MapReduce

• Programming model on top of Hadoop• Basic concept is to provide a programming model that

immediately supports parallel processing (SQL on the other hand does not natively encourage parallel processing)

• Pig is a framework and programming language to develop MapReduce

• Note – MapReduce is great for extremely large data sets with simple relations. SQL is great for medium size data sets but with complex relationships– I.e. you have to decide the right technology depending on your

problem space

Page 15: Big data Intro - Presentation to OCHackerz Meetup Group

A Simple Example • Counting words in a large set of documents

map(string value)//key: document name//value: document contentsfor each word w in value

EmitIntermediate(w, “1”);

reduce(string key, iterator values)//key: word//values: list of countsint results = 0;for each v in values

result += ParseInt(v);Emit(AsString(result));

Page 16: Big data Intro - Presentation to OCHackerz Meetup Group

MapReduce

Page 17: Big data Intro - Presentation to OCHackerz Meetup Group

Outline

• What is Big Data?• Why is this important now?• Key Concepts– Hadoop, MapReduce – Storage architecture– Machine Learning – Analytics – Visualization

Page 18: Big data Intro - Presentation to OCHackerz Meetup Group

Machine Learning

• Essentially ways to analyze data to extract valuable information with or without training data– Prediction

• predicting a variable from data

– Classification• assigning records to predefined groups

– Clustering• splitting records into groups based on similarity

– Association learning• seeing what often appears together with what

– And many others….

Page 19: Big data Intro - Presentation to OCHackerz Meetup Group
Page 20: Big data Intro - Presentation to OCHackerz Meetup Group

Now you have an optimization metric by which you can automate the exploration of all possible hypotheses !Problems with this approach??

Page 21: Big data Intro - Presentation to OCHackerz Meetup Group

21

Two kinds of learning

• Supervised– we have training data with correct answers– use training data to prepare the algorithm– then apply it to data without a correct

answer• Unsupervised– no training data– throw data into the algorithm, hope it

makes some kind of sense out of the data

Page 22: Big data Intro - Presentation to OCHackerz Meetup Group

Example: Collaborative Filtering• Goal: predict what movies/books/… a person may be interested

in, on the basis of– Past preferences of the person– Other people with similar past preferences– The preferences of such people for a new movie/book/…

• One approach based on repeated clustering– Cluster people on the basis of preferences for movies– Then cluster movies on the basis of being liked by the same clusters of

people– Again cluster people based on their preferences for (the newly created

clusters of) movies– Repeat above till equilibrium

• Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest

22

Page 23: Big data Intro - Presentation to OCHackerz Meetup Group

Outline

• What is Big Data?• Why is this important now?• Key Concepts– Hadoop, MapReduce – Storage architecture– Machine Learning – Analytics – Visualization

Page 24: Big data Intro - Presentation to OCHackerz Meetup Group

Is this an effective visual representation?

Page 25: Big data Intro - Presentation to OCHackerz Meetup Group

Better Mapping? Why?

Page 26: Big data Intro - Presentation to OCHackerz Meetup Group

Diagrams Showing O-Ring Damage that was Used to Decide to Launch Challenger in 1987

Page 27: Big data Intro - Presentation to OCHackerz Meetup Group

Representation of the Same Data

Page 28: Big data Intro - Presentation to OCHackerz Meetup Group

Strategies to Increase the Information Encoded by Spatial Position

• Composition– Orthogonal placement of axes– Creates a 2D metric space

Page 29: Big data Intro - Presentation to OCHackerz Meetup Group

Strategies to Increase the Information Encoded by Spatial Position

• Alignment

Page 30: Big data Intro - Presentation to OCHackerz Meetup Group

Folding

• Continuation of the Axes

Page 31: Big data Intro - Presentation to OCHackerz Meetup Group

Recursion

Page 32: Big data Intro - Presentation to OCHackerz Meetup Group

Overloading

Page 33: Big data Intro - Presentation to OCHackerz Meetup Group

Conclusion

• Big Data is a huge field that combines expertise from different domains in order to find interesting information from data

• Extracting interesting information from data is the next competitive edge for many companies as information becomes available, instantly anywhere