Big data Intro - Presentation to OCHackerz Meetup Group

Introduction to Big Data

Sri Kanajan

Big Data• When data is too VVV (volume, variety, velocity) to manage with traditional

RDBMS, then you enter BIG DATA!• Data Storage and Manipulation, at Scale

– MapReduce, Hadoop, relationship to databases (Framework)– Key-value stores and NoSQL; tradeoffs of SQL and NoSQL (Database type)– Entity resolution, record linkage, data cleaning (data integration)

• Analytics (Machine Learning)– Basic statistical modeling, experiment design, overfitting– Supervised learning: overview, simple nearest neighbor, decision trees/forests,

regression – Unsupervised learning: k-means, multi-dimensional scaling – Graph Analytics: PageRank, community detection, recursive queries, iterative processing – Text Analytics: latent semantic analysis – Collaborative Filtering: slope-one

• Communicating Results – Visualization, data products, visual data analytics

Outline

• What is Big Data?• Why is this important now?• Key Concepts– Hadoop ,MapReduce – Storage, Processing – Machine Learning – Analytics – Visualization

Big Data Everywhere!

• Lots of data is being collected and warehoused – Web data, e-commerce– purchases at department/

grocery stores– Bank/Credit Card

transactions– Social Network

Unknown Hidden Relationships within this Data !!!

How much data?• Google processes 20 PB a day (2008)• Wayback Machine has 3 PB + 100 TB/month (3/2009)• Facebook has 2.5 PB of user data + 15 TB/day (4/2009) • eBay has 6.5 PB of user data + 50 TB/day (5/2009)• CERN’s Large Hydron Collider (LHC) generates 15 PB a

year

640K ought to be enough for anybody.

Type of Data• Relational Data (Tables/Transaction/Legacy Data)• Unstructured Text Data – Log data, Comments, User generated text

• Semi-structured Data (XML) • Graph Data– Social Network, Semantic Web (RDF)

• Real time Data – You can only scan the data once and need to do

analytics quickly

What does Big Data Give You?• Without Big Data

– Many data warehouses that were separate and on non distributed architectures– Had to modify data structures and unique programming to merge databases

together– Scaling database size is a continual problem– Any large scale analytics took days and weeks and large coordination effort within

IT to get database accesses– Data analysis is a large effort and lots of data tend to remain unanalyzed or even

worse not stored• With Big Data

– Hadoop provides a single view of all databases that can be distributed– Database size is a non issue– Ability to perform advanced statistical analysis on very large datasets very quickly– Data analysis is the competitive edge for many companies since barriers of entry

are continually dropping through the development of platforms

Examples• Norwegian Food Safety Authority

– accumulates data on all farm animals– birth, death, movements, medication, samples, ...

• Hafslund– time series from hydroelectric dams, power prices, meters of individual

customers, ...• Social Security Administration

– data on individual cases, actions taken, outcomes...• Statoil

– massive amounts of data from oil exploration, operations, logistics, engineering, ...

• Retailers– see Target example above– also, connection between what people buy, weather forecast, logistics, ...

Big Data

Power of Distribution

45 Minutes! 4.5 Minutes!

Outline

• What is Big Data?• Why is this important now?• Key Concepts– Hadoop ,MapReduce – Storage, Processing – Machine Learning – Analytics – Visualization

Hadoop

• A framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model (I.e. MapReduce)– Distributed data processing– Works with structured and unstructured data– Open source– Master-slave architecture– Fault tolerant using commodity hardware

MapReduce

• Programming model on top of Hadoop• Basic concept is to provide a programming model that

immediately supports parallel processing (SQL on the other hand does not natively encourage parallel processing)

• Pig is a framework and programming language to develop MapReduce

• Note – MapReduce is great for extremely large data sets with simple relations. SQL is great for medium size data sets but with complex relationships– I.e. you have to decide the right technology depending on your

problem space

A Simple Example • Counting words in a large set of documents

map(string value)//key: document name//value: document contentsfor each word w in value

EmitIntermediate(w, “1”);

reduce(string key, iterator values)//key: word//values: list of countsint results = 0;for each v in values

result += ParseInt(v);Emit(AsString(result));

MapReduce

Outline

• What is Big Data?• Why is this important now?• Key Concepts– Hadoop, MapReduce – Storage architecture– Machine Learning – Analytics – Visualization

Machine Learning

• Essentially ways to analyze data to extract valuable information with or without training data– Prediction

• predicting a variable from data

– Classification• assigning records to predefined groups

– Clustering• splitting records into groups based on similarity

– Association learning• seeing what often appears together with what

– And many others….

Now you have an optimization metric by which you can automate the exploration of all possible hypotheses !Problems with this approach??

21

Two kinds of learning

• Supervised– we have training data with correct answers– use training data to prepare the algorithm– then apply it to data without a correct

answer• Unsupervised– no training data– throw data into the algorithm, hope it

makes some kind of sense out of the data

Example: Collaborative Filtering• Goal: predict what movies/books/… a person may be interested

in, on the basis of– Past preferences of the person– Other people with similar past preferences– The preferences of such people for a new movie/book/…

• One approach based on repeated clustering– Cluster people on the basis of preferences for movies– Then cluster movies on the basis of being liked by the same clusters of

people– Again cluster people based on their preferences for (the newly created

clusters of) movies– Repeat above till equilibrium

• Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest

22

Outline

• What is Big Data?• Why is this important now?• Key Concepts– Hadoop, MapReduce – Storage architecture– Machine Learning – Analytics – Visualization

Is this an effective visual representation?

Better Mapping? Why?

Diagrams Showing O-Ring Damage that was Used to Decide to Launch Challenger in 1987

Representation of the Same Data

Strategies to Increase the Information Encoded by Spatial Position

• Composition– Orthogonal placement of axes– Creates a 2D metric space

Strategies to Increase the Information Encoded by Spatial Position

• Alignment

Folding

• Continuation of the Axes

Recursion

Overloading

Conclusion

• Big Data is a huge field that combines expertise from different domains in order to find interesting information from data

• Extracting interesting information from data is the next competitive edge for many companies as information becomes available, instantly anywhere

Big data Intro - Presentation to OCHackerz Meetup Group

Education