Top Banner
MAKING BUSINESS INTELLIGENT www.pragmaticworks.c om Introduction to Mahout with HDInsight (Hadoop) Chris Price Senior BI Consultant @BluewaterSQL
27

Introduction to Mahout with HDInsight

Jan 15, 2015

Download

Technology

Chris Price

In this session, we will introduce a Mahout, a machine learning library that has multiple algorithms implemented on top of Hadoop and HDInsight. We will start by introducing the foundational concepts needed to understand clustering, classification and collaborative filtering before demonstrating what it takes to get started with Mahout. In addition to learning how you get Mahout set-up, you will learn what it takes to process and prepare data, how to execute an “embarrassing parallel” batch recommendation job and subsequently how to integrate the result back into your existing ecosystem.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 3: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Survey Whose currently using Machine Learning?

Google Facebook LinkedIn Twitter Amazon Wal-Mart

Page 4: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Outline Mahout Introduction The Algorithms Hands On:

A recommendation engine

Page 5: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Riding the Elephant Born out of the Apache Lucene project Top-level Apache project A scalable machine learning library

Fast, Efficient & Pragmatic Many of the algorithms can be run on Hadoop

Page 6: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Algorithms Collaborative Filtering

Item/User Recommenders Clustering

Grouping movies by type Classification

Categorizing documents Frequent Itemset

Market basket analysis

Page 7: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Collaborative Filtering Find subset of users who have similar

taste/preferences to target user and use this subset for recommendations

Types: User-Based Item-Based

Examples: Amazon

Page 8: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Clustering Group similar objects Examples:

News Aggregator Customer Grouping

Page 9: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Clustering Algorithms:

K-Means Fuzzy K-Means Mean Shift Canopy Dirichlet

Similarity Distance: Euclidean Squared Euclidean Cosine Tanimoto Manhattan

** Also weighted measures

Page 10: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Clustering

Page 11: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Classification Using a pre-determined set of groups:

Predict the type of a new object based on its features

Classifiable Data Continuous – Quantitative Value (i.e. Stock Price) Categorical – Small known set (i.e. Colors) Word-Like – Large unknown set Text-Like – Many word-like that are unordered

Examples: Spam Identification Photo Facial Recognition

Page 12: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Frequent Itemset Examples:

Product Placement Market Basket Analysis Query Recommendations

Page 13: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Mahout on HDInsight Installation

Download http://www.apache.org/dyn/closer.cgi/mahout/

Unpack Add to Path (Environment Variable)

Page 14: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Recommendation Engine Define the Business Objective

Metrics Context

Identify Data Sources Normalization Data Shift

Which Algorithm? Integration?

Page 15: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Business Objective

NavigationalInefficiency

Cross-Sell

Up-Sell

Increase # of Orders

Increase Items per Order

Increase Average

Item PriceWebsite Increase

Revenue

Page 16: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Handling Context

???January

20 degrees & Snowing…..

Page 17: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Data Acquisition Sources of Data for Recommendation

Implicit Ratings Feedback Demographics Pyschographics (Personality/Lifestyle/Attitude), Ephemeral Need (Need for a moment)

Explicit Purchase History Click/Browse History

Product/Item Taxonomy Attributes Descriptions

Page 18: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Data Preparation Preparation

Remove Outliers (Z-Score) Remove frequent buyers (Skew) Normalize Data (Unity-Based)

Beware of Data Shift

Page 19: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Algorithms Collaborative Filtering (Mahout)

User-Based Item-Based

Content-Based (Mahout Clustering) Data Mining (SSAS)

Association Clustering

Page 20: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

CF Recommendations

Neighborhood Formation Similarity Metrics

Pearson Correlation Euclidean Distance Spearman Correlation Cosine Tanimoto Coefficient Log-Likelihood

Page 21: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

CF Pseudo-Code

for each item i that u has no preferencefor each user v that has a preference for

icompute similarity s between u and v

calculate running average of v‘s preference for i, weighted

by s

return top ranked (weighted average) i

Restrict to Neighborhood

Page 22: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Testing

Smell Test Built-In (Requires Java Coding)

Root Mean Squared Error (RMSE) Average Absolute Difference

RandomUtils.useTestSeed()Evaluator.evaluate(builder,null,0.7,1.0)

Page 23: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Recommendation Engine Steps 1 – Generate List of ItemIDs 2 – Create Preference Vector 3 – Count Unique Users 4 – Transpose Preference Vectors 5 – Row Similarity

Compute Weights Computer Similarities Similarity Matrix

6 – Pre-Partial Multiply, Similarity Matrix 7 – Pre-Partial Multiply, Preferences 8 – Partial Multiple (Steps 6 & 7) 9 – Filter Items 10 – Aggregate & Recommend

Page 24: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Batch Integration

ETL Data to HDFS SSIS Map/Reduce

Process with Mahout ETL Results

Map/Reduce Hive/Sqoop

Page 25: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Hands-On Demo

Page 26: Introduction to Mahout with HDInsight

MAKING BUSINESS INTELLIGENT www.pragmaticworks.com

Resources

Mahout in ActionSean Owen, Robin Anil, Ted Dunning, Ellen Friedman

Hadoop: The Definitive Guide Tom White