MAKING BUSINESS INTELLIGENT www.pragmaticworks.c om Introduction to Mahout with HDInsight (Hadoop) Chris Price Senior BI Consultant @BluewaterSQL
Jan 15, 2015
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Introduction to Mahoutwith HDInsight (Hadoop)
Chris PriceSenior BI Consultant
@BluewaterSQL
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Intro
Chris Price Senior BI Consultant with Pragmatic Works
AuthorRegular SpeakerData Geek & Super Dad!
@BluewaterSQL http://bluewatersql.wordpress.com/ [email protected]
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Survey Whose currently using Machine Learning?
Google Facebook LinkedIn Twitter Amazon Wal-Mart
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Outline Mahout Introduction The Algorithms Hands On:
A recommendation engine
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Riding the Elephant Born out of the Apache Lucene project Top-level Apache project A scalable machine learning library
Fast, Efficient & Pragmatic Many of the algorithms can be run on Hadoop
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Algorithms Collaborative Filtering
Item/User Recommenders Clustering
Grouping movies by type Classification
Categorizing documents Frequent Itemset
Market basket analysis
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Collaborative Filtering Find subset of users who have similar
taste/preferences to target user and use this subset for recommendations
Types: User-Based Item-Based
Examples: Amazon
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Clustering Group similar objects Examples:
News Aggregator Customer Grouping
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Clustering Algorithms:
K-Means Fuzzy K-Means Mean Shift Canopy Dirichlet
Similarity Distance: Euclidean Squared Euclidean Cosine Tanimoto Manhattan
** Also weighted measures
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Clustering
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Classification Using a pre-determined set of groups:
Predict the type of a new object based on its features
Classifiable Data Continuous – Quantitative Value (i.e. Stock Price) Categorical – Small known set (i.e. Colors) Word-Like – Large unknown set Text-Like – Many word-like that are unordered
Examples: Spam Identification Photo Facial Recognition
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Frequent Itemset Examples:
Product Placement Market Basket Analysis Query Recommendations
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Mahout on HDInsight Installation
Download http://www.apache.org/dyn/closer.cgi/mahout/
Unpack Add to Path (Environment Variable)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Recommendation Engine Define the Business Objective
Metrics Context
Identify Data Sources Normalization Data Shift
Which Algorithm? Integration?
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Business Objective
NavigationalInefficiency
Cross-Sell
Up-Sell
Increase # of Orders
Increase Items per Order
Increase Average
Item PriceWebsite Increase
Revenue
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Handling Context
???January
20 degrees & Snowing…..
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Data Acquisition Sources of Data for Recommendation
Implicit Ratings Feedback Demographics Pyschographics (Personality/Lifestyle/Attitude), Ephemeral Need (Need for a moment)
Explicit Purchase History Click/Browse History
Product/Item Taxonomy Attributes Descriptions
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Data Preparation Preparation
Remove Outliers (Z-Score) Remove frequent buyers (Skew) Normalize Data (Unity-Based)
Beware of Data Shift
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Algorithms Collaborative Filtering (Mahout)
User-Based Item-Based
Content-Based (Mahout Clustering) Data Mining (SSAS)
Association Clustering
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
CF Recommendations
Neighborhood Formation Similarity Metrics
Pearson Correlation Euclidean Distance Spearman Correlation Cosine Tanimoto Coefficient Log-Likelihood
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
CF Pseudo-Code
for each item i that u has no preferencefor each user v that has a preference for
icompute similarity s between u and v
calculate running average of v‘s preference for i, weighted
by s
return top ranked (weighted average) i
Restrict to Neighborhood
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Testing
Smell Test Built-In (Requires Java Coding)
Root Mean Squared Error (RMSE) Average Absolute Difference
RandomUtils.useTestSeed()Evaluator.evaluate(builder,null,0.7,1.0)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Recommendation Engine Steps 1 – Generate List of ItemIDs 2 – Create Preference Vector 3 – Count Unique Users 4 – Transpose Preference Vectors 5 – Row Similarity
Compute Weights Computer Similarities Similarity Matrix
6 – Pre-Partial Multiply, Similarity Matrix 7 – Pre-Partial Multiply, Preferences 8 – Partial Multiple (Steps 6 & 7) 9 – Filter Items 10 – Aggregate & Recommend
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Batch Integration
ETL Data to HDFS SSIS Map/Reduce
Process with Mahout ETL Results
Map/Reduce Hive/Sqoop
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Hands-On Demo
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Resources
Mahout in ActionSean Owen, Robin Anil, Ted Dunning, Ellen Friedman
Hadoop: The Definitive Guide Tom White
MAKING BUSINESS INTELLIGENT www.pragmaticworks.comMAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Thank you!
@BluewaterSQL http://bluewatersql.wordpress.com/ [email protected]