Top Banner
Applied Recommender Systems Bob Brehm 5/20/2014
39
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Recommender.system.presentation.pjug.05.20.2014

Applied Recommender Systems Bob Brehm5/20/2014

Page 2: Recommender.system.presentation.pjug.05.20.2014

Hadoop MapReduce Overview Mahout Overview Hive Overview Review recommender systems Introduction to Spring XD Demonstrations as we go

Presentation Topics

Page 3: Recommender.system.presentation.pjug.05.20.2014

Hadoop Overview

History [7] 2003: Apache Nutch (open-source web

search engine) was created by Doug Cutting and Mike Caferalla.

2004: Google File System and MapReduce papers published.

2005: Hadoop was created in Nutch as an open source inplementation to GFS and MapReduce.

Page 4: Recommender.system.presentation.pjug.05.20.2014

Hadoop Overview

Today Hadoop is an independent Apache Project consisting of 4 modules: [6]

Hadoop common HDFS – distributed, scalable file system YARN (V2) – job scheduling and cluster

resource management MapReduce – system for parallel

processing of large data sets Hadoop market size is over $3 billion!

Page 5: Recommender.system.presentation.pjug.05.20.2014

Hadoop Overview

Other Hadoop Related projects include Hive – data warehouse infrastructure Mahout – Machine learning library

While there are many more projects the rest of the talk will be focused on these two as well as MapReduce and HDFS.

Page 6: Recommender.system.presentation.pjug.05.20.2014

Hadoop Overview

NameNode – keeps track of all DataNodes JobTracker – main scheduler Data Node – individual data clusters TaskTracker – sequences each DataNode

Page 7: Recommender.system.presentation.pjug.05.20.2014

Hadoop Overview

HDFS basic command examples: Put – copies from local to HDFS

hadoop fs -put localfile /user/hadoop/hadoopfile

Mkdir – makes a directory hadoop fs -mkdir /user/hadoop/dir1

/user/hadoop/dir2 Tail – Displays last kilobyte of file

hadoop fs -tail pathname

Very similar to Linux commands

Page 8: Recommender.system.presentation.pjug.05.20.2014

Hadoop Overview

Input data – wrangling can be difficult Mapper – split data into key value pairs Sort – sort values by key Reducer – Combine values by key

Page 9: Recommender.system.presentation.pjug.05.20.2014

Hadoop Overview

Wordcount (HelloWorld) – Counts occurrences of each word in a document

Half of TF-IDF

Page 10: Recommender.system.presentation.pjug.05.20.2014

Hadoop Overview

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

Page 11: Recommender.system.presentation.pjug.05.20.2014

Hadoop Overviewpublic static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);

}

Page 12: Recommender.system.presentation.pjug.05.20.2014

Hadoop Overview

Setup the data:

/usr/joe/wordcount/input - input directory in HDFS

/usr/joe/wordcount/output - output directory in HDFS

$ hadoop fs -ls /usr/joe/wordcount/input/

/usr/joe/wordcount/input/file01

/usr/joe/wordcount/input/file02

$ hadoop fs -cat /usr/joe/wordcount/input/file01

Hello World Bye World

$ hadoop fs -cat /usr/joe/wordcount/input/file02

Hello Hadoop Goodbye Hadoop

Page 13: Recommender.system.presentation.pjug.05.20.2014

Hadoop Overview

Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar:

$ mkdir wordcount_classes

$ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java

$ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .

Run the application:

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output

Output:

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000

Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2

Page 14: Recommender.system.presentation.pjug.05.20.2014

Hadoop Overview

Interesting facts about MapReduce MapReduce can run on any type of file including

images. Hadoop streaming technology allows other

languages to use MapReduce. Python, R, Ruby. Can include a Combiner method that can

streamline traffic Not required to include a Reducer (image

processing, ETL) Hadoop includes a JobTracker WebUI MRUnit – Junit test framework

Page 15: Recommender.system.presentation.pjug.05.20.2014

Hadoop Overview

Spring for Apache Hadoop project Configure and run MapReduce jobs as

container managed objects Provide template helper classes for

HDFS, Hbase, Pig and Hive. Use standard Spring approach for

Hadoop! Access all Spring goodies – Messaging,

Persistence, Security, Web Services, etc.

Page 16: Recommender.system.presentation.pjug.05.20.2014

Hive

Hive is an alternative to writing MapReduce jobs. Hive compiles to MapReduce.

Hive programs are written in HiveQL. Similar to to SQL.

Examples: Create table: hive> CREATE TABLE pokes (foo

INT, bar STRING); Loading data: hive> LOAD DATA LOCAL

INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

Page 17: Recommender.system.presentation.pjug.05.20.2014

Hive

Examples (cont):

Getting data out of hive: INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

Join: FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

Hive may reduce the amount of code you have to write when you are doing data wrangling.

It's a tool that has it's place and is useful to know.

Page 18: Recommender.system.presentation.pjug.05.20.2014

Mahout

Started as a subproject of Lucene in 2008. Idea behind Mahout is that is provides a

framework for the development and deployment of Machine Learning algorithms.

Currently it has three distinct capabilities: Classification

Clustering

Recommenders

Page 19: Recommender.system.presentation.pjug.05.20.2014

Mahout

Support for recommenders include: Data model – provides connections to data

UserSimilarity – provides similarity to users

ItemSimilarity – provides similarity to items

UserNeighborhood – find a neighborhood (mini cluster) of like-minded users.

Recommender – the producer of recommendations.

Algorithms!

Page 20: Recommender.system.presentation.pjug.05.20.2014

Intro to Recommenders

Page 21: Recommender.system.presentation.pjug.05.20.2014

What is a recommender?

Wikipedia [3]: A subclass of [an] information filtering system that seek to

predict the 'rating' or 'preference' that user would give to an item

My addition: A subclass of machine-learning.

Recommender model [2]: Users

Items

Ratings

Community

Page 22: Recommender.system.presentation.pjug.05.20.2014

What is a recommender? [2]

Page 23: Recommender.system.presentation.pjug.05.20.2014

Recommender types

Non-personalized [2] Content-based filtering (user-item) [2] Hybrid [3] Collaborative filtering (user-user, item-item)

[2]

Page 24: Recommender.system.presentation.pjug.05.20.2014

Recommender types

Non-personalized [2] Content-based filtering (user-item) [2] Hybrid [3] Collaborative filtering (user-user, item-item)

[2]

Page 25: Recommender.system.presentation.pjug.05.20.2014

Collaborative Filtering

We will now look at item-item collaborative filtering as the recommendation algorithm.

Answers the question: what items are similar to the ones you like?

Popularized by Amazon who found that item-item scales better, can be done in real time, and generate high-quality results. [8]

Specifically we will look at Pearson Correlation Coefficient algorithm.

Page 26: Recommender.system.presentation.pjug.05.20.2014

Collaborative Filtering

Pearson's correlation coefficient - defined as the covariance of the two variables divided by the product of their standard deviations.

Page 27: Recommender.system.presentation.pjug.05.20.2014

Collaborative Filtering

Idea is to examine a log file for user's movie ratings. Data looks like this:

109.170.148.120 - - [06/Jan/1998:01:48:18 -0500] "GET /rate?movie=268&rating=4 HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"

109.170.148.120 - - [05/Jan/1998:22:48:57 -0800] "GET /rate?movie=345&rating=4 HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"

109.170.148.120 - - [05/Jan/1998:22:50:15 -0800] "GET /rate?movie=312&rating=4 HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"

Page 28: Recommender.system.presentation.pjug.05.20.2014

Collaborative Filtering

Steps used for the analysis: Run a hive script to extract the user data

from a log file Run Mahout command from the

command line (could be done programmatically as well).

Examine the contents.

Page 29: Recommender.system.presentation.pjug.05.20.2014

Collaborative filtering

<hive-runner id="hiveRunner"> <script> CREATE TABLE MAHOUT_INPUT_A ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' AS SELECT cookie as user, regexp_extract(request, "GET /rate\\?movie=(\\d+) &amp; rating=(\\d) HTTP/1.1", 1) as movie, CAST(regexp_extract(request, "GET /rate\\?movie=(\\d+) &amp; rating=(\\d) HTTP/1.1", 2) as double) as rating from ACCESS_LOGS WHERE regexp_extract(request, "GET /rate\\?movie=(\\d+) &amp; rating=(\\d) HTTP/1.1", 2) != ""; </script> </hive-runner>

Page 30: Recommender.system.presentation.pjug.05.20.2014

Collaborative filteringpublic class HiveApp {

private static final Log log = LogFactory.getLog(HiveApp.class);

public static void main(String[] args) throws Exception { AbstractApplicationContext context = new

ClassPathXmlApplicationContext("/META-INF/spring/hive-context.xml", HiveApp.class);

context.registerShutdownHook();

HiveRunner runner = context.getBean(HiveRunner.class); runner.call();

}}

Page 31: Recommender.system.presentation.pjug.05.20.2014

Collaborative Filtering

Hive output looks like this (This is the format that Mahout requires):

UserId, MovieID, relationship strength

943,373,3.0943,391,2.0943,796,3.0943,237,4.0943,840,4.0943,230,1.0943,229,2.0943,449,1.0943,450,1.0943,228,3.0

Page 32: Recommender.system.presentation.pjug.05.20.2014

Collaborative filtering

Rerun Mahout with a different correlation say SIMILARITY_EUCLIDEAN_DISTANCE

Do A/B comparison in production Gather statistics over time See if one algorithm is better than others.

Page 33: Recommender.system.presentation.pjug.05.20.2014

Spring XD

XD - Spring.io project that extends the work that Spring Data team did on Spring for Apache Hadoop project.

High throughput distributed data ingestion into HDFS from a variety of input sources.

Real-time analytics at ingestion time, e.g. gathering metrics and counting values.

Hadoop workflow management via batch jobs that combine interactions with standard enterprise systems (e.g. RDBMS) as well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive or Cascading).

High throughput data export, e.g. from HDFS to a RDBMS or NoSQL database.

Page 34: Recommender.system.presentation.pjug.05.20.2014

Spring XD

Configure a stream using XD. Simple case:

Page 35: Recommender.system.presentation.pjug.05.20.2014

Spring XD

More typical Corporate Use Case Stream:

Page 36: Recommender.system.presentation.pjug.05.20.2014

Spring XD

Admin UI

Page 37: Recommender.system.presentation.pjug.05.20.2014

??

Page 38: Recommender.system.presentation.pjug.05.20.2014

Thanks!

Page 39: Recommender.system.presentation.pjug.05.20.2014

References

[1] Introduction to recommender systems. Joseph Konstan.

[2] Intro to recommendations. Coursera.

[3] Recommender system. Wikipedia.

[4] An Algorithmic Framework for Performing Collaborative Filtering.

[5] Hybrid Web Recommender Systems.

[6] Hadoop web site.

[7] Apache Hadoop. Wikipedia

[8] Amazon.com Recommendations paper. cs.umd.edu.

[9] Cloudera Data Science Training. Cloudera.