Data Mining using Mahout - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/22.pdf · Apache Mahout Idea : Scale and go parallel Designed for high throughput, low latency. and

Data Mining using Data Mining using

MahoutMahout

Balaji Thota (200999001) Aneesh Chivukula(200707038)

Project No. 7Team No. 22

Machine Learning

Large amount of digital(structured and unstructured) data - User generated content on web - Customer&Event Logs - Sensor Data

Draw conclusions from raw data - Index and Search the data - Learn common process rules from data - Build models and generalize what was observed

Analytics needed to turn data into useful information. Machine learning is concerned with techniques that allow computers to improve their outputs based on previous experiences.

Apache Mahout

Idea : Scale and go parallel

Designed for high throughput, low latency. and can handle massive datasets both during training and application in production environment.

Failure resistant: What if service X is unavailable?

Failover built in: Hardware failure does happen.

Documented logging: Understand message w/o code.

Monitoring: Which parameters indicate system's health?

Automated deployment: How long to bring up machines?

Backup: Where do backups go to, how to do restore?

Scaling: What if load or amount of data double, triple?

Apache Mahout Release 0.1

Mahout is open-source, stable, scalable, parallelized machine-learning algorithm

implementations on Hadoop platform. Lucene – provides the use cases. Hadoop – parallelization. Hama – Matrix and vector libraries Mahout functionality : Recommendation mining : Taste Collaborative Filtering Clustering : k-Means, fuzzy k-Means, Canopy,

Dirichlet, Mean-Shift Classification : Distributed Naive Bayes and Complementary Naive Bayes Frequent itemset mining : Random forests, Parallel FP growth Evolutionary programming : Distributed fitness function, Watchmaker

Examples of all of the above algorithms.

Building Mahout

Pre-Requisite Software * JDK 1.6 or higher

* Ant 1.7 or higher : Java-based build tool * Maven 2.0.9 or 2.0.10 : Manage project's build, reporting and documentation.

Verify Maven install. Configure http proxy in Maven

Download Mahout source. Compile Core, Examples code using Maven. Required jar files downloaded. Unit Tests executed.

Three Moduled ArchitectureThree Moduled Architecture

Http Request

Front End and the Integrator

Algorithm Module or Engine( Data Mining and Clustering Algorithms)

User Module

(JSP/HTML+JavaScript )

Integrator Module( Servlets)

XML

Architectural Overview:

Generally we adopt a three tier Architecture here by creating different modules for different levels of processing and Data Flow.

• All the Client Side/User Related Logic goes into the User Modules. This module is nothing but the Client Interface which is implemented using JSP's and XML for communication.

• Integrator for Forwarding the Client Request to the Algorithm Engine in the form of a Request. We Implemented this module using Servlets.

• Algorithm Engine/Module is the heart of this project build on the top of Mahout. This is nothing but our wrappers for the Algorithms already present in Mahout( Hadoop, Lucene, Hama). These are nothing but “jar” files.

• Datasets: UCI Data sets.

Software Requirements of the Software Requirements of the SystemSystem

What does our System do?What does our System do?

User's Perspective of the User's Perspective of the SystemSystem

Different Map Reduce JobsDifferent Map Reduce Jobs

Why to use static method runJob Why to use static method runJob in each Algorithm Driver, Input in each Algorithm Driver, Input Driver, Output Driver, Output Driver?Driver?

1. For making algorithms chainable

2. We can call other algorithms from within an algorithm.

Algorithms for which we made Algorithms for which we made Drivers for Generic’sDrivers for Generic’s

1. K-Means (Dist Measure, No. Of Clusters, No. Of Iterations, Convergence, Threshold1, Threshold2)

1. Canopy (Dist Measure, Threshold1, Threshold2)

2. Mean Shift (parameters same has K-Means except no number of clusters)

3. Dirichlet (Distribution Model, No of Clusters, Iterations)

Other Parameters: No of Reducers.

• So that algorithm can be applied to any dataset i.e. it can be applied to any domain.

• The user need not worry about converting the data set.

• If parameters are made generic then the user can observe the performance of the Algorithm for various parameters . This helps in determing Optimal Parameters for the data set of a particular domain.

Why to make the Data and Why to make the Data and Parameter’s GenericParameter’s Generic

String[] numbers = values.toString().split(" [\\s]+");String[] attributeTypes = types.split(“[\\s]+”); // sometimes there are multiple separator spaces List<Double> doubles = new ArrayList<Double>(); int currentIndex = 0;

for (String value : numbers) { if (value.length() > 0){ if(attributeType[currentIndex].equals(“Integer”){

//convert to a double value and add to the doubles arraylist }else if(attributeType[currentIndex].equals(“Double”){

//convert to a double value and add to the doubles arraylist }else if(attributeType[currentIndex].equals(“String”){

//convert to a double value and add to the doubles arraylist}currentIndex++;

} Vector result = new DenseVector(doubles.size()); int index = 0; for (Double d : doubles) result.set(index++, d); output.collect(null, new Text(result.asFormatString())); }

Making Dataset Generic and Making Dataset Generic and using the using the

Data Structures present in Data Structures present in HamaHama

• Like decimal Numbers are base 10 and binary numbers base 2. We considered Strings as base 26.

• So for example “ba” value will be 2*26+1*1 = 53. Now this

procedure would generate large values so we need to make the values normalised. So while passing the Attribute types the user also passes the max length of each type. So we divide with (base)^maxlength.

Note: long range is -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807

For String how do you make them For String how do you make them generic and generate double generic and generate double values from them?values from them?

1. If the max Length doesn’t match.

2. If there is some missing values.

3. If the Attribute Type doesn’t match.

Solution: Binning (Max Binning)

Dealing with outlier’s and Dealing with outlier’s and Missing ValuesMissing Values

FP Growth

Stages involved:

a) Splitting Stage: Splitting the Transaction Data set to smaller datasets.

Using Map Reduce on the Splitting Stage further to solve the problem in two Stages.

a) Processing Stage1(Map Reduce Job 1): To find the frequency of individual Items in the Transaction Dataset.

b) Processing Stage 2(Map Reduce Job 2): To find the actual frequent patterns

FP Growth Stage1

FPGrowthStage1Mapper:public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,

Reporter reporter){ String line = ((Text)value).toString(); StringTokenizer tokenizer = new

StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); }} FPGrowthStage1Reducer:public void reduce(Text key, Iterator<IntWritable> value, OutputCollector<Text,

IntWritable> output, Reporter reporter){int sum = 0;while (values.hasNext()) {

sum += ((IntWritable)values.next()).get();}if(sum > threshold){

output.collect((Text)key, new IntWritable(sum));}

}

FP Growth Stage2

FPGrowthStage2Mapper:public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter){ String line = ((Text)value).toString(); StringTokenizer tokenizer = new StringTokenizer(line); Vector<String> row = new Vector<String>(); Vector<String> sortedRow = new Vector<String>(); while (tokenizer.hasMoreTokens()) {

String currentColumn = tokenizer.nextToken();

if(items.containsValue(currentColumn){ row.add(currentColumn); }

} for(Item keys:items.keySet()){

for(int i=0;i<row.size();i++){ if(items.get(keys).equals(row.get(i))){

sortedRow.add(row.get(i));row.remove(i);i--;

}}

} for(int i=sortedRow.size()-1;i>=1;i--){

String remainingString = "";for(int j=0;j<sortedRow.size();j++){

if(j!=i){if(j != 0){

remainingString += " ";} remainingString += sortedRow.get(j);

}}output.collect(new Text(sortedRow.get(i)), new Text(remainingString));

} }

FP Growth Stage2

FPGrowthStage2Reducer:public void reduce(Text key, Iterator<Text> values,OutputCollector<Text, Text> output, Reporter reporter){

HashMap<String, Integer> items = new HashMap<String, Integer>();items.put(key.toString(), itemSingle.get(key.toString()));

while (values.hasNext()) {StringTokenizer tokenizer = new StringTokenizer(values.next().toString());while (tokenizer.hasMoreTokens()) {

String token = tokenizer.nextToken() if(items.containsKey(token)){ items.put(token, new Integer(items.get(token).intValue() +1));}else{

items.put(token, new Integer(1));}

} } int value = 10000; String outputString = ""; boolean flag = false; for(String keys:items.keySet()){

if(items.get(keys).intValue() > threshold){if(flag){ outputString += " ";

}flag = true;outputString += keys; if(items.get(keys).intValue()<value){

value = items.get(keys).intValue();}

} } if(!outputString.equals("")){

output.collect(new Text(outputString), new Text(new Integer(value).toString()));

}}

}

Data Mining using Mahout - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/22.pdf · Apache Mahout Idea : Scale and go parallel Designed for high throughput, low latency. and

Documents