Top Banner

of 49

Bigdata - Installation a dn Configuaration.docx

Jun 02, 2018

Download

Documents

Aaron Farmer
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    1/49

    HDFS

    HDFS is a filesystem designed for storing very large files with streaming data access

    patterns, running on clusters of commodity hardware.

    WHERE WE CANT USE HDFS:

    1)Low-latency data access

    Applications that require low-latency access to data, in the tens of milliseconds

    range, will not work well with HDFS.

    2)Lots of small files

    Because the namenode holds filesystem metadata in memory, the limit to the

    number of files in a filesystem is governed by the amount of memory on the namenode.

    3)Multiple writers, arbitrary file modifications

    Files in HDFS may be written to by a single writer. Writes are always made at the

    end of the file. There is no support for multiple writers or for modifications at

    arbitrary offsets in the file.

    HDFS Concepts:-BlocksA disk has a block size, which is the minimum amount of data that it can read or write.

    Filesystems for a single disk build on this by dealing with data in blocks, which are an

    integral multiple of the disk block size.HDFS, too, has the concept of a block, but it is a much

    larger unit64 MB by default.

    Namenodes and Datanodes:-An HDFS cluster has two types of nodes operating in a master-worker pattern: a namenode

    (the master) and a number of datanodes (workers). The namenode manages the

    filesystem namespace. It maintains the filesystem tree and the metadata for all the files

    and directories in the tree. This information is stored persistently on the local disk in

    the form of two files: the namespace image and the edit log. The namenode also knows

    the datanodes on which all the blocks for a given file are located; however, it does

    not store block locations persistently, because this information is reconstructed from

    datanodes when the system starts.

    Parallel Copying with distcp:-

    The canonical use case for distcp is for transferring data between two HDFS clusters.

    If the clusters are running identical versions of Hadoop, the hdfs scheme is

    appropriate:

    % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/barThis will copy the/foo directory (and its contents) from the first cluster to the/bar

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    2/49

    directory on the second cluster, so the second cluster ends up with the directory structure

    /bar/foo. If/bar doesnt exist, it will be created first. You can specify multiple source

    paths, and all will be copied to the destination. Source paths must be absolute.

    By default, distcp will skip files that already exist in the destination, but they can be

    overwritten by supplying the -overwrite option. You can also update only the files that

    have changed using the -update option.

    Hadoop ArchivesHDFS stores small files inefficiently, since each file is stored in a block, and block

    metadata is held in memory by the namenode. Thus, a large number of small files can

    eat up a lot of memory on the namenode.

    Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS

    blocks more efficiently, thereby reducing namenode memory usage while still allowingtransparent access to files.

    Archive command:

    % hadoop archive -archiveName files.har /my/files /my

    The first option is the name of the archive, herefiles.har. HAR files always have

    a .har extension.

    Limitations:-There are a few limitations to be aware of with HAR files. Creating an archive creates

    a copy of the original files, so you need as much disk space as the files you are archiving

    to create the archive (although you can delete the originals once you have created the

    archive). There is currently no support for archive compression, although the files that

    go into the archive can be compressed (HAR files are like tar files in this respect).

    Archives are immutable once they have been created. To add or remove files, you must

    re-create the archive.

    Hadoop-2.2.0 Installation Steps for Single-Node Cluster (On Ubuntu

    12.04)

    1) Read chapter seven of HADOOP Definite Guide

    Or

    http://thecodeway.blogspot.com/2013/11/hadoop-220-installation-steps-for.htmlhttp://thecodeway.blogspot.com/2013/11/hadoop-220-installation-steps-for.htmlhttp://thecodeway.blogspot.com/2013/11/hadoop-220-installation-steps-for.htmlhttp://thecodeway.blogspot.com/2013/11/hadoop-220-installation-steps-for.html
  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    3/49

    FOLLOW THIS LINK ->(http://codesfusion.blogspot.com/2013/10/setup-

    hadoop-2x-220-on-ubuntu.html?m=1)

    Or

    http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-

    linux-single-node-cluster/

    1. Download and install VMware Player depending on your Host OS (32 bit or 64 bit)

    https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player

    /6_0

    2. Download the .iso image file of Ubuntu 12.04 LTS (32-bit or 64-bit depending on

    your requirements)

    http://www.ubuntu.com/download/desktop

    3. Install Ubuntu from image in VMware. (For efficient use, configure the Virtual

    Machine to have at least 2GB (4GB preferred)of RAMand at least 2 cores of

    processor

    Note:Install it using any user id and password you prefer to keep for your Ubuntuinstallation. We will create a separate user for Hadoop installation later.

    4. After Ubuntu is installed, login to it and go to User Accounts (right-top corner) to

    create a new user for Hadoop

    5. Click on Unlock and unlock the settings by entering your administrator password.

    http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0http://www.ubuntu.com/download/desktophttp://www.ubuntu.com/download/desktophttps://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1
  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    4/49

    6. Then click on + at the bottom-left to add a new user. Add the user type as

    Administrator (I prefer this but you can also select as Standard) and then add the

    username as hduser and create it.

    Note:After creating the account you may see it as disabled. Click on the Dropdown

    where Disabled is written and select Set Password to set the password for this

    account or select Enable to enable this account without password.

    7. Your account is set. Now login into your new hduser account.

    8. Open terminal window by pressing Ctrl + Alt + T

    9. Install openJDK using the following command

    $ sudo apt-get install openjdk-7-jdk

    10. Verify the java version installed

    $ java -version

    java version "1.7.0_25"

    OpenJDK Runtime Environment (IcedTea 2.3.12) (7u25-2.3.12-4ubuntu3)

    OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

    11. Create a symlink from openjdk default name to jdk using the following commands:

    $ cd /usr/lib/jvm

    $ ln -s java-7-openjdk-amd64 jdk

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    5/49

    12. Install ssh server:

    $ sudo apt-get install openssh-client

    $ sudo apt-get install openssh-server

    13. Add hadoop group and user

    $ sudo addgroup hadoop

    $ usermod -a -G hadoop hduser

    To verify that hduser has been added to the group hadoop use the command:

    $ groups hduser

    which will display the groups hduser is in.

    14. Configure SSH:

    $ ssh-keygen -t rsaP

    ...

    Your identification has been saved in /home/hduser/.ssh/id_rsa

    Your public key has been saved in /home/hduser/.ssh/id_rsa.pub

    ...

    $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

    $ ssh localhost

    15.Disable IPv6 because it creates problems in HadoopRun the following command:

    $ gksudo gedit /etc/sysctl.conf

    16. Add the following line to the end of the file:

    # disable ipv6

    net.ipv6.conf.all.disable_ipv6 = 1

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    6/49

    net.ipv6.conf.default.disable_ipv6 = 1

    net.ipv6.conf.lo.disable_ipv6 = 1

    Save and close the file. Then restart the system and login with hduser again.

    17. Download Hadoop - 2.2.0 from the following link to your Downloads folder

    http://www.dsgnwrld.com/am/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz

    18. Extract Hadoop and move it to /usr/local and make this user own it:

    $ cd Downloads

    $ sudo tar vxzf hadoop-2.2.0.tar.gz -C /usr/local

    $ cd /usr/local

    $ sudo mv hadoop-2.2.0 hadoop

    $ sudo chown -R hduser:hadoop hadoop

    19. Open the .bashrc file to edit it:

    $ cd ~

    $ gksudo gedit .bashrc

    20. Add the following lines to the end of the file:

    #Hadoop variables

    export JAVA_HOME=/usr/lib/jvm/jdk/

    export HADOOP_INSTALL=/usr/local/hadoop

    export PATH=$PATH:$HADOOP_INSTALL/bin

    export PATH=$PATH:$HADOOP_INSTALL/sbin

    export HADOOP_MAPRED_HOME=$HADOOP_INSTALL

    http://www.dsgnwrld.com/am/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gzhttp://www.dsgnwrld.com/am/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    7/49

    export HADOOP_COMMON_HOME=$HADOOP_INSTALL

    export HADOOP_HDFS_HOME=$HADOOP_INSTALL

    export YARN_HOME=$HADOOP_INSTALL

    #end of paste

    Save and close the file.

    21. Open hadoop-env.sh to edit it:

    $ gksudo gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh

    and modify the JAVA_HOME variable in the File:

    export JAVA_HOME=/usr/lib/jvm/jdk/

    Save and close the file

    Restart the system and re-login

    22. Verify the Hadoop Version installed using the following command in the terminal:

    $ hadoop version

    The output should be like:

    Hadoop 2.2.0

    Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768

    Compiled by hortonmu on 2013-10-07T06:28Z

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    8/49

    Compiled with protoc 2.5.0

    From source with checksum 79e53ce7994d1628b240f09af91e1af4

    This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-

    common-2.2.0.jar

    This makes sure that Hadoop is installed and we just have to configure it now.

    23. Run the following command:

    $ gksudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml

    24. Add the following between the ... tags

    fs.default.name

    hdfs://localhost:9000

    Then Save and close the file

    25. In extended terminal write:

    $ gksudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml

    and Paste following between tags

    yarn.nodemanager.aux-services

    mapreduce_shuffle

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    9/49

    yarn.nodemanager.aux-services.mapreduce.shuffle.class

    org.apache.hadoop.mapred.ShuffleHandler

    26. Run the following command:

    $ gksudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml.template

    27. Add the following between the ... tags

    mapreduce.framework.name

    yarn

    Note) in that folder (/usr/local/Hadoop/etc/hadoop) where slaves file is there, make

    an empty file named as masters and write localhost in it and save it.

    28. Instead of saving the file directly, Save As and then set the filename as

    mapred-site.xml. Verify that the file is being saved to the/usr/local/hadoop/etc/hadoop/ directory only

    29. Type following commands to make folders for namenode and datanode:

    $ cd ~

    $ mkdir -p mydata/hdfs/namenode

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    10/49

    $ mkdir -p mydata/hdfs/datanode

    30. Run the following:

    $ gksudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

    31. Add the following lines between the

    tags

    dfs.replication

    1

    dfs.namenode.name.dir

    file:/home/hduser/mydata/hdfs/namenode

    dfs.datanode.data.dir

    file:/home/hduser/mydata/hdfs/datanode

    32. Format the namenode with HDFS:

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    11/49

    $ hdfs namenode -format

    33. Start Hadoop Services:

    $ start-dfs.sh

    ....

    $ start-yarn.sh

    .

    34. Run the following command to verify that hadoop services are running

    $ jps

    If everything was successful, you should see following services running2583 DataNode

    2970 ResourceManager3461 Jps

    3177 NodeManager

    2361 NameNode

    2840 SecondaryNameNode

    Run Hadoop Example

    hduser@ubuntu: cd /usr/local/hadoop

    hduser@ubuntu:/usr/local/hadoop$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 5

    Number of Maps = 2

    Samples per Map = 5

    13/10/21 18:41:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for

    your platform... using builtin-java classes where applicable

    Wrote input for Map #0

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    12/49

    Wrote input for Map #1

    Starting Job

    13/10/21 18:41:04 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

    13/10/21 18:41:04 INFO input.FileInputFormat: Total input paths to process : 2

    13/10/21 18:41:04 INFO mapreduce.JobSubmitter: number of splits:2

    13/10/21 18:41:04 INFO Configuration.deprecation: user.name is deprecated. Instead, use

    mapreduce.job.user.name

    ...

    Installation of hadoop on linux in Pseudo mode:

    http://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-

    ubuntu-12-04-lts/

    http://www.analyticsby.me/2013/09/hadoop-pseudo-distributed-mode.html

    Map Reduce

    MapReduce is a Language independent programming model for data processing.

    MapReduce works by breaking the processing into two phases: the map phase and the

    reduce phase. Each phase has key-value pairs as input and output, the types of which may

    be chosen by the programmer. The programmer also specifies two functions: the mapfunction and the reduce function.

    Data Flow

    http://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-ubuntu-12-04-lts/http://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-ubuntu-12-04-lts/http://www.analyticsby.me/2013/09/hadoop-pseudo-distributed-mode.htmlhttp://www.analyticsby.me/2013/09/hadoop-pseudo-distributed-mode.htmlhttp://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-ubuntu-12-04-lts/http://akbarahmed.com/2012/06/26/install-cloudera-cdh4-with-yarn-mrv2-in-pseudo-mode-on-ubuntu-12-04-lts/
  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    13/49

    A MapReducejob is a unit of work that the client wants to be performed: it consists of the

    input data, the MapReduce program, and configuration information. Hadoop runs the job by

    dividing it into tasks, of which there are two types:

    map tasks and reduce tasks.

    There are two types of nodes that control the job execution process: a jobtrackerand

    a number of tasktrackers. The jobtracker coordinates all the jobs run on the system byscheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress

    reports to the jobtracker, which keeps a record of the overall progress of each job. If a

    task fails, the jobtracker can reschedule it on a different tasktracker.

    Hadoop divides the input to a MapReduce job into fixed-size pieces called input

    splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined

    map function for each record in the split.

    MapReduce data flow with a single reduce task

    HOW TO RUN A MAPREDUCE PROGRAM IN HADOOP:

    HADOOP Step to run a program--

    1)hadoop dfs -copyFromLocal /home/cloudera/yourfilename /

    example:

    hadoop dfs -copyFromlocal /home/cloudera/test/

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    14/49

    ***(where test is the name of the input file)

    2)hadoop dfs -ls /

    3)/usr/lib/hadoop-0.20$ hadoop jar yourjarfile.jar classname /InputFileName

    /OutputFileName.

    example:

    1)/usr/lib/hadoop-0.20$ hadoop jar hadoop -0.20.2-cdh3u0-examples.jar

    wordcount/test/output

    2)/home/cloudera/ hadoop jar edu.jar count /test /output

    ***(where yourjarfile.jar= hadoop -0.20.2-cdh3u0-examples.jar,

    classname=wordcount,InputFileName=test,OutputFileName=output)

    4)hadoop fs -ls /OutputFileName

    5)hadoop dfs -cat /OutputFilename/part-r-00000(To see the output).

    CASE STUDTIES

    Case I: Word-Count problem

    Problem: In this problem we have to calculate the no of occurrence of a particular word in a text file.

    Solution: Here i am just using the logic of map and reduce function rather than writing the whole

    code.

    // Map function:public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());output.collect(word, one);

    }}

    }

    // Reduce function:

    public static class Reduce extends MapReduceBase implementsReducer {

    public void reduce(Text key, Iterator values,OutputCollector output, Reporter reporter) throwsIOException {

    int sum = 0;while (values.hasNext()) {

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    15/49

    sum += values.next().get();}output.collect(key, new IntWritable(sum));

    }}

    Input:

    how how are you

    where have you been so long

    i was worried about you

    since the day you had gone

    Output:

    Case II: Max temperature problem

    Problem:

    We have been provided by a file having year and temperature. In this case you have more than one

    temperature value corresponding to a year. You have to calculate the max temperature,sum of the

    temperature,and minimum temperature year-wise. Lets understand it by an example. In your file

    you have more than one value of temperature for the year 1900 as 32,45,67,12,68,43.So this

    problem will give you output as

    1900 68

    1900 12

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    16/49

    1900 267

    Map function:

    public void map(LongWritable key, Text value,OutputCollector output, Reporter reporter)throws IOException {

    long airTemp;

    String line = value.toString();int a=line.length();int spec=line.indexOf(' ');String year=line.substring(0,4);

    airTemp = Long.parseLong(line.substring(spec+1,a));output.collect(new Text(year), new LongWritable(airTemp));

    }}

    Reduce function:

    public static class OldMaxTemperatureReducer extends MapReduceBaseimplements Reducer {@Overridepublic void reduce(Text key, Iterator values,OutputCollector output, Reporter reporter)throws IOException {long maxvalue=0L;long minvalue=Long.MAX_VALUE;int sum=0;//Boolean flag=false;long s=0;while (values.hasNext()){

    s=values.next().get();sum+=s;

    if(smaxvalue){maxvalue=s;}

    }

    output.collect(key, new LongWritable(sum));output.collect(key, new LongWritable(maxvalue));output.collect(key, new LongWritable(minvalue));}

    Note:You can take DoubleWritable as well if you have double input.

    Output:

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    17/49

    Case III: Prime Number problem

    Problem: In this problem you have a file having numbers and you have to find the prime number

    only from the given file.

    //Map function:

    public void map(LongWritable key, Text value,

    OutputCollector output, Reporter

    reporter) throws IOException

    {

    Boolean flag = true;long num = 0L;String line = value.toString();

    StringTokenizer tokenizer = new StringTokenizer(line);while(tokenizer.hasMoreTokens())

    {num = Long.parseLong(tokenizer.nextToken());number.set(num);

    for(long i = 2; i

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    18/49

    if(flag){

    bool.set(true);output.collect(bool, number);

    }

    }}

    }

    //Reduce function:public void reduce(BooleanWritable key, Iterator values,OutputCollector output, Reporterreporter) throws IOException

    {while(values.hasNext()){

    output.collect(key, values.next());

    }}}

    Output:

    Case IV: Alphabets problem(Length)

    Problem: In this problem we have a long text file and you have to calculate the length of each word

    included in the text file.It seems very similar to WordCount problem though it is But twist comes

    here and that is the file is not a simple text file having only words.It also contains some numeric data

    as well punctuation symbol such as (.,$ @ 1 2 3 4 5 etc).So It would be difficult to calculate the

    length of a word.

    Let me elaborate it.Suppose there is field in the text file as world.,so if you

    calculate the length of it it will give 6,BUT IT IS VERY EASY TO GUESS THE LENTH OF WORD IS 5.So

    what it is doing ,it count . as a character included in world but our problem was to calculate the

    length of each world.So what we have to do here first remove all numeric and punctuation symbolfrom the file and then count the character in each word.

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    19/49

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    20/49

    Case V: Alphabets problem(Arrangements)

    Problem: W e have same input file as above but problem statement is different from that.Here you

    have to show all distinct words of same length before the length of word. It means that if you file

    file contains (hi how are you.i have sent a mail to you ),then output would be like this

    1 i,a

    2 hi,to

    3 how, are, you(we have two you in file,but output have only one that is distinct word)

    4 have,sent,mail

    Map function:

    public class mapreduce1 {public static class Map extends MapReduceBase implements

    Mapper{ String b;

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    21/49

    private IntWritable word = new IntWritable();public void map(LongWritable key, Text value,

    OutputCollector output, Reporter reporter) throwsIOException {

    String line = value.toString();

    line=line.replaceAll("[\\p{Punct}]", "");line=line.replaceAll("[0-9]", "");

    StringTokenizer tokenizer = new StringTokenizer(line);

    while (tokenizer.hasMoreTokens()){// int a=tokenizer.nextToken().length();b=tokenizer.nextToken().toString();int a=b.length();word.set(a);

    output.collect(word, new Text(b));

    }

    }}

    Reduce function:

    public static class Reduce extends MapReduceBase implementsReducer {

    public void reduce(IntWritable key, Iterator values,OutputCollector output, Reporter reporter) throwsIOException {

    String s="";while (values.hasNext()){

    String var=values.next().toString();if(!s.contains(var)){

    s=s+var+',';}

    }

    output.collect(key,new Text(s));

    }

    }

    Output:

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    22/49

    Case VI: Weather data problem

    Problem: This is one of the good problem. This problem is to calculate the maximum temperature

    ,minimum temperature, if temperature is greater than 40 describe it as hot day, if temperature is

    less than 10 describe it as cold day according to the date.

    But main problem arises when you analyse the input data. In the input file

    you have date as the 2ndfield and the remaining field as temp. For simple analysis one can think it

    as(you have many column and 2ndcolumn is date and remaining are temp).Though its a flat file but

    one can think the way I have described .So you have to extract the date and temperature as

    described in the problem above.

    Map function:

    public static class Map extends MapReduceBase implementsMapper {// private final static IntWritable one = new IntWritable(1);//private Text word = new Text();

    double temp;

    public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {

    String line = value.toString();line=line.replaceAll("U","");int a=line.length();if(a>2){

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    23/49

    int spec=line.indexOf(' ');String s=line.substring(spec,spec+9);String b=line.substring(spec+10,a);

    StringTokenizer tokenizer = new StringTokenizer(b);while (tokenizer.hasMoreTokens()) {

    {temp=Double.valueOf(tokenizer.nextToken().toString());

    }output.collect(new Text(s), new DoubleWritable(temp));

    }}}

    }

    Reduce function:

    public static class Reduce extends MapReduceBase implements

    Reducer {public void reduce(Text key, Iterator values,

    OutputCollector output, Reporter reporter) throwsIOException {

    Double maxValue = Double.MIN_VALUE;Double minvalue=Double.MAX_VALUE;Double a;while (values.hasNext()){

    a=values.next().get();maxValue = Math.max(maxValue,a);minvalue=Math.min(minvalue,a);if(maxValue>40){

    output.collect(key,new DoubleWritable(maxValue));}

    /* if(minvalue

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    24/49

    Pig(Batch Processing)Pig is a scripting language for exploring large datasets. Pig isnt suitable for all data

    processing tasks, however. Like MapReduce, it is designed for batch processing of data. If

    you want to perform a query that touches only a small amount of data in a large dataset,

    then Pig will not perform well, because it is set up to scan the whole dataset, or at least

    large portions of it.

    Pig is made up of two pieces:

    The language used to express data flows, called Pig Latin.

    The execution environment to run Pig Latin programs. There are currently two

    environments: local execution in a single JVM and distributed execution on a Hadoop

    cluster.

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    25/49

    A Pig Latin program is made up of a series of operations, or transformations, that are

    applied to the input data to produce output.

    Installation of Pig

    1)Download a stable release from http://pig.apache.org/releases.html, and unpackthe tarball in a suitable place on your workstation:

    % tar xzf pig-x.y.z.tar.gz

    2)Its convenient to add Pigs binary directory to your command-line path. For example:

    % export PIG_INSTALL=/home/tom/pig-x.y.z

    % export PATH=$PATH:$PIG_INSTALL/bin

    3))In home folder,open view and select show hidden file.Then search for .bashrcfile,open

    it.Set the home path at the end of the file as

    export JAVA_HOME=$HOMEPATH/programs/jdk1.7

    export path=.:JAVA_HOME/bin:$PATH

    4) Then run the pig commond to check whether home path is set or not.

    Running Pig Programs

    On Terminal Use command Pig-x local (to load files from /user/cloudera folder)

    Otherwise it will try to load data from hdfs.

    OR

    Alternatively, you can set these two properties in thepig.properties file in Pigs conf

    directory (or the directory specified by PIG_CONF_DIR). Heres an example for a

    pseudodistributed

    setup:

    fs.default.name=hdfs://localhost/

    mapred.job.tracker=localhost:8021

    1) You can also create a script file and save it as .pig extention. Then run it as

    $ Pigx local script.pig

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    26/49

    It will give the same result.

    2) Parameter substitution is also possible in pig.

    Parameter Substitution

    If you have a Pig script that you run on a regular basis, its quite common to want to

    be able to run the same script with different parameters. For example, a script that runs

    daily may use the date to determine which input files it runs over. Pig supportsparameter

    substitution, where parameters in the script are substituted with values supplied at

    runtime. Parameters are denoted by identifiers prefixed with a $ character; for example,

    $input and $output are used in the following script to specify the input and output paths:

    -- max_temp_param.pig

    records = LOAD '$input' AS (year:chararray, temperature:int, quality:int);

    filtered_records = FILTER records BY temperature != 9999 AND

    (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);

    grouped_records = GROUP filtered_records BY year;

    max_temp = FOREACH grouped_records GENERATE group,

    MAX(filtered_records.temperature);

    STORE max_temp into '$output';

    Parameters can be specified when launching Pig, using the -param option, one for each

    parameter:

    % pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt \> -param output=/tmp/out \

    > ch11/src/main/pig/max_temp_param.pig

    You can also put parameters in a file and pass them to Pig using the -param_file option.

    For example, we can achieve the same result as the previous command by placing the

    parameter definitions in a file:

    # Input file

    input=/user/tom/input/ncdc/micro-tab/sample.txt

    # Output file

    output=/tmp/out

    Thepig invocation then becomes:% pig -param_file ch11/src/main/pig/max_temp_param.param \

    > ch11/src/main/pig/max_temp_param.pig

    You can specify multiple parameter files using -param_file repeatedly. You can also

    use a combination of -param and -param_file options, and if any parameter is defined

    in both a parameter file and on the command line, the last value on the command line

    takes precedence.

    Dynamic parameters

    For parameters that are supplied using the -param option, it is easy to make the value

    dynamic by running a command or script. Many Unix shells support command substitution

    for a command enclosed in backticks, and we can use this to make the outputdirectory date-based:

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    27/49

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    28/49

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    29/49

    Output:

    CaseIII: Weather data problem

    Problem: This is one of the good problem. This problem is to calculate the maximum temperature .

    But main problem arises when you analyse the input data. In the input file you have date as the 2nd

    field and the remaining field as temp. For simple analysis one can think it as(you have many column

    and 2nd

    column is date and remaining are temp).Though its a flat file but one can think the way Ihave described .So you have to extract the date and temperature as described in the problem above.

    Solution:

    a = load 'a1' as (line:chararray);

    b = foreach a generate REPLACE(line,'U','') AS line;

    C= FOREACH B GENERATE TRIM(SUBSTRING(LINE,6,14)) AS DATE,TRIM(SUBSTRING(LINE,15,216))

    AS TEMP;

    D = FOREACH C GENERATE DATE,FLATTEN(TOKENIZE(temp)) as temp;

    e = FOREACH D GENERATE DATE,(DOUBLE)TEMP;

    f = filter e by temp > = 25;

    g = group f by TEMP;

    h = foreach g generate group,max(f.temp);

    dump h;

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    30/49

    Output:

    Case IV: Alphabets problem(Length)

    Problem: In this problem we have a long text file and you have to calculate the length of each word

    included in the text file.It seems very similar to WordCount problem though it is But twist comes

    here and that is the file is not a simple text file having only words.It also contains some numeric data

    as well punctuation symbol such as (.,$ @ 1 2 3 4 5 etc).So It would be difficult to calculate the

    length of a word.

    Let me elaborate it. Suppose there is field in the text file as world.,so if you

    calculate the length of it it will give 6,BUT IT IS VERY EASY TO GUESS THE LENTH OF WORD IS 5.So

    what it is doing ,it count . as a character included in world but our problem was to calculate the

    length of each world.So what we have to do here first remove all numeric and punctuation symbol

    from the file and then count the character in each word.

    Solution:

    A = load alpha2 using PigStorage(\t) as (line:chararray);

    B = foreach A generate REPLACE(line,*\\p{Punct}[0-9++, );

    C = foreach B GENERATE FLATTEN(TOKENIZE(line)) as line;

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    31/49

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    32/49

    Note:You can also see the same result on grunts using DUMP g;

    Output:

    Hive

    Hive was created to make it possible for analysts with strong SQL skills (but meager

    Java programming skills) to run queries on the huge volumes of data that Facebook

    stored in HDFS.

    Installing HiveIn normal use, Hive runs on your workstation and converts your SQL query into a series

    of MapReduce jobs for execution on a Hadoop cluster. Hive organizes data into tables,which provide a means for attaching structure to data stored in HDFS. Metadatasuch as

    table schemasis stored in a database called the metastore. When starting out with Hive, it

    is convenient to run the metastore on your local machine. In this configuration, which is the

    default, the Hive table definitions that you create will be local to your machine, so you cant

    share them with other users.

    1) Download a release at http://hive.apache.org/releases.html, and unpack the tarball in a

    suitable place on your workstation:

    % tar xzf hive-x.y.z.tar.gz

    2)Its handy to put Hive on your path to make it easy to launch:% export HIVE_INSTALL=/home/tom/hive-x.y.z-dev

    % export PATH=$PATH:$HIVE_INSTALL/bin

    3)Now type hive to launch the Hive shell:

    % hive local

    hive>

    4)Sometime it will say unable to open metastore. In that case use following lines on

    commond prompt then run again

    SUDO chmod -R 777 /VAR/LIB/HIVE/METASTORE/METASTORE_DB

    chmod -R a+rwx /var/lib/hive/metastore/metastore_db

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    33/49

    rm /var/lib/hive/metastore/metastore_db/*.lck

    Running Script on Hive

    NOTE-Loading Data form HDFS to Local System in Hive

    Hadoop dfscat /user/hive/warehouse/filename/* > ~/output.txt

    hadoop dfs -copyToLocal /user/hive/warehouse/filename/* ~/a.txt

    We can create a script and save it .q extension and use command

    $ hivef a1.q

    It will run successfully.

    In-order to delete a table from command prompt

    $hivee drop table a1//no need of colon(;)

    We can create table on command prompt

    1) $hivee create table new(line string);

    Load data local inpath s2 overwrite into table new (ENTER)

    2)To view data

    $ hiveS -e select * from new (ENTER)

    S is used to suppress output. It will suppress time taken to show the data. You can do it

    without using S too.

    CREATE TABLE ...

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\001'

    COLLECTION ITEMS TERMINATED BY '\002'

    MAP KEYS TERMINATED BY '\003'

    LINES TERMINATED BY '\n'STORED AS TEXTFILE;

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    34/49

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    35/49

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    36/49

    what it is doing ,it count . as a character included in world but our problem was to calculate the

    length of each world.So what we have to do here first remove all numeric and punctuation symbol

    from the file and then count the character in each word.

    Solution:

    CREATE TABLE A (LINE STRING);

    LOAD DATA LOCAL INPATH ALPHA2 OVERWRITE INTO TABLE A;

    FROM A INSERT OVERWRITE TABLE A SELECT REGEXP_REPLACE(LINE,*0-9+, );

    FROM A INSERT OVERWRITE TABLE A SELECT REGEXP_REPLACE(LINE,\\pPunct-, );

    SELECT WORD,LENGTH(WORD) FROM A LATERAL VIEW EXPLODE(SPLIT(LINE, )) 1TABLE AS

    WORD;

    Create a another table KK and store output

    CREATE TABLE KK (Word string, length int) as

    SELECT WORD,LENGTH(WORD) FROM A LATERAL VIEW EXPLODE(SPLIT(LINE, )) 1TABLE AS

    WORD;

    Load the output file from HDFS to local system

    Hadoop dfscat /user/hive/warehouse/KK/* > ~/a.txt

    Output:

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    37/49

    CASE IV: PASS-FAIL PROBLEM

    Problem:

    In this problem you have two file one is student and another is result. You have to find out the name

    of those students who have passed the exam.

    Solution:

    Create table student(name string,id int)row format delimited fields terminated by \t;

    Load data local inpath student overwrite into table student;

    Create table result(id int,status string) row format delimited fields terminated by \t;

    Load data local inpath results overwrite into table result;

    Select student.id,student.name,result.status from student join result on (student.id=result.id)

    Where result.status=pass;

    Output:

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    38/49

    input

    Haryana Ambala 404 20 80 37591 76.76746 30.373488 404-20-80-37591

    Haryana Ambala 404 20 80 30021 76.76746 30.373488 404-20-80-30021

    Haryana Ambala 404 20 80 37591 76.76746 30.373488 404-20-80-37591

    pig script

    A = LOAD 'input.txt' AS line;

    B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\s+')) AS

    (f1:chararray,f2:chararray,f3:int,f4:int,f5:int,f6:int,f7:double,f8:double,f9:chararray);

    C = FOREACH B GENERATE f1,f2,f8,f9;

    DUMP C;

    output

    (Haryana,Ambala,30.373488,404-20-80-37591)

    (Haryana,Ambala,30.373488,404-20-80-30021)

    (Haryana,Ambala,30.373488,404-20-80-37591)

    Hbase

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    39/49

    HBase is a distributed column-oriented database built on top of HDFS. HBase is the

    Hadoop application to use when you require real-time read/write random access to

    very large datasets.

    Installation1)Download a stable release from an Apache Download Mirror and unpack it on yourlocal filesystem. For example:

    % tar xzf hbase-x.y.z.tar.gz

    2)As with Hadoop, you first need to tell HBase where Java is located on your system. If

    you have the JAVA_HOME environment variable set to point to a suitable Java installation,

    then that will be used, and you dont have to configure anything further. Otherwise,

    you can set the Java installation that HBase uses by editing HBases conf/hbase-env.sh

    and specifying the JAVA_HOME variable to point to

    version 1.6.0 of Java.

    3)For convenience, add the HBase binary directory to your command-line path. For

    example:

    % export HBASE_HOME=/home/hbase/hbase-x.y.z

    % export PATH=$PATH:$HBASE_HOME/bin

    1) To administer your HBase instance, launch the HBase shell by typing:

    % hbase shell

    HBase Shell; enter 'help' for list of supported commands.

    Type "exit" to leave the HBase Shell

    Version: 0.89.0-SNAPSHOT, ra4ea1a9a7b074a2e5b7b24f761302d4ea28ed1b2, Sun Jul 18

    15:01:50 PDT 2010 hbase(main):001:0>

    This will bring up a JRuby IRB interpreter that has had some HBase-specific commands

    added to it. Type help and then press RETURN to see the list of shell commands grouped

    into categories. Type help COMMAND_GROUP for help by category or help COMMAND for

    help on a specific command and example usage.

    2) Create a table:

    To create a table, you must name your table and define its schema. A tables schema

    comprises table attributes and the list of table column families. Column families

    themselves have attributes that you in turn set at schema definition time. Examples of

    column family attributes include whether the family content should be compressed onthe filesystem and how many versions of a cell to keep. Schemas can be edited later by

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    40/49

    offlining the table using the shell disable command, making the necessary alterations

    using alter, then putting the table back online with enable.

    To create a table named test with a single column family named data using defaults

    for table and column family attributes, enter:

    hbase(main):007:0> create 'test', 'data'

    0 row(s) in 1.3066 seconds

    3)To prove the new table was created successfully, run the list command. This willoutput all tables in user space:

    hbase(main):019:0> list

    test

    1 row(s) in 0.1485 seconds

    4)To insert data into three different rows and columns in the data column family, and

    then list the table content, do the following:

    hbase(main):021:0> put 'test', 'row1', 'data:1', 'value1'

    0 row(s) in 0.0454 seconds

    hbase(main):022:0> put 'test', 'row2', 'data:2', 'value2'

    0 row(s) in 0.0035 seconds

    hbase(main):023:0> put 'test', 'row3', 'data:3', 'value3'0 row(s) in 0.0090 seconds

    hbase(main):024:0> scan 'test'

    ROW COLUMN+CELL

    row1 column=data:1, timestamp=1240148026198, value=value1

    row2 column=data:2, timestamp=1240148040035, value=value2

    row3 column=data:3, timestamp=1240148047497, value=value3

    3 row(s) in 0.0825 seconds

    Notice how we added three new columns without changing the schema.To remove the table, you must first disable it before dropping it:

    hbase(main):025:0> disable 'test'

    09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled test

    0 row(s) in 6.0426 seconds

    hbase(main):026:0> drop 'test'

    09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted test

    0 row(s) in 0.0210 seconds

    hbase(main):027:0> list

    0 row(s) in 2.0645 seconds

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    41/49

    5) To get data from specific row:

    Get tablename,rowname(ENTER)

    Get wc,row2 (It will print data of second row).

    6)Delete a row:

    Delete wc,row1,data:1(ENTER)

    Import data from flat file to

    HBase table.1. Create an HBase table

    1

    2

    $ hbase shellhbase> create table 'noaastation', 'd'

    This creates a table called noaastation with one column family d.

    2. Create a temporary HDFS folder to hold the data for bulk load$ hdfs dfs mkdir /user/john/hbase

    3. Run importtsv to generate data in the temporary folder for bulk load.

    $ hadoop jar /usr/lib/hbase/hbase.jar importtsv '-

    Dimporttsv.separator=|' -Dimporttsv.bulk.output=/user/john/hbase/tmp -

    Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2,d:c3,d:c4,d:c5,d:c6,d:c

    7,d:c8,d:c9,d:c10,d:c11,d:c12,d:c13,d:c14 noaastation

    /user/john/noaa/201212station.txt

    NOAA station data in file 201212station.txt has 15 fields separated by |. The first field is

    station id, which will be the row key in HBase. The rest of fields will be added as HBase

    columns.

    1. Change the temporary folder permission$ hdfs dfs -chmod -R +rwx /user/john/hbase

    2. Run bulk load

    $ hadoop jar /usr/lib/hbase/hbase.jar completebulkload

    /user/john/hbase/tmp noaastation

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    42/49

    Now that the data is loaded, run some HBase shell commands to query.

    12

    hbase> scan noaastationhbase> get 'noaastation', '94994', {COLUMN => 'd:c7'}

    CASSANDRA

    Apache Cassandra is anopen source, distributed, decentralized, elastically scalable,

    highly available, fault-tolerant, tuneably consistent, column-oriented database that

    bases its distribution design on Amazons Dynamo and its data model on GooglesBigtable. Created at Facebook, it is now used at some of the most popular sites on the

    Web.

    Brewers CAP Theorem:The theorem states that within a large-scale distributed data system, there are three

    requirements that have a relationship of sliding dependency: Consistency, Availability,

    and Partition Tolerance.

    Consistency

    All database clients will read the same value for the same query, even given concurrent

    updates.

    Availability

    All database clients will always be able to read and write data.

    Partition Tolerance

    The database can be split into multiple machines; it can continue functioning in

    the face of network segmentation breaks.

    Brewers theorem is that in any given system, you can strongly support only two of the

    three. This is analogous to the saying you may have heard in software development:

    You can have it good, you can have it fast, you can have it cheap: pick two.

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    43/49

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    44/49

    sudo mkdir /var/lib/cassandra (CREATING DIRECTORY)

    sudo chown -R cloudera/var/lib/cassandra (CHANGING OWNER FROM ROOT TO

    CLOUDERA) OR

    sudo chown -R 'whoami'/var/lib/cassnadra (assigning to all user who would be

    working on server)

    4)Start cassandra daemon

    /home/cloudera/cassandra/bin/cassandra -f

    5)Run commond on COMMAND LINE INTERFACE AND IN CQLSHELL

    /home/cloudera/cassandra/bin/cassandra-cli

    /home/cloudera/cassandra/bin/cqlsh

    6)netstat -ano |grep 9160 |grep LISTEN or netstat -nl|grep 9160

    To check whether a port is working or not.

    7)you can also use

    netstat -tulpen

    8)Run commond on cli

    a) CREATE KEYSPACE my_keyspace

    with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'

    and strategy_options = {replication_factor:1};

    then run USE MMMY_KEYSPACE

    b)create column family account1

    with key_validation_class = UTF8Type

    and comparator = 'AsciiType'

    and default_validation_class = UTF8Type;

    update column family User with

    column_metadata =

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    45/49

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    46/49

    E)deleting a cell

    del account1['123']['fisrt_name']; //delete column first_name.

    f)update column metadata

    update column family address

    with comparator = 'AsciiType'

    and key_validation_class = 'UTF8Type'

    and default_validation_class = 'UTF8Type'

    and column_metadata =

    [{column_name : city,

    validation_class : utf8},

    {column_name : zip,

    validation_class : utf8}];

    g) create super column family

    create column family student

    with column_type = 'Super'

    and key_validation_class = UTF8Type

    and comparator = 'AsciiType'and default_validation_class = UTF8Type;

    h)update super column metadata

    update column family student

    with comparator = 'AsciiType'

    and key_validation_class = 'UTF8Type'

    and default_validation_class = 'UTF8Type'

    and column_metadata =

    [{column_name : city,

    validation_class : utf8},

    {column_name : zip,

    validation_class : utf8}];

    i)execute a script

    bin/cassandra-cli -host localhost -port 9160 -f cassandrascript.txt

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    47/49

    WORKING ON CQLSH

    (http://www.datastax.com/documentation/cql/3.1/webhelp/index.html#cql/cql_usin

    g/use_ttl_t.html)

    NOTE: FOCUS ON LIST,SET,MAP DATATYPE OF CASSANDRA

    1) CREATE KEYSPACE key

    WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 2};

    2)USE key;

    3) ALTER KEYSPACE Key WITH REPLICATION =

    { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };

    4)Creating TABLE(COLUMN-FAMILY IN CLI) WITHIN A KEYSPACE

    a)CREATE TABLE users (

    user_name varchar,

    password varchar,

    gender varchar,

    session_token varchar,

    state varchar,

    birth_year bigint,PRIMARY KEY (user_name)); /// PRIMARY KEY

    b)CREATE TABLE emp (

    empID int,

    deptID int,

    first_name varchar,

    last_name varchar,

    PRIMARY KEY (empID, deptID)); //COMPOSITE PRIMARY KEY

    5) INSERTION IN TABLE

    INSERT INTO emp (empID, deptID, first_name, last_name)

    VALUES (104, 15, 'jane', 'smith');

    6)SYSTEM KEYSPACE ::

    The system keyspace includes a number of tables that contain details about your Cassandra

    database objects and cluster configuration.Cassandra populates these tables and others in

    the system keyspace.Keyspace, table, and column information.An alternative to the Thrift

    API describe_keyspaces function is querying system.schema_keyspaces directly. You can

    also retrieve information about tables by querying system.schema_columnfamilies and

    about column metadata by querying system.schema_columns.

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    48/49

    Procedure

    Query the defined keyspaces using the SELECT statement

    SELECT * from system.schema_keyspaces;

    Note: To kill all process running under CASSANDRADAEMON USE:

    pgrep -u user -f cassandra |xargs kill -9 (Here User is cloudera).

    MAHOUT

    Mahout is an open source machine learning library from Apache. The algorithms it

    implements fall under the broad umbrella of machine learning or collective intelligence. This

    can mean many things, but at the moment for Mahout it means primarily recommender

    engines (collaborative filtering), clustering, and classification. Its also scalable. Mahout aims

    to be the machine learning tool of choice when the collection of data to be processed is very

    large, perhaps far too large for a single machine. In its current incarnation, these scalable

    machine learning implementations in Mahout are written in Java, and some portions are

    built upon Apaches Hadoop distributed computation project.

    Recommender enginesRecommender engines are the most immediately recognizable machine learning technique

    in use today. Youll have seen services or sites that attempt to recommend books or movies

    or articles based on your past actions. They try to infer tastes and preferences and identify

    unknown items that are of interest:

    1) Amazon.com is perhaps the most famous e-commerce site to deploy recommendations.

    Based on purchases and site activity, Amazon recommends books and other items likely to

    be of interest. See figure 1.2.

    2) Netflix similarly recommends DVDs that may be of interest, and famously offered a

    $1,000,000 prize to researchers who could improve the quality of their recommendations.

    3) Dating sites like Lbmseti (discussed later) can even recommend people to people.

    4) Social networking sites like Facebook use variants on recommender techniques to identify

    people most likely to be as-yet-unconnected friends.

    Clustering

  • 8/10/2019 Bigdata - Installation a dn Configuaration.docx

    49/49

    Clustering is less apparent, but it turns up in equally well-known contexts. As its name

    implies, clustering techniques attempt to group a large number of things together into

    clusters that share some similarity.

    1) Google News groups news articles by topic using clustering techniques, in order to

    present news grouped by logical story, rather than presenting a raw listing of all articles.

    2)Search engines like Clusty group their search results for similar reasons.3)Consumers may be grouped into segments (clusters) using clustering techniques based on

    attributes like income, location, and buying habits.

    Classification

    Classification techniques decide how much a thing is or isnt part of some type or category,

    or how much it does or doesnt have some attribute. Classification, like clustering, is

    ubiquitous, but its even more behind the scenes. Often these systems learn by reviewing

    many instances of items in the categories in order to deduce classification rules. This general

    idea has many applications:

    1) Yahoo! Mail decides whether or not incoming messages are spam based on prior emailsand spam reports from users, as well as on characteristics of the email itself.

    2) Googles Picasa and other photo-management applications can decide when a region of

    an image contains a human face.

    3) Optical character recognition software classifies small regions of scanned text into

    individual characters.

    4) Apples Genius feature in iTunes reportedly uses classification to classify songsinto

    potential playlists for users.