Top Banner
Bayesian Counters 0.1.0 Development Environment Tutorial For CentOS 6.3 x86_64 Workstation By Alex Kozlov and Chris Poulin Testing by Daniel Rule and Ken Krugler January 31 2013 Contact: Chris Poulin: [email protected]
26

BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

Jun 23, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

 

f            

Bayesian  Counters  0.1.0  

Development  Environment  Tutorial  For  CentOS  6.3  

x86_64  Workstation  

By  Alex  Kozlov  and  Chris  Poulin    

Testing  by  Daniel  Rule  and  Ken  Krugler  

January  31  2013  

Contact:  Chris  Poulin:  [email protected]  

 

Page 2: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  2  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

 Table  of  Contents  

1   Introduction  ................................................................................................................................................  4  

2   The  Audience  ..............................................................................................................................................  4  

3   The  Goal  ........................................................................................................................................................  4  

4   Provisioning  ................................................................................................................................................  4  4.1   Virus  Risk  Warning  .....................................................................................................................................................................  4  

4.2   Software  Archive  Warning  .......................................................................................................................................................  4  

4.3   Development  Workstation  .......................................................................................................................................................  5  

4.4   HBase  Cluster  ................................................................................................................................................................................  5  

4.5   Network  ..........................................................................................................................................................................................  6  

5   Conventions  ................................................................................................................................................  6  

6   Bayesian  Counters  0.1.0  Development  Environment  ...................................................................  7  6.1   Log  into  the  Gnome  Desktop  on  h13.demo.dev  with  the  poulin  account.  ................................................................  7  

6.2   Browse  to  Cloudera  Manager  at  http://h12.demo.dev:7180/  and  log  in  with  the  admin  account  using  

Firefox  or  other  browser  in  the  Gnome  Desktop.  ..........................................................................................................................  7  

6.3   Within  Cloudera  Manager,  Hosts  -­‐>  Add  Host  This  will  start  the  Add  Hosts  Wizard.  Follow  the  

instructions  of  the  wizard  to  add  h13.demo.dev  to  the  cluster  but  do  not  add  any  rolls  to  h13.demo.dev,  as  we  

do  not  want  the  development  host  to  participate  as  part  of  the  cluster.  When  complete,  h13.demo.dev  will  show  

up  in  the  Hosts  list.    ..................................................................................................................................................................................  8  

6.4   Within  Cloudera  Manager,  Services  -­‐>  HBase  (hbase1)  and  Actions  -­‐>  Download  Client  Configuration  and  

save  to  /home/poulin/Desktop/hbase1-­‐clientconfig.zip    .........................................................................................................  8  

6.5   Within  Cloudera  Manager,  Services  -­‐>  MapReduce  (mapreduce1)  and  Actions  -­‐>  Download  Client  

Configuration  and  save  to  /home/poulin/Desktop/mapreduce1-­‐clientconfig.zip    ..........................................................  8  

6.6   Open  a  Terminal  (All  shell  commands  going  forward  will  be  executed  in  Terminal)  .........................................  9  

6.7   Point    hbase  shell      to  h12.demo.dev  .....................................................................................................................................  9  

6.8   Test  HBase  connectivity  and  hbase  shell  ............................................................................................................................  9  

6.9   Install  the  Development  Tools  group  ................................................................................................................................  10  

6.10   Install  Apache  Maven  3.0.X  ...................................................................................................................................................  10  

6.11   Install  Apache  Maven  2.2.X  ...................................................................................................................................................  10  

6.12   Install  asciidoc  doxygen  help2man,    source-­‐highlight  and  Python  2.7.3  ..............................................................  11  

6.13   Install  Eclipse  Juno  (  in  future,  newer  is  probably  OK,  but  optimal  repeatability  with  Juno)  .......................  11  

6.14   Append  to  the  end  of  /home/poulin/.bashrc  .................................................................................................................  12  

6.15   Open  a  2nd  Terminal  and  Start  Eclipse  Juno  ....................................................................................................................  12  

Page 3: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  3  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

6.16   Click  OK  on  workspace  launcher  defaults,  both  now  and  when  seen  through  the  remainder  of  this  

tutorial.  .....................................................................................................................................................................................................  12  

6.17   Install  M2E  extension  for  Eclipse  Juno  ..............................................................................................................................  12  

6.18   Close  Eclipse  ...............................................................................................................................................................................  13  

6.19   Obtain  a  copy  of  bcounts-­‐0.1.0-­‐SNAPSHOT-­‐project.tar.bz2  from  Cloudera  and  save  to  /home/poulin/  ..  13  

6.20   Prepare  bcounts-­‐0.1.0-­‐SNAPSHOT  for  execution  from  Eclipse  and  CLI  .................................................................  13  

6.21   Check  M2_REPO  classpath  variable  set  in  Eclipse  .........................................................................................................  14  

6.22   In  Eclipse:    File-­‐>Import...-­‐>(expand)  General  -­‐>    (highlight)  Existing    Projects  into  Workspace  -­‐>  (click)  

Next  -­‐>  and  specify  root  directory  /home/poulin/bcounts-­‐0.1.0-­‐SNAPSHOT  (hit  enter  if  pasted)  ..........................  15  

6.23   In  Eclipse:    Window-­‐>Preferences-­‐>Java-­‐>Code  Style-­‐>Formatter  ........................................................................  16  

6.24   Click  Import…  and  navigate  to  /home/poulin/bcounts-­‐0.1.0-­‐SNAPSHOT/eclipse_formatter_apache.xml  

Then  Click  Apply  and  then  Click  OK  and  then  close  Eclipse.  ..................................................................................................  16  

6.25   Create  schema  for  Bayesian  Counters  examples  in  HBase  .........................................................................................  16  

6.26   Load  Iris  data  into  HBase  via  CLI  ........................................................................................................................................  17  

6.27   Load  Iris  data  into  HBase  via  Eclipse  .................................................................................................................................  17  

6.28   View  Iris  data  and  schema  in  HBase  ..................................................................................................................................  18  

6.29   Perform  NB  inference  on  the  Iris  dataset  Note:  NB  inference  must  be  executed  within  300  seconds  of  

loading  iris  data  into  hbase,  or  modify  the  300  in  the  following  steps  to  a  larger  number  of  seconds  while  

testing  ........................................................................................................................................................................................................  18  

6.30   Perform  clique  scoring  with  random  projections  .........................................................................................................  19  

6.31   Create  small  delta  of  the  ad.data  file  .................................................................................................................................  20  

6.32   Load  Ad  data  into  HBase  via  Eclipse  ..................................................................................................................................  21  

6.33   Perform  NB  inference  on  the  Ad  dataset  Run-­‐>Run  Configurations...  ...................................................................  21  

6.34   Create  bag  of  words  file  from  configuration  file  ............................................................................................................  22  

6.35   Edit  /home/poulin/bcounts-­‐0.1.0-­‐SNAPSHOT/bin/sp_schema.py  Change  from:      if  len(sys.argv)<3  or  

sys.argv[1]  is  None:  ...............................................................................................................................................................................  22  

6.36   Create  an  XML  Configuration  file  derived  from  a  bag-­‐of-­‐words  file  .......................................................................  22  

6.37   Convert  testing  files  into  header-­‐less  files  for  storing  in  HDFS  ................................................................................  23  

6.38   Generate  a  'scored_'  file  in  current  directory  .................................................................................................................  23  

6.39   Create  small  delta  of  sp-­‐training-­‐file  .................................................................................................................................  23  

6.40   Load  small  sample  of  SP  into  HBase  via  Eclipse  .............................................................................................................  23  

6.41   Perform  NB  inference  on  the  SP  dataset  Run-­‐>Run  Configurations...  ....................................................................  24  

7   Bayesian  Counters  Javadoc  ..................................................................................................................  25  

8   Additional  Resources  –  Technologies  Used  In  this  Tutorial  .....................................................  25    

Page 4: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  4  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

1 Introduction  

Bayesian  counters  (B-­‐counts)  is  a  framework  for  on-­‐line  near  real  time  model  building  and  prediction.    It  can  be  used  

to   identify   correlations   in   the   data,   and   as   a   library   used   to   respond   to   unusual   or   rare   events.     The   underlying  

technology  for  B-­‐counts  is  HBase,  a  highly  scalable  and  fault  tolerant  key-­‐value  map  storage  engine.    The  solution  can  

scale  to  thousands  of  nodes  and  billions  of  features.  Finally,  the  initial  prediction  algorithm  is  Naïve  Bayes  (NB).    The  

framework  is  currently  being  extended  to  incorporate  Nearest  Neighbors  (NN)  and  a  general  Bayesian  Network  (BN)  

learning  algorithms.  

2 The  Audience  

The  steps  in  this  tutorial  are  highly  detailed  and  aim  for  optimal  repeatability  at  the  time  of  this  writing,  however  the  

audience  must  have  Linux  literacy  either  by  experience,  formal  training  or  education  and  have  a  strong  understanding  

of  computer  and  network  security.  Finally,  this  tutorial  does  not  cover  statistical  analysis  aspects  of  the  solution.  

3 The  Goal  

Preparing  a  development  environment  is  usually  a  complex  task  but  leads  to  powerful  results  and  strong  capabilities.  

This  tutorial  will  attempt  to  make  this  task  as  painless  and  repeatable  as  possible.  

4 Provisioning  

4.1 Virus  Risk  Warning  

It   is   the   responsibility   of   the   customers   to   check   every   download   mentioned   in   this   document   for   signature  

verification,   run  MD5   checks   and   virus   scans   and   any   other   steps   to   ensure   that   no   download   poses   a   risk   to   the  

customer’s  trusted  network.  It  is  also  the  customer’s  responsibility  to  ensure  that  network  security,  firewalls,  network  

level  port  blockage  are  correctly  configured  for  the  trusted  network  specified  in  this  document.  Both  the  servers  and  

network  referenced  in  this  tutorial  must  be  provisioned  entirely  for  the  purpose  of  learning  from,  experimenting  and  

completing  this  tutorial  and  must  never  be  adjacent  too,  or  in  any  way  share  resources  with  production  or  otherwise  

mission  critical  environments.  

4.2 Software  Archive  Warning  

It   is   the   responsibility   of   the   customer   to  maintain   archives   of   all   software   specified   in   this   document   as   the  URL,  

URI’s,  IP  Addresses  or  other  external  references  specified  in  this  document  may  be  come  invalid  at  any  time  without  

notice.  

Page 5: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  5  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

4.3 Development  Workstation  

The  Workstation  should  be  provisioned  with  the  following:  

1. CentOS  Linux  6.3  x86_64  Workstation  

2. jdk-­‐6u31-­‐linux-­‐x64-­‐rpm  or  newer  version  of  jdk-­‐6.  Note:  Do  not  use  jdk-­‐6u18  

3. 16GB  RAM  &  2  Cores  minimum  hardware  allocation  

4. The  hostname  of  the  workstation  for  this  tutorial  is  expected  to  be  permanently  assigned  as  h13.demo.dev  

5. h13.demo.dev  should  be  configured  to  use  a  certified  copy  of  both  the  CentOS  and  EPEL  repos  

6. The  workstation  will  need  an  example  account  created  called  ‘poulin’  with  permissions:  

a. Account  poulin  must  be  able  to  sudo  with  root  level  credentials  h13.demo.dev  

b. Account  poulin  must  be  able  to  log  into  the  gnome  desktop  of  h13.demo.dev,  either  directly  in  the  case  

of  local  bare  metal  installation,  vmware  installation  or  virtualbox  installation.  Or  via  VNC  SSH  tunnel  

client  if  the  workstation  is  hosted  on  AWS  or  trusted  other  remote  cloud/VPS,  dedicated  hosting  

service  or  datacenter.  

4.4 HBase  Cluster  

1. Navigate  to  https://ccp.cloudera.com/display/DOC/Documentation    

2. Download  and  archive  a  copy  of  all  documents  under:  

a. Cloudera  Manager  4.1  Enterprise  Edition  Documentation  

b. Cloudera  Manager  4.1  Free  Edition  Documentation  

 

3. Follow  the  steps  outlined  in  CM-­‐4.1-­‐free-­‐installation-­‐guide.pdf  to  provision  a  pseudo  distributed  cluster  in  the  

same  trusted  network  as  the  Development  Workstation  on  h13.demo.dev.  The  cluster  must  consist  of:  

a. A  single  host  with  the  assigned  hostname  h12.demo.dev  

b. The  single  host  should  be  installed  with  CM4.1  as  well  as  the  hbase  and  all  dependent  rolls  via  the  CM4.1  

UI.  

Page 6: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  6  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

4.5 Network  

Both  the  HBase  Cluster  and  the  Workstation  should  be  on  the  same  trusted  network.  With:  

-­‐ Two  and  only  two  hosts  existing  on  the  trusted  network:    

o h12.demo.dev  (pseudo  distributed  HBase  cluster  &  Cloudera  Manager  4.1)  

o h13.demo.dev  (CentOS  Linux  6.3  x86_64  Workstation)  

-­‐ All  external  in-­‐bound  ports  blocked  for  connections  into  the  trusted  network  except  for  SSH  from  other  trusted  

locations  only.  

-­‐ The  workstation  must  be  able  to  connect  out  to  the  internet  on  HTTP,  HTTPS  ,  FTP  and  SFTP  

-­‐ No  ports  should  be  blocked  between  the  HBase  Cluster  and  the  Workstation  within  the  trusted  network.  

-­‐ Both  h13.demo.dev  and  h12.demo.dev  to  have  a  permanent  static  IP  address  and  hostname.  

-­‐ The  IP  address  on  both  h13.demo.dev  and  h12.demo.dev  must  support  reverse  lookup  to  hostname.  

-­‐ The  date  and  time  on  both  h13.demo.dev  and  h12.demo.dev  must  be  permanently  in  sync.  

5 Conventions  

Single  line  boxes  delimit  commands  to  execute  in  the  Terminal    

# example

 

Dashed  line  boxes  delimit  some  or  all  of  the  contents  of  a  text  file    

# example

 

Double  line  boxes  delimit  some  or  all  standard  output  

# example

 

Wave  line  boxes  delimit  hbase  shell  

# example

 

Candy  cane  boxes  delimit  overview  of  logic  

# example

 

Page 7: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  7  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

6 Bayesian  Counters  0.1.0  Development  Environment  

6.1 Log  into  the  Gnome  Desktop  on  h13.demo.dev  with  the  poulin  account.  

Note:  Gnome  Desktop  comes  with  CentOS  Linux  6.3  x86_64  Desktop  install  

 

 

6.2 Browse  to  Cloudera  Manager  at  http://h12.demo.dev:7180/  and  log  in  with  the  admin  

account  using  Firefox  or  other  browser  in  the  Gnome  Desktop.  

 

Page 8: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  8  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

6.3 Within  Cloudera  Manager,  Hosts  -­‐>  Add  Host  

This  will  start  the  Add  Hosts  Wizard.  Follow  the  instructions  of  the  wizard  to  add  h13.demo.dev  

to  the  cluster  but  do  not  add  any  rolls  to  h13.demo.dev,  as  we  do  not  want  the  development  

host  to  participate  as  part  of  the  cluster.  When  complete,  h13.demo.dev  will  show  up  in  the  

Hosts  list.  

 

6.4 Within  Cloudera  Manager,  Services  -­‐>  HBase  (hbase1)  and  Actions  -­‐>  Download  Client  

Configuration  and  save  to  /home/poulin/Desktop/hbase1-­‐clientconfig.zip  

 

6.5 Within  Cloudera  Manager,  Services  -­‐>  MapReduce  (mapreduce1)  and  Actions  -­‐>  Download  

Client  Configuration  and  save  to  /home/poulin/Desktop/mapreduce1-­‐clientconfig.zip  

 

 

Page 9: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  9  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

6.6 Open  a  Terminal  (All  shell  commands  going  forward  will  be  executed  in  Terminal)  

 

6.7 Point    hbase  shell      to  h12.demo.dev    

su -l

cd /home/poulin/Desktop/

unzip ./hbase1-clientconfig.zip

cp /home/poulin/Desktop/hbase-conf/* /etc/hbase/conf/

rm -fr /home/poulin/Desktop/hbase-conf

6.8 Test  HBase  connectivity  and  hbase  shell  

su poulin

echo "create 'mytest', 'cf1'" | hbase shell

echo "put 'mytest', 'row1', 'cf1', 'val1'" | hbase shell

echo "put 'mytest', 'row1', 'cf1', 'val2'" | hbase shell

echo "scan 'mytest'" | hbase shell

date --date="Fri Nov 11 11:11:11 PST 2011" +%s

# start hbase shell

hbase shell

org.apache.hadoop.hbase.util.Bytes.toString("abcde".to_java_bytes)

org.apache.hadoop.hbase.util.Bytes.toInt("abcde".to_java_bytes)

org.apache.hadoop.hbase.util.Bytes.toInt("\xa\xb\xc\xd".to_java_bytes)

org.apache.hadoop.hbase.util.Bytes.toInt("\x61\x62\x63\x64".to_java_bytes)

org.apache.hadoop.hbase.util.Bytes.toInt("\141\142\143\144".to_java_bytes)

import java.text.SimpleDateFormat

import java.text.ParsePosition

SimpleDateFormat.new("yy/MM/dd HH:mm:ss").parse("11/11/11 11:11:11", ParsePosition.new(0)).getTime()

exit

=> "abcde"

=> 1633837924

=> 168496141

=> 1633837924

=> 1633837924

=> 1321038671000

 

Page 10: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  10  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

6.9 Install  the  Development  Tools  group    

su –l

yum groupinstall Development tools

6.10 Install  Apache  Maven  3.0.X  

su –l

cat > /etc/yum.repos.d/epel-apache-maven.repo << EOF

[epel-apache-maven]

name=maven from apache foundation.

baseurl=http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-6Server/x86_64/

enabled=1

skip_if_unavailable=1

gpgcheck=0

EOF

cat > /etc/yum.repos.d/epel-apache-maven-source.repo << EOF

[epel-apache-maven-source]

name=maven from apache foundation. – Source

baseurl=http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-6Server/SRPMS

enabled=0

skip_if_unavailable=1

gpgcheck=0

EOF

yum update yum

yum install apache-maven

ln -s /usr/share/apache-maven/bin/mvn /usr/bin/mvn3

# close and reopen terminal

# confirm Apache Maven 3.0.X

mvn3 –version

6.11 Install  Apache  Maven  2.2.X  

su –l

cd /usr/share/

# Reminder: check signature of download and check maven site for alternate mirror if link broken

wget http://apache.petsads.us/maven/maven-2/2.2.1/binaries/apache-maven-2.2.1-bin.tar.gz

tar -xzf apache-maven-2.2.1-bin.tar.gz

touch /usr/bin/mvn2

chmod ugo+rx /usr/bin/mvn2

vi /usr/bin/mvn2

   

Edit  /usr/bin/mvn2  

#!/bin/bash

MAVEN_HOME=/usr/share/apache-maven-2.2.1

M2_HOME=$MAVEN_HOME

Page 11: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  11  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

PATH=$MAVEN_HOME/bin:$PATH

export MAVEN_HOME

export M2_HOME

export PATH

/usr/share/apache-maven-2.2.1/bin/mvn "$@"

 

6.12 Install  asciidoc  doxygen  help2man,    source-­‐highlight  and  Python  2.7.3  

su -l

# Reminder: Downloads from the standard repos automatically do

# signature check if /etc/yum.repos.d directory is

# configured correctly.

yum install asciidoc doxygen help2man

yum install boost-devel

cd /tmp/

curl -O http://mirrors.kernel.org/gnu/src-highlite/source-highlight-3.1.7.tar.gz.sig

curl -O http://mirrors.kernel.org/gnu/src-highlite/source-highlight-3.1.7.tar.gz

# Reminder: Confirm signature is OK before running or installing anything.

# Reminder: If mirror downloads fail, locate an alternate mirror

# by searching the main site of the project.

tar xzvf source-highlight-3.1.7.tar.gz

cd source-highlight-3.1.7

./configure

make

make install

cd ~/

rm -fr /tmp/source-highlight-3.1.7*

cd /usr/lib/

wget http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz

wget http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz.asc

tar xzvf Python-2.7.3.tgz && cd /usr/lib/Python-2.7.3

./configure

make

# Note: the following line is not a typo and it really must be “altinstall”

make altinstall

chmod ugo+rx /usr/lib/Python-2.7.3

ln -s /usr/lib/Python-2.7.3/python /usr/bin/python273

python273 –version

ln -s /usr/lib/hadoop-0.20-mapreduce/hadoop-core.jar /usr/lib/hadoop/hadoop-core.jar

6.13 Install  Eclipse  Juno  (  in  future,  newer  is  probably  OK,  but  optimal  repeatability  with  Juno)  

• Navigate  to  http://www.eclipse.org/downloads/?osType=linux  

• Download  “Eclipse  IDE  for  Java  EE  Developers”  "Linux  64  Bit"  to  /home/poulin/Desktop/    

Page 12: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  12  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

su -l

mv /home/poulin/Desktop/eclipse-jee-juno-SR1-linux-gtk-x86_64.tar.gz /usr/lib/

chown root:root /usr/lib/eclipse-jee-juno-SR1-linux-gtk-x86_64.tar.gz

cd /usr/lib/

tar -xvf ./eclipse-jee-juno-SR1-linux-gtk-x86_64.tar.gz

ln -s /usr/lib/eclipse/eclipse /usr/bin/eclipse

rm -fr /usr/lib/eclipse-jee-juno-SR1-linux-gtk-x86_64.tar.gz

6.14 Append  to  the  end  of  /home/poulin/.bashrc    

Close  and  re-­‐open  the  Terminal  after  adding  these  lines.  

export M2_OPTS="-server -Xms256m -Xmx512m"

export PATH=${PATH}:/home/poulin/bcounts-0.1.0-SNAPSHOT/bin

export CLASSPATH=`hbase classpath`

export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce

export HBASE_HOME=/usr/lib/hbase

source /home/poulin/bcounts-0.1.0-SNAPSHOT/bin/bcount-config.sh 2>> /dev/null

# Note: the following should be a single line

export

BAYESIANCOUNTERS_CLASSPATH=conf:target/classes:~/.m2/repository/net/sf/trove4j/trove4j/3.0.1/*:~/.

m2/repository/net/sf/opencsv/opencsv/2.3/*:~/.m2/repository/com/google/guava/guava/10.0.1/*

6.15 Open  a  2nd  Terminal  and  Start  Eclipse  Juno    

su poulin

eclipse

 

6.16 Click  OK  on  workspace  launcher  defaults,  both  now  and  when  seen  through  the  remainder  

of  this  tutorial.  

 

6.17 Install  M2E  extension  for  Eclipse  Juno  

Note:  this  will  update  /home/poulin/.eclipse  directory  

o help  -­‐>  eclipse  marketplace  -­‐>  Search  tab  -­‐>  find  (Maven  Integration  for  Eclipse)  (enter  key)  

o Under  "Maven  Integration  for  Eclipse"  click  Install.  

o Check  all  boxes  under  “Confirm  Selected  Features”  if  they  are  not  already  checked  and  Click  Next.  

Page 13: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  13  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

o If  you  accept  the  Eclipse  Foundation  Software  User  Agreement,  Check  the  acceptance  and  Click  Finish  

o When  prompted  to  restart  Eclipse  Click  Yes.  

 

   

 

6.18 Close  Eclipse  

6.19 Obtain  a  copy  of  bcounts-­‐0.1.0-­‐SNAPSHOT-­‐project.tar.bz2  from  Cloudera  and  save  to  

/home/poulin/  

6.20 Prepare  bcounts-­‐0.1.0-­‐SNAPSHOT  for  execution  from  Eclipse  and  CLI  

su poulin

# reminder: Any previously cashed maven downloads for the poulin

# account will be deleted in next command

cd /home/poulin/

tar jxf bcounts-0.1.0-SNAPSHOT-project.tar.bz2

cd /home/poulin/bcounts-0.1.0-SNAPSHOT/

mkdir /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old

mv /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/hbase-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old/

mv /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/core-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old/

mv /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/mapred-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old/

mv /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/hdfs-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old/

cp /home/poulin/Desktop/*-clientconfig.zip /home/poulin/bcounts-0.1.0-SNAPSHOT/

unzip hbase1-clientconfig.zip

unzip mapreduce1-clientconfig.zip

cp ./hbase-conf/hbase-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/

cp ./hadoop-conf/core-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/

cp ./hadoop-conf/mapred-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/

cp ./hadoop-conf/hdfs-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/

rm -fr ./mapreduce1-clientconfig.zip

rm -fr ./hbase1-clientconfig.zip

rm -fr ./hbase-conf

rm -fr ./hadoop-conf

cp /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/mapred-site.xml /home/poulin/bcounts-0.1.0-

SNAPSHOT/src/main/resources/

cp /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/hbase-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/src/main/resources/

cp /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/core-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/src/main/resources/

cp /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/hdfs-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/src/main/resources/

mkdir /home/poulin/bcounts-0.1.0-SNAPSHOT/lib.old

Page 14: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  14  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

mv /home/poulin/bcounts-0.1.0-SNAPSHOT/lib/* /home/poulin/bcounts-0.1.0-SNAPSHOT/lib.old/

cp /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.0.1.jar /home/poulin/bcounts-0.1.0-SNAPSHOT/lib/

# First build with Maven2

wget https://builds.apache.org/job/mrunit-trunk/ws/target/mrunit-1.0.0-SNAPSHOT-hadoop1.jar

rm -fr /home/poulin/.m2

# Install mrunit-1.0.0 into ~/.m2

mvn2 -DskipTests install:install-file -DgroupId=org.apache.mrunit -DartifactId=mrunit -Dversion=1.0.0-

20130107.225915-832 -Dclassifier=hadoop1 -Dpackaging=jar -Dfile=/home/poulin/bcounts-0.1.0-SNAPSHOT/mrunit-1.0.0-

SNAPSHOT-hadoop1.jar

mvn2 install:install-file -DgroupId=org.apache.mrunit -DartifactId=mrunit -Dversion=1.0.0-SNAPSHOT -

Dclassifier=hadoop1 -Dpackaging=jar -Dfile=/home/poulin/bcounts-0.1.0-SNAPSHOT/mrunit-1.0.0-SNAPSHOT-hadoop1.jar

# if the following hangs for more than 5 minutes without output, ctrl-c and then re-run it

# can ignore: [INFO] Unable to find resource *

mvn2 -DskipTests install

# redo the build with mvn3

rm -fr /home/poulin/.m2

mvn3 -DskipTests install:install-file -DgroupId=org.apache.mrunit -DartifactId=mrunit -Dversion=1.0.0-

20130107.225915-832 -Dclassifier=hadoop1 -Dpackaging=jar -Dfile=/home/poulin/bcounts-0.1.0-SNAPSHOT/mrunit-1.0.0-

SNAPSHOT-hadoop1.jar

mvn3 install:install-file -DgroupId=org.apache.mrunit -DartifactId=mrunit -Dversion=1.0.0-SNAPSHOT -

Dclassifier=hadoop1 -Dpackaging=jar -Dfile=/home/poulin/bcounts-0.1.0-SNAPSHOT/mrunit-1.0.0-SNAPSHOT-hadoop1.jar

# if the following hangs for more than 5 minutes without output, ctrl-c and then re-run it

mvn3 -DskipTests install

# run without -DskipTests switch

mvn3 install

# make maven project loadable into eclipse

mvn3 -Declipse.workspace=/home/poulin/workspace eclipse:configure-workspace

mvn3 -DdownloadSources=true -DdownloadJavadocs=true eclipse:clean eclipse:eclipse

mvn3 dependency:build-classpath

# Optional: Building a job jar only

mvn3 package

# Optional: Packaging the Source to /home/poulin/bcounts-0.1.0-SNAPSHOT/target/

mvn3 assembly:single

# Optional: Generating JavaDoc only

mvn3 javadoc:javadoc

                         

6.21 Check  M2_REPO  classpath  variable  set  in  Eclipse    

o Open  a  new  Terminal  and  start  up  Eclipse  

su poulin

eclipse

Page 15: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  15  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

 

o In  Eclipse:      Window-­‐>preferences-­‐>java-­‐>build  path-­‐>classpath  variables  

 

o Once  M2_REPO  is  observed,  Click  Cancel  back  to  the  parent  Eclipse  window  

 

6.22 In  Eclipse:    File-­‐>Import...-­‐>(expand)  General  -­‐>    (highlight)  Existing    Projects  into  Workspace  

-­‐>  (click)  Next  -­‐>  and  specify  root  directory  /home/poulin/bcounts-­‐0.1.0-­‐SNAPSHOT  (hit  enter  if  

pasted)  

Ensure  that  bcounts  is  checked  in  the  Projects  box  and  Click  Finish  

   

Page 16: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  16  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

6.23 In  Eclipse:    Window-­‐>Preferences-­‐>Java-­‐>Code  Style-­‐>Formatter  

   

6.24 Click  Import…  and  navigate  to  /home/poulin/bcounts-­‐0.1.0-­‐

SNAPSHOT/eclipse_formatter_apache.xml  

Then  Click  Apply  and  then  Click  OK  and  then  close  Eclipse.  

   

6.25 Create  schema  for  Bayesian  Counters  examples  in  HBase  

su poulin

echo "create 'sp', {NAME => '5min', VERSIONS => 1, TTL => 604800, BLOCKCACHE => false}, {NAME =>

'30min', VERSIONS => 1, TTL => 604800, BLOCKCACHE => false}, {NAME => '1day', VERSIONS => 1, TTL

=> 604800, BLOCKCACHE => false}, {NAME => 'All', VERSIONS => 1, TTL => 1209600, BLOCKCACHE =>

false}" | hbase shell

echo "create 'iris', {NAME => '5min', VERSIONS => 1, TTL => 86400, BLOCKCACHE => false}, {NAME =>

'30min', VERSIONS => 1, TTL => 86400, BLOCKCACHE => false}, {NAME => '1day', VERSIONS => 1, TTL =>

86400, BLOCKCACHE => false}, {NAME => 'All', VERSIONS => 1, TTL => 432000, BLOCKCACHE => false}" |

hbase shell

echo "create 'ad', {NAME => '5min', VERSIONS => 1, TTL => 259200, BLOCKCACHE => false}, {NAME =>

'30min', VERSIONS => 1, TTL => 259200, BLOCKCACHE => false}, {NAME => '1day', VERSIONS => 1, TTL

=> 259200, BLOCKCACHE => false}, {NAME => 'All', VERSIONS => 1, TTL => 432000, BLOCKCACHE =>

false}" | hbase shell

echo "create 'car', {NAME => '5min', VERSIONS => 2, TTL => 300, BLOCKCACHE => false}, {NAME =>

'30min', VERSIONS => 2, TTL => 1800, BLOCKCACHE => false}, {NAME => '1day', VERSIONS => 2, TTL =>

259200, BLOCKCACHE => false}, {NAME => 'All', VERSIONS => 2, TTL => 432000, BLOCKCACHE => false}"

| hbase shell

 

Page 17: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  17  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

6.26 Load  Iris  data  into  HBase  via  CLI  

su poulin

cd /home/poulin/bcounts-0.1.0-SNAPSHOT

bcount com.cloudera.bayesiancounters.util.Driver loader examples/data/iris.data iris

echo "scan 'iris'" | hbase shell

 

Load  Iris  data  into  HBase  

The iris data loaded into hbase is rectangular and newline delimited in the format:

<count-delta>,<count-delta>,<count-delta>,…<classifier><newline>

During the load, the counts in hbase are incremented.

The human-readable meaning and schema of iris.data can be found in the Iris section of

the bayesiancounters-site.xml which is added to a CLASSPATH in prior steps.

For a production pipeline, will repeat this iris.load at a regular interval of deltas

or bind the UI directly to the hbase calls used by the loader code.

The loader logic can be mastered with eclipse by modifying the following section of

this tutorial:

Change from: “Run->Run Configurations”

Change to: “Run->Debug Configurations”

Then check mark next to “Stop in main” and then step through the code.

6.27 Load  Iris  data  into  HBase  via  Eclipse  

o Open  a  new  Terminal  and  start  up  Eclipse  

su Poulin

eclipse

o Run-­‐>Run  Configurations...  

o Java  Application-­‐>  (right  click)  -­‐>  New  

o Click  on  the  New_configuration  to  edit  its  settings  on  the  right  

o Configure  the  runtime  options  as:  

§ Name:  IrisLoad  

§ Main-­‐>Project:  bcounts  

§ Main-­‐>Main  class:  com.cloudera.bayesiancounters.util.Driver  

§ Arguments-­‐>Program  arguments:  loader  /home/poulin/bcounts-­‐0.1.0-­‐

SNAPSHOT/examples/data/iris.data  iris  

§ Arguments-­‐>VM  arguments:  -­‐Dlog4j.configuration=file:debug-­‐log4j.properties  

o Click  on  Apply,  then  Click  on  Run  and  then  view  the  Console  tab  of  the  parent  window  

o Once  the  data  is  loaded,  Close  Eclipse  

Page 18: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  18  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

   

 

 

6.28 View  Iris  data  and  schema  in  HBase    

# view some of the data

echo "scan 'iris'" | hbase shell | tail

# view the schema via the Stargate interface

firefox http://h12.demo.dev:8080/iris/schema

6.29 Perform  NB  inference  on  the  Iris  dataset  

Note:  NB  inference  must  be  executed  within  300  seconds  of  loading  iris  data  into  hbase,  or  

modify  the  300  in  the  following  steps  to  a  larger  number  of  seconds  while  testing  

o Open  a  new  Terminal  and  start  up  Eclipse  

su poulin

eclipse

o Run-­‐>Run  Configurations...    

o Java  Application-­‐>  (right  click)  -­‐>  New    

o Click  on  the  New_configuration  to  edit  its  settings  on  the  right  

o Configure  the  runtime  options  as:  

§ Name:  IrisInference  

§ Main-­‐>Project:  bcounts  

§ Main-­‐>Main  class:  com.cloudera.bayesiancounters.util.Driver  

§ Arguments-­‐>Program  arguments:  nb  iris  300  "sepal_length=5;petal_length=1.4"  class=2  

Page 19: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  19  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

§ Arguments-­‐>VM  arguments:  -­‐Dlog4j.configuration=file:debug-­‐log4j.properties  

o Click  on  Apply,  then  Click  on  Run  and  then  view  the  Console  tab  of  the  parent  window  

o Wait  for  results  in  the  Console  tab  and  execution  to  complete.  

 

 

 

NB  inference  on  the  Iris  dataset  

Opens connection to the iris table

Loads iris classifications from bayesiancounters-site.xml into local memory

Moves columns in hbase between tiers, e.g. T5MIN, T30MIN, etc. while computing scores

and tracking parent and child counts

The logic is derived from naive Bayes classifier theory

The resulting scores, counts and probabilities are displayed to standard output

 

The  probabilities  output  of  scoring  can  be  used  directly  for  mode  complex  decision  making  algorithms  based  

on  benefit/loss  analysis.    

6.30 Perform  clique  scoring  with  random  projections  

o Run-­‐>Run  Configurations...    

Page 20: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  20  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

o Java  Application-­‐>  (right  click)  -­‐>  New    

o Click  on  the  New_configuration  to  edit  its  settings  on  the  right  

o Configure  the  runtime  options  as:  

§ Name:  CliqueRandom  

§ Main-­‐>Project:  bcounts  

§ Main-­‐>Main  class:  com.cloudera.bayesiancounters.util.Driver  

§ Arguments-­‐>Program  arguments:  cr  iris  300  2  3  

§ Arguments-­‐>VM  arguments:  -­‐Dlog4j.configuration=file:debug-­‐log4j.properties  

o Click  on  Apply,  then  Click  on  Run  and  then  view  the  Console  tab  of  the  parent  window  

o Wait  for  results  in  the  Console  tab  and  execution  to  complete.  

 

 Clique  scoring  can  be  used  to  perform  variable  importance  analysis  or  for  emerging  trend  identification.  

6.31 Create  small  delta  of  the  ad.data  file  

su poulin

head -n 1 /home/poulin/bcounts-0.1.0-SNAPSHOT/examples/data/ad.data > /tmp/ad.small

tail -n 1 /home/poulin/bcounts-0.1.0-SNAPSHOT/examples/data/ad.data >> /tmp/ad.small

 

Page 21: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  21  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

6.32 Load  Ad  data  into  HBase  via  Eclipse  

o Run-­‐>Run  Configurations...  

o Java  Application-­‐>  (right  click)  -­‐>  New  

o Click  on  the  New_configuration  to  edit  its  settings  on  the  right  

o Configure  the  runtime  options  as:  

§ Name:  AdLoad  

§ Main-­‐>Project:  bcounts  

§ Main-­‐>Main  class:  com.cloudera.bayesiancounters.util.Driver  

§ Arguments-­‐>Program  arguments:  loader  /tmp/ad.small  ad  

§ Arguments-­‐>VM  arguments:  -­‐Dlog4j.configuration=file:debug-­‐log4j.properties  -­‐Xmx1024M  

o Click  on  Apply,  then  Click  on  Run  and  then  view  the  Console  tab  of  the  parent  window  

   

6.33 Perform  NB  inference  on  the  Ad  dataset  

Run-­‐>Run  Configurations...    

o Java  Application-­‐>  (right  click)  -­‐>  New    

o Click  on  the  New_configuration  to  edit  its  settings  on  the  right  

o Configure  the  runtime  options  as:  

§ Name:  AdInference  

§ Main-­‐>Project:  bcounts  

§ Main-­‐>Main  class:  com.cloudera.bayesiancounters.util.Driver  

§ Arguments-­‐>Program  arguments:  nb  ad  604800  "sepal_length=5;petal_length=1.4"  class=2  

§ Arguments-­‐>VM  arguments:  -­‐Dlog4j.configuration=file:debug-­‐log4j.properties  

o Click  on  Apply,  then  Click  on  Run  and  then  view  the  Console  tab  of  the  parent  window  

o Wait  for  results  in  the  Console  tab  and  execution  to  complete.  

o Close  eclipse  

Page 22: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  22  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

Note:    These  results  are  from  bcounts  on  2  lines  of  the  input  data  only.  Recommend  using  small  or  

medium  sized  cluster  for  processing  the  entire  ad.data  file.  See  Cloudera  Manager  Documentation  for  

cluster  size  specifications.  

 

6.34 Create  bag  of  words  file  from  configuration  file  

su poulin

cd /home/poulin/bcounts-0.1.0-SNAPSHOT/

python273 ./bin/sp_bag_of_words.py ./conf/bayesiancounters-site.xml /tmp/bag-of-words

tail /tmp/bag-of-words

 

worker

working

wreckage

xvi

yates

young

SP_increase

 

6.35 Edit  /home/poulin/bcounts-­‐0.1.0-­‐SNAPSHOT/bin/sp_schema.py  

Change  from:      if  len(sys.argv)<3  or  sys.argv[1]  is  None:  

Change  to:          if  len(sys.argv)<2  or  sys.argv[1]  is  None:  

if len(sys.argv)<2 or sys.argv[1] is None:

 

6.36 Create  an  XML  Configuration  file  derived  from  a  bag-­‐of-­‐words  file    

su poulin

cd /home/poulin/bcounts-0.1.0-SNAPSHOT/

python273 ./bin/sp_schema.py /tmp/bag-of-words > /tmp/bayesiancounters-example.xml

tail /tmp/bayesiancounters-example.xml

Page 23: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  23  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

 

<value>SP_increase</value>

</property>

<property>

<name>bayesiancounters.dataset.sp.col.valueset.647</name>

<value>-100, -40, 10, 40, 100</value>

</property>

6.37 Convert  testing  files  into  header-­‐less  files  for  storing  in  HDFS    

su poulin

cd /home/poulin/bcounts-0.1.0-SNAPSHOT/

python273 ./bin/sp_training.py /tmp/bag-of-words \

./examples/data/training_19_2004-18_2005.dat > /tmp/sp-training-file

tail -c 32 /tmp/sp-training-file

 

0,0,0,0,0,0,0,0,0,0,0,0,0,0,7.2

 

6.38 Generate  a  'scored_'  file  in  current  directory  

su poulin

cd /home/poulin/bcounts-0.1.0-SNAPSHOT/

python273 ./bin/sp_testing.py /tmp/bag-of-words ./examples/data/testing_19_2005-19_2005.dat

tail -c 32 ./scored_testing_19_2005-19_2005.dat

 

0,1,0,0,0,0,0,0,2,2,2,1,1,3,6,0

 

6.39 Create  small  delta  of  sp-­‐training-­‐file  

su poulin

head -n 1 /tmp/sp-training-file > /tmp/sp-training.small

tail -n 1 /tmp/sp-training-file >> /tmp/sp-training.small

 

6.40 Load  small  sample  of  SP  into  HBase  via  Eclipse  

o Open  a  new  Terminal  and  start  up  Eclipse  

su poulin

eclipse

o Run-­‐>Run  Configurations...  

o Java  Application-­‐>  (right  click)  -­‐>  New  

Page 24: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  24  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

o Click  on  the  New_configuration  to  edit  its  settings  on  the  right  

o Configure  the  runtime  options  as:  

§ Name:  SpLoad  

§ Main-­‐>Project:  bcounts  

§ Main-­‐>Main  class:  com.cloudera.bayesiancounters.util.Driver  

§ Arguments-­‐>Program  arguments:  loader  /tmp/sp-­‐training.small  sp  

§ Arguments-­‐>VM  arguments:  -­‐Dlog4j.configuration=file:debug-­‐log4j.properties  

o Click  on  Apply,  then  Click  on  Run  and  then  view  the  Console  tab  of  the  parent  window  

6.41 Perform  NB  inference  on  the  SP  dataset  

Run-­‐>Run  Configurations...    

o Java  Application-­‐>  (right  click)  -­‐>  New    

o Click  on  the  New_configuration  to  edit  its  settings  on  the  right  

o Configure  the  runtime  options  as:  

§ Name:  SpInference  

§ Main-­‐>Project:  bcounts  

§ Main-­‐>Main  class:  com.cloudera.bayesiancounters.util.Driver  

§ Arguments-­‐>Program  arguments:  nb  sp  604800  "sepal_length=5;petal_length=1.4"  class=2  

§ Arguments-­‐>VM  arguments:  -­‐Dlog4j.configuration=file:debug-­‐log4j.properties  

o Click  on  Apply,  then  Click  on  Run  and  then  view  the  Console  tab  of  the  parent  window  

o Wait  for  results  in  the  Console  tab  and  execution  to  complete.  

o Close  eclipse  

Note:    These  results  are  from  bcounts  on  2  lines  of  the  input  data  only.  Recommend  using  small  or  

medium  sized  cluster  for  processing  the  entire  ad.data  file.  See  Cloudera  Manager  Documentation  for  

cluster  size  specifications.  

 

Page 25: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  25  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

In  this  example  B-­‐counts  can  be  used  for  predicting  an  effect  of  the  news  on  stock  market  movements    

(US  Patent  No.  7,516,050)  

7 Bayesian  Counters  Javadoc    

su Poulin

firefox file:///home/poulin/bcounts-0.1.0-SNAPSHOT/target/site/apidocs/index.html

 

   

8 Additional  Resources  –  Technologies  Used  In  this  Tutorial  

1. Apache  Hadoop  Training  and  Certification  -­‐  http://university.cloudera.com/    

2. Apache  Maven  Project  -­‐  http://maven.apache.org/    

3. Bash  Reference  Manual  -­‐  http://www.gnu.org/software/bash/manual/bashref.html    

4. Bayesian  Counters  on  HBase  -­‐  http://www.slideshare.net/Hadoop_Summit/bayesian-­‐counters    

5. CentOS  Wiki  -­‐  http://wiki.centos.org/    

6. CentOS  6.X  x86_64  (Same  image  for  VM,  cloud/VPS  or  bare  metal)  -­‐  http://isoredirect.centos.org/centos/6/isos/x86_64/    

7. Cloudera  Documentation  -­‐  https://ccp.cloudera.com/display/DOC/Documentation    

8. Eclipse  Documentation  –  Current  Release  Eclipse  Juno  -­‐  http://help.eclipse.org/juno/index.jsp    

9. EPEL  Documentation  -­‐  http://fedoraproject.org/wiki/EPEL    

10. Gnome  [Desktop]  Library,  Users,  Administrators  and  Developers  -­‐  http://library.gnome.org/      

Page 26: BayesianCounters0.1.0! - Durkheim Project › ... › 02 › BCounts_SetupTutorial.pdff!!!!! BayesianCounters0.1.0! Development!Environment!Tutorial!For!CentOS6.3 x86_64Workstation!

P a g e  |  26  

 

 

©2013  Patterns  and  Predictions  (Poulin  Holdings,  LLC)  All  Rights  Reserved.  Confidential.  Reproduction  or  redistribution  

without  written  permission  is  prohibited    

11. GNU  Source-­‐highlight  3.1.7  -­‐  http://www.gnu.org/software/src-­‐highlite/source-­‐highlight.html    

12. Python  v2.7.3  documentation  -­‐  http://docs.python.org/release/2.7.3/    

13. Ruby  in  Twenty  Minutes  -­‐  http://www.ruby-­‐lang.org/en/documentation/quickstart/