Top Banner
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. Real&me Learning for Fun and Profit
35

Buzz Words Dunning Real-Time Learning

Oct 21, 2014

Download

Technology

My talk at Buzzwords 2013 about real time learning with special reference to Lambda architectures and associative functions.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Buzz Words Dunning Real-Time Learning

1  ©MapR  Technologies  -­‐  Confiden6al  

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Real-­‐&me  Learning  for  Fun  and  Profit  

Page 2: Buzz Words Dunning Real-Time Learning

2  ©MapR  Technologies  -­‐  Confiden6al  

§  Contact:  –  [email protected]  – @ted_dunning  

§  Slides  and  such  (available  late  tonight):  –  hEp://slideshare.net/tdunning  

§  Hash  tags:  #mapr  #storm  #bbuzz      

Page 3: Buzz Words Dunning Real-Time Learning

3  ©MapR  Technologies  -­‐  Confiden6al  

The  Challenge  

§  Hadoop  is  great  of  processing  vats  of  data  –  But  sucks  for  real-­‐6me  (by  design!)    

§  Storm  is  great  for  real-­‐6me  processing  –  But  lacks  any  way  to  deal  with  batch  processing  

§  It  sounds  like  there  isn’t  a  solu6on  –  Neither  fashionable  solu6on  handles  everything  

Page 4: Buzz Words Dunning Real-Time Learning

4  ©MapR  Technologies  -­‐  Confiden6al  

This  is  not  a  problem.  

 It’s  an  opportunity!  

Page 5: Buzz Words Dunning Real-Time Learning

5  ©MapR  Technologies  -­‐  Confiden6al  

t  

now  

Hadoop  is  Not  Very  Real-­‐&me  

UnprocessedData  

Fully  processed  

Latest  full  period  

Hadoop  job  takes  this  long  for  this  data  

Page 6: Buzz Words Dunning Real-Time Learning

6  ©MapR  Technologies  -­‐  Confiden6al  

t  

now  

Hadoop  works  great  back  here  

Storm  works  here  

Real-­‐&me  and  Long-­‐&me  together  

Blended  view  

Blended  view  

Blended  View  

Page 7: Buzz Words Dunning Real-Time Learning

7  ©MapR  Technologies  -­‐  Confiden6al  

One  Alterna&ve  

Search  Engine  

NoSql  de  Jour  

Consumer  

Real-­‐6me   Long-­‐6me  

?  

Page 8: Buzz Words Dunning Real-Time Learning

8  ©MapR  Technologies  -­‐  Confiden6al  

Problems  

§  Simply  dumping  into  noSql  engine  doesn’t  quite  work  §  Insert  rate  is  limited  §  No  load  isola6on  –  Big  retrospec6ve  jobs  kill  real-­‐6me  

§  Low  scan  performance  –  Hbase  preEy  good,  but  not  stellar  

§  Difficult  to  set  boundaries  – where  does  real-­‐6me  end  and  long-­‐6me  begin?  

Page 9: Buzz Words Dunning Real-Time Learning

9  ©MapR  Technologies  -­‐  Confiden6al  

Almost  a  Solu&on  

§  Lambda  architecture  talks  about  func6on  of  long-­‐6me  state  –  Real-­‐6me  approximate  accelerator  adjusts  previous  result  to  current  state  

§  Sounds  good,  but  …  –  How  does  the  real-­‐6me  accelerator  combine  with  long-­‐6me?  – What  algorithms  can  do  this?  –  How  can  we  avoid  gaps  and  overlaps  and  other  errors?  

§  Needs  more  work  

Page 10: Buzz Words Dunning Real-Time Learning

10  ©MapR  Technologies  -­‐  Confiden6al  

A  Simple  Example  

§  Let’s  start  with  the  simplest  case  …  coun6ng  

§  Coun6ng  =  addi6on  –  Addi6on  is  associa6ve  –  Addi6on  is  on-­‐line  – We  can  generalize  these  results  to  all  associa6ve,  on-­‐line  func6ons  –  But  let’s  start  simple  

Page 11: Buzz Words Dunning Real-Time Learning

11  ©MapR  Technologies  -­‐  Confiden6al  

Data  Sources  

Catcher  Cluster  

Rough  Design  –  Data  Flow  

Catcher  Cluster  

Query  Event  Spout  

Logger  Bolt  

Counter  Bolt  

Raw  Logs  

Logger  Bolt  

Semi  Agg  

Hadoop  Aggregator  

Snap  

Long  agg  

ProtoSpout   Counter  Bolt  

Logger  Bolt  

Data  Sources  

Page 12: Buzz Words Dunning Real-Time Learning

12  ©MapR  Technologies  -­‐  Confiden6al  

Closer  Look  –  Catcher  Protocol  

Data  Sources  

Catcher  Cluster  Catcher  Cluster  

Data  Sources  

The  data  sources  and  catchers  communicate  with  a  very  simple  protocol.    Hello()  =>  list  of  catchers  Log(topic,message)  =>            (OK|FAIL,  redirect-­‐to-­‐catcher)  

Page 13: Buzz Words Dunning Real-Time Learning

13  ©MapR  Technologies  -­‐  Confiden6al  

Closer  Look  –  Catcher  Queues  

Catcher  Cluster  

Catcher  Cluster  

The  catchers  forward  log  requests  to  the  correct  catcher  and  return  that  host  in  the  reply  to  allow  the  client  to  avoid  the  extra  hop.    

Each  topic  file  is  appended  by  exactly  one  catcher.    

Topic  files  are  kept  in  shared  file  storage.  

Topic  File  

Topic  File  

Page 14: Buzz Words Dunning Real-Time Learning

14  ©MapR  Technologies  -­‐  Confiden6al  

Closer  Look  –  ProtoSpout  

The  ProtoSpout  tails  the  topic  files,  parses  log  records  into  tuples  and  injects  them  into  the  Storm  topology.    Last  fully  acked  posi6on  stored  in  shared,  transac6onally  correct  file  system.  

Topic  File  

Topic  File  

ProtoSpout  

Page 15: Buzz Words Dunning Real-Time Learning

15  ©MapR  Technologies  -­‐  Confiden6al  

Closer  Look  –  Counter  Bolt  

§  Cri6cal  design  goals:  –  fast  ack  for  all  tuples  –  fast  restart  of  counter  

§  Ack  happens  when  tuple  hits  the  replay  log  (10’s  of  milliseconds,  group  commit)  

§  Restart  involves  replaying  semi-­‐agg’s  +  replay  log  (very  fast)  

§  Replay  log  only  lasts  un6l  next  semi-­‐aggregate  goes  out  

Counter  Bolt  

Replay  Log  

Semi-­‐aggregated  records  

Incoming  records  

Real-­‐6me   Long-­‐6me  

Page 16: Buzz Words Dunning Real-Time Learning

16  ©MapR  Technologies  -­‐  Confiden6al  

A  Frozen  Moment  in  Time  

§  Snapshot  defines  the  dividing  line  

§  All  data  in  the  snap  is  long-­‐6me,  all  aser  is  real-­‐6me  

§  Semi-­‐agg  strategy  allows  clean  combina6on  of  both  kinds  of  data  

§  Data  synchronized  snap  not  needed  

Semi  Agg  

Hadoop  Aggregator  

Snap  

Long  agg  

Page 17: Buzz Words Dunning Real-Time Learning

17  ©MapR  Technologies  -­‐  Confiden6al  

Guarantees  

§  Counter  output  volume  is  small-­‐ish  –  the  greater  of  k  tuples  per  100K  inputs  or  k  tuple/s  –  1  tuple/s/label/bolt  for  this  exercise  

§  Persistence  layer  must  provide  guarantees  –  distributed  against  node  failure  – must  have  either  readable  flush  or  closed-­‐append  

§  HDFS  is  distributed,  but  provides  no  guarantees  and  strange  seman6cs  

§  MapRfs  is  distributed,  provides  all  necessary  guarantees  

Page 18: Buzz Words Dunning Real-Time Learning

18  ©MapR  Technologies  -­‐  Confiden6al  

Presenta&on  Layer  

§  Presenta6on  must  –  read  recent  output  of  Logger  bolt  –  read  relevant  output  of  Hadoop  jobs  –  combine  semi-­‐aggregated  records  

§  User  will  see  –  counts  that  increment  within  0-­‐2  s  of  events  –  seamless  and  accurate  meld  of  short  and  long-­‐term  data  

Page 19: Buzz Words Dunning Real-Time Learning

19  ©MapR  Technologies  -­‐  Confiden6al  

The  Basic  Idea  

§  Online  algorithms  generally  have  rela6vely  small  state  (like  coun6ng)  

§  Online  algorithms  generally  have  a  simple  update  (like  coun6ng)  §  If  we  can  do  this  with  coun6ng,  we  can  do  it  with  all  kinds  of  algorithms  

Page 20: Buzz Words Dunning Real-Time Learning

20  ©MapR  Technologies  -­‐  Confiden6al  

Summary  –  Part  1  

§  Semi-­‐agg  strategy  +  snapshots  allows  correct  real-­‐6me  counts  –  because  addi6on  is  on-­‐line  and  associa6ve  

§  Other  on-­‐line  associa6ve  opera6ons  include:  

–  k-­‐means  clustering  (see  Dan  Filimon’s  talk  at  16.)  –  count  dis6nct  (see  hyper-­‐log-­‐log  counters  from  streamlib  or  kmv  from  Brickhouse)  

–  top-­‐k  values  –  top-­‐k  (count(*))  (see  streamlib)  –  contextual  Bayesian  bandits  (see  part  2  of  this  talk)  

Page 21: Buzz Words Dunning Real-Time Learning

21  ©MapR  Technologies  -­‐  Confiden6al  

Example  2  –  AB  tes&ng  in  real-­‐&me  

§  I  have  15  versions  of  my  landing  page  §  Each  visitor  is  assigned  to  a  version  – Which  version?  

§  A  conversion  or  sale  or  whatever  can  happen  –  How  long  to  wait?  

§  Some  versions  of  the  landing  page  are  horrible  –  Don’t  want  to  give  them  traffic  

Page 22: Buzz Words Dunning Real-Time Learning

22  ©MapR  Technologies  -­‐  Confiden6al  

A  Quick  Diversion  

§  You  see  a  coin  – What  is  the  probability  of  heads?  –  Could  it  be  larger  or  smaller  than  that?  

§  I  flip  the  coin  and  while  it  is  in  the  air  ask  again  

§  I  catch  the  coin  and  ask  again  §  I  look  at  the  coin  (and  you  don’t)  and  ask  again  §  Why  does  the  answer  change?  –  And  did  it  ever  have  a  single  value?  

Page 23: Buzz Words Dunning Real-Time Learning

23  ©MapR  Technologies  -­‐  Confiden6al  

A  Philosophical  Conclusion  

§  Probability  as  expressed  by  humans  is  subjec6ve  and  depends  on  informa6on  and  experience  

Page 24: Buzz Words Dunning Real-Time Learning

24  ©MapR  Technologies  -­‐  Confiden6al  

I  Dunno  

0 0.2 0.4 0.6 0.8 1

p

Prob

(p)

Page 25: Buzz Words Dunning Real-Time Learning

25  ©MapR  Technologies  -­‐  Confiden6al  

5  heads  out  of  10  throws  

0 0.2 0.4 0.6 0.8 1

p

Prob

(p)

Page 26: Buzz Words Dunning Real-Time Learning

26  ©MapR  Technologies  -­‐  Confiden6al  

2  heads  out  of  12  throws  

0 0.2 0.4 0.6 0.8 1

p

Prob

(p)

Mean  

Using  any  single  number  as  a  “best”  es6mate  denies  the  uncertain  nature  of  a  distribu6on  

Adding  confidence  bounds  s6ll  loses  most  of  the  informa6on  in  the  distribu6on  and  prevents  good  modeling  of  the  tails  

Page 27: Buzz Words Dunning Real-Time Learning

27  ©MapR  Technologies  -­‐  Confiden6al  

Bayesian  Bandit  

§  Compute  distribu6ons  based  on  data  §  Sample  p1  and  p2  from  these  distribu6ons  §  Put  a  coin  in  bandit  1  if  p1  >  p2  §  Else,  put  the  coin  in  bandit  2  

Page 28: Buzz Words Dunning Real-Time Learning

28  ©MapR  Technologies  -­‐  Confiden6al  

And  it  works!  

11000 100 200 300 400 500 600 700 800 900 1000

0.12

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

n

regr

et

ε-greedy, ε = 0.05

Bayesian Bandit with Gamma-Normal

Page 29: Buzz Words Dunning Real-Time Learning

29  ©MapR  Technologies  -­‐  Confiden6al  

Video  Demo  

Page 30: Buzz Words Dunning Real-Time Learning

30  ©MapR  Technologies  -­‐  Confiden6al  

The  Code  

§  Select  an  alterna6ve  

§  Select  and  learn  

§  But  we  already  know  how  to  count!  

n = dim(k)[1]! p0 = rep(0, length.out=n)! for (i in 1:n) {! p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)! }! return (which(p0 == max(p0)))!

for (z in 1:steps) {! i = select(k)! j = test(i)! k[i,j] = k[i,j]+1! }! return (k)!

Page 31: Buzz Words Dunning Real-Time Learning

31  ©MapR  Technologies  -­‐  Confiden6al  

The  Basic  Idea  

§  We  can  encode  a  distribu6on  by  sampling  §  Sampling  allows  unifica6on  of  explora6on  and  exploita6on  

§  Can  be  extended  to  more  general  response  models  

§  Note  that  learning  here  =  coun6ng  =  on-­‐line  algorithm  

Page 32: Buzz Words Dunning Real-Time Learning

32  ©MapR  Technologies  -­‐  Confiden6al  

Generalized  Banditry  

§  Suppose  we  have  an  infinite  number  of  bandits  –  suppose  they  are  each  labeled  by  two  real  numbers  x  and  y  in  [0,1]  –  also  that  expected  payoff  is  a  parameterized  func6on  of  x  and  y  

–  now  assume  a  distribu6on  for  θ  that  we  can  learn  online  

§  Selec6on  works  by  sampling  θ,  then  compu6ng  f  §  Learning  works  by  propaga6ng  updates  back  to  θ  –  If  f  is  linear,  this  is  very  easy  

§  Don’t  just  have  to  have  two  labels,  could  have  labels  and  context    

E z[ ] = f (x, y |θ )

Page 33: Buzz Words Dunning Real-Time Learning

33  ©MapR  Technologies  -­‐  Confiden6al  

Caveats  

§  Original  Bayesian  Bandit  only  requires  real-­‐6me  

§  Generalized  Bandit  may  require  access  to  long  history  for  learning  –  Pseudo  online  learning  may  be  easier  than  true  online  

§  Bandit  variables  can  include  content,  6me  of  day,  day  of  week  

§  Context  variables  can  include  user  id,  user  features  

§  Bandit  ×  context  variables  provide  the  real  power  

Page 34: Buzz Words Dunning Real-Time Learning

34  ©MapR  Technologies  -­‐  Confiden6al  

§  Contact:  –  [email protected]  – @ted_dunning  

§  Slides  and  such  (available  late  tonight):  –  hEp://slideshare.net/tdunning  

§  Hash  tags:  #mapr  #storm  #bbuzz      

Page 35: Buzz Words Dunning Real-Time Learning

35  ©MapR  Technologies  -­‐  Confiden6al  

Thank  You