Top Banner
Eric Baldeschwieler VP, Hadoop Software HADOOP YAHOO & USING AND IMPROVING APACHE HADOOP AT YAHOO!
23

hadoop @ Ibmbigdata

Jan 15, 2015

Download

Technology

Eric Baldeschwieler's talk about Hadoop at Yahoo, given at IBM Big Data Symposium
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: hadoop @ Ibmbigdata

Eric Baldeschwieler VP, Hadoop Software

HADOOP

YAHOO &

USING  AND  IMPROVING  APACHE  HADOOP  AT  YAHOO!

Page 2: hadoop @ Ibmbigdata

•   Brief  Overview  

•   Hadoop  @  Yahoo!    •  Hadoop  Momentum  

•  The  Future  of  Hadoop  

AGENDA

2  

Page 3: hadoop @ Ibmbigdata

happening WHAT’S

-­‐  Big  Data  is  here!    -­‐ unstructured data -­‐    petabyte scale -­‐    operationally critical

Flickr : sub_lime79

Page 4: hadoop @ Ibmbigdata

INTO INSIGHTS TURNING DATA

machine learning time series

content clustering

factorization models

logic regression

Flickr : NASA Goddard Photo and Video

algorithms user interest prediction

ad inventory modeling

Page 5: hadoop @ Ibmbigdata

RELEVANT MAKING YAHOO

Flickr : ogimogi

Page 6: hadoop @ Ibmbigdata

POWERING HADOOP:

science  +  big  data + insight = personal relevance = VALUE

YAHOO!

Flickr : DDFic

Page 7: hadoop @ Ibmbigdata

WHAT IS HADOOP?

7  

HDFS

MapReduce

Pig Hive Programming Languages

Computation

Storage

Commodity •  Computers •  Network

Focus on •  Simplicity •  Redundancy •  Scale •  Availability

Transforms commodity equipment into a service that: •  HDFS – Stores peta bytes of data reliably •  Map-Reduce – Allows huge distributed computations

Key Attributes •  Redundant and reliable – Doesn’t stop or loose data even as hardware fails •  Easy to program – Our rocket scientists use it directly! •  Very powerful – Allows the development of big data algorithms & tools •  Batch processing centric

Page 8: hadoop @ Ibmbigdata

WHAT HADOOP ISN’T

•  A  replacement  for  relaFonal  and  data  warehouse  systems    

•  A  transacFonal  /  online  /  serving  system  •  A  low  latency  or  streaming  soluFon    

8  

Page 9: hadoop @ Ibmbigdata

HADOOP IN THE ENTERPRISE

9  

RDMS   EDW   Data  Marts  

HADOOP CLUSTER(S)

TransacFons,  Structured  Data  

Business  ApplicaFons  

Web  Logs,  Server  Logs,  Social  Media,  etc…  

InteracFons  Semi-­‐Structured  or  Un-­‐Structured  Data  

Business  Intelligence  ApplicaFons  

Page 10: hadoop @ Ibmbigdata

10  

HADOOP @ YAHOO!

Page 11: hadoop @ Ibmbigdata

11  

HADOOP @ YAHOO! “Where  Science  meets  Data”  

HADOOP CLUSTERS Tens of thousands of servers

PRODUCTS

APPLIED SCIENCE

Data Analytics Content Optimization Content Enrichment Yahoo! Mail Anti-Spam Advertising Products Ad Optimization Ad Selection Big Data Processing & ETL

User Interest Prediction Ad inventory prediction Machine learning - search ranking Machine learning - ad targeting Machine learning - spam filtering 10s of Petabytes

Page 12: hadoop @ Ibmbigdata

2006 2007 2008 2009 2010 12  

FROM PROJECT TO CORE PLATFORM

170 PB Storage

Thou

sand

s of

Ser

vers

Pet

abyt

es

90

80

70

60

50

40

30

20

10

0

250

200

150

100

50

0

Research  

Science  Impact  

Daily  ProducFon  

“Behind  every  click”    

40K+ Servers

5M+ Monthly Jobs

Page 13: hadoop @ Ibmbigdata

HADOOP POWERS THE YAHOO! NETWORK

advertising optimization

ad selection

Yahoo! Homepage

machine learning search ranking

ad inventory prediction

Yahoo! Mail anti-spam

user interest prediction

audience, ad and search pipelines advertising data systems

Content Optimization

data analytics

13  

Page 14: hadoop @ Ibmbigdata

         twice  the  engagement  

CASE STUDY YAHOO! HOMEPAGE

14  

Personalized    for  each  visitor    Result:    twice  the  engagement  

 

+160% clicks vs. one size fits all

+79% clicks vs. randomly selected

+43% clicks vs. editor selected

Recommended  links   News  Interests   Top  Searches  

Page 15: hadoop @ Ibmbigdata

CASE STUDY YAHOO! HOMEPAGE

15  

•  Serving  Maps  •  Users  -­‐  Interests  

 •  Five  Minute  ProducLon  

 •  Weekly  CategorizaLon  models  

SCIENCE HADOOP

CLUSTER

SERVING  SYSTEMS

PRODUCTION HADOOP

CLUSTER

USER  BEHAVIOR  

ENGAGED  USERS

CATEGORIZATION  MODELS  (weekly)  

SERVING  MAPS  

(every  5  minutes)  USER  

BEHAVIOR  

»  Identify user interests using Categorization models

»  Machine learning to build ever better categorization models

 Build  customized  home  pages  with  latest  data  (thousands  /  second)  

Page 16: hadoop @ Ibmbigdata

CASE STUDY YAHOO! MAIL Enabling  quick  response  in  the  spam  arms  race  

•  450M  mail  boxes    •  5B+  deliveries/day    •  AnLspam  models  retrained    every  few  hours  on  Hadoop  

 

40%  less  spam  than  Hotmail  and  55%  less  spam  than  Gmail  “ “

SCIENCE

PRODUCTION

16  

Page 17: hadoop @ Ibmbigdata

YAHOO! & APACHE HADOOP

17  

Yahoo!  has  contributed  70+%  of    Apache  Hadoop  code  to  date  Hadoop  is  not  our  business,  but  Hadoop  is  key  to  our  business  •   Yahoo!  benefits  from  open  source  eco-­‐system  around  Hadoop  •   Hadoop  drives  revenue  at  Yahoo!  by  making  our  core  products  be`er    We  need  Hadoop  to  be  rock  solid  •   We  invest  heavily  in  core  Hadoop  development  •   We  focus  on  scalability,  reliability,  availability    We  fix  bugs  before  you  see  them  •   We  run  very  large  clusters  •   We  have  a  large  QA  effort  •   We  run  a  huge  variety  of  workloads    We  are  good  Apache  Hadoop  ciLzens  •   We  contribute  our  work  to  Apache  •   We  share  the  exact  code  we  run  

Page 18: hadoop @ Ibmbigdata

18  

HADOOP MOMENTUM

Page 19: hadoop @ Ibmbigdata

HADOOP IS GOING MAINSTREAM 2007

2008

2009

19  

2010

The  Datagraph  Blog  

Page 20: hadoop @ Ibmbigdata

THE PLATFORM EFFECT BIRTH OF AN ECOSYSTEM                                                        and other Early Adopters

Scale and productize Hadoop

20  

Apache  Hadoop  

Orgs with Internet Scale Problems Add tools / frameworks, enhance Hadoop

Mainstream / Enterprise adoption Drive further development, enhancements

Enhance  Hadoop  Ecosystem  

Service Providers Grow ecosystem - Training, support, enhancements

Virtuous Circle! •  Investment -> Adoption •  Adoption -> Investment

Page 21: hadoop @ Ibmbigdata

21  

THE FUTURE OF HADOOP

Page 22: hadoop @ Ibmbigdata

MAKING HADOOP ENTERPRISE-READY WHAT’S NEXT

22  

Hadoop  is  far  from  “done”  •  Current  implementaFon  is  showing  its  age  •  Need  to  address  several  deficiencies  in  scalability,  flexibility,  ease  of  use  &  performance  

 

Yahoo!  is  working  on  Next  GeneraLon  of  Hadoop  •  MapReduce:  Rewrite  to  improve  performance;  pluggable  support  for  new  programming  models  

•  HDFS:  Adding  volumes  to  improve  scalability;  Flush  &  sync  support  for  applicaFons  that  log  to  HDFS  

 

Apache  should  remain  the  hub  of  Hadoop  ecosystem  •  Yahoo!  contributes  all  Hadoop  changes  back  to  Apache  Hadoop  •  Everyone  benefits  from  shared  neutral  foundaFon  

 

Page 23: hadoop @ Ibmbigdata

23  

Questions?