Top Banner
1 1 Headline Goes Here Speaker Name or Subhead Goes Here Building Data Applica;ons with Hadoop Tom White @tom_e_white London Java Community #ljcjug 29 August 2013
31

Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

Jun 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

1 1

Headline  Goes  Here  Speaker  Name  or  Subhead  Goes  Here  

Building  Data  Applica;ons  with  Hadoop  

Tom  White  @tom_e_white  London  Java  Community  #ljcjug  29  August  2013  

Page 2: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

What  is  Hadoop?  

2

Page 3: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

HDFS  and  MapReduce  

3

Page 4: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

A  Hadoop  Stack  

4

Page 5: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

Glossary  

• Apache  Avro  –  cross-­‐language  data  serializa;on  library  

• Apache  HCatalog  –  metadata  storage  system  (part  of  Hive)  

• Apache  Flume  –  streaming  log  capture  and  delivery  system  • Apache  Oozie  –  workflow  scheduler  system  

• Apache  Crunch  –  Java  API  for  wri;ng  data  pipelines  

•  Parquet  –  column-­‐oriented  storage  format  for  nested  data  

•  Impala  –  interac;ve  SQL  on  Hadoop  

5

Page 6: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

Hadoop  Pain  Points*  

6 * Not exhaustive

Page 7: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

Choosing  a  File  Format  

No  compression   gzip   snappy   lzo   bzip2  

Delimited  text  

JSON  

Sequence  File  

Avro  File  

RCFile  

Parquet  

…  

7

?  ?  ?  

Page 8: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

Defining  a  Data  Model  

8

Page 9: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

Schema  on  read  vs.  

Schema  on  write  

9

Page 10: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

What  is  the  user  ID  field  called?  

•  uid  

•  userId  

•  userid  •  user_id  

•  user_Id  

•  “Scaling  Big  Data  Mining  Infrastructure:  The  Twider  Experience”  by  Lin  and  Ryaboy  

10

Page 11: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

11

Page 12: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

Defining  a  File  Layout  

• Which  is  best?  

•  /data/clickstream/20120101  

•  /data/clickstream/date=20120101  •  /data/clickstream/2012/01/01  

•  /data/clickstream/year=2012/month=01/day=01  

12

Page 13: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

A  Padern:  Hadoop  is  Flexible  

13

Page 14: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

…  but  also  low-­‐level  and  complex  

14

Page 15: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

Some  Best  Prac;ces  

• Use  Avro  Schemas  for  the  data  model  

• Use  Avro  Data  Files  for  row-­‐oriented  data  

• Use  Parquet  for  column-­‐oriented  data  • Use  a  Hive/HCatalog  compa;ble  file  layout:  

•  /data/<dataset>/par;;on-­‐1=<x>/par;;on-­‐2=<y>  

• Use  a  library  like  Crunch  or  Cascading  for  batch  analysis  

• Use  Impala  for  interac;ve  ad  hoc  analysis  

15

Page 16: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

The  Cloudera  Development  Kit  Codifies  Best  Prac;ce  as  APIs,  Tools,  Docs  and  Examples  

16

Page 17: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

CDK  

• A  client-­‐side  library  for  wri;ng  Hadoop  Data  Applica;ons  

•  First  release  was  in  April  

•  0.6.0  released  earlier  this  month  • Open  source,  Apache  2  license  

• Modular  

• Data  module  (HDFS,  Flume,  Crunch,  HCatalog)  

• Morphlines  transforma;on  module  

• Maven  plugin  

17

Page 18: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

A  typical  system  (zoom  100:1)  

18

Page 19: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

A  typical  system  (zoom  10:1)  

19

Page 20: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

A  typical  system  (zoom  5:1)  

20

Page 21: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

Example  

21

Page 22: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

1.  The  Events  Schema  

{!

"name": "StandardEvent",!

"namespace": "com.cloudera.cdk.data.event",!

"type": "record",!

"fields": [!

{ "name": "event_initiator", "type": "string" },!

{ "name": "event_name", "type": "string" },!

{ "name": "user_id", "type": "long" },!

{ "name": "session_id", "type": "string" },!

{ "name": "ip", "type": "string" },!

{ "name": "timestamp", "type": "long" }!

]!

}!22

Page 23: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

2.  Define  the  Events  Dataset  

$ mvn cdk:create-dataset \!

-Dcdk.rootDirectory=hdfs://namenode/data \!

-Dcdk.datasetName=events \!

-Dcdk.avroSchemaFile=standard_event.avsc!

23

Page 24: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

(2.  Or  in  Code)  

DatasetRepository repo = new FileSystemDatasetRepository.Builder()!

.rootDirectory(new URI("hdfs://namenode/data")).get();!

Schema schema = new Schema.Parser().parse(!

Resources.getResource("standard_event.avsc").openStream());!

DatasetDescriptor descriptor = new DatasetDescriptor.Builder().schema(schema).get();!

Dataset events = repo.create("events", descriptor);!

24

Page 25: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

3.  Write  Events  

public class LoggingServlet extends HttpServlet {!

private Logger logger = Logger.getLogger(LoggingServlet.class); // A!

@Override protected void doGet(HttpServletRequest request, HttpServletResponse!

response) throws ServletException, IOException {!

StandardEvent event = StandardEvent.newBuilder() // B!

.setEventInitiator("server_user")!

.setEventName("web:message")!

.setUserId(Long.parseLong(userId))!

.setSessionId(request.getSession(true).getId())!

.setIp(request.getRemoteAddr())!

.setTimestamp(System.currentTimeMillis())!

.build();!

logger.info(event); // C!

}!

}!25

Page 26: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

The  resul;ng  file  layout  

/data!

/events!

/.metadata!

/schema.avsc!

/descriptor.properties!

/year=2013!

/month=08!

/day=27!

/hour=15!

/FlumeData.1375378320979!

/FlumeData.1375378320980!

26

Page 27: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

4.  Generate  Derived  Sessions  

<build>!

<plugins>!

<plugin>!

<groupId>com.cloudera.cdk</groupId>!

<artifactId>cdk-maven-plugin</artifactId>!

<configuration>!

<toolClass>com.cloudera.cdk.examples.demo.CreateSessions</toolClass>!

</configuration>!

</plugin>!

</plugins>!

</build>!

$ mvn cdk:create-dataset -Dcdk.datasetName=sessions ...!

$ mvn cdk:run-tool!

27

Page 28: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

5.  Run  ad  hoc  queries  

$ impala-shell –q ‘SELECT AVG(duration) FROM sessions’!

Query: select AVG(duration)!

FROM sessions!

Query finished, fetching results ...!

+---------------+!

| avg(duration) |!

+---------------+!

| 5475.5 |!

+---------------+!

Returned 1 row(s) in 0.24s!

28

Page 29: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

Extensions  

• Use  Oozie  to  generate  sessions  par;;ons  every  hour  

•  Tools  to  evolve  the  data  model  compa;bly  

•  Read  datasets  using  •  Impala  JDBC  

•  CDK  dataset  API  

• Other  Hadoop  tools  (Pig,  Hive)  

•  hdps://github.com/cloudera/cdk-­‐examples/tree/master/demo  

29

Page 30: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

Ques;ons?  

hdps://github.com/cloudera/cdk  

30

Page 31: Headline(Goes(Here( - Meetupfiles.meetup.com/841735/Building Data Applications With Hadoop.pdf3.(Write(Events(public class LoggingServlet extends HttpServlet {! private Logger logger

31 31