Top Banner
Thursdays 9:00 ET/PT Using a Hadoop Data Pipeline to Build a Graph of Users and Content Hadoop Summit - June 29, 2011 Bill Graham [email protected]
21

Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

May 26, 2015

Download

Technology

Bill Graham

At CBSi we’re developing a scalable, flexible platform to provide the ability to aggregate large volumes of data, to mine it for meaningful relationships and to produce a graph of connected users and content. This will enable us to better understand the connections between our users, our assets, and our authors.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Thursdays9:00 ET/PT

Using a Hadoop Data Pipeline to Build a Graph of Users and Content Hadoop Summit - June 29, 2011

Bill Graham

[email protected]

Page 2: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

About me

• Principal Software Engineer

• Technology, Business & News BU (TBN)

• TBN Platform Infrastructure Team

• Background in SW Systems Engineering and Integration Architecture

• Contributor: Pig, Hive, HBase

• Committer: Chukwa

Page 3: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

About CBSi – who are we?

GAMES & MOVIES TECH, BIZ & NEWS SPORTSENTERTAINMENT

MUSIC

Page 4: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

About CBSi - scale

• Top 10 global web property

• 235M worldwide monthly uniques1

• Hadoop Ecosystem– CDH3, Pig, Hive, HBase, Chukwa, Oozie, Sqoop, Cascading

• Cluster size:– Currently workers: 35 DW + 6 TBN (150TB)– Next quarter: 100 nodes (500TB)

• DW peak processing: 400M events/day globally

1 - Source: comScore, March 2011

Page 5: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Abstract

At CBSi we’re developing a scalable, flexible platform to provide the ability to aggregate large volumes of data, to mine it for meaningful relationships and to produce a graph of connected users and content. This will enable us to better understand the connections between our users, our assets, and our authors.

Page 6: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

The Problem

• User always voting on what they find interesting– Got-it, want-it, like, share, follow, comment, rate, review, helpful

vote, etc.

• Users have multiple identities– Anonymous– Registered (logged in)– Social– Multiple devices

• Connections between entities are in silo-ized sub-graphs

• Wealth of valuable user connectedness going unrealized

Page 7: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

The Goal

• Create a back-end platform that enables us to assemble a holistic graph of our users and their connections to:– Content– Authors– Each other– Themselves

• Better understand how our users connect to our content

• Improved content recommendations

• Improved user segmentation and content/ad targeting

Page 8: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Requirements

• Integrate with existing DW/BI Hadoop Infrastructure

• Aggregate data from across CBSi and beyond

• Connect disjointed user identities

• Flexible data model

• Assemble graph of relationships

• Enable rapid experimentation, data mining and hypothesis testing

• Power new site features and advertising optimizations

Page 9: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

The Approach

• Mirror data into HBase

• Use MapReduce to process data

• Export RDF data into a triple store

Page 10: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

CMS Publishing

Social/UGC Systems

DW Systems

CMS Systems

Data Flow

HDFS

MapReduce• Pig• ImportTsv

HBasetransform

& load

TripleStore

Site Activity Streama.k.a. Firehose (JMS)

Content Tagging Systems

bulk load

atomic writes

RDF

Site SPARQL

Page 11: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

NOSQL Data Models

ColumnFamily

Key-value stores

Document databases

Graph databases

Da

ta s

ize

Data complexityCredit: Emil Eifrem, Neotechnology

Page 12: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Conceptual Graph

anonId

Asset

Asset

regId

Asset

is also

follow

like

follow

Activity firehose (real-time)

Product

Author

Brand

is also

is also

is also

Story

Author

authored by

CMS (batch + incr.)

tag

tagged withtagged with

Tags (batch)

SessionId

PageEvent

PageEvent

had session

contains

contains

DW (daily)

Page 13: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

HBase Schema

user_info table

Row Key Column Families

ALIAS: EVENT:

user id col. name value col. name value

ANON-<id1> URS-<id1> <ts>

ANON-<id1> LIKE-<ts> <json>

ANON-<id1> SHARE-<ts> <json>

URS-<id1> ANON-<id1> <ts>

Page 14: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

HBase Loading

• Incremental– Consuming from a JMS queue == real-time

• Batch– Pig’s HBaseStorage == quick to develop & iterate– HBase’s ImportTsv == more efficient

Page 15: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Generating RDF with Pig

• RDF1 is an XML standard to represent subject-predicate-object relationships

• Philosophy: Store large amounts of data in Hadoop, be selective of what goes into the triple store

• For example:– “first class” graph citizens we plan to query on– Implicit to explicit (i.e., derived) connections

− Content recommendations− User segments− Related users− Content tags

• Easily join data to create new triples with Pig

• Run SPARQL2 queries, examine, refine, reload1 - http://www.w3.org/RDF, 2 - http://www.w3.org/TR/rdf-sparql-query

Page 16: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Example Pig RDF Script

Create RDF triples of users to social events:

RAW = LOAD 'hbase://user_info' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event:*', '-loadKey true’)AS (id:bytearray, event_map:map[]);

-- Convert our maps to bags so we can flatten them out A = FOREACH RAW GENERATE id, FLATTEN(mapToBag(event_map)) AS (social_k,

social_v);

-- Convert the JSON events into maps B = FOREACH A GENERATE id, social_k, jsonToMap(social_v) AS social_map:map[]; -- Pull values from map C = FOREACH B GENERATE id, social_map#'levt.asid' AS asid,

social_map#'levt.xastid' AS astid, social_map#'levt.event' AS event, social_map#'levt.eventt' AS eventt, social_map#'levt.ssite' AS ssite, social_map#'levt.ts' AS eventtimestamp ;

EVENT_TRIPLE = FOREACH C GENERATE GenerateRDFTriple(

'USER-EVENT', id, astid, asid, event, eventt, ssite, eventtimestamp ) ;

STORE EVENT_TRIPLE INTO 'trident/rdf/out/user_event' USING PigStorage ();

Page 17: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Example SPARQL query

Recommend content based on Facebook “liked” items:

SELECT ?asset1 ?tagname ?asset2 ?title2 ?pubdt2 WHERE { # anon-user who Like'd a content asset (news item, blog post) on Facebook <urn:com.cbs.dwh:ANON-Cg8JIU14kobSAAAAWyQ> <urn:com.cbs.trident:event:LIKE> ?x . ?x <urn:com.cbs.trident:eventt> "SOCIAL_SITE” . ?x <urn:com.cbs.trident:ssite> "www.facebook.com" . ?x <urn:com.cbs.trident:tasset> ?asset1 . ?asset1 a <urn:com.cbs.rb.contentdb:content_asset> . # a tag associated with the content asset ?asset1 <urn:com.cbs.cnb.bttrax:tag> ?tag1 . ?tag1 <urn:com.cbs.cnb.bttrax:tagname> ?tagname .

# other content assets with the same tag and their title ?asset2 <urn:com.cbs.cnb.bttrax:tag> ?tag2 . FILTER (?asset2 != ?asset1) ?tag2 <urn:com.cbs.cnb.bttrax:tagname> ?tagname . ?asset2 <http://www.w3.org/2005/Atom#title> ?title2 . ?asset2 <http://www.w3.org/2005/Atom#published> ?pubdt2 . FILTER (?pubdt2 >= "2011-01-01T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>) } ORDER BY DESC (?pubdt2) LIMIT 10

Page 18: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Conclusions I - Power and Flexibility

• Architecture is flexible with respect to:– Data modeling– Integration patterns– Data processing, querying techniques

• Multiple approaches for graph traversal– SPARQL– Traverse HBase– MapReduce

Page 19: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Conclusions II – Match Tool with the Job

• Hadoop - scale and computing horsepower

• HBase – atomic r/w access, speed, flexibility

• RDF Triple Store – complex graph querying

• Pig – rapid MR prototyping and ad-hoc analysis

• Future:– HCatalog – Schema & table management– Oozie or Azkaban – Workflow engine– Mahout – Machine learning– Hama – Graph processing

Page 20: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Conclusions III – OSS, woot!

If it doesn’t do what you want, submit a patch.

Page 21: Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Questions?