Big Data: Guidelines and Examples for the Enterprise Decision Maker

Post on 29-Aug-2014

468 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

This presentation covers how to use MongoDB with Hadoop to leverage big data within your company.

Transcript

Big Data: Examples and Guidelines for the Enterprise Decision Maker

Solutions Architect, MongoDB

Buzz Moschettibuzz.moschetti@mongodb.com

#MongoDB

Who is your Presenter?• Yes, I use “Buzz” on my business cards• Former Investment Bank Chief Architect at

JPMorganChase and Bear Stearns before that

• Over 25 years of designing and building systems• Big and small• Super-specialized to broadly useful in any

vertical• “Traditional” to completely disruptive• Advocate of language leverage and strong

factoring• Still programming – using emacs, of course

Agenda• (Occasionally) Brutal Truths about Big Data

• Review of Directed Content Business Architecture

• A Simple Technical Implementation

Truths• Clear definition of Big Data still maturing

• Efficiently operationalizing Big Data is non-trivial• Developing, debugging, understanding MapReduce• Cluster monitoring & management, job scheduling/recovery• If you thought regular ETL Hell was bad….

• Big Data is not about math/set accuracy• The last 25000 items in a 25,497,612 set “don’t matter”

• Big Data questions are best asked periodically• “Are we there yet?”

• Realtime means … realtime

It’s About The Functions, not the Terms

DON’T ASK:• Is this an operations or an analytics

problem?• Is this online or offline?• What query language should we use?• What is my integration strategy across tools?ASK INSTEAD:• Am I incrementally addressing data (esp.

writes)?• Am I computing a precise answer or a

trend?• Do I need to operate on this data in

realtime?• What is my holistic architecture?

What We’re Going to “Build” today

Realtime Directed Content System• Based on what users click,

“recommended” content is returned in addition to the target

• The example is sector (manufacturing, financial services, retail) neutral

• System dynamically updates behavior in response to user activity

The Participants and Their Roles

DirectedContentSystem

Customers

ContentCreators

Management/

Strategy

Analysts/Data

Scientists

Generate and tag content from a known domain of tags

Make decisions based on trends and other summarized data

Operate on data to identify trends and develop tag domains

Developers/ProdOps

Bring it all together: apps, SDLC, integration, etc.

Priority #1: Maximizing User value

Considerations/Requirements

Maximize realtime user value and experienceProvide management reporting and trend analysisEngineer for Day 2 agility on recommendation engineProvide scrubbed click history for customerPermit low-cost horizontal scalingMinimize technical integrationMinimize technical footprintUse conventional and/or approved toolsProvide a RESTful service layer…..

The Architecture

mongoDB HadoopApp(s) MapReduce

Complementary Strengths

mongoDB HadoopApp(s) MapReduce

• Standard design paradigm (objects, tools, 3rd party products, IDEs, test drivers, skill pool, etc. etc.)

• Language flexibility (Java, C#, C++ python, Scala, …)

• Webscale deployment model• appservers, DMZ,

monitoring• High performance rich shape

CRUD

• MapReduce design paradigm• Node deployment model• Very large set operations• Computationally intensive,

longer duration• Read-dominated workload

“Legacy” Approach: Somewhat unidirectional

mongoDB HadoopApp(s) MapReduce

• Extract data from mongoDB and other sources nightly (or weekly)

• Run analytics• Generate reports for people to

read

• Where’s the feedback?

Somewhat better approach

mongoDB HadoopApp(s) MapReduce

• Extract data from mongoDB and other sources nightly (or weekly)

• Run analytics• Generate reports for people to

read• Move important summary data

back to mongoDB for consumption by apps.

…but the overall problem remains:

• How to realtime integrate and operate upon both periodically generated data and realtime current data?

• Lackluster integration between OLTP and Hadoop

• It’s not just about the database: you need a realtime profile and profile update function

The legacy problem in pseudocode

onContentClick() {String[] tags = content.getTags();Resource[] r = f1(database, tags);

}

• Realtime intraday state not well-handled

• Baselining is a different problem than click handling

The Right Approach• Users have a specific Profile entity

• The Profile captures trend analytics as baselining information

• The Profile has per-tag “counters” that are updated with each interaction / click

• Counters plus baselining are passed to fetch function

• The fetch function itself could be dynamic!

24 hours in the life of The System

• Assume some content has been created and tagged

• Two systemetized tags: Pets & PowerTools

Monday, 1:30AM EST

• Fetch all user Profiles from mongoDB; load into Hadoop• Or skip if using the mongoDB-Hadoop

connector!

mongoDB HadoopApp(s) MapReduce

mongoDB-Hadoop MapReduce Example

public class ProfileMapper extends Mapper<Object, BSONObject, IntWritable, IntWritable> { @Override public void map(final Object pKey,

final BSONObject pValue,final Context pContext )

throws IOException, InterruptedException{ String user = (String)pValue.get(”user"); Date d1 = (Date)pValue.get(“lastUpdate”); int count = 0; List<String> keys = pValue.get(“tags”).keys(); for ( String tag : keys) { count += pValue.get(tag).get(“hist”).size(); ) int avg = count / keys.size(); pContext.write( new IntWritable( count), new IntWritable( avg ) ); }}

Monday, 1:45AM EST

• Grind through all content data and user Profile data to produce:• Tags based on feature extraction (vs. creator-

applied tags)• Trend baseline per user for tags Pets and

PowerTools

• Load Profiles with new baseline back into mongoDB• Or skip if using the mongoDB-Hadoop

connector!

mongoDB HadoopApp(s) MapReduce

Monday, 8AM EST

• User Bob logs in and Profile retrieved from mongoDB• Bob clicks on Content X which is already tagged as

“Pets”• Bob has clicked on Pets tagged content many times• Adjust Profile for tag “Pets” and save back to

mongoDB

• Analysis = f(Profile)

• Analysis can be “anything”; it is simply a result. It could trigger an ad, a compliance alert, etc.

mongoDB HadoopApp(s) MapReduce

Monday, 8:02AM EST

• Bob clicks on Content Y which is already tagged as “Spices”

• Spice is a new tag type for Bob• Adjust Profile for tag “Spices” and save back to

mongoDB• Analysis = f(profile)

mongoDB HadoopApp(s) MapReduce

Profile in Detail{ user: “Bob”, personalData: { zip: “10024”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [0,0,10,4,1322,44,23, … ], hist: [ { ts: datetime1, url: url1 }, { ts: datetime2, url: url2 } // 100 more ]}, SPICE: { hist: [ { ts: datetime3, url: url3 } ]} }}

Tag-based algorithm detailgetRecommendedContent(profile, [“PETS”, other]) { if algo for a tag available {

filter = algo(profile, tag); } fetch N recommendations (filter);}

A4(profile, tag) { weight = get tag (“PETS”) global weighting; adjustForPersonalBaseline(weight, “PETS” baseline); if “PETS” clicked more than 2 times in past 10 mins then weight += 10; if “PETS” clicked more than 10 times in past 2 days then weight += 3; return new filter({“PETS”, weight}, globals)}

Tuesday, 1AM EST

mongoDB HadoopApp(s) MapReduce

• Fetch all user Profiles from mongoDB; load into Hadoop• Or skip if using the mongoDB-Hadoop

connector!

Tuesday, 1:30AM EST

• Grind through all content data and user profile data to produce:• Tags based on feature extraction (vs. creator-

applied tags)• Trend baseline for Pets and PowerTools and Spice

• Data can be specific to individual or by group• Load baseline back into mongoDB

• Or skip if using the mongoDB-Hadoop connector!

mongoDB HadoopApp(s) MapReduce

New Profile in Detail{ user: “Bob”, personalData: { zip: “10024”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [0,0,10,4,1322,44,23, … ], hist: [ { ts: datetime1, url: url1 }, { ts: datetime2, url: url2 } // 100 more ]}, SPICE: { hist: [ baseline: [0], { ts: datetime3, url: url3 } ]} }}

Tuesday, 1:35AM EST

• Perform maintenance on user Profiles• Click history trimming (variety of

algorithms)• “Dead tag” removal• Update of auxiliary reference data

mongoDB HadoopApp(s) MapReduce

New Profile in Detail{ user: “Bob”, personalData: { zip: “10022”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [ 1322,44,23, … ], hist: [ { ts: datetime1, url: url1 } // 50 more ]}, SPICE: { algo: “Z1”, hist: [ baseline: [0], { ts: datetime3, url: url3 } ]} }}

Feel free to run the baselining more frequently

… but avoid “Are We There Yet?”

mongoDB HadoopApp(s) MapReduce

Nearterm / Realtime Questions & Actions

With respect to the Customer:• What has Bob done over the past 24 hours?• Given an input, make a logic decision in 100ms or

less

With respect to the Provider:• What are all current users doing or looking at?• Can we nearterm correlate single events to shifts in

behavior?

Longterm/ Not Realtime Questions & Actions

With respect to the Customer:• Any way to explain historic performance /

actions?• What are recommendations for the future?

With respect to the Provider:• Can we correlate multiple events from multiple

sources over a long period of time to identify trends?

• What is my entire customer base doing over 2 years?

• Show me a time vs. aggregate tag hit chart• Slice and dice and aggregate tags vs. XYZ• What tags are trending up or down?

The Key To Success: It is One System

mongoDB

Hadoop

App(s)

MapReduce

Webex Q&A

Thank You

Buzz Moschettibuzz.moschetti@mongodb.com

#MongoDB

top related