Top Banner
1 CDO Leadership Forum – Copyright Usama Fayyad © 2014 The New CDO Challenge: Taming the Big Data Beast for Value & Insights Usama Fayyad, Ph.D. Chief Data Officer Barclays Twitter: @usamaf Feb 25 th , 2014 CDO Leadership Forum London
48
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

1 CDO Leadership Forum – Copyright Usama Fayyad © 2014

The New CDO Challenge:Taming the Big Data Beast for Value & Insights

Usama Fayyad, Ph.D.

Chief Data Officer – Barclays

Twitter: @usamaf

Feb 25th, 2014

CDO Leadership Forum

London

Page 2: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

2 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Outline• The CDO role• Big Data all around us• Some of the issues in BigData• Introduction to Data Mining and Predictive

Analytics Over BigData• Case studies • Summary and conclusions

Page 3: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

3 CDO Leadership Forum – Copyright Usama Fayyad © 2014

What Matters in the Age of Analytics?

1.Being Able to exploit all the data that is available • not just what you've got available • what you can acquire and use to enhance your actions

2. Proliferating analytics throughout the organization• make every part of your business smarter

3. Driving significant business value • embedding analytics into every area of your business can

help you drive top line revenues and/or bottom line cost efficiencies

Page 4: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

4 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Why a Chief Data Officer?• There is a fundamental realisation that Data needs to

become a primary value driver at organizations• We have lots of Data• We spend much on it: in tech and resources• We are not realising the expected value we could get from it

• A strong business need to create the CDO role:• New generation companies are not following, but adopting

the model that actually works in other data-intensive industries

• CDO has a seat at executive table: the voice of Data• Data done right is an essential element to unify large

enterprises to unlock value form business synergies

4 | DSI Town Hall l February 2014

Page 5: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

5 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Why a Chief Data Officer?

• CDO has a governance, architectural, and advisory role,

but also…

• CDO has execution and development groups to deliver on the Data agenda

5 | DSI Town Hall l February 2014

Page 6: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

6 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Fundamental Data Principles1. Data gains value exponentially when integrated and

coalesced. When fragmented: dramatic value loss; increased costs; reduced utility/integrity; and increased security risks

2. Fusing Data together from disparate /independent sources is difficult to achieve and impossible to maintain: hence fixing at the source and controlling lifecycle and flow is the only viable approach

3. Standardisation is essential: for sustained ability to integrate data sources and hence growing value; for simplifying down-stream systems and apps

4. Data governance and policy must be centralised and need to be enforced strongly else we slip into chaos and a Babylon of terms/languages

– An Enterprise Data Architecture spanning structured and unstructured data

6 | DSI Town Hall l February 2014

Page 7: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

7 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Fundamental Data Principles

5. Encryption and Masking: Persisting unencrypted confidential and secret data (even within our firewalls) is an invitation for problems and risks

6. Data infrastructure needs renewal & modernization:

the pace of change and development of technology are very rapid:

– Design for migration and infrastructure replacement via abstraction layers that remove tech dependencies

7. Data is a primary competency and not a side-activity supporting other processes – hence specialized skills and know-how are a must

7 | DSI Town Hall l February 2014

Page 8: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

8 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Why Big Data?A new term, with associated “Data Scientist” positions:

• Big Data: is a mix of structured, semi-structured, and unstructured data:– Typically breaks barriers for traditional RDB storage

– Typically breaks limits of indexing by “rows”

– Typically requires intensive pre-processing before each query to extract “some structure” – usually using Map-Reduce type operations

• Above leads to “messy” situations with no standard recipes or architecture: hence the need for “data scientists” – conduct “Data Expeditions”

– Discovery and learning on the spot

Page 9: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

9 CDO Leadership Forum – Copyright Usama Fayyad © 2014

The 4-V’s of “Big Data”

• Big Data is Characterized by the 3-V’s:

– Volume: larger than “normal” – challenging to load/process• Expensive to do ETL

• Expensive to figure out how to index and retrieve

• Multiple dimensions that are “key”

– Velocity: Rate of arrival poses real-time constraints on what are typically “batch ETL” operations

• If you fall behind catching up is extremely expensive (replicate very expensive systems)

• Must keep up with rate and service queries on-the-fly

– Variety: Mix of data types and varying degrees of structure• Non-standard schema

• Lots of BLOB’s and CLOB’s

• DB queries don’t know what to do with semi-structured and unstructured data.

Page 10: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

10 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Male, age 32

Lives in SFLawyer

Searched on from London last week

Searched on:“Italian restaurantPalo Alto”

Checks Yahoo! Mail daily via PC & Phone

Has 25 IM Buddies, Moderates 3 Y! Groups, and hosts a 360 page viewed by 10k people

Searched on:“Hillary Clinton”

Clicked on Sony Plasma TV

SS ad

Registration Campaign Behavior Unknown

Spends 10 hour/week

On the internet Purchased Da Vinci Codefrom Amazon

Today’s Data: e.g. Yahoo! User DNA

Page 11: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

11 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Male, age 32

Lives in SFLawyer

Searched on from London last week

Searched on:“Italian restaurantPalo Alto”

Checks Yahoo! Mail daily via PC & Phone

Has 25 IM Buddies, Moderates 3 Y! Groups, and hosts a 360 page viewed by 10k people

Searched on:“Hillary Clinton”

Clicked on Sony Plasma TV

SS ad

Spends 10 hour/week

On the internet Purchased Da Vinci Code from Amazon

How Data Explodes: really big

Social Graph (FB)

Likes &

friends likes

Professional netwk

- reputation

Web searches on

this person,

hobbies, work,

locationMetaData on everything

Blogs, publications,

news, local papers,

job info, accidents

Page 12: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

12 CDO Leadership Forum – Copyright Usama Fayyad © 2014

The Distinction between “Data” and “Big Data” is fast disappearing

• Most real data sets nowadays come with a serious mix of semi-structured and unstructured components:– Images– Video– Text descriptions and news, blogs, etc…– User and customer commentary– Reactions on social media: e.g. Twitter is a mix of data

anyway

• Using standard transforms, entity extraction, and new generation tools to transform unstructured raw data into semi-structured analyzable data

Page 13: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

13 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Text Data: The Big Driver

• We speak of “big data” and the “Variety” in 3-V’s• Reality: biggest driver of growth of Big Data has been

text data– Most work on analysis of “images” and “video” data has

really been reduced to analysis of surrounding text

Nowhere more so than on the internet

• Map-Reduce popularized by Google to address the problem of processing large amounts of text data: – Many operations with each being a simple operation but

done at large scale– Indexing a full copy of the web– Frequent re-indexing

Page 14: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

14 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Reality Check

So what do technology people worry about these days?

Page 15: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

15 CDO Leadership Forum – Copyright Usama Fayyad © 2014

To Hadoop or not to Hadoop?

when to use techniques requiring Map-Reduce and grid computing?• Typically organizations try to use Map-Reduce

for everything to do with Big Data– This is actually very inefficient and often irrational– Certain operations require specialized storage

• Updating segment memberships over large numbers of users

• Defining new segments on user or usage data

Page 16: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

16 CDO Leadership Forum – Copyright Usama Fayyad © 2014

To Hadoop or not to Hadoop?

when to use techniques requiring Map-Reduce and grid computing?• Map-Reduce is useful when a very simple operation is

to be applied on a large body of unstructured data– Typically this is during entity and attribute extraction– Still need Big Data analysis post Hadoop

• Map-Reduce is not efficient or effective for tasks involving deeper statistical modeling– good for gathering counts and simple (sufficient) statistics

• E.g. how many times a keyword occurs, quick aggregation of simple facts in unstructured data, estimates of variances, density, etc…

– Mostly pre-processing for Data Mining

Page 17: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

17 CDO Leadership Forum – Copyright Usama Fayyad © 2014

ERP Financial Data1%

Supply Chain Data2%

Sensor Data2%

Financial Trading Data4%

CRM Data4%

Science Data7%

Advertising Data10%

Social Data11%

Text and Language Data16%

IT Log Data19%

Content and Preference Data24%

Hadoop Use Cases by Data Type

Page 18: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

18 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Analysis & Programming Software

PIG

HIPI

Page 19: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

19 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Page 20: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

20 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Reality: What is Biggest Driver for Hadoop Demand in the Enterprise?

Storage• Good performance stores at “commodity” prices

– High end: $100,000/Terabyte– Low end: $1000/Terabyte

• Hadoop typical cost:

$3000/Terabyte• Plus decent compression for unstructured data• Plus suite of really cool tools…

Page 21: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

21 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Turning the three Vs of Big Data into ValueUnderstand context and content• What are appropriate actions?• Is it Ok to associate my brand with this content?• Is content sad?, happy?, serious?, informative?

Understand community sentiment• What is the emotion?• Is it negative or positive?• What is the health of my brand online?

Understand customer intent?• What is each individual trying to achieve?• Can we predict what to do next?• Critical in cross-sell, personalization, monetization,

advertising, etc…

Page 22: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

22 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Many Business Uses of Predictive AnalyticsAnalytic technique Uses in business

Marketing and sales Identify potential customers; establish the

effectiveness of a campaign

Understanding customer behavior model churn, affinities, propensities, …

Web analytics & metrics model user preferences from data, collaborative filtering, targeting, etc.

Fraud detection Identify fraudulent transactions

Credit scoring Establish credit worthiness of a customer requesting a loan

Manufacturing process analysis Identify the causes of manufacturing problems

Portfolio trading optimize a portfolio of financial instruments by maximizing returns & minimizing risks

Healthcare Application fraud detection, cost optimization, detection of events like epidemics, etc...

Insurance fraudulent claim detection, risk assessment

Security and Surveillance intrusion detection, sensor data analysis, remote sensing, object/person detection, link analysis, etc...

Page 23: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

Case Studies:

1. Context Analysis (unstructured data)

2. Predictive Analysis w/ RapidMiner

Page 24: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

24 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Understanding Context

Page 25: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

25 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Reality Check

So who is the company we think is best at handling BigData?

Page 26: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

26 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Biggest BigData in Advertising?

Understanding Context for Ads

Page 27: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

27 CDO Leadership Forum – Copyright Usama Fayyad © 2014

The Display Ads Challenge Today

What Ad would you place here?

Page 28: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

28 CDO Leadership Forum – Copyright Usama Fayyad © 2014

The Display Ads Challenge TodayDamaging to Brand?

Page 29: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

29 CDO Leadership Forum – Copyright Usama Fayyad © 2014

The Display Ads Challenge Today

What Ad would you place here?

Page 30: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

30 CDO Leadership Forum – Copyright Usama Fayyad © 2014

The Display Ads Challenge Today

Irrelevant and Damaging to Brand

Completely Irrelevant

Page 31: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

31 CDO Leadership Forum – Copyright Usama Fayyad © 2014

NetSeer: Intent for Display

• Currently Processing 4 Billion Impressions per Day

Page 32: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

32 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Problem: Hard to Understand User Intent

Contextual Ad served by Google What NetSeer Sees:

Page 33: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

Case Studies:

1. Context Analysis (unstructured data)

2. Predictive Analysis w/ RapidMiner

Page 34: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

RapidMiner’s Strengths

3434

• Open Source Community & Marketplace – Crowd-sourced innovation, quality assurance, market awareness.

• Fully-integrated Platform – Integrated, process-based business analytics platform with focus on predictive analytics.

• No Programming Required – Easy-to-use, low maintenance costs, standard platform for business analysts.

• Advanced Analytics at Every Scale – In-memory, in-database and in-Hadoop analytics offer best option for every size of database.

• Connectivity – More than 60 connectors (incl. SAP & Hadoop), allowing easy access to structured and unstructured data.

Page 35: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

10,000+ Downloads per Month

CONFIDENTIAL

SELECT LIST OF RECIPIENT ORGANIZATIONS

3535

Government & DefensePharma & Healthcare

Consulting

Oil & Gas, Chemicals

Financial ServicesSoftware & Analytics

Retail

Manufacturing

Business Services

Consumer ProductsAerospace

Technology

Entertainment Academia

Page 36: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

Leader in Advanced Analytics

36

Source: Gartner

Magic Quadrant for Advanced

Analytics Platforms

(February 2014)

Full report at:

http://www.rapidminer.com/gartn

er2014

Page 37: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

37CONFIDENTIAL

PayPal

Who > world leading

online payment services

provider

Solution > Customer

feedback and voice of the

customer analysis, churn

prediction and prevention,

text mining and sentiment

analysis

SmartSoft

Who > provider of

solutions for preventing

fraud, money laundering,

and risks in financial

institutions

Solution > Integration of

Rapid-I’s predictive

analytics engine into their

solutions for fraud

detection and fraud

prevention for the

financial and telecom

sectors

Select Customer Stories

Page 38: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

EXAMPLE:

FRAUD DETECTION

Information about fraud detection in the context of Medicaid

and Medicare and how this can used for financial

transactions.

Page 39: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

Fraud in Medicaid: ProblemProblem:

– Cost for Medicaid and Medicare constantly increasing

– Significant proportion of the amount paid is due to fraud / waste

– Waste and fraud rate officially estimated to be about 2-3%, unofficial

internal estimates assume 10-20%

– Example: State of Massachusetts

$5-6b per year, probably more than $1b waste and fraud every year!

Task for Data Mining and Predictive Analytics:

– Identify patients, doctors, and pharmacies with suspicious behavior

– Identify patterns within the claims that show potential systemic issues

– Identify best alternatives to deploy State Auditor resources to maximize

transparency and make state government work more effectively and

reduce systemic issues

CONFIDENTIAL 39

Page 40: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

Data Sources

External Data Available

Data Targeted

RapidMiner

CONFIDENTIAL

Page 41: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

Example Results May/June 2013: State Auditor of Massachusetts published that 1164 Social Security Numbers of

dead persons were identified still receiving Medicaid

– Some of these persons had deceased more than a year ago

– The risk analytics engine combined the Medicaid claims data with the SSN death records to

identify these cases

Similarly external data is used to identify lottery winners still receiving Medicaid payments

High risk patients, pharmacies, and prescribing doctors are identified automatically

– Example: A pharmacy with only 1117 customers in the last 8 years; 1087 of which have had at

least 1 prescription over 500 pills

– Those 1087 got over 17 Million pills and charged over 17 Million dollars

– This is over 2000 pills per patient per month

– Each Patient has filled on average over 16K worth of pills there

– Almost all patients happen to be disabled

6 patients have been prescribed over 1 million drug units, 1 patient 4,742,171pills. Nearly 4.8

Million pills at a cost of almost 6.5 M. That is nearly 1650 pills a day. Is that even possible?

CONFIDENTIAL 41

Page 43: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

Concluding Remarks

Page 44: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

44 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Experience summary at Yahoo!

• Dealing with one of the largest data sources (25

Terabyte per day)

• BT business was grown from $20M to about $500M

in 3 years of investment!

• BigData critical to operations

– Ad targeting creates huge value

– Right teams to build technology (3 years of recruiting)

– Search is a BigData proble,

• Big demands for grid computing (Hadoop)

– Not all BigData can be handled via Hadoop

– Spunoff BigData Segmentation data platfrom: nPario

Page 45: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

45 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Lessons LearnedA lot more data than qualified talent

– Finding talent in BigData is very difficult

– Retaining talent in BigData is even harder

• At Yahoo! we created central group that drove huge value to company

• Data people need to feel like they have critical mass

– Makes it easier to attract

– Makes it easier to retain

• Drive data efforts by business need, not by technology priorities

– Chief Data Officer role at Yahoo! – now popular

Page 46: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

46 CDO Leadership Forum – Copyright Usama Fayyad © 2014

BigData Analytics for Organizations

• Key to competitive Intelligence:

– Understand context

– Understand intent

• Key to understanding consumer trends through social media analysis

– Brand issues

– Trend issues

– Anticipating the next shift

Page 47: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

47 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Threats & Opportunities• Data world is changing, especially in on-line businesses

• Major shifts from relational DB to NoSQL, document-oriented stores

• Connecting new world to “old”world?– Convenience of execution – integration with data platforms

– Appropriateness of algorithms to BigData

– Unstructured data algorithms:

• Text, Semi-structured and Unstructured data

• Entity extraction a must

• Appropriate theory and probability distributions (power laws, fat tails)

• Sparse Data

– Model management and proper aging of models

– Getting to basics so we can decide what models to use:

• Understanding noise and distributions

• Data tours

Page 48: Barclays Bank Presenting at the Chief Data Officer Forum Europe - London

48 CDO Leadership Forum – Copyright Usama Fayyad © 2014

Usama Fayyad - [email protected][email protected]

Twitter – @Usamaf

+1-206-529-5123

www.Oasis500.com

www.open-insights.com

Thank You! & Questions