Power product innovation with Big Data technologies · PDF filePower product innovation with Big Data ... Experian . 3 ©Experian In the 21st century, data is the new oil, Big Data

Post on 03-Feb-2018

219 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

Transcript

Power product innovation with Big Data technologies

Introducing:

Zhixuan Wang Experian

Hua Li Experian

©Experian 3

In the 21st century, data is the new oil, Big Data analytics is the new engine, Big Data tools are the new machinery.

4/21/2017 Experian Public Vision 2017

©Experian 4

Big Data and open source landscape

4/21/2017 Experian Public Vision 2017

©Experian 5

Apache Hadoop Stack

4/21/2017 Experian Public Vision 2017

Tip: Use Hadoop streaming to write mapper and reducer in your favorite program language

©Experian 6

Credit Card Attrition Trigger Transaction Data Insight System (TDIS)

4/21/2017 Experian Public Vision 2017

1

Historical spend enables probability

expectation (profile) to be computed

As time passes, new transactions

adjust the probability expectation

Notify when transaction does not

occur within the probability

expectation threshold

2

3

©Experian 7

Hadoop Streaming with secondary sort

4/21/2017 Experian Public Vision 2017

Calculate triggers in reducer

• Build up profile based on account-grouped date-time ordered transactions

• Reuse old python code

Results

• 10M accounts with 1.2B transaction over 24 months

• No profile data to be stored: ~50GB / snapshot

• Finish in 1 hours 17 minutes

6 machine with 8 cores each

• Trigger delivery from weekly to daily

Sort

• Primary key = account number

• Secondary key = date-time

©Experian 8

Apache Spark

4/21/2017 Experian Public Vision 2017

©Experian 10

Credit card transaction data: 24-month

• 25GB bzipped

• 1.2B transaction

– 18 fields / transaction

• 8 machine

– 32 cores / machine

– 256GB memory / machine

Interactively explore credit card transactions data

4/21/2017 Experian Public Vision 2017

©Experian 11

Split, convert and load data

4/21/2017 Experian Public Vision 2017

Split, convert, and load data

Fire up Spark-shell

©Experian 12

Cache data

4/21/2017 Experian Public Vision 2017

Check cached data and executors

Cache it!

©Experian 13

Explore data (fast!)

4/21/2017 Experian Public Vision 2017

Take a peak

Five number summary on TRAN_AMT

©Experian 14

Save results

4/21/2017 Experian Public Vision 2017

Top merchant ZIP Codes™

Save results

©Experian 15

Start Spark Shell

• Set proper number of executors and memory per executor

Convert, load, cache data

• Spark >=1.6v: memory efficient

• Partition data to fit executor’s memory limit

Explore

Recap and tips

4/21/2017 Experian Public Vision 2017

©Experian 16

Graph database

4/21/2017 Experian Public Vision 2017

©Experian 17

Challenge: Finding the missing link

Potential applications:

• Healthcare: Elder patients close to his / her children

• Wealth service: Identify the heirs of the elder customers

• Retail: Condolence / celebration / holiday gifts and services

• Anti-money laundry: Domestic politically exposed persons

• Fraud prevention: Synthetic ID fraud

Who are my family members?

4/21/2017 Experian Public Vision 2017

©Experian 18

What is a graph database?

4/21/2017 Experian Public Vision 2017

Graph

• A collection of vertices (nodes) and edges(relationships) that connect them

Graph database

• Index-free adjacency: connected nodes physically “point” to each other in the database

©Experian 19

• Extremely Flexible data format

• Most of time family members are not directly connected

• Nodes that are useful family indicators:

• Address

• Phone number

• Email address

• Last name

• Other usage:

• Meetup / E-harmony (based on hobby, taste etc.)

• Facebook / LinkedIn (based on co-worker, classmates etc.)

Design the graph for family search

4/21/2017 Experian Public Vision 2017

©Experian 20

Comparison

4/21/2017 Experian Public Vision 2017

SQL Query (RDBMS Database) Cypher Query (Graph Database)

©Experian 21

Geolocation database with PostgreSQL

4/21/2017 Experian Public Vision 2017

©Experian 22

Geolocation data

4/21/2017 Experian Public Vision 2017

©Experian 23

Geolocation data

4/21/2017 Experian Public Vision 2017

Exponential growth of mobile location data with the rise of smart phones

Wide applications:

• Home / work location detection

• Favorite shops

• Mobile marketing service

• Passenger analysis

Key question:

• Where has the consumer been?

Supporting components:

• Where are the Points of Interest (POI) data?

• Which POI is/are around the consumer?

where you

work

where you

shop

how you get there

events you

attend

where you

travel

where you live

where you spend free time

©Experian 24

OpenStreetMap Best Free source for points of interests

4/21/2017 Experian Public Vision 2017

OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world

• Not as accurate as Google, but getting closer and closer, especially in major cities

• Points, lines, polygons

• Rich tags:

Addr: House number, street, city, etc.

Shop: Alcohol, beverage, computer

Admin_level: 2 (country), 4 (state), 6 (city)

Highway: Residential, primary, cycle way, track, etc.

Amenity: Library, school, parking area, bar

Cuisine: coffee, pizza, Chinese, sushi

• Could be easily imported into PostgreSQL with PostGIS extension

©Experian 25

What POIs are around me?

4/21/2017 Experian Public Vision 2017

©Experian 26

9p 9r 9x 9z

9n 9q 9w 9y

9j 9m 9t 9v

9h 9k 9s 9u

95 97 9e 9g

94 96 9d 9f

91 93 99 9c

90 92 98 9b

9q

b

9q

c

9qf 9q

g

9q

u

9q

v

9q

y

9q

z

9q

8

9q

9

9q

d

9q

e

9q

s

9qt 9q

w

9q

x

9q

2

9q

3

9q

6

9q

7

9q

k

9q

m

9q

q

9qr

9q

0

9q

1

9q

4

9q

5

9q

h

9qj 9q

n

9q

p

• Hierarchical group coding of (latitude, longitude) coordinates

• Arbitrary accuracy

• Fast encoding

Geohash

4/21/2017 Experian Public Vision 2017

©Experian 27

Nearby points Easy case of vicinity search

4/21/2017 Experian Public Vision 2017

Which store am I visiting?

Identify the search radius

POI candidates within

candidate Geohash

Filter by actual distance

calculation

©Experian 28

Nearby polygons Advanced case of vicinity search

4/21/2017 Experian Public Vision 2017

Challenge #1: The geohash of a polygon

is the geohash of its center, but the boundary

could be very far away from its center

Which park am I visiting?

Solution:

Categorize polygons by its size first, then

customize search radius by the search

©Experian 29

Nearby polygons Advanced case of vicinity search

4/21/2017 Experian Public Vision 2017

Multiple level search: Find polygons of all sizes

Which park am I visiting?

©Experian 30

Nearby polygons Advanced case of vicinity search

4/21/2017 Experian Public Vision 2017

Am I in the park?

Challenge #2: Given a point, how do we

determine whether it is inside the polygon?

Solution:

ST_Within (PostGIS built-in function):

Using ray_casting algorithm

• Draw a ray from the point in random

direction

• Count the number of intersections

• Odd: In Even: Out

©Experian 31

Key takeaways

4/21/2017 Experian Public Vision 2017

• Use OpenStreetMap + PostgreSQL(PostGIS) to handle your geo-location data

• Filter the candidates first before you calculate distance

©Experian 32

Tips on some latest techniques based on our experiences

• Spark:

– Set proper number of executors and memory per executor

– Partition data to fit executor’s memory limit

• Graph Database:

– Much more efficient when you have to do multiple joins in traditional RDBMS

– Much more flexible

• Geolocation data:

– OpenStreetMap + PostgreSQL

– Filter candidates before a proximity search

Summary

4/21/2017 Experian Public Vision 2017

http://www.experian.com/big-data/datalabs.html

©Experian 33

Experian contact:

Hua Li Hua.Li@experian.com

Zhixuan Wang Zhixuan.Wang@experian.com

Questions and answers

4/21/2017 Experian Public Vision 2017

©Experian 34

Share your thoughts about Vision 2017!

4/21/2017 Experian Public Vision 2017

Please take the time now to give us your feedback about this session.

You can complete the survey at the kiosk outside.

How would you rate both the Speaker and Content?

top related