Top Banner
User Behaviour Tracking Track - Store - Process //Florian Pfeiffer - Head of Data&Infrastructure - gutefrage.net
31

Datageeks

Jan 15, 2015

Download

Technology

Fl Pfeiffer

Slides for Munich Datageek Meetup
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Datageeks

User Behaviour Tracking Track - Store - Process

!//Florian Pfeiffer - Head of Data&Infrastructure - gutefrage.net

!

Page 2: Datageeks
Page 3: Datageeks

Vision „Let’s build our own Google

Analytics“

Page 4: Datageeks

Why

analytics does sampling

we want the (raw) data

Page 5: Datageeks

Ideas,Thoughts&Goals

fast / minimal impact on page loading time

high availability

track user over multiple platforms

storage engine? -> hbase

Page 6: Datageeks

Infrastructure

Page 7: Datageeks

Numbers!

10-20ms Response Time per pixel

record for now: ~2500 concurrent reqs

1,5 billion entries in Hbase

10 Nodes in Hadoop Cluster

Page 8: Datageeks

Serving Infrastructure

Loadbalancers & RR DNS

nginx with empty_gif module (~2ms)

data is written to logfile

Page 9: Datageeks

Storing Infrastructure

every nginx node has flume-ng

flume ingests logfile

AsyncHBaseSink with custom Serializer

direct writes to HBase

Page 10: Datageeks

why flume?

we had it already in production ;)

Storm might be an interesting alternative

Page 11: Datageeks

HBase rowkey design

Page 12: Datageeks

Why?

You can scan through all data and use filters for selecting specific data

But scanning with start & stop row speeds things up (a lot)

Page 13: Datageeks

HBase rowkey design

Do I need a fast user or a fast timespan lookup?

User - clientid,ts<,connectionId>

Timespan - ts,clientid<,connectionId>

Page 14: Datageeks

Inverse Timestamps

Data in HBase is stored lexicographicaly sorted

Normal TS - scan would yield oldest results first

Inverse TS - newer entries come first (and you can cancel the scan if you have enough data)

Page 15: Datageeks

Cross Domain Tracking

(Flash)Cookies

Fingerprinting

Etag

HTML5 Storage

Page 16: Datageeks

The olden times… or

Cookies

Easy to drop a 3rd party cookie with userId on different websites

Gets more and more blocked (Safari, FF..)

Page 17: Datageeks

Fingerprinting

Yields interesting results on desktop, difficult on e.g. iPhone

invisible to user

Last resort if everything else fails?

Page 18: Datageeks

Etag

Quite new, based on browser cache

sounds interesting

Page 19: Datageeks

HTML5 Storage

Store data in local HTML5 storage

Retrieve data with Cross Domain Messaging

Page 20: Datageeks

Store data

e.g. UserId, SessionId, GeoIP, URL, action, data

Page 21: Datageeks

Batch Processing

Calculate how many users are active on platform A and also on B

Get Traffic of all Questions belonging to Channel X sorted by Country

Page 22: Datageeks

Now to something completely different…

Page 23: Datageeks

demo

Page 24: Datageeks

with Myrrix

Recommendations

Page 25: Datageeks

Myrrix

Evolution: taste -> mahout -> myrrix (-> oryx)

Recommender based on ALS

Page 26: Datageeks

Recommendations @ GF.net

User emit signals on questions

view, like, gives answer, answer is voted best

Application sends signals through RabbitMQ to recommendation servers

Page 27: Datageeks

but what happens, when a new user signs up?

YEAH

Page 28: Datageeks

?

Page 29: Datageeks

and feed it into myrrix

Fetch data from tracking

Page 30: Datageeks

using & processing is another thing ;)

Collecting&Storing data works great

Page 31: Datageeks