User Behaviour Tracking Track - Store - Process //Florian Pfeiffer - Head of Data&Infrastructure - gutefrage.net
Jan 15, 2015
User Behaviour Tracking Track - Store - Process
!//Florian Pfeiffer - Head of Data&Infrastructure - gutefrage.net
!
Vision „Let’s build our own Google
Analytics“
Why
analytics does sampling
we want the (raw) data
Ideas,Thoughts&Goals
fast / minimal impact on page loading time
high availability
track user over multiple platforms
storage engine? -> hbase
Infrastructure
Numbers!
10-20ms Response Time per pixel
record for now: ~2500 concurrent reqs
1,5 billion entries in Hbase
10 Nodes in Hadoop Cluster
Serving Infrastructure
Loadbalancers & RR DNS
nginx with empty_gif module (~2ms)
data is written to logfile
Storing Infrastructure
every nginx node has flume-ng
flume ingests logfile
AsyncHBaseSink with custom Serializer
direct writes to HBase
why flume?
we had it already in production ;)
Storm might be an interesting alternative
HBase rowkey design
Why?
You can scan through all data and use filters for selecting specific data
But scanning with start & stop row speeds things up (a lot)
HBase rowkey design
Do I need a fast user or a fast timespan lookup?
User - clientid,ts<,connectionId>
Timespan - ts,clientid<,connectionId>
Inverse Timestamps
Data in HBase is stored lexicographicaly sorted
Normal TS - scan would yield oldest results first
Inverse TS - newer entries come first (and you can cancel the scan if you have enough data)
Cross Domain Tracking
(Flash)Cookies
Fingerprinting
Etag
HTML5 Storage
The olden times… or
Cookies
Easy to drop a 3rd party cookie with userId on different websites
Gets more and more blocked (Safari, FF..)
Fingerprinting
Yields interesting results on desktop, difficult on e.g. iPhone
invisible to user
Last resort if everything else fails?
Etag
Quite new, based on browser cache
sounds interesting
HTML5 Storage
Store data in local HTML5 storage
Retrieve data with Cross Domain Messaging
Store data
e.g. UserId, SessionId, GeoIP, URL, action, data
Batch Processing
Calculate how many users are active on platform A and also on B
Get Traffic of all Questions belonging to Channel X sorted by Country
Now to something completely different…
demo
with Myrrix
Recommendations
Myrrix
Evolution: taste -> mahout -> myrrix (-> oryx)
Recommender based on ALS
Recommendations @ GF.net
User emit signals on questions
view, like, gives answer, answer is voted best
Application sends signals through RabbitMQ to recommendation servers
but what happens, when a new user signs up?
YEAH
?
and feed it into myrrix
Fetch data from tracking
using & processing is another thing ;)
Collecting&Storing data works great