Using lucene solr to build advertising systems

Using Lucene/Solr to build Advertising Systems

Hide (Hatayama Hideharu)

Big Data Department, Targeting Section, Advertising Group

Rakuten, Inc. May 2nd 2013

2

Intro

Agenda | www.lucenerevolution.org

http://www.lucenerevolution.org/2013/agenda




3

Intro

Agenda | www.lucenerevolution.org


35 min...orz

my talk is NOT about... m(_ _)m

NRT

SolrCloud

complicated queries

or other Solr hot topics

my talk is just about

Overview of Solr, most common features

Our empirical knowledge about Solr




4

Agenda

1 Introduction of Me & Rakuten

2 Solr centered Advertising Systems

4 Solr plug-in

3 Solr performance

5 (Solr with Japanese language)

5

Agenda



4 Solr plug-in

3 Solr performance


6

Agenda



4 Solr plug-in

3 Solr performance


7

Agenda



4 Solr plug-in

3 Solr performance


8

Agenda



4 Solr plug-in

3 Solr performance


9

Agenda



4 Solr plug-in

3 Solr performance


10

Who am I?

Hatayama Hideharu (call me Hide)

M.Eng, Tokyo Institute of Technology, Japan

Worked with advertising system in Rakuten for 3 years

ad management system development

ad distribution system development

system architecture design

increase the performance of systems

increase profitability of ad services

User of Solr, not implementer http://6109.hidepiy.com/

http://6109.hidepiy.com/



11

Who are we?

Rakuten, Inc.

Internet services company

Founded : Feb. 7th 1997, Tokyo, Japan

The first service: Rakuten Ichiba (shopping mall)

12

Who are we?

13

Rakuten in Japan

14

Rakuten Ichiba

Ichiba: The largest online shopping mall in Japan

user info

campaign

other services

item search

category navigation

personalized item

item history

sale event shop history

bookmarked item

service tab

:

15

Rakuten’s Global Expansion

★

● ● ●

● ●

●

●

● ● ● ● ● ●

●

● ●

● ● ● ●

● ● ● ●

●

●

●

●

● ● ●

●

● ●

● ●

● ●

● ●

●

● ●

●

●

●

●

●

● E-Commerce

eBook

Travel

Other services & businesses

Development center ●

http://www.rakuten.com.br/ecservice/e-commerce/

http://rakutenloyalty.com/

http://golf.rakuten.com/

http://travel.rakuten.com/

16

Agenda



4 Solr plug-in

3 Solr performance


17

Types of advertisements on Rakuten Ichiba [1/3]

Listing Ad (search word related ad)

item search

searched ads

searched items

18


Display Ad (placement related ad)

where, when …

Targeting Ad (user related ad)

sex, age, browsing history …

19

... Ad ?

120 ads on 1 page ...orz


20

ad system function landscape

ad system

Rakuten

Owned

Media (Web/Email)

Owned Ad

Network

Rakuten staff

Merchants

Tool User Media

External

ADNW,

AdEx

Other staff

Tenancy Ad (Fixed placement/fee/term)

P4P Ad (CPM/CPC/CPA etc.)

Ad placement def.

Sales mgmt.

Creative mgmt.

Campaign mgmt.

Budget mgmt.

Bidding

Additional Function

Big Data Analysis Advanced

targeting

Creative

optimization

Connect to

affiliate network

Programmatic

media buying

- Attribution

- Behavior

- Optimization

Delivery mgmt.

Reporting

Merchant Tool

Targeting/media

Reporting

Merchant Tool

Ad server.

ad management ad distribution

Log processing

Targeting (Placement, keyword,

behavioral, demographic, etc.)

Beacon server.

Redirect server.

Device

x

PC Mobile

Smart

phone Tablet

21

ad distribution system [1/2]

JSON

HTML

JavaScript

ad searching

ad filtering

ad sorting

logging

...

???

parameter

placement

keyword

ad type

...

cookie

22

ad distribution system [2/2]

need high performance, high availability

e.g., more than 7,000 req / sec for 1 server with 100.00% avail.

collect & analyze log, then improve profitability

basic architecture is the same for our variety of ad

using...

Kyoto Tycoon

23

system design: few years ago [1/5]

master

...

: 1 physical server

... : SLB

: 1 server cluster

x4 x4 x4 x4 x4

x4 x4

x2

slave

web svr

app svr

master

24

master


master

...

: 1 physical server

... : SLB

: 1 server cluster

x4 x4

x2

slave

web svr

app svr x4 x4 x4 x4 x4

cluster

web server x 4

app server x 5

25

master


master

...

: 1 physical server

: SLB

: 1 server cluster x2

slave

web svr

app svr

...

x4 x4 x4 x4 x4

x4 x4

SLB connect

app <-> Solr

26


master

...

: 1 physical server

... : SLB

: 1 server cluster

x4 x4 x4 x4 x4

x4 x4

x2

slave

web svr

app svr

High availability, robust

simplified task for each servers

Web server only do Apaching

Solr server searching

...

make full use of resources, on demand provisioning

e.g., add 1 front cluster

e.g., swap broken apache server

e.g., tune up performance, decrease app server 5 -> 3

27


master

...

: 1 physical server

... : SLB

: 1 server cluster

x4 x4 x4 x4 x4

x4 x4

x2

slave

web svr

app svr

so many servers, so many configurations

we didn’t have automatic deploy or operation tools

so many external networking

Apache <-> Tomcat

app <-> Solr

...

Apache, Tomcat, Solr, and Redis had never died,

but the performance was our biggest issue.

28

system design: little bit changed [1/4]

master

...

: 1 physical server

... : SLB

: 1 server cluster

x4 x4

x4 x4

x2

slave

master

29

system design: little bit changed[2/4]

master

: 1 physical server

... : SLB

: 1 server cluster

x4 x4

x2

slave

master

... x4 x4

merged web & app server

1 physical server both contains

Apache & Tomcat

30

system design: little bit changed[3/4]

master

...

: 1 physical server

... : SLB

: 1 server cluster

x4 x4

x4 x4

x2

slave

master

easy to understand whole system network

easy to operate

easy to deploy or change configurations

31

system design: little bit changed [4/4]

master

...

: 1 physical server

... : SLB

: 1 server cluster

x4 x4

x4 x4

x2

slave

master

Solr is still far from apps

32

system design: current[1/4]

...

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

33

system design: current [2/4]

: 1 physical server

: SLB x2 x2

master

... app

x2 x2 app

x2 x2

Solr slave is included

in app server

34


: 1 physical server

: SLB

master

... app

x2 x2 app

x2 x2

x2 x2

SLB connect

master <-> slave

35


...

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

no SPOF (Solr master is redundant)

easy to understand whole system process

easy to operate

easy to deploy or change configurations

easy to scale out

good performance (7000 req / sec by 1 server)

but we can’t make full use of server resources

e.g., we want 0.7 Solr instance for 1 app instance...

36

system design: in the near future

server instance

physical on-premise, private cloud, public cloud

PaaS

Apache or Nginx?

shared cache

master <-> slave or SolrCloud?

Solr or Elasticsearch?

abolish servlet & tomcat style?

collaborate more with Hadoop family members

37

system design: in the near future

server instance

physical on-premise, private cloud, public cloud

PaaS

Apache or Nginx?

shared cache

Solr or Elasticsearch?

abolish servlet & tomcat style

collaborate more with Hadoop family members

m(_ _)m

UNDER

CONSTRUCTION

38

operation e.g. Solr schema update [1/8]

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

39


: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

Stop replication of

Solr & Redis

40


: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

Separated from the net

Service IN Service IN Service OUT

41


: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

update schema & app


42


: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

update schema


43

operation e.g. Solr shcema update [6/8]

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

restart replication


44

operation e.g. Solr shema update [7/8]

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

test app functions

with reverse proxy


45

operation e.g. Solr shcema update [8/8]

: 1 physical server

: SLB x2

app

x2 x2 app

x2 x2

x2

master

app

x2 x2

Service IN Service IN Service IN

connected to the net

46

Agenda



4 Solr plug-in

3 Solr performance


47

Solr cache

about various kind of Lucene/Solr cache

fieldCache (Lucene level)

fieldValueCache

documentCache

filterCache

queryResultCache

HTTP chache

and user defined cache

48

filter cache

we’re using it for caching the results of filter queries

 <filterCache class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="0"/>

49

query result cache

we used to activate it for avoiding useless searching

 <queryResultCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="0"/>

50

application cache

about cache in app side

processing time without Searching is 0 – 1 msec

-> convert from doc to DTO is relatively wasteful

-> SolrJ with javabin works well, but...

51

sizing & memory usage

monitoring -> tuning configuration, memory allocation

server: traffic, load, cpu, memory, page, swap

Apache: busy, rps, bps, cpu, state, processing time

Tomcat: thread, rps, bps, eps, memory, jmx

Solr: index size, doc num, memory, cache hit ratio

admin page, admin/Luke, replication?command=details...

server mon GrowthForecast Solr admin, command, Luke

52

avoid Full GC

Full GC

if we allocate 2GB for a tomcat heap

-> “Stop the World” would be more than 1 sec

Concurrent GC (we’re still struggling in tuning)

e.g.,)

HEAP_OPTS="-Xmx2g -Xms2g -Xss512k"

GC_LOG_OPTS="-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails"

FULL_GC_OPTS="-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseParNewGC -

XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=32 -XX:TargetSurvivorRatio=90"

JMX_OPTS="-Dcom.sun.management.config.file=${CATALINA_HOME}/conf/management.properties"

CATALINA_OPTS="-server ${HEAP_OPTS} ${GC_LOG_OPTS} ${FULL_GC_OPTS} ${JMX_OPTS}"

53

Agenda



4 Solr plug-in

3 Solr performance


54

Solr plugin

RequestHandler, SearchHandler

SearchComponent, QueryComponent

QParserPlugin, PostFilter

QueryResponseWriter

-> implemented these classes for our own use

55

RequestHandler & SearchHandler

for logging

for health check

like /admin/ calls AdminHandlers

public class OurRequestHandler extends RequestHandlerBase { /** Logger */ private static Logger log = LoggerFactory.getLogger(OurRequestHandler.class); @Override public void init(NamedList args) { super.init(args); } @Override public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception { log.info(req.toString()); rsp.setHttpCaching(false); ... } }

56

Solr index situation [1/2]

Solr’s indexing need huge costs, we thought (just thought...)

-> then separated into these two

basic stable data

additional unstable data

or

57

Solr index situation [2/2]

Solr index: for searching

keyword, placement data (Japan, Ichiba, footer...)

a few GB

Redis data (previously MySQL): for filtering or sorting

ad status (active or not)

ad price, ad rank (based on CTR, CVR...)

and ad contents data (image path, link, text...)

100MB – 10GB (depends on advertisement types)

58

searching: handle ads in app [1/2]

handle req

search

filter

sort

...

59

searching: handle ads in Solr [2/2]

handle req

search

...

60

Solr with Redis data handling [1/2]

ResponseWriter

-> unsuitable for searching or filtering

SearchComponent

-> easy to implement, configure

-> basic process is already handled in QueryComponent

61

Solr with Redis data handling [2/2]

modify QueryComponent

-> good position in terms of functionality

-> base for default searching

-> relatively big component

ConstantScoreQuery with our own Filter?

62

QueryParserPlugin & PostFilter [1/2]

e.g.)

<!–- solrconfig.xml -->  <lib dir=“.../orochi_search” />  <queryParser name=“redis” class=“...orochi.search.ExtendedQParserPlugin” />

public class ExtendedQParserPlugin extends QParserPlugin { public void init(NamedList args) { /* NOOP */ } @Override public QParser createParser (String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { return new QParser(qstr, localParams, params, req) { ... @Override public Query parse() throws ParseException { return new RedisPostFilter(rows, preview, currentTimeMillis); } }; } }

63

QueryParserPlugin & PostFilter [2/2]

public class RedisPostFilter extends ExtendedQueryBase implements PostFilter { public RedisPostFilter(int rows, long preview, long currentTimeMillis) { setCache(false); ... } public boolean isValid(int docId, IndexSearcher indexSearcher) { // return the document is valid or not. document = indexSearcher.doc(docId, fieldSelector); ... } public DelegatingCollector getFilterCollector(final IndexSearcher indexSearcher) { return new DelegatingCollector() { @Override public void collect(int docId) throws IOException { if (isValid(docId, indexSearcher)) { super.collect(docId); ... } } }; } @Override public int getCost() { return Math.max(super.getCost(), 100); } ... }

64

Merge Solr & Redis

handle req

search

...

65

Agenda



4 Solr plug-in

3 Solr performance


66

Japanese linguistics

すもももももももも

(pronunciation) sumomomomomomomomo

すもももももももも

(words) sumomo mo momo mo momo

李も桃も桃

(meaning) Plums and peaches are both part of peaches

67

Japanese linguistics

最中を食べている最中ですm(_ _)m

(pronunciation) monakawotabeteirusaichudesu

(meaning) I’m eating monaka. (excuse me)

how to separate this sentence into tokens for indexing?

68

Tokenize approach: N-gram


unigram

最中を食べている最中です m ( _ _ ) m

bigram

最中中をを食食べべてていいるる最最中中でですすm m( (_ _ _ _) )m

trigram

最中を中を食を食べ食べてべていているいる最る最中最中で中ですですm す

m( m(_ (_ _ _ _) _)m

69

Tokenize approach: Morphological Analysis [1/2]


using dictionary

最中を食べている最中です m(_ _)m

最中を食べている最中です m(_ _)m

text 最中を食べている最中です m(_ _)m

partO

fSpee

ch

noun-

common

particle-

case-

misc

verb-

main

particle-

conjuncti

ve

verb-

auxiliary

noun-

adverbial

auxiliary-

verb

-

pronu

nciati

on

monaka o tabe te iru saichu desu -

70

Tokenize approach: Morphological Analysis [2/2]


71

Tokenize approach: compare 2 ways

N-gram Morphological Analysis

index size big small

preparation not needed make & maintain word

dictionary

implementation very easy hard

NLP, ML, statistic

new word no problem update dictionary, re-index

search relevancy without omission

contains trivial

with omission

human like

processing time ... ...

72

Solr with Morphological Analysis

ver. -3.5 : setup component & dictionary manually

Sen

Lucene gosen

...

ver. 3.6- : field type text_ja woks well

“kuromoji” is inside

73

issues of kuromoji

some adjustments are needed for migration

supported dictionaries would be different between

previous engine & kuromoji

half width & full width characters

Windows8 <-> Ｗｉｎｄｏｗｓ８

AKB48 <-> ＡＫＢ４８

74

Japanese Analyzer

JapaneseTokenizer

JapaneseBaseFormFilter

JapanesePartOfSpeechStopFilter

CJKWidthFilter

StopFilter

JapaneseKatakanaStemFilter

LowerCaseFilter

75

Agenda



4 Solr plug-in

3 Solr performance


76

Thank you, San Diego

any question?

any comment?

any advice?

If you have some, let’s talk later (not now...?)

Hide (Hatayama Hideharu)

Big Data Department, Targeting Section, Advertising Group

Rakuten Inc.

blog: http://6109.hidepiy.com

facebook: http://www.facebook.com/hatayama.hideharu

twitter: ... I don’t remember


http://www.facebook.com/hatayama.hideharu

Using lucene solr to build advertising systems

Education