Using Lucene/Solr to build Advertising Systems Hide (Hatayama Hideharu) Big Data Department, Targeting Section, Advertising Group Rakuten, Inc. May 2 nd 2013
Jan 15, 2015
Using Lucene/Solr to build Advertising Systems
Hide (Hatayama Hideharu)
Big Data Department, Targeting Section, Advertising Group
Rakuten, Inc. May 2nd 2013
2
Intro
Agenda | www.lucenerevolution.org
http://www.lucenerevolution.org/2013/agenda
3
Intro
Agenda | www.lucenerevolution.org
http://www.lucenerevolution.org/2013/agenda
35 min...orz
my talk is NOT about... m(_ _)m
NRT
SolrCloud
complicated queries
or other Solr hot topics
my talk is just about
Overview of Solr, most common features
Our empirical knowledge about Solr
4
Agenda
1 Introduction of Me & Rakuten
2 Solr centered Advertising Systems
4 Solr plug-in
3 Solr performance
5 (Solr with Japanese language)
5
Agenda
1 Introduction of Me & Rakuten
2 Solr centered Advertising Systems
4 Solr plug-in
3 Solr performance
5 (Solr with Japanese language)
6
Agenda
1 Introduction of Me & Rakuten
2 Solr centered Advertising Systems
4 Solr plug-in
3 Solr performance
5 (Solr with Japanese language)
7
Agenda
1 Introduction of Me & Rakuten
2 Solr centered Advertising Systems
4 Solr plug-in
3 Solr performance
5 (Solr with Japanese language)
8
Agenda
1 Introduction of Me & Rakuten
2 Solr centered Advertising Systems
4 Solr plug-in
3 Solr performance
5 (Solr with Japanese language)
9
Agenda
1 Introduction of Me & Rakuten
2 Solr centered Advertising Systems
4 Solr plug-in
3 Solr performance
5 (Solr with Japanese language)
10
Who am I?
Hatayama Hideharu (call me Hide)
M.Eng, Tokyo Institute of Technology, Japan
Worked with advertising system in Rakuten for 3 years
ad management system development
ad distribution system development
system architecture design
increase the performance of systems
increase profitability of ad services
User of Solr, not implementer http://6109.hidepiy.com/
11
Who are we?
Rakuten, Inc.
Internet services company
Founded : Feb. 7th 1997, Tokyo, Japan
The first service: Rakuten Ichiba (shopping mall)
12
Who are we?
13
Rakuten in Japan
14
Rakuten Ichiba
Ichiba: The largest online shopping mall in Japan
user info
campaign
other services
item search
category navigation
personalized item
item history
sale event shop history
bookmarked item
service tab
:
15
Rakuten’s Global Expansion
★
● ● ●
● ●
●
●
● ● ● ● ● ●
●
● ●
● ● ● ●
● ● ● ●
●
●
●
●
● ● ●
●
● ●
● ●
● ●
● ●
●
● ●
●
●
●
●
●
● E-Commerce
eBook
Travel
Other services & businesses
Development center ●
16
Agenda
1 Introduction of Me & Rakuten
2 Solr centered Advertising Systems
4 Solr plug-in
3 Solr performance
5 (Solr with Japanese language)
17
Types of advertisements on Rakuten Ichiba [1/3]
Listing Ad (search word related ad)
item search
searched ads
searched items
18
Types of advertisements on Rakuten Ichiba [2/3]
Display Ad (placement related ad)
where, when …
Targeting Ad (user related ad)
sex, age, browsing history …
19
... Ad ?
120 ads on 1 page ...orz
Types of advertisements on Rakuten Ichiba [3/3]
20
ad system function landscape
ad system
Rakuten
Owned
Media (Web/Email)
Owned Ad
Network
Rakuten staff
Merchants
Tool User Media
External
ADNW,
AdEx
Other staff
Tenancy Ad (Fixed placement/fee/term)
P4P Ad (CPM/CPC/CPA etc.)
Ad placement def.
Sales mgmt.
Creative mgmt.
Campaign mgmt.
Budget mgmt.
Bidding
Additional Function
Big Data Analysis Advanced
targeting
Creative
optimization
Connect to
affiliate network
Programmatic
media buying
- Attribution
- Behavior
- Optimization
Delivery mgmt.
Reporting
Merchant Tool
Targeting/media
Reporting
Merchant Tool
Ad server.
ad management ad distribution
Log processing
Targeting (Placement, keyword,
behavioral, demographic, etc.)
Beacon server.
Redirect server.
Device
x
PC Mobile
Smart
phone Tablet
21
ad distribution system [1/2]
JSON
HTML
JavaScript
ad searching
ad filtering
ad sorting
logging
...
???
parameter
placement
keyword
ad type
...
cookie
22
ad distribution system [2/2]
need high performance, high availability
e.g., more than 7,000 req / sec for 1 server with 100.00% avail.
collect & analyze log, then improve profitability
basic architecture is the same for our variety of ad
using...
Kyoto Tycoon
23
system design: few years ago [1/5]
master
...
: 1 physical server
... : SLB
: 1 server cluster
x4 x4 x4 x4 x4
x4 x4
x2
slave
web svr
app svr
master
24
master
system design: few years ago [2/5]
master
...
: 1 physical server
... : SLB
: 1 server cluster
x4 x4
x2
slave
web svr
app svr x4 x4 x4 x4 x4
cluster
web server x 4
app server x 5
25
master
system design: few years ago [3/5]
master
...
: 1 physical server
: SLB
: 1 server cluster x2
slave
web svr
app svr
...
x4 x4 x4 x4 x4
x4 x4
SLB connect
app <-> Solr
26
system design: few years ago [4/5]
master
...
: 1 physical server
... : SLB
: 1 server cluster
x4 x4 x4 x4 x4
x4 x4
x2
slave
web svr
app svr
High availability, robust
simplified task for each servers
Web server only do Apaching
Solr server searching
...
make full use of resources, on demand provisioning
e.g., add 1 front cluster
e.g., swap broken apache server
e.g., tune up performance, decrease app server 5 -> 3
27
system design: few years ago [5/5]
master
...
: 1 physical server
... : SLB
: 1 server cluster
x4 x4 x4 x4 x4
x4 x4
x2
slave
web svr
app svr
so many servers, so many configurations
we didn’t have automatic deploy or operation tools
so many external networking
Apache <-> Tomcat
app <-> Solr
...
Apache, Tomcat, Solr, and Redis had never died,
but the performance was our biggest issue.
28
system design: little bit changed [1/4]
master
...
: 1 physical server
... : SLB
: 1 server cluster
x4 x4
x4 x4
x2
slave
master
29
system design: little bit changed[2/4]
master
: 1 physical server
... : SLB
: 1 server cluster
x4 x4
x2
slave
master
... x4 x4
merged web & app server
1 physical server both contains
Apache & Tomcat
30
system design: little bit changed[3/4]
master
...
: 1 physical server
... : SLB
: 1 server cluster
x4 x4
x4 x4
x2
slave
master
easy to understand whole system network
easy to operate
easy to deploy or change configurations
31
system design: little bit changed [4/4]
master
...
: 1 physical server
... : SLB
: 1 server cluster
x4 x4
x4 x4
x2
slave
master
Solr is still far from apps
32
system design: current[1/4]
...
: 1 physical server
: SLB x2
app
x2 x2 app
x2 x2
x2
master
33
system design: current [2/4]
: 1 physical server
: SLB x2 x2
master
... app
x2 x2 app
x2 x2
Solr slave is included
in app server
34
system design: current [3/4]
: 1 physical server
: SLB
master
... app
x2 x2 app
x2 x2
x2 x2
SLB connect
master <-> slave
35
system design: current [4/4]
...
: 1 physical server
: SLB x2
app
x2 x2 app
x2 x2
x2
master
no SPOF (Solr master is redundant)
easy to understand whole system process
easy to operate
easy to deploy or change configurations
easy to scale out
good performance (7000 req / sec by 1 server)
but we can’t make full use of server resources
e.g., we want 0.7 Solr instance for 1 app instance...
36
system design: in the near future
server instance
physical on-premise, private cloud, public cloud
PaaS
Apache or Nginx?
shared cache
master <-> slave or SolrCloud?
Solr or Elasticsearch?
abolish servlet & tomcat style?
collaborate more with Hadoop family members
37
system design: in the near future
server instance
physical on-premise, private cloud, public cloud
PaaS
Apache or Nginx?
shared cache
Solr or Elasticsearch?
abolish servlet & tomcat style
collaborate more with Hadoop family members
m(_ _)m
UNDER
CONSTRUCTION
38
operation e.g. Solr schema update [1/8]
: 1 physical server
: SLB x2
app
x2 x2 app
x2 x2
x2
master
app
x2 x2
39
operation e.g. Solr schema update [2/8]
: 1 physical server
: SLB x2
app
x2 x2 app
x2 x2
x2
master
app
x2 x2
Stop replication of
Solr & Redis
40
operation e.g. Solr schema update [3/8]
: 1 physical server
: SLB x2
app
x2 x2 app
x2 x2
x2
master
app
x2 x2
Separated from the net
Service IN Service IN Service OUT
41
operation e.g. Solr schema update [4/8]
: 1 physical server
: SLB x2
app
x2 x2 app
x2 x2
x2
master
app
x2 x2
update schema & app
Service IN Service IN Service OUT
42
operation e.g. Solr schema update [5/8]
: 1 physical server
: SLB x2
app
x2 x2 app
x2 x2
x2
master
app
x2 x2
update schema
Service IN Service IN Service OUT
43
operation e.g. Solr shcema update [6/8]
: 1 physical server
: SLB x2
app
x2 x2 app
x2 x2
x2
master
app
x2 x2
restart replication
Service IN Service IN Service OUT
44
operation e.g. Solr shema update [7/8]
: 1 physical server
: SLB x2
app
x2 x2 app
x2 x2
x2
master
app
x2 x2
test app functions
with reverse proxy
Service IN Service IN Service OUT
45
operation e.g. Solr shcema update [8/8]
: 1 physical server
: SLB x2
app
x2 x2 app
x2 x2
x2
master
app
x2 x2
Service IN Service IN Service IN
connected to the net
46
Agenda
1 Introduction of Me & Rakuten
2 Solr centered Advertising Systems
4 Solr plug-in
3 Solr performance
5 (Solr with Japanese language)
47
Solr cache
about various kind of Lucene/Solr cache
fieldCache (Lucene level)
fieldValueCache
documentCache
filterCache
queryResultCache
HTTP chache
and user defined cache
48
filter cache
we’re using it for caching the results of filter queries
<!-- default in solrconfig.xml --> <filterCache class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="0"/>
49
query result cache
we used to activate it for avoiding useless searching
<!-- default in solrconfig.xml --> <queryResultCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="0"/>
50
application cache
about cache in app side
processing time without Searching is 0 – 1 msec
-> convert from doc to DTO is relatively wasteful
-> SolrJ with javabin works well, but...
51
sizing & memory usage
monitoring -> tuning configuration, memory allocation
server: traffic, load, cpu, memory, page, swap
Apache: busy, rps, bps, cpu, state, processing time
Tomcat: thread, rps, bps, eps, memory, jmx
Solr: index size, doc num, memory, cache hit ratio
admin page, admin/Luke, replication?command=details...
server mon GrowthForecast Solr admin, command, Luke
52
avoid Full GC
Full GC
if we allocate 2GB for a tomcat heap
-> “Stop the World” would be more than 1 sec
Concurrent GC (we’re still struggling in tuning)
e.g.,)
HEAP_OPTS="-Xmx2g -Xms2g -Xss512k"
GC_LOG_OPTS="-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails"
FULL_GC_OPTS="-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseParNewGC -
XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=32 -XX:TargetSurvivorRatio=90"
JMX_OPTS="-Dcom.sun.management.config.file=${CATALINA_HOME}/conf/management.properties"
CATALINA_OPTS="-server ${HEAP_OPTS} ${GC_LOG_OPTS} ${FULL_GC_OPTS} ${JMX_OPTS}"
53
Agenda
1 Introduction of Me & Rakuten
2 Solr centered Advertising Systems
4 Solr plug-in
3 Solr performance
5 (Solr with Japanese language)
54
Solr plugin
RequestHandler, SearchHandler
SearchComponent, QueryComponent
QParserPlugin, PostFilter
QueryResponseWriter
-> implemented these classes for our own use
55
RequestHandler & SearchHandler
for logging
for health check
like /admin/ calls AdminHandlers
public class OurRequestHandler extends RequestHandlerBase { /** Logger */ private static Logger log = LoggerFactory.getLogger(OurRequestHandler.class); @Override public void init(NamedList args) { super.init(args); } @Override public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception { log.info(req.toString()); rsp.setHttpCaching(false); ... } }
56
Solr index situation [1/2]
Solr’s indexing need huge costs, we thought (just thought...)
-> then separated into these two
basic stable data
additional unstable data
or
57
Solr index situation [2/2]
Solr index: for searching
keyword, placement data (Japan, Ichiba, footer...)
a few GB
Redis data (previously MySQL): for filtering or sorting
ad status (active or not)
ad price, ad rank (based on CTR, CVR...)
and ad contents data (image path, link, text...)
100MB – 10GB (depends on advertisement types)
58
searching: handle ads in app [1/2]
handle req
search
filter
sort
...
59
searching: handle ads in Solr [2/2]
handle req
search
...
60
Solr with Redis data handling [1/2]
ResponseWriter
-> unsuitable for searching or filtering
SearchComponent
-> easy to implement, configure
-> basic process is already handled in QueryComponent
61
Solr with Redis data handling [2/2]
modify QueryComponent
-> good position in terms of functionality
-> base for default searching
-> relatively big component
ConstantScoreQuery with our own Filter?
62
QueryParserPlugin & PostFilter [1/2]
e.g.)
<!–- solrconfig.xml --> <!-- put jar file here --> <lib dir=“.../orochi_search” /> <!-- define implemented class --> <queryParser name=“redis” class=“...orochi.search.ExtendedQParserPlugin” />
public class ExtendedQParserPlugin extends QParserPlugin { public void init(NamedList args) { /* NOOP */ } @Override public QParser createParser (String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { return new QParser(qstr, localParams, params, req) { ... @Override public Query parse() throws ParseException { return new RedisPostFilter(rows, preview, currentTimeMillis); } }; } }
63
QueryParserPlugin & PostFilter [2/2]
public class RedisPostFilter extends ExtendedQueryBase implements PostFilter { public RedisPostFilter(int rows, long preview, long currentTimeMillis) { setCache(false); ... } public boolean isValid(int docId, IndexSearcher indexSearcher) { // return the document is valid or not. document = indexSearcher.doc(docId, fieldSelector); ... } public DelegatingCollector getFilterCollector(final IndexSearcher indexSearcher) { return new DelegatingCollector() { @Override public void collect(int docId) throws IOException { if (isValid(docId, indexSearcher)) { super.collect(docId); ... } } }; } @Override public int getCost() { return Math.max(super.getCost(), 100); } ... }
64
Merge Solr & Redis
handle req
search
...
65
Agenda
1 Introduction of Me & Rakuten
2 Solr centered Advertising Systems
4 Solr plug-in
3 Solr performance
5 (Solr with Japanese language)
66
Japanese linguistics
すもももももももも
(pronunciation) sumomomomomomomomo
すもも も もも も もも
(words) sumomo mo momo mo momo
李も桃も桃
(meaning) Plums and peaches are both part of peaches
67
Japanese linguistics
最中を食べている最中ですm(_ _)m
(pronunciation) monakawotabeteirusaichudesu
(meaning) I’m eating monaka. (excuse me)
how to separate this sentence into tokens for indexing?
68
Tokenize approach: N-gram
最中を食べている最中ですm(_ _)m
unigram
最 中 を 食 べ て い る 最 中 で す m ( _ _ ) m
bigram
最中 中を を食 食べ べて てい いる る最 最中 中で です すm m( (_ _ _ _) )m
trigram
最中を 中を食 を食べ 食べて べてい ている いる最 る最中 最中で 中です ですm す
m( m(_ (_ _ _ _) _)m
69
Tokenize approach: Morphological Analysis [1/2]
最中を食べている最中ですm(_ _)m
using dictionary
最中 を 食べ て いる 最中 です m(_ _)m
最中 を 食べ て いる 最中 です m(_ _)m
text 最中 を 食べ て いる 最中 です m(_ _)m
partO
fSpee
ch
noun-
common
particle-
case-
misc
verb-
main
particle-
conjuncti
ve
verb-
auxiliary
noun-
adverbial
auxiliary-
verb
-
pronu
nciati
on
monaka o tabe te iru saichu desu -
70
Tokenize approach: Morphological Analysis [2/2]
最中を食べている最中ですm(_ _)m
71
Tokenize approach: compare 2 ways
N-gram Morphological Analysis
index size big small
preparation not needed make & maintain word
dictionary
implementation very easy hard
NLP, ML, statistic
new word no problem update dictionary, re-index
search relevancy without omission
contains trivial
with omission
human like
processing time ... ...
72
Solr with Morphological Analysis
ver. -3.5 : setup component & dictionary manually
Sen
Lucene gosen
...
ver. 3.6- : field type text_ja woks well
“kuromoji” is inside
73
issues of kuromoji
some adjustments are needed for migration
supported dictionaries would be different between
previous engine & kuromoji
half width & full width characters
Windows8 <-> Windows8
AKB48 <-> AKB48
74
Japanese Analyzer
JapaneseTokenizer
JapaneseBaseFormFilter
JapanesePartOfSpeechStopFilter
CJKWidthFilter
StopFilter
JapaneseKatakanaStemFilter
LowerCaseFilter
75
Agenda
1 Introduction of Me & Rakuten
2 Solr centered Advertising Systems
4 Solr plug-in
3 Solr performance
5 (Solr with Japanese language)
76
Thank you, San Diego
any question?
any comment?
any advice?
If you have some, let’s talk later (not now...?)
Hide (Hatayama Hideharu)
Big Data Department, Targeting Section, Advertising Group
Rakuten Inc.
blog: http://6109.hidepiy.com
facebook: http://www.facebook.com/hatayama.hideharu
twitter: ... I don’t remember