1 楽天株式会社 開発部 アーキテクトG 窪田 博昭 | 2012年1月18日 KVSの性能 RDBMSのインデックス 更にMapReduceを併せ持つ All-in-one NoSQL
May 24, 2015
1 楽天株式会社 開発部 アーキテクトG 窪田 博昭 | 2012年1月18日
KVSの性能
RDBMSのインデックス
更にMapReduceを併せ持つ
All-in-one NoSQL
2
• Introduction
• How to use mongo on the news.infoseek.co.jp
Agenda
Introduction
3
Introduction
4
Who am I ?
5
Name: 窪田 博昭 Hiroaki Kubota
Company: Rakuten Inc.
Unit: ACT = Development Unit Architect Group
Mail: [email protected]
Hobby: Futsal , Golf
Recent: My physical power has gradual declined...
twitter : crumbjp
github: crumbjp
Profile
Introduction
6
How to take advantages of the Mongo
for the infoseek news
7
For instance of our page
8
Page structure
9
Layout / Components
Layout Components
10
Albatross structure
Internet
WEB
API
ContentsDB MongoDB
ReplSet
Gat page layout
Request LayoutDB
MongoDB
ReplSet Call APIs
Retrieve data
Get components
SessionDB
MongoDB
ReplSet
Memcache
11
CMS
Developer
Albatross structure
API servers
ContentsDB MongoDB
ReplSet
Set page layout HTML markup
&
API settings
LayoutDB
MongoDB
ReplSet
Deploy API
Set components
Batch servers
Insert Data
12
CMS
Layout editor
13
CMS
14
CMS
15
MapReduce
16
We have never used MapReduce as regular operation.
However, We have used it for some irreglar case.
• To search the invalid articles that should be removed because of someone’s mistakes...
• To analyze the number of new articles posted a day.
• To analyze the updated number an article.
• We get start considering to use it regularly for the social data analyzing before long ...
Our usage
MapReduce
17
Structure & Performance
18
Structure
• Intel(R) Xeon(R) CPU X5650 2.67GHz 1core!!
• 4GB memory
• 50 GB disk space ( iScsi )
• CentOS5.5 64bit
• mongodb 1.8.0
– ReplicaSet 5 nodes ( + 1 Arbiter)
– Oplog size 1.2GB
– Average object size 1KB
We are using very poor machine (Virtual machine) !!
19
Structure
We’ve also researched following environments...
• Virtual machine 1 core
– 1kb data , 6,000,000 documents
– 8kb data , 200,000 documents
• Virtual machine 3 core
– 1kb data , 6,000,000 documents
– 8kb data , 200,000 documents
• EC2 large instance
– 2kb data , 60,000,000 documents. ( 100GB )
Researched environment
20
Performance
1~8 kb documents + 1 unique index
C = Number of CPU cores (Xeon 2.67 GHz)
DD = Score of ‘dd’ command (byte/sec)
S = Document size (byte)
• GET qps = 4500 × C
• SET(fsync) bytes/s = 0.05×DD ÷ S
• SET(nsync) qps = 4500 BUT...
have chance of STALE
I found the formula for making a rough estimation of QPS
21
Performance example (on EC2 large)
22
Performance example (on EC2 large)
Data-type
{
shop: 'someone',
item: 'something',
description: 'item explanation sentences...‘
}
EC2 large instance
– 2kb data , 60,000,000 documents. ( 100GB )
– 1 unique index
Environment and amount of data
23
Performance example (on EC2 large)
Batch insert (1000 documents) fsync=true
17906 sec (=289 min) (=3358 docs/sec)
Ensure index (background=false)
4049 sec (=67min)
1. primary 2101 sec (=35min)
2. secondary 1948 sec (=32min)
24
Performance example (on EC2 large)
Add one node
5833sec (=97min)
1. Get files 2GB×48 2120 sec (=35min)
2. _id indexing 1406 sec (=23min)
3. uniq indexing 2251 sec (=38min)
4. other processes 56 sec (=1 min)
25
Performance example (on EC2 large)
Group by
• Reduce by unique index & map & reduce
– 368 msec
db.data.group({ key: { shop: 1},
cond: { shop: 'someone' },
reduce: function ( o , p ) { p.sum++; },
initial: { sum: 0 } });
26
Performance example (on EC2 large)
MapReduce
• Scan all data 3116sec (=52min)
– number of key = 39092
db.data.mapReduce(
function(){ emit(this.shop,1); },
function(k,v){
var ret=0;
v.forEach( function (value){ ret+=value; });
return ret; },
{ query: {}, inline: 1, out: 'Tmp' } );
27
Major problems...
28
Indexing
29
Index probrem
Indexing is lock operation in default.
Indexing operation can run as background
on the primary. But...
It CANNOT run as background on the secondary
Moreover the all secondary’s indexing run
at the same time !!
Result in above...
All slave freezes ! orz...
Online indexisng is completely useless even if last version (2.0.2)
30
Present indexing ( default )
31
Index probrem
Present indexing ( default )
Secondary
Secondary Secondary
Client Client Client Client Client
Batch
Primary save
32
Index probrem
Present indexing ( default )
ensureIndex Primary
Lock
Indexing
Secondary
Secondary Secondary
Client Client Client Client Client
Batch Cannot write
33
Index probrem
Present indexing ( default )
Cannot read !!
SYNC SYNC SYNC
Secondary
Lock
Indexing
Secondary
Lock
Indexing
Secondary
Lock
Indexing
Client Client Client Client Client
Primary
Complete Batch
finished
34
Index probrem
Ideal indexing ( default )
Secondary
Complete
Secondary
Complete
Secondary
Complete
Client Client Client Client Client
Primary
Complete Batch
35
Present indexing ( background )
36
Index probrem
Present indexing ( background )
Secondary
Secondary Secondary
Client Client Client Client Client
Batch
Primary save
37
Index probrem
Present indexing ( background )
Primary
Slowdown
Indexing
Secondary
Secondary Secondary
Client Client Client Client Client
Batch ensureIndex(background) Slow down...
38
Index probrem
Present indexing ( background )
Cannot read !!
SYNC SYNC SYNC
Secondary
Lock
Indexing
Secondary
Lock
Indexing
Secondary
Lock
Indexing
Client Client Client Client Client
Primary
Complete Batch
finished
39
Index probrem
Present indexing ( background )
Cannot read !!
SYNC SYNC SYNC
Secondary
Lock
Indexing
Secondary
Lock
Indexing
Secondary
Lock
Indexing
Client Client Client Client Client
Primary
Complete Batch
finished
Background indexing don’t work
on the secondaries
40
Index probrem
Present indexing ( background )
Cannot read !!
SYNC SYNC SYNC
Secondary
Lock
Indexing
Secondary
Lock
Indexing
Secondary
Lock
Indexing
Client Client Client Client Client
Primary
Complete Batch
finished
41
Index probrem
Ideal indexing ( background )
Secondary
Complete
Secondary
Complete
Secondary
Complete
Client Client Client Client Client
Primary
Complete Batch
42
Probable 2.1.X indexing
43
Index probrem
But not released formally.
So I checked out the source code up to date.
Certainlly it’ll be fixed !
Moreover it sounds like it’ll run as foreground
when slave status isn’t SECONDARY
(it means RECOVERING )
Accoding to mongodb.org this probrem will fix in 2.1.0
44
Index probrem
Secondary
Secondary Secondary
Client Client Client Client Client
Batch
Primary save
Probable 2.1.X indexing
45
Index probrem
Probable 2.1.X indexing
Secondary
Secondary Secondary
ensureIndex(background)
Client Client Client Client Client
Batch
Primary
Slowdown
Indexing
Slow down...
46
SYNC
Index probrem
Slow down...
SYNC SYNC
Secondary
Slowdown
Indexing
Secondary
Slowdown
Indexing
Secondary
Slowdown
Indexing
Probable 2.1.X indexing
Client Client Client Client Client
Primary
Complete Batch
finished
47
Index probrem
Secondary
Complete
Secondary
Complete
Secondary
Complete
Client Client Client Client Client
Primary
Complete
Probable 2.1.X indexing
Batch
48
Index probrem
But I think it’s not enough.
I think it can be fatal for the system that
the all secondaries slowdown at the same time !!
So...
Background indexing 2.1.X
49
Ideal indexing
50
Index probrem
Secondary
Secondary Secondary
Client Client Client Client Client
Batch
Primary save
Ideal indexing
51
Index probrem
Ideal indexing
Secondary
Secondary Secondary
ensureIndex(background)
Client Client Client Client Client
Batch
Primary
Slowdown
Indexing
Slow down...
52
Index probrem
Ideal indexing
ensureIndex
Recovering
Indexing
Secondary
Secondary
Client Client Client Client Client
finished Primary
Complete Batch
53
Index probrem
Ideal indexing
ensureIndex
Secondary
Complete
Recovering
Indexing
Secondary
Client Client Client Client Client
Batch
Primary
Complete
54
Index probrem
Ideal indexing
ensureIndex
Secondary
Complete
Secondary
Complete
Recovering
Indexing
Client Client Client Client Client
Batch
Primary
Complete
55
Index probrem
Ideal indexing
Secondary
Complete
Secondary
Complete
Secondary
Complete
Client Client Client Client Client
Primary
Complete Batch
56
Index probrem
It would be great if I can operate indexing manually
at each secondaries
But ... I easilly guess it’s difficult to apply for current Oplog
57
I suggest Manual indexing
58
Index probrem
Secondary
Secondary Secondary
Client Client Client Client Client
Batch
Primary save
Manual indexing
59
Index probrem
Manual indexing
Secondary
Secondary Secondary
Client Client Client Client Client
Batch
Primary
Slowdown
Indexing
Slow down... ensureIndex(manual,background)
60
Index probrem
Manual indexing
finished Primary
Complete Batch
Secondary
Secondary
Secondary
Client Client Client Client Client
61
Index probrem
Manual indexing
finished Primary
Complete Batch
Secondary
Secondary
Secondary
Client Client Client Client Client
The secondaries don’t sync
automatically
62
Index probrem
Manual indexing
finished Primary
Complete Batch
Secondary
Secondary
Secondary
Client Client Client Client Client
63
Index probrem
Manual indexing
Recovering
Indexing
Secondary
Secondary
Client Client Client Client Client
ensureIndex(manual)
Primary
Complete Batch
64
Index probrem
Manual indexing
Secondary
Complete
Recovering
Indexing
Secondary
Client Client Client Client Client
ensureIndex(manual)
Primary
Complete Batch
65
Index probrem
Manual indexing
Secondary
Complete
Secondary
Complete
Secondary
Slowdown
Indexing
Client Client Client Client Client
ensureIndex(manual,background)
Primary
Complete Batch
66
Index probrem
Manual indexing
Secondary
Complete
Secondary
Complete
Secondary
Slowdown
Indexing
Client Client Client Client Client
ensureIndex(manual,background)
Primary
Complete Batch
It needs to support
background operation
Just in case,if the ReplSet has only
one Secondary
67
Index probrem
Manual indexing
Secondary
Complete
Secondary
Complete
Secondary
Slowdown
Indexing
Client Client Client Client Client
ensureIndex(manual,background)
Primary
Complete Batch
68
Index probrem
Manual indexing
Secondary
Complete
Secondary
Complete
Secondary
Complete
Client Client Client Client Client
Primary
Complete Batch
69
That’s all about Indexing problem
70
Struggle to control the sync
71
STALE
72
Unknown log & Out of control the ReplSet
• Secondaries change status repeatedly in a moment
between Secondary and Recovering (1.8.0)
• Then we found the strange line in the log...
[rsSync] replSet error RS102 too stale to catch up
We often suffered from going out of control the Secondaries...
73
What’s Stale ?
• 〈食品・飲料などが〉新鮮でない(⇔fresh);
• 気の抜けた, 〈コーヒーが〉香りの抜けた,
• 〈パンが〉ひからびた, 堅くなった,
• 〈空気・臭(にお)いなどが〉むっとする,
• いやな臭いのする
stale [stéil] (レベル:社会人必須 ) powered by goo.ne.jp
74
What’s Stale ?
• 〈食品・飲料などが〉新鮮でない(⇔fresh);
• 気の抜けた, 〈コーヒーが〉香りの抜けた,
• 〈パンが〉ひからびた, 堅くなった,
• 〈空気・臭(にお)いなどが〉むっとする,
• いやな臭いのする
どうも非常によろしくないらしい・・・
stale [stéil] (レベル:社会人必須 ) powered by goo.ne.jp
75
Mechanizm of being stale
76
ReplicaSet
Primary
Database Oplog
Client
Secondary
Database
mongod mongod
Oplog
77
Replication (simple case)
78
ReplicaSet
Primary
Database Oplog
Client
Secondary
Database
mongod mongod
Oplog
79
Insert & Replication 1
Primary
Database Oplog
Client
Secondary
Database
Insert A
A
A
Insert
mongod mongod
Oplog
80
Insert & Replication 1
Primary
Database Oplog
Client
Secondary
Database
Insert A
A A
Oplog
Insert A
Sync
81
Replication (busy case)
82
Stale
Primary
Database Oplog
Client
Secondary
Database
Insert A
A A
Oplog
Insert A
mongod mongod
83
Insert & Replication 2
Primary
Database Oplog
Client
Secondary
Database
Insert A
B
A
Insert
B
Insert B
A
Oplog
Insert A
84
Insert & Replication 2
Primary
Database Oplog
Client
Secondary
Database
Insert A
C
A
Insert
B
Insert B
A
C
Insert C
Oplog
Insert A
85
Insert & Replication 2
Primary
Database Oplog
Client
Secondary
Database
Insert A
A
A
Update
B
Insert B
A
C
Insert C
Update A
Oplog
Insert A
86
Insert & Replication 2
Primary
Database Oplog
Client
Secondary
Database
Insert A
A
B
Insert B
A
C
Insert C
Update A
Oplog
Insert A
Check Oplog
87
Insert & Replication 2
Primary
Database Oplog
Client
Secondary
Database
Insert A
A
B
Insert B C
Insert C
Update A
Oplog
Insert A
Sync
Insert B
Insert C
Update A
A
B
C
88
Replication (more busy)
89
Stale
Primary
Database Oplog
Client
Secondary
Database
Insert A
A A
Oplog
Insert A
mongod mongod
90
Stale
Primary
Database Oplog
Client
Secondary
Database
Insert A
B
A
Insert
B
Insert B
A
Oplog
Insert A
91
Stale
Primary
Database Oplog
Client
Secondary
Database
Insert A
C
A
Insert
B
Insert B
A
C
Insert C
Oplog
Insert A
92
Stale
Primary
Database Oplog
Client
Secondary
Database
Insert A
A
A
Update
B
Insert B
A
C
Insert C
Update A
Oplog
Insert A
93
Stale
Primary
Database
Client
Secondary
Database
C
A
Update
B
A
C
Oplog
Insert A
Oplog
Insert A
Insert B
Insert C
Update A
Update C
94
Stale
Primary
Database
Client
Secondary
Database
D
A
Insert
B
A
C
Oplog
Insert A
D
Insert B
Insert C
Update A
Update C
Insert D
Insert A
95
Stale
Primary
Database
Client
Secondary
Database
A
B
A
C
Oplog
Insert A
D
Insert B
Insert C
Update A
Update C
Insert D
Check Oplog
[Inset A]
not found !!
Insert A
96
Stale
Primary
Database
Client
Recovering
Database
A
B
A
C
Oplog
Insert A
D
Insert B
Insert C
Update A
Update C
Insert D
Check Oplog
[Inset A]
not found !!
It cannot get
infomation about
[Insert B].
So cannot sync !!
It’s called STALE Insert A
97
Stale
We can specify the oplog size as one of the command line option
Only at the first time per the dbpath
that is also specified as a command line.
Also we cannot change the oplog size
without clearing the dbpath.
Be careful !
We have to understand the importance of adjusting oplog size
98
Replication (Join as a new node)
99
InitialSync
Primary
Database
Client
A
B
C
D
Oplog
Insert C
Update A
Update C
Insert D
mongod
100
InitialSync
Primary
Database
Client
Startup
Database
A
B
C
Oplog
D
Oplog
Insert C
Update A
Update C
Insert D
mongod mongod
101
InitialSync
Primary
Database
Client
Recovering
Database
A
B
C
Oplog
D
Oplog
Insert C
Update A
Update C
Insert D
Get last Oplog
Insert D
102
InitialSync
Primary
Database
Client
Recovering
Database
A
B
C
Oplog
D
Oplog
Insert C
Update A
Update C
Insert D
Cloning DB
Insert D
A
B
C
D
103
InitialSync
Primary
Database
Client
Recovering
Database
A
B
C
Oplog
D
Oplog
Insert C
Update A
Update C
Insert D
Cloning DB
Insert D A
A
B
C
D
104
InitialSync
Primary
Database
Client
Recovering
Database
A
B
C
Oplog
D
Oplog
Insert C
Update A
Update C
Insert D
Cloning DB
Insert D A
A
B
C
D E
Insert
E Insert E
B
105
InitialSync
Primary
Database
Client
Recovering
Database
A
B
C
Oplog
D
Oplog
Update A
Update C
Insert D
Cloning DB complete
Insert D
A
B
C
D
B
Update
E
Insert E
Update B
106
InitialSync
Primary
Database
Client
Recovering
Database
A
B
C
Oplog
D
Oplog
Update C
Insert D
Check Oplog
Insert D
A
B
C
D
E
Insert E
Update B
107
InitialSync
Primary
Database
Client
Secondary
Database
A
B
C
Oplog
D
Oplog
Update C
Insert D
Sync
Insert D
E
Insert E
Update B
Insert E
Update B
A
B
C
D
E
108
Additional infomation
Secondary will try to sync from other Secondaries
when it cannot reach the Primary or
might be stale against the Primary.
There is a bit of chance that sync problem not occured if the
secondary has old Oplog or larger Oplog space than Primary
From source code. ( I’ve never examed these... )
109
Sync from another secondary
Primary
Database
Client
Secondary
Database
A
B
A
C
Oplog
Insert A
D
Insert B
Insert C
Update A
Update C
Insert D
Insert A
Secondary
Database
A
B
C
D
Insert B
Insert C
Update A
Update C
Insert D
Insert A
110
Sync from another secondary
Primary
Database
Client
Secondary
Database
A
B
A
C
Oplog
Insert A
D
Insert B
Insert C
Update A
Update C
Insert D
Insert A
Secondary
Database
A
B
C
D
Insert B
Insert C
Update A
Update C
Insert D
Insert A
Check Oplog
[Inset A]
not found !!
111
Sync from another secondary
Primary
Database
Client
Secondary
Database
A
B
A
C
Oplog
Insert A
D
Insert B
Insert C
Update A
Update C
Insert D
Insert A
Secondary
Database
A
B
C
D
Insert B
Insert C
Update A
Update C
Insert D
Insert A
Check Oplog
But found at the other secondary
So it’s able to sync
112
Sync from the other secondary
Primary
Database
Client
Secondary
A
B
C
D
Insert C
Update A
Update C
Insert D
Secondary
Database
A
B
C
D
Insert B
Insert C
Update A
Update C
Insert D
Insert A
Sync
But found at the other secondary
So it’s able to sync
Database
A
B
C
D
Insert C
Update A
Update C
Insert D
Insert B
Insert A
Insert B
Insert A
113
That’s all about sync
114
Others...
115
Disk space
116
Disk space
Data fragment into any DB files sparsely...
We met the unfavorable circumstance in our DBs
This circumstance appears at some of our collections
around 3 months after we launched the services
db.ourcol.storageSize() = 16200727264 (15GB)
db.ourcol.totalSize() = 16200809184
db.ourcol.totalIndexSize() = 81920
db.outcol.dataSize() = 2032300 (2MB)
What’s happen to them !!
117
Disk space
Data fragment into any DB files sparsely...
It’s seems like to be caused by the specific operation
that insert , update and delete over and over.
Anyway we have to shrink the using disk space regularly
just like PostgreSQL’s vacume.
But how to do it ?
118
Disk space
Shrink the using disk spaces
MongoDB offers some functions for this case.
But couldn’t use in our case !
repairdatabase:
Only runable on the Primary.
It needs long time and BLOCK all operations !!
compact:
Only runable on the Secondary.
Zero-fill the blank space instead of shrink disk spaces.
So cannot shrink...
119
Disk space
Our measurements
For temporary collection:
To issue drop-command regularly.
For other collections:
1. Get rid of one secondary from the ReplSet.
2. Shut down this.
3. Remove all DB files.
4. Join to the ReplSet.
5. Do these operations one after another.
6. Step down the Primary. (Change Primary node)
7. At last, do 1 – 4 operations on prior Primary.
120
PHP client
121
PHP client
We tried 1.4.4 and 1.2.2
1.4.4:
There is some critical bugs around connection pool.
We struggled to invalidate the broken connection.
I think, you should use 1.2.X instead of 1.4.X
1.2.2:
It seems like to be fixed around connection pool.
But there are 2 critical bugs !
– Socket handle leak
– Useless sleep
However, This version is relatively stable
as long as to fix these bugs
122
PHP client
We tried 1.4.4 and 1.2.2
https://github.com/crumbjp/Personal
- mongo1.2.2.non-wait.patch
- mongo1.2.2.sock-leak.patch
123
PHP client
124
Closing
125
Closing
What’s MongoDB ?
It has very good READ performance.
We can use mongo instead of memcached.
if we can allow the limited write performance.
Die hard !
MongoDB have high availability even if under a severe stress..
Can use easilly without deep consideration
We can manage to do anything after getting start to use.
Let’s forget any awkward trivial things that have bothered us.
How to treat the huge data ?
How to put in the cache system ?
How to keep the availablity ?
And so on ....
126
Closing
Keep in mind
Sharding is challenging...
It’s last resort !
It’s hard to operate. In particular, to maintain config-servers.
[Mongos] is also difficult to keep alive.
I want the way to failover Mongos.
Mongo is able to run on the poor environment but...
You should ONLY put aside the large diskspace
Huge write is sensitive
Adjust the oplog size carefully
Indexing function has been unfinished
Cannot apply index online
127
All right, Have fun !!
128
Thank you for your listening