KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB

1 楽天株式会社開発部アーキテクトG 窪田博昭｜ 2012年1月18日

KVSの性能

RDBMSのインデックス

更にMapReduceを併せ持つ

All-in-one NoSQL

2

• Introduction

• How to use mongo on the news.infoseek.co.jp

Agenda

Introduction

3

Introduction

4

Who am I ?

5

Name：窪田博昭 Hiroaki Kubota

Company： Rakuten Inc.

Unit： ACT = Development Unit Architect Group

Mail: [email protected]

Hobby： Futsal , Golf

Recent： My physical power has gradual declined...

twitter : crumbjp

github: crumbjp

Profile

Introduction

6

How to take advantages of the Mongo

for the infoseek news

7

For instance of our page

8

Page structure

9

Layout / Components

Layout Components

10

Albatross structure

Internet

WEB

API

ContentsDB MongoDB

ReplSet

Gat page layout

Request LayoutDB

MongoDB

ReplSet Call APIs

Retrieve data

Get components

SessionDB

MongoDB

ReplSet

Memcache

11

CMS

Developer

Albatross structure

API servers

ContentsDB MongoDB

ReplSet

Set page layout HTML markup

&

API settings

LayoutDB

MongoDB

ReplSet

Deploy API

Set components

Batch servers

Insert Data

12

CMS

Layout editor

13

CMS

14

CMS

15

MapReduce

16

We have never used MapReduce as regular operation.

However, We have used it for some irreglar case.

• To search the invalid articles that should be removed because of someone’s mistakes...

• To analyze the number of new articles posted a day.

• To analyze the updated number an article.

• We get start considering to use it regularly for the social data analyzing before long ...

Our usage

MapReduce

17

Structure & Performance

18

Structure

• Intel(R) Xeon(R) CPU X5650 2.67GHz 1core!!

• 4GB memory

• 50 GB disk space ( iScsi )

• CentOS5.5 64bit

• mongodb 1.8.0

– ReplicaSet 5 nodes ( + 1 Arbiter)

– Oplog size 1.2GB

– Average object size 1KB

We are using very poor machine (Virtual machine) !!

19

Structure

We’ve also researched following environments...

• Virtual machine 1 core

– 1kb data , 6,000,000 documents

– 8kb data , 200,000 documents

• Virtual machine 3 core

– 1kb data , 6,000,000 documents

– 8kb data , 200,000 documents

• EC2 large instance

– 2kb data , 60,000,000 documents. ( 100GB )

Researched environment

20

Performance

1~8 kb documents + 1 unique index

C = Number of CPU cores (Xeon 2.67 GHz)

DD = Score of ‘dd’ command (byte/sec)

S = Document size (byte)

• GET qps = 4500 × C

• SET(fsync) bytes/s = 0.05×DD ÷ S

• SET(nsync) qps = 4500 BUT...

have chance of STALE

I found the formula for making a rough estimation of QPS

21

Performance example (on EC2 large)

22


Data-type

{

shop: 'someone',

item: 'something',

description: 'item explanation sentences...‘

}

EC2 large instance

– 2kb data , 60,000,000 documents. ( 100GB )

– 1 unique index

Environment and amount of data

23


Batch insert (1000 documents) fsync=true

17906 sec (=289 min) (=3358 docs/sec)

Ensure index (background=false)

4049 sec (=67min)

1. primary 2101 sec (=35min)

2. secondary 1948 sec (=32min)

24


Add one node

5833sec (=97min)

1. Get files 2GB×48 2120 sec (=35min)

2. _id indexing 1406 sec (=23min)

3. uniq indexing 2251 sec (=38min)

4. other processes 56 sec (=1 min)

25


Group by

• Reduce by unique index & map & reduce

– 368 msec

db.data.group({ key: { shop: 1},

cond: { shop: 'someone' },

reduce: function ( o , p ) { p.sum++; },

initial: { sum: 0 } });

26


MapReduce

• Scan all data 3116sec (=52min)

– number of key = 39092

db.data.mapReduce(

function(){ emit(this.shop,1); },

function(k,v){

var ret=0;

v.forEach( function (value){ ret+=value; });

return ret; },

{ query: {}, inline: 1, out: 'Tmp' } );

27

Major problems...

28

Indexing

29

Index probrem

Indexing is lock operation in default.

Indexing operation can run as background

on the primary. But...

It CANNOT run as background on the secondary

Moreover the all secondary’s indexing run

at the same time !!

Result in above...

All slave freezes ! orz...

Online indexisng is completely useless even if last version (2.0.2)

30

Present indexing ( default )

31

Index probrem


Secondary

Secondary Secondary

Client Client Client Client Client

Batch

Primary save

32

Index probrem


ensureIndex Primary

Lock

Indexing

Secondary

Secondary Secondary


Batch Cannot write

33

Index probrem


Cannot read !!

SYNC SYNC SYNC

Secondary

Lock

Indexing

Secondary

Lock

Indexing

Secondary

Lock

Indexing


Primary

Complete Batch

finished

34

Index probrem

Ideal indexing ( default )

Secondary

Complete

Secondary

Complete

Secondary

Complete


Primary

Complete Batch

35

Present indexing ( background )

36

Index probrem


Secondary

Secondary Secondary


Batch

Primary save

37

Index probrem


Primary

Slowdown

Indexing

Secondary

Secondary Secondary


Batch ensureIndex(background) Slow down...

38

Index probrem


Cannot read !!

SYNC SYNC SYNC

Secondary

Lock

Indexing

Secondary

Lock

Indexing

Secondary

Lock

Indexing


Primary

Complete Batch

finished

39

Index probrem


Cannot read !!

SYNC SYNC SYNC

Secondary

Lock

Indexing

Secondary

Lock

Indexing

Secondary

Lock

Indexing


Primary

Complete Batch

finished

Background indexing don’t work

on the secondaries

40

Index probrem


Cannot read !!

SYNC SYNC SYNC

Secondary

Lock

Indexing

Secondary

Lock

Indexing

Secondary

Lock

Indexing


Primary

Complete Batch

finished

41

Index probrem

Ideal indexing ( background )

Secondary

Complete

Secondary

Complete

Secondary

Complete


Primary

Complete Batch

42

Probable 2.1.X indexing

43

Index probrem

But not released formally.

So I checked out the source code up to date.

Certainlly it’ll be fixed !

Moreover it sounds like it’ll run as foreground

when slave status isn’t SECONDARY

(it means RECOVERING )

Accoding to mongodb.org this probrem will fix in 2.1.0

44

Index probrem

Secondary

Secondary Secondary


Batch

Primary save


45

Index probrem


Secondary

Secondary Secondary

ensureIndex(background)


Batch

Primary

Slowdown

Indexing

Slow down...

46

SYNC

Index probrem

Slow down...

SYNC SYNC

Secondary

Slowdown

Indexing

Secondary

Slowdown

Indexing

Secondary

Slowdown

Indexing



Primary

Complete Batch

finished

47

Index probrem

Secondary

Complete

Secondary

Complete

Secondary

Complete


Primary

Complete


Batch

48

Index probrem

But I think it’s not enough.

I think it can be fatal for the system that

the all secondaries slowdown at the same time !!

So...

Background indexing 2.1.X

49

Ideal indexing

50

Index probrem

Secondary

Secondary Secondary


Batch

Primary save

Ideal indexing

51

Index probrem

Ideal indexing

Secondary

Secondary Secondary

ensureIndex(background)


Batch

Primary

Slowdown

Indexing

Slow down...

52

Index probrem

Ideal indexing

ensureIndex

Recovering

Indexing

Secondary

Secondary


finished Primary

Complete Batch

53

Index probrem

Ideal indexing

ensureIndex

Secondary

Complete

Recovering

Indexing

Secondary


Batch

Primary

Complete

54

Index probrem

Ideal indexing

ensureIndex

Secondary

Complete

Secondary

Complete

Recovering

Indexing


Batch

Primary

Complete

55

Index probrem

Ideal indexing

Secondary

Complete

Secondary

Complete

Secondary

Complete


Primary

Complete Batch

56

Index probrem

It would be great if I can operate indexing manually

at each secondaries

But ... I easilly guess it’s difficult to apply for current Oplog

57

I suggest Manual indexing

58

Index probrem

Secondary

Secondary Secondary


Batch

Primary save

Manual indexing

59

Index probrem

Manual indexing

Secondary

Secondary Secondary


Batch

Primary

Slowdown

Indexing

Slow down... ensureIndex(manual,background)

60

Index probrem

Manual indexing

finished Primary

Complete Batch

Secondary

Secondary

Secondary


61

Index probrem

Manual indexing

finished Primary

Complete Batch

Secondary

Secondary

Secondary


The secondaries don’t sync

automatically

62

Index probrem

Manual indexing

finished Primary

Complete Batch

Secondary

Secondary

Secondary


63

Index probrem

Manual indexing

Recovering

Indexing

Secondary

Secondary


ensureIndex(manual)

Primary

Complete Batch

64

Index probrem

Manual indexing

Secondary

Complete

Recovering

Indexing

Secondary


ensureIndex(manual)

Primary

Complete Batch

65

Index probrem

Manual indexing

Secondary

Complete

Secondary

Complete

Secondary

Slowdown

Indexing


ensureIndex(manual,background)

Primary

Complete Batch

66

Index probrem

Manual indexing

Secondary

Complete

Secondary

Complete

Secondary

Slowdown

Indexing



Primary

Complete Batch

It needs to support

background operation

Just in case,if the ReplSet has only

one Secondary

67

Index probrem

Manual indexing

Secondary

Complete

Secondary

Complete

Secondary

Slowdown

Indexing



Primary

Complete Batch

68

Index probrem

Manual indexing

Secondary

Complete

Secondary

Complete

Secondary

Complete


Primary

Complete Batch

69

That’s all about Indexing problem

70

Struggle to control the sync

71

STALE

72

Unknown log & Out of control the ReplSet

• Secondaries change status repeatedly in a moment

between Secondary and Recovering (1.8.0)

• Then we found the strange line in the log...

[rsSync] replSet error RS102 too stale to catch up

We often suffered from going out of control the Secondaries...

73

What’s Stale ?

• 〈食品・飲料などが〉新鮮でない（⇔fresh）；

• 気の抜けた, 〈コーヒーが〉香りの抜けた,

• 〈パンが〉ひからびた, 堅くなった,

• 〈空気・臭(にお)いなどが〉むっとする,

• いやな臭いのする

stale [stéil] (レベル：社会人必須 ) powered by goo.ne.jp

74

What’s Stale ?

• 〈食品・飲料などが〉新鮮でない（⇔fresh）；

• 気の抜けた, 〈コーヒーが〉香りの抜けた,

• 〈パンが〉ひからびた, 堅くなった,

• 〈空気・臭(にお)いなどが〉むっとする,

• いやな臭いのする

どうも非常によろしくないらしい・・・

stale [stéil] (レベル：社会人必須 ) powered by goo.ne.jp

75

Mechanizm of being stale

76

ReplicaSet

Primary

Database Oplog

Client

Secondary

Database

mongod mongod

Oplog

77

Replication (simple case)

78

ReplicaSet

Primary

Database Oplog

Client

Secondary

Database

mongod mongod

Oplog

79

Insert & Replication 1

Primary

Database Oplog

Client

Secondary

Database

Insert A

A

A

Insert

mongod mongod

Oplog

80


Primary

Database Oplog

Client

Secondary

Database

Insert A

A A

Oplog

Insert A

Sync

81

Replication (busy case)

82

Stale

Primary

Database Oplog

Client

Secondary

Database

Insert A

A A

Oplog

Insert A

mongod mongod

83


Primary

Database Oplog

Client

Secondary

Database

Insert A

B

A

Insert

B

Insert B

A

Oplog

Insert A

84


Primary

Database Oplog

Client

Secondary

Database

Insert A

C

A

Insert

B

Insert B

A

C

Insert C

Oplog

Insert A

85


Primary

Database Oplog

Client

Secondary

Database

Insert A

A

A

Update

B

Insert B

A

C

Insert C

Update A

Oplog

Insert A

86


Primary

Database Oplog

Client

Secondary

Database

Insert A

A

B

Insert B

A

C

Insert C

Update A

Oplog

Insert A

Check Oplog

87


Primary

Database Oplog

Client

Secondary

Database

Insert A

A

B

Insert B C

Insert C

Update A

Oplog

Insert A

Sync

Insert B

Insert C

Update A

A

B

C

88

Replication (more busy)

89

Stale

Primary

Database Oplog

Client

Secondary

Database

Insert A

A A

Oplog

Insert A

mongod mongod

90

Stale

Primary

Database Oplog

Client

Secondary

Database

Insert A

B

A

Insert

B

Insert B

A

Oplog

Insert A

91

Stale

Primary

Database Oplog

Client

Secondary

Database

Insert A

C

A

Insert

B

Insert B

A

C

Insert C

Oplog

Insert A

92

Stale

Primary

Database Oplog

Client

Secondary

Database

Insert A

A

A

Update

B

Insert B

A

C

Insert C

Update A

Oplog

Insert A

93

Stale

Primary

Database

Client

Secondary

Database

C

A

Update

B

A

C

Oplog

Insert A

Oplog

Insert A

Insert B

Insert C

Update A

Update C

94

Stale

Primary

Database

Client

Secondary

Database

D

A

Insert

B

A

C

Oplog

Insert A

D

Insert B

Insert C

Update A

Update C

Insert D

Insert A

95

Stale

Primary

Database

Client

Secondary

Database

A

B

A

C

Oplog

Insert A

D

Insert B

Insert C

Update A

Update C

Insert D

Check Oplog

[Inset A]

not found !!

Insert A

96

Stale

Primary

Database

Client

Recovering

Database

A

B

A

C

Oplog

Insert A

D

Insert B

Insert C

Update A

Update C

Insert D

Check Oplog

[Inset A]

not found !!

It cannot get

infomation about

[Insert B].

So cannot sync !!

It’s called STALE Insert A

97

Stale

We can specify the oplog size as one of the command line option

Only at the first time per the dbpath

that is also specified as a command line.

Also we cannot change the oplog size

without clearing the dbpath.

Be careful !

We have to understand the importance of adjusting oplog size

98

Replication (Join as a new node)

99

InitialSync

Primary

Database

Client

A

B

C

D

Oplog

Insert C

Update A

Update C

Insert D

mongod

100

InitialSync

Primary

Database

Client

Startup

Database

A

B

C

Oplog

D

Oplog

Insert C

Update A

Update C

Insert D

mongod mongod

101

InitialSync

Primary

Database

Client

Recovering

Database

A

B

C

Oplog

D

Oplog

Insert C

Update A

Update C

Insert D

Get last Oplog

Insert D

102

InitialSync

Primary

Database

Client

Recovering

Database

A

B

C

Oplog

D

Oplog

Insert C

Update A

Update C

Insert D

Cloning DB

Insert D

A

B

C

D

103

InitialSync

Primary

Database

Client

Recovering

Database

A

B

C

Oplog

D

Oplog

Insert C

Update A

Update C

Insert D

Cloning DB

Insert D A

A

B

C

D

104

InitialSync

Primary

Database

Client

Recovering

Database

A

B

C

Oplog

D

Oplog

Insert C

Update A

Update C

Insert D

Cloning DB

Insert D A

A

B

C

D E

Insert

E Insert E

B

105

InitialSync

Primary

Database

Client

Recovering

Database

A

B

C

Oplog

D

Oplog

Update A

Update C

Insert D

Cloning DB complete

Insert D

A

B

C

D

B

Update

E

Insert E

Update B

106

InitialSync

Primary

Database

Client

Recovering

Database

A

B

C

Oplog

D

Oplog

Update C

Insert D

Check Oplog

Insert D

A

B

C

D

E

Insert E

Update B

107

InitialSync

Primary

Database

Client

Secondary

Database

A

B

C

Oplog

D

Oplog

Update C

Insert D

Sync

Insert D

E

Insert E

Update B

Insert E

Update B

A

B

C

D

E

108

Additional infomation

Secondary will try to sync from other Secondaries

when it cannot reach the Primary or

might be stale against the Primary.

There is a bit of chance that sync problem not occured if the

secondary has old Oplog or larger Oplog space than Primary

From source code. ( I’ve never examed these... )

109

Sync from another secondary

Primary

Database

Client

Secondary

Database

A

B

A

C

Oplog

Insert A

D

Insert B

Insert C

Update A

Update C

Insert D

Insert A

Secondary

Database

A

B

C

D

Insert B

Insert C

Update A

Update C

Insert D

Insert A

110


Primary

Database

Client

Secondary

Database

A

B

A

C

Oplog

Insert A

D

Insert B

Insert C

Update A

Update C

Insert D

Insert A

Secondary

Database

A

B

C

D

Insert B

Insert C

Update A

Update C

Insert D

Insert A

Check Oplog

[Inset A]

not found !!

111


Primary

Database

Client

Secondary

Database

A

B

A

C

Oplog

Insert A

D

Insert B

Insert C

Update A

Update C

Insert D

Insert A

Secondary

Database

A

B

C

D

Insert B

Insert C

Update A

Update C

Insert D

Insert A

Check Oplog

But found at the other secondary

So it’s able to sync

112

Sync from the other secondary

Primary

Database

Client

Secondary

A

B

C

D

Insert C

Update A

Update C

Insert D

Secondary

Database

A

B

C

D

Insert B

Insert C

Update A

Update C

Insert D

Insert A

Sync

But found at the other secondary

So it’s able to sync

Database

A

B

C

D

Insert C

Update A

Update C

Insert D

Insert B

Insert A

Insert B

Insert A

113

That’s all about sync

114

Others...

115

Disk space

116

Disk space

Data fragment into any DB files sparsely...

We met the unfavorable circumstance in our DBs

This circumstance appears at some of our collections

around 3 months after we launched the services

db.ourcol.storageSize() = 16200727264 (15GB)

db.ourcol.totalSize() = 16200809184

db.ourcol.totalIndexSize() = 81920

db.outcol.dataSize() = 2032300 (2MB)

What’s happen to them !!

117

Disk space

Data fragment into any DB files sparsely...

It’s seems like to be caused by the specific operation

that insert , update and delete over and over.

Anyway we have to shrink the using disk space regularly

just like PostgreSQL’s vacume.

But how to do it ?

118

Disk space

Shrink the using disk spaces

MongoDB offers some functions for this case.

But couldn’t use in our case !

repairdatabase:

Only runable on the Primary.

It needs long time and BLOCK all operations !!

compact:

Only runable on the Secondary.

Zero-fill the blank space instead of shrink disk spaces.

So cannot shrink...

119

Disk space

Our measurements

For temporary collection:

To issue drop-command regularly.

For other collections:

1. Get rid of one secondary from the ReplSet.

2. Shut down this.

3. Remove all DB files.

4. Join to the ReplSet.

5. Do these operations one after another.

6. Step down the Primary. (Change Primary node)

7. At last, do 1 – 4 operations on prior Primary.

120

PHP client

121

PHP client

We tried 1.4.4 and 1.2.2

1.4.4:

There is some critical bugs around connection pool.

We struggled to invalidate the broken connection.

I think, you should use 1.2.X instead of 1.4.X

1.2.2:

It seems like to be fixed around connection pool.

But there are 2 critical bugs !

– Socket handle leak

– Useless sleep

However, This version is relatively stable

as long as to fix these bugs

122

PHP client

We tried 1.4.4 and 1.2.2

https://github.com/crumbjp/Personal

- mongo1.2.2.non-wait.patch

- mongo1.2.2.sock-leak.patch

123

PHP client

124

Closing

125

Closing

What’s MongoDB ?

It has very good READ performance.

We can use mongo instead of memcached.

if we can allow the limited write performance.

Die hard !

MongoDB have high availability even if under a severe stress..

Can use easilly without deep consideration

We can manage to do anything after getting start to use.

Let’s forget any awkward trivial things that have bothered us.

How to treat the huge data ?

How to put in the cache system ?

How to keep the availablity ?

And so on ....

126

Closing

Keep in mind

Sharding is challenging...

It’s last resort !

It’s hard to operate. In particular, to maintain config-servers.

[Mongos] is also difficult to keep alive.

I want the way to failover Mongos.

Mongo is able to run on the poor environment but...

You should ONLY put aside the large diskspace

Huge write is sensitive

Adjust the oplog size carefully

Indexing function has been unfinished

Cannot apply index online

127

All right, Have fun !!

128

Thank you for your listening

KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB

Technology