Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Clustering: For Geeks... & for Normal People Too! George Chiesa <[email protected]> Daniel Nashed <[email protected]> DATABASE VIEW DATA REPLICA Pu sh Pull Push Pull SERVER UPDATE SERVER UPDATE DATABASE DATA VIEW DATABASE VIEW DATA (replica) Push Push SERVER UPDATE SERVER UPDATE DATABASE DATA VIEW CLREPL Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. This Presentation was not researched nor conceived at the British Library Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. This was not conceived at BL.uk This is bubble-bath-ware! Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. License: You have a limited license to this presentation. Copyright 2000-2006 dotNSF and its' suppliers. This presentation is non exclusively LICENSED to you for internal usage within your own entity, company or organization . For fair-usage purposes, please quote the source as "Bubble-Bath Ideas presentation at DNUG 2006, by G. Chiesa and D. Nashed" We request this presentation NOT to be publicly reposted, please ! Public abstracts will be posted at http://dotNSF.com & http://nashcom.de Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Disclaimers: NO Proofs... This presentation is based upon empyrical info Observed behaviours, features, bugs, beyond... I can NOT prove many of the hypothesis here Please accept these pearls of wisdom "as is" Some of this information may be obsolete soon but it's useful to know what the state of art is We ALWAYS report security issues to IBM in private. and no, we will not discuss security bugs (all fixed:-) Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Ok, just one hack from a red book where I wrote something in... Download and get this redbook: SG24-7017 Lotus Security Handbook (2004) Hint: firefox's "modify header" plugin extension (free)
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
License: You have a limited license to this presentation.
Copyright 2000-2006 dotNSF and its' suppliers. This presentation is non exclusively LICENSED to you for internal usage within your own entity, company or organization.
For fair-usage purposes, please quote the source as "Bubble-Bath Ideas presentation at DNUG 2006, by G. Chiesa and D. Nashed"
We request this presentation NOT to be publicly reposted, please !
Public abstracts will be posted at http://dotNSF.com & http://nashcom.de
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Disclaimers: NO Proofs...
This presentation is based upon empyrical infoObserved behaviours, features, bugs, beyond...
I can NOT prove many of the hypothesis here
Please accept these pearls of wisdom "as is"
Some of this information may be obsolete soon
but it's useful to know what the state of art is
We ALWAYS report security issues to IBM in private.
and no, we will not discuss security bugs (all fixed:-)
to "automagically" provide a better and cheaper servICE (not serVER)
In some cases,
thinking quite outside of the box
pushing the product to the limits !
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
The 50/50 rule/s:
50% of what you KNOW about clusters...
is quite useless !50% of what you don't know about clusters
is quite useful !!!Value Proposition 50%+50%=100%
50% of DDTs (Don't Do That!)s
And 50% of DO this !
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
What we're covering today60' version of a much longer workshop...
what is called "1352 Native Clustering"
Which pieces are client/server based
How each major piece work "per se"
How to make the puzzle work for you
V
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
About questions...
IT IS "OK"(not impolite)
To interrupt...
to ASK questions...
'ala' easyjet...
"within reason" :-)
We reserve the right to postpone the answers, but, when in doubt, raise hand!
100% of what you do not understand can, and WILL probably hurt you!
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Once upon a time... last millenium...
The STATE of the ART in 1995...
was THIN ethernet (ethernet 10 as in 10Mb)
if you were an IBM SHOP, you had TR/4/16
Each adaptor had one and only one address
And in 1995 LOTUS was already shippingClustering and Failover embedded in Notes 4.01
(at the time called NPN=Notes Public Networks)
So a LOT within Notes has a strong LEGACY.
So, we're going to provoke your brain to think!
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Server Configured in 1995...
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
This is the MOST controversial!
If I were you I would use...JUST ONE TCPIP NOTES PORT
You can still have as many addresses
You can still listen to 0.0.0.0 in notes.ini
You can still have complex tcpip routing tables
YOU DO NOT NEED THE EXTRA LOGICof Notes trying to cope with Ethernet 10
and just one IP address per physical card.
K.I.S.S. (at the Notes/Domino Layer!!!)
Stay awake, more controversy to come...
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Listen...(Bonus HACK): ( 42 443 )This time the answer is not 42 ;-) but instead: 443!
You can specity what you are "listening to"
You must understand netstat -an | find "LISTEN"
If you bind addresses you will listen just that BUT
You CAN specify "0.0.0.0" as a specific address!
You can use this to listen to all addresses at a portExample: You can set a notes server to
also listen on NRPC to port 443 on 0.0.0.0
this is a useful hack when you are behind a proxy
and want to access your home server
and the proxy only allows access to ports 80 and 443
port 443 proxies use transparent "connect method"
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
This how I connect to my serverWhen visiting customers
Using http proxies and not allowing 1352 direct.
If cust agrees to allow me to connect to my own server while at their premises...using their proxy
PORTS=TCPIP,TCPIP2
TCPIP=TCP,0,15,0,,45088,
TCPIP_TCPIPADDRESS=0,0.0.0.0:1352
TCPIP2=TCP,0,15,0,,45088,
TCPIP2_TCPIPADDRESS=0,0.0.0.0:443
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
HACK! How does that work?
In my server's Notes.ini
PORTS=TCPIP,TCPIP2
TCPIP=TCP,0,15,0,,45088,
TCPIP_TCPIPADDRESS=0,0.0.0.0:1352
TCPIP2=TCP,0,15,0,,45088,
TCPIP2_TCPIPADDRESS=0,0.0.0.0:443
Voila': I can connect using HTTP Proxy"transparent connect method" to 443
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Cluster Aware "1352" Notes Clients:
a.k.a. Cluster-READY clients
Definition:
A Notes Client is said to be cluster-aware when it will perform custom logic to transparently and automatically fail-over from one server to another, upon server directive or LACK of reply
QUIZ:
what % of Notes Clients are CLUSTER Aware?
hint: what was the first version of Cluster Aware Notes client?
If I told you Notes 4.01 was the first one...
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Cluster.NCF (client side)Servers also use it to connect to other servers!
Time=22/12/2001 14:26:46 (80256B2A:004F5AD8)
Cluster/NotesWeb
CN=Notes2/O=Notesweb
CN=Notes1/O=Notesweb
Time=03/01/2002 16:18:24 (80256B36:0059935B)
TheConifers.com
CN=dotNSF.TheConifers.com/O=TheConifers
CN=Linux.TheConifers.com/O=TheConifers
CN=WebSphere.TheConifers.com/O=TheConifers
CN=Win2k.TheConifers.com/O=TheConifers
CN=www.TheConifers.com/O=TheConifers
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Clustering
COMPLEX SET of design methodologies, techniques and heuristics
applied to "stuff"
that you can use to "make"
"n" things to be perceived as ONE bigger/better & "more reliable"
The key words of this slide are "PERCEIVED as"
NB: We're going to focus on
MultiPlatform SOFTWARE Clustering
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
The "i" in RAID stands for: In-Expensive
In 1987, Patterson, Gibson and Katz at the University of California Berkeley, published "A Case for Redundant Arrays of Inexpensive Disks (RAID)" . This paper described various types of disk arrays, referred to by the acronym RAID. The basic idea of RAID was to combine multiple small, inexpensive disk drives into an array of disk drives which yields performance exceeding that of a Single Large Expensive Drive (SLED). Additionally, this array of drives appears to the computer as a single logical storage unit or drive.
Perspective...C
opyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Cluster Examples: 3, 5 or 20+
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Cluster.ncf: (default max 2 mates TIMES 20 clusters, LKB 185700: Cluster_Name_Cache_Size=n (notes.ini)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Clustering & Failover in Action
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Server QUIT while reading...
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Cluster Mates:"Mate" is an industry NON-PC (non politically correct!) std term
Definition:A cluster of something is composed of mates
Server regularly check state of their Cluster Mates
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Portfolio techiques / Sizing heuristics
There are always 2 practical limits:Lower:
at LEAST how many you need to reduce risk
Upper:
at MOST hoy many can you manage effectively
Tip: Start with 3 or 4, fine tune afterwards
but pleasedo NOT start with 2 or 6
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Class of service:
by "n" instances of resource
Say, for the purpose of example, you have "3""whatevers": OSs, Sites, Servers, Routers, ISPs
say you name the 3 elements as A B and C
With 3 elements you can define the followingClasses of Service:
Top, simultaneously present in A+B+C
Middle, present in either: AB, AC or BC
Single, present just in A or B or C
Homework: Try the combinations for 4 units,C(4,4) + C(4,3) + C(4,2) + C(4,1)
Nota benissimo: DO STOP AT 4 ! ! !
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Almost Real Time Replication...
a) we need to define how we will syncronize
Bad News: Scheduled replication not good enough...
Some apps must be cluster aware enabled!
Good News:NATIVE Event/Queue Driven = CLREPL =
(aka Almost Real Time)
Most apps will automatically work better
b) we still need to spread the load/access.
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
ClDbDir
It's a Notes Database, similar to catalogue, Cluster Specific (RepId depends on ClusterName)
Maintained by a server task of the same name
It's in the Enterprise Edition of Domino
Contains info about databases deployed in a cluster
Is used by Notes/Domino Cluster Aware modulesto know where to push what (and what NOT to!!!)
and for "failovers": a server finds resource elsewhere!
Like CATALOG, each server updates its OWN dbs
BEWARE: 8192 maximun number of useful entries; you do NOT get a warning NOR Error message!
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
ClDbDir (contents)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Bonus Hack: Set Config Cluster_Admin_On=1
It also works IN NON Clustered servers!
You can afterwards do:
CL DEL filename (cluster delete)
CL COPY source dest REPLICA
CL OUT database (out of service)
CL IN database (in service again(both work but are only meaningful in clusters
Useful to OUT-of-service databases BEFORE adding an OLD server to a cluster
useful for decomissioning an old server
you HAVE to add a server to get it intothe CLIENT's Cluster.NCF C
opyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
DATABASE
VIEW DATA
REPLICA
Push
Pull
Push
Pull
SERVER UPDATESERVERUPDATE
DATABASE
DATA VIEW
From LKB: How Push-Pull (std) Replica works
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
DATABASE
VIEW DATA
(replica)
Push
Push
SERVER UPDATESERVERUPDATE
DATABASE
DATA VIEW
CLREPL
From LKB: How Push Cluster Replica works !
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Document changes are captured and trigger the cluster Replicator via a message queue
Cluster Replicator reads message queue and pushes changes to other all other replicas in the cluster regardless of replication settings (aka almost "real time" replication)
How does Cluster Replication works (details)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
CLREPL
CLREPL is a server task
It's an in-Memory QUEUE driven event replicator (REMEMBER BATH TUB !)
that SHOULD push content at most within 15 seconds - in average 7
thus ClRepl is also sometime called RTR
or "ALMOST" REAL TIME REPLICATOR
the KEY here is in "ALMOST"
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
ClRepl (cont'd)
ClRepl PUSHES content modified locally to all cluster mates containing replicas of the modified database
Tips: It PUSHES ignoring source ACL
Check that the queue is not over filled
Always schedule CLASS+1 of themNB: CLREPL does NOT initialize "Replica Stubs"
It also knows what YES/NOT to push
Out Of Service (for quite obvious reasons) but also
Pending Delete (cldbdir does final push, not clrepl !)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
ClRepl (cont'd)
ClRepl will keep an IN-memory queue
It's a QUEUE, and can be overfilled
It's in MEMORY and is NOT disk persistent
THUS, also schedule normal replicas: Tips: within reason, overschedulling pull replicas is not a huge issue, because the deltas are small
i.e. Enabled Replica From */Srv/Whatever to <each>/Srv/Whatever, PULL, every 60 Mins
Will make servers catch up fast, pulling at restart time.
TIP: SH ST REPLICA.CLUSTER.*Q*(Daniel to explain detail stats)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
General Rule: number of clrepl = cluster members "minus" 1
My Tip, set to CLASS_OF_SERVICE PLUS one, not minus one, over schedule it and it's cheap, underschedule it and you will have problems!
Check if clustering works properly via
Show Stat Replica.Cluster.*
Replica.Cluster.WorkQueueDepth should be "small", i.e. less than 10
Replica.Cluster.RetryWaiting should be also "small" i.e. less than 5
Replica.Cluster.Failed should be zero if possible (easy to say :-)
Check the Max and Average Times in queue, should be < 10 seconds
Show Stat Server.Cluster.*Server.Cluster.OpenRedirects.xxx.Unsuccessful = 0
check for unsuccessful redirects!
Cluster Replicator Performance & Statistics
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
How to restrict access (LKB 7002910)Domino server clusters have an optional workload balancing feature that lets you distribute the workload of heavily-used databases across multiple servers in a cluster. To distribute workload, you limit or restrict the work that a server can perform using the following settings in the NOTES.INI:
Server_Availability_Threshold
This setting allows you to specify the maximum availability level beyond which the server attempts to redirect user requests to other servers in the cluster. A server's availability index is recalculated each minute and compared against any threshold you set. If the index falls below the server threshold, the server becomes BUSY. The Cluster Manager redirects access requests from a BUSY server to the servers in the cluster. When an attempt to redirect is unsuccessful, the user receives access to the BUSY server. Each time a redirection occurs, Notes generates a workload balancing event in the Notes log (LOG.NSF).
Server_MaxUsers This setting specifies the maximum number of user sessions allowed on a server. When the server reaches this limit, the server goes into a MAXUSERS state. The Cluster Manager then attempts to redirect new user request to other servers in the cluster. To see how often requests are being redirected, check the LOG.NSF for failover events. If redirection of the user request is unsuccessful, the user receives a message, and is not allowed access to the server.
Server_Restricted
This setting enables a server to deny new open database requests and places the server in a RESTRICTED state. Users who have active connections to databases retain their connections. The Cluster Manager attempts to redirect new requests to other servers in the cluster. When an attempt to redirect is unsuccessful, the user receives a message and is not allowed access to the server. For each redirection attempt, Notes generates a failover event in the LOG.NSF.
Note: You can use the Server_Restricted setting for any Domino server. This setting is not restricted to clusters.
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
SAI examples, un/touched
You may want to smooth this (or not)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Ensure you have full manager access for LocalDomainServers as a Server group or better */Srv/Org as Manager of type Server in all ACLs.. I prefer hardcoding OUs to groups. Works always!
Make sure all applications provide roles to give access to documents with reader fields (remember computed auth fields)
Give Servers all rights and roles to "see" all documents
Don't use replication formulas for clustered databases
Have a scheduled replication in case some events in the clrep-queue get lost or the server is down...
Add startup replication documents "from *" to ensure databases are up to date after server restart
Schedule replication to the Name of the cluster instead of single server names (load balancing & failover)
Best Practices for Cluster ReplicationC
opyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
There are issues with Database Quotas before R5.0.10
Good news:
New option in R5.0.10 CLREPL_OVERRIDE_QUOTAS=1
Domino 6 overrides quotas by default
you get the old behavior with Clrepl_Obeys_Quotas=1 (DDT)
Bad news:
If you already have this problem you need to delete replication history and CutOff Date to resolve existing replication problems
Lotus Script can clear the replication history
Set rep = db.ReplicationInfo , Call rep.ClearHistory() , Call rep.Save()
But not remove the CutOffDate (in most cases not needed)
Cluster Replication & Database Quotas
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Notes Named Network & Directory Assistance
Customer was using Notes Named Networks (NNN) across WAN connection
Caused unintended traffic
Directory Assistance (DA)Multiple replicas of 4 Directories where used
First Server in the list was a remote server in the same NNN in some cases!
Changed configuration to use the local server only
All servers had replicas of all directories
One external directory had huge number of deletion stubs due to external company always reimporting the directory :-(
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Changes/Recommendations
Only local servers in the same NNN
Use only local directories in (DA)
Used "*" to specify the local replica only (TN #1087708)
Evaluating Extended Directory Catalog to further optimization
Directory catalog could simplify working with external addresses and allow more flexibility
Avoid large number of changes in Domino directoriesLess need to update views in Domino Directory
Less deletion stubs
Not the first time we have seen nightly complete delete/add import agents in customer environments
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
How to use NNNs (KISS)
One for TCPIP (and one per Cluster Port )
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Other High Availability Tips Domino 6/7 support multiple versions on one logical UNIX/Linux box
much easier update and coexistence of multiple releases and allows to have a easy to handle "go back" scenario
Fault-RecoveryMaximize server availability
Faster Server Restart after crash!
Automatic collect NSDs for faster troubleshooting
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Domino Server Availability Index (SAI or AI)
Domino 6+ uses a new algorithm to calculate the workload of a server and the resulting AI
A number of customers reported unpredictable, alternating AI which caused Clustering to fail.
Algorithm was enhanced in D6.0.2CF2 and additional notes.ini parameters have been introduced.
But there is another bug that is hopefully finally fixed in D6.5.6 and D7.0.2!
We traced AI at customer site
Live Environment
Test Environment with Server.Load
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
LoadMon
Domino 6/7 use a module called "LoadMon"Routine calculating speed of current transactions, summarizes and compares them with previous intervals and minimum values (RunningAvgTime & MinAvgTransTime)
unit: microseconds
OPEN_DB
OPEN_NOTE
CLOSE_DB
DB_INFO_GET
DB_REPLINFO_GET
GET_OBJECT_SIZE
READ_OBJECT
GET_SPECIAL_NOTE_ID
DB_READ_HIST
DB_WRITE_HIST
SERVER_AVAILABLE_LITE
NIF_OPEN_NOTE
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Expansion Factor (XF)
XF is calculated based on the performance values of current transactions in relation to minimum time for a transaction
It's the number of times the current transactions take longer than the minimum transaction time
XF values for different transactions build a overall XF
This XF is computed and converted into AI based on a Range to scale the XF (TN #1112352)
Notes.ini Server_Transinfo_Range n is 6 by default and specifies the maximum Expansion Factor of a Domino Server. The XF is calculated 2 raised to the power n (64 by default)
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
LoadMon Notes.ini Settings
SERVER_TRANSINFO_MAX (default 5 / max 60)
number of statistics collections stored in LoadMon
SERVER_TRANSINFO_UPDATE_INTERVAL (default 15)
interval for statistics capturing & calculation
SERVER_MIN_TRANS (default 5)
minimum transactions needed for a statistic value to be valid
SERVER_TRANSINFO_NORMALIZE (default 3000)
SERVER_TRANSINFO_HTTP_NORMALIZE (12000)
as far we found out used to initialize empty statistics (zero in loadmon.ncf) on startup in Domino 6
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Debugging LoadMondebug_loadmon=1
Enables LoadMon Debugging, writes additional information to server console
07.10.2003 07:08:09 Loadmon: Domino AI = 100, XF = 1
And adds additional 46 statistics counters (server.loadmon.*)
Can be captured locally or remotely via "show server" or statistics collection program.
nstats servername or C-API NSFGetServerStats (...)
loadmon.ncfloadmon.ncf in Domino data directory stores last information from loadmon before server is shutdown
loaded on server start to initialize statistics counters
Which you can get by asking for them at the back of your business card...
We politely request NO REPOSTING...
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Mission Critical Service
Much better defined by the
Total Cost of NOT HAVING IT
when you need it
In other words, something that despite having a (well known?) TCO
may prove too much more significantly
painful & expensive "NOT TO HAVE"
Keys: TOTAL costs of NOT having
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
The "Nines":
2 nines (99%) =circa= 88 hours/year
3 nines (99.9%) =circa= 9 hours/year
4 nines (99.99%) =circa= 52 minutes/year
5 nines (99.999%) =circa= 5 minutes/year
Downtime costs per user = [(Total hours of Unscheduled downtime (25% of user population) X (Hourly user salary) + (Total hours of Scheduled downtime X Hourly Messaging Administrator Salary) ] / Number of messaging users
NOTA BENE: R.S.E. and Change Management/Control needs
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Business Users do NOT care what you dowith your PLANNED down time
as much as they care NOT to have ANY UN-PLANNED down times during "biz time"
Business users can plan around PLANNED un-availability of mission critical sytems
What Business Users can NOT usually acceptis having to have both Planned and UN-Pl'd
YOU CAN NOT REDUCE BOTH TO ZERO
on an individual component basis
Key: "individual component basis"
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Never begin asking for the budget...
ask for preference/aversion
acceptable time of UNplanned downtime against money to prevent them
Have the user KEEP updated a contingency "Plan B" for alternative/manual processing, so they realise how much mission critical their system really is...
TEST their plan B (fire drill :-)
Ask again for the "TC of not Having"
Ask again for "Not Having Aversion"
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
RunFaster=1
RunSafer=1
DoNotCrash=1
DoNotGetHacked=1
DoNotScrewMySLA=1
DoNotRuinMyBonus=1
DoNotGetMeSacked=1
Which of these do ACTUALLY EXIST ?
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
High Availability
My petty own TWO definitions
Historical = (ex-post)
the FACT that a service has been available in the past
Predicted = (ex-ante)
a "PERCEPTION" in terms of Probability that a service will be up
when it will be needed in the future
KEY: do NOT extrapolate past availability
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Strategic Planning:
My petty own definition (borrowed from many:-)Analize possible future scenarios/events, their value and impact to you
What can go wrong, and how much will it cost me/my entity NOT to have the service
Estimate the "a priori" / "pari passu" probability of these events
Analize, decide and take actions TODAY that will improve the probability of the desired events and scenarios actually happening
Keyword of this slide is TODAY
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
There is no such thing as
"THE BEST" practice as absolute recipe
Does it make sense to ask ?
Will the server be up tomorrow?
NO SLA will make it happen...at most you will get damages/penalties
It makes sense to Actively Plan & Design:
WHAT CAN I DO TODAY to IMPROVE the probablity or likelihood that a Service will be perceived as available when needed?
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
The (pre) Works
You must apply generally agreed Best Practices
for making the individual items more reliable
Examples:
Clean your network of unwanted traffic
Deploy Storage & IO sensibly, i.e. http://www.Lotus.com/Performance
Automate the deployment customizations
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
The Works: Networking
Apply standard tuning to OS and TCP
DELETE every single other protocol you can
PRINT and understand relevant KB notes
Examples of TcpIp advised hacks:EnablePMTUDiscovery=0
TcpTimedWaitDelay=30
etc
Copyright 2000-2
006 b
y G
eorg
e C
hie
sa a
nd d
otN
SF
, In
c -
ALL R
IGH
TS
RE
SE
RV
ED
It is k
indly
re
que
ste
d that th
is p
resenta
tion is N
OT
public
ly p
oste
d, see "
license"
slid
e.
Analyze your network and Investigate and EliminateALL non essential traffic