Top Banner
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Clustering: For Geeks... & for Normal People Too! George Chiesa <[email protected]> Daniel Nashed <[email protected]> DATABASE VIEW DATA REPLICA Pu sh Pull Push Pull SERVER UPDATE SERVER UPDATE DATABASE DATA VIEW DATABASE VIEW DATA (replica) Push Push SERVER UPDATE SERVER UPDATE DATABASE DATA VIEW CLREPL Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. This Presentation was not researched nor conceived at the British Library Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. This was not conceived at BL.uk This is bubble-bath-ware! Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. License: You have a limited license to this presentation. Copyright 2000-2006 dotNSF and its' suppliers. This presentation is non exclusively LICENSED to you for internal usage within your own entity, company or organization . For fair-usage purposes, please quote the source as "Bubble-Bath Ideas presentation at DNUG 2006, by G. Chiesa and D. Nashed" We request this presentation NOT to be publicly reposted, please ! Public abstracts will be posted at http://dotNSF.com & http://nashcom.de Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Disclaimers: NO Proofs... This presentation is based upon empyrical info Observed behaviours, features, bugs, beyond... I can NOT prove many of the hypothesis here Please accept these pearls of wisdom "as is" Some of this information may be obsolete soon but it's useful to know what the state of art is We ALWAYS report security issues to IBM in private. and no, we will not discuss security bugs (all fixed:-) Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Ok, just one hack from a red book where I wrote something in... Download and get this redbook: SG24-7017 Lotus Security Handbook (2004) Hint: firefox's "modify header" plugin extension (free)
20

Cluster in Detail

Apr 24, 2015

Download

Documents

ctorrens
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Clustering: For Geeks...

& for Normal People Too!

George Chiesa <[email protected]>

Daniel Nashed <[email protected]>

DATABASE

VIEW DATA

REPLICA

Push

Pull

Push

Pull

SERVER UPDATESERVERUPDATE

DATABASE

DATA VIEW

DATABASE

VIEW DATA

(replica)

Push

Push

SERVER UPDATESERVERUPDATE

DATABASE

DATA VIEW

CLREPL

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

This Presentation was not researched

nor conceived at the British Library

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

This was not conceived at BL.uk

This is bubble-bath-ware!

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

License: You have a limited license to this presentation.

Copyright 2000-2006 dotNSF and its' suppliers. This presentation is non exclusively LICENSED to you for internal usage within your own entity, company or organization.

For fair-usage purposes, please quote the source as "Bubble-Bath Ideas presentation at DNUG 2006, by G. Chiesa and D. Nashed"

We request this presentation NOT to be publicly reposted, please !

Public abstracts will be posted at http://dotNSF.com & http://nashcom.de

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Disclaimers: NO Proofs...

This presentation is based upon empyrical infoObserved behaviours, features, bugs, beyond...

I can NOT prove many of the hypothesis here

Please accept these pearls of wisdom "as is"

Some of this information may be obsolete soon

but it's useful to know what the state of art is

We ALWAYS report security issues to IBM in private.

and no, we will not discuss security bugs (all fixed:-)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Ok, just one hack from a red book

where I wrote something in...

Download and get this redbook:

SG24-7017Lotus SecurityHandbook (2004)

Hint: firefox's "modify header" plugin extension (free)

Page 2: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

If you are using Reverse Proxies:

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

What is "Clustering for Geeks"

Clustering 101 (definitions/vocabulary)

Clustering For Geeks"is the art of

using documented functionality

and "stable observed behaviours"

to "automagically" provide a better and cheaper servICE (not serVER)

In some cases,

thinking quite outside of the box

pushing the product to the limits !

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

The 50/50 rule/s:

50% of what you KNOW about clusters...

is quite useless !50% of what you don't know about clusters

is quite useful !!!Value Proposition 50%+50%=100%

50% of DDTs (Don't Do That!)s

And 50% of DO this !

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

What we're covering today60' version of a much longer workshop...

what is called "1352 Native Clustering"

Which pieces are client/server based

How each major piece work "per se"

How to make the puzzle work for you

V

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

About questions...

IT IS "OK"(not impolite)

To interrupt...

to ASK questions...

'ala' easyjet...

"within reason" :-)

We reserve the right to postpone the answers, but, when in doubt, raise hand!

100% of what you do not understand can, and WILL probably hurt you!

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Once upon a time... last millenium...

The STATE of the ART in 1995...

was THIN ethernet (ethernet 10 as in 10Mb)

if you were an IBM SHOP, you had TR/4/16

Each adaptor had one and only one address

And in 1995 LOTUS was already shippingClustering and Failover embedded in Notes 4.01

(at the time called NPN=Notes Public Networks)

So a LOT within Notes has a strong LEGACY.

So, we're going to provoke your brain to think!

Page 3: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Server Configured in 1995...

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

This is the MOST controversial!

If I were you I would use...JUST ONE TCPIP NOTES PORT

You can still have as many addresses

You can still listen to 0.0.0.0 in notes.ini

You can still have complex tcpip routing tables

YOU DO NOT NEED THE EXTRA LOGICof Notes trying to cope with Ethernet 10

and just one IP address per physical card.

K.I.S.S. (at the Notes/Domino Layer!!!)

Stay awake, more controversy to come...

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Listen...(Bonus HACK): ( 42 443 )This time the answer is not 42 ;-) but instead: 443!

You can specity what you are "listening to"

You must understand netstat -an | find "LISTEN"

If you bind addresses you will listen just that BUT

You CAN specify "0.0.0.0" as a specific address!

You can use this to listen to all addresses at a portExample: You can set a notes server to

also listen on NRPC to port 443 on 0.0.0.0

this is a useful hack when you are behind a proxy

and want to access your home server

and the proxy only allows access to ports 80 and 443

port 443 proxies use transparent "connect method"

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

This how I connect to my serverWhen visiting customers

Using http proxies and not allowing 1352 direct.

If cust agrees to allow me to connect to my own server while at their premises...using their proxy

PORTS=TCPIP,TCPIP2

TCPIP=TCP,0,15,0,,45088,

TCPIP_TCPIPADDRESS=0,0.0.0.0:1352

TCPIP2=TCP,0,15,0,,45088,

TCPIP2_TCPIPADDRESS=0,0.0.0.0:443

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

HACK! How does that work?

In my server's Notes.ini

PORTS=TCPIP,TCPIP2

TCPIP=TCP,0,15,0,,45088,

TCPIP_TCPIPADDRESS=0,0.0.0.0:1352

TCPIP2=TCP,0,15,0,,45088,

TCPIP2_TCPIPADDRESS=0,0.0.0.0:443

Voila': I can connect using HTTP Proxy"transparent connect method" to 443

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Cluster Aware "1352" Notes Clients:

a.k.a. Cluster-READY clients

Definition:

A Notes Client is said to be cluster-aware when it will perform custom logic to transparently and automatically fail-over from one server to another, upon server directive or LACK of reply

QUIZ:

what % of Notes Clients are CLUSTER Aware?

hint: what was the first version of Cluster Aware Notes client?

If I told you Notes 4.01 was the first one...

Page 4: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Cluster.NCF (client side)Servers also use it to connect to other servers!

Time=22/12/2001 14:26:46 (80256B2A:004F5AD8)

Cluster/NotesWeb

CN=Notes2/O=Notesweb

CN=Notes1/O=Notesweb

Time=03/01/2002 16:18:24 (80256B36:0059935B)

TheConifers.com

CN=dotNSF.TheConifers.com/O=TheConifers

CN=Linux.TheConifers.com/O=TheConifers

CN=WebSphere.TheConifers.com/O=TheConifers

CN=Win2k.TheConifers.com/O=TheConifers

CN=www.TheConifers.com/O=TheConifers

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Clustering

COMPLEX SET of design methodologies, techniques and heuristics

applied to "stuff"

that you can use to "make"

"n" things to be perceived as ONE bigger/better & "more reliable"

The key words of this slide are "PERCEIVED as"

NB: We're going to focus on

MultiPlatform SOFTWARE Clustering

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

The "i" in RAID stands for: In-Expensive

In 1987, Patterson, Gibson and Katz at the University of California Berkeley, published "A Case for Redundant Arrays of Inexpensive Disks (RAID)" . This paper described various types of disk arrays, referred to by the acronym RAID. The basic idea of RAID was to combine multiple small, inexpensive disk drives into an array of disk drives which yields performance exceeding that of a Single Large Expensive Drive (SLED). Additionally, this array of drives appears to the computer as a single logical storage unit or drive.

Perspective...C

opyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Cluster Examples: 3, 5 or 20+

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Cluster.ncf: (default max 2 mates TIMES 20 clusters, LKB 185700: Cluster_Name_Cache_Size=n (notes.ini)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Clustering & Failover in Action

Page 5: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Server QUIT while reading...

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Cluster Mates:"Mate" is an industry NON-PC (non politically correct!) std term

Definition:A cluster of something is composed of mates

logically siblings among them (no master)

Domino Wise, a Cluster Mate can be:

Available (normal) (SAI>SAT)

Busy (Server_Availability_Index <= Server_Availability_Threshold)

Tip: You CAN BUSY a server by setting SAT=100

Unavailable (or unreacheable/perceived as such)

Restricted (Temp=1 or Perm=2)

Invalid (never contacted)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

cladmin Servertask in R5 takes care about administrative things

(D6+ not in servertasks=, launched automatically)

cldbdir takes care that cluster directory is up to date (D6+ not in servertasks=, launched automatically)

clrepl pushes changes to other replicas based on information from cluster directory

(D6+ not in servertasks=, launched automatically)

logs periodically into replication log (manual: tell clrepl log)

replica should still be active as a fallback and to init replicas!

Server Tasks involved

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

API Level call NSPingServer gives back a list of cluster mates and the availability

You can check this information via

> show cluster

Cluster Information

Cluster name: nsh-cluster, Server name: nsh-dus-02/Srv/NashCom/DE

Server cluster probe timeout: 1 minute(s)

Server cluster probe count: 185

Server availability threshold: 0

Server availability index: 100 (state: AVAILABLE)

Cluster members (2)...

server: nsh-dus-02/Srv/NashCom/DE, availability index: 100

server: nsh-dus-01/Srv/NashCom/DE, availability: 42

Server regularly check state of their Cluster Mates

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Portfolio techiques / Sizing heuristics

There are always 2 practical limits:Lower:

at LEAST how many you need to reduce risk

Upper:

at MOST hoy many can you manage effectively

Tip: Start with 3 or 4, fine tune afterwards

but pleasedo NOT start with 2 or 6

Page 6: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Class of service:

by "n" instances of resource

Say, for the purpose of example, you have "3""whatevers": OSs, Sites, Servers, Routers, ISPs

say you name the 3 elements as A B and C

With 3 elements you can define the followingClasses of Service:

Top, simultaneously present in A+B+C

Middle, present in either: AB, AC or BC

Single, present just in A or B or C

Homework: Try the combinations for 4 units,C(4,4) + C(4,3) + C(4,2) + C(4,1)

Nota benissimo: DO STOP AT 4 ! ! !

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Almost Real Time Replication...

a) we need to define how we will syncronize

Bad News: Scheduled replication not good enough...

Some apps must be cluster aware enabled!

Good News:NATIVE Event/Queue Driven = CLREPL =

(aka Almost Real Time)

Most apps will automatically work better

b) we still need to spread the load/access.

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

ClDbDir

It's a Notes Database, similar to catalogue, Cluster Specific (RepId depends on ClusterName)

Maintained by a server task of the same name

It's in the Enterprise Edition of Domino

Contains info about databases deployed in a cluster

Is used by Notes/Domino Cluster Aware modulesto know where to push what (and what NOT to!!!)

and for "failovers": a server finds resource elsewhere!

Like CATALOG, each server updates its OWN dbs

BEWARE: 8192 maximun number of useful entries; you do NOT get a warning NOR Error message!

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

ClDbDir (contents)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Bonus Hack: Set Config Cluster_Admin_On=1

It also works IN NON Clustered servers!

You can afterwards do:

CL DEL filename (cluster delete)

CL COPY source dest REPLICA

CL OUT database (out of service)

CL IN database (in service again(both work but are only meaningful in clusters

Useful to OUT-of-service databases BEFORE adding an OLD server to a cluster

useful for decomissioning an old server

you HAVE to add a server to get it intothe CLIENT's Cluster.NCF C

opyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

DATABASE

VIEW DATA

REPLICA

Push

Pull

Push

Pull

SERVER UPDATESERVERUPDATE

DATABASE

DATA VIEW

From LKB: How Push-Pull (std) Replica works

Page 7: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

DATABASE

VIEW DATA

(replica)

Push

Push

SERVER UPDATESERVERUPDATE

DATABASE

DATA VIEW

CLREPL

From LKB: How Push Cluster Replica works !

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Document changes are captured and trigger the cluster Replicator via a message queue

Cluster Replicator reads message queue and pushes changes to other all other replicas in the cluster regardless of replication settings (aka almost "real time" replication)

How does Cluster Replication works (details)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

CLREPL

CLREPL is a server task

It's an in-Memory QUEUE driven event replicator (REMEMBER BATH TUB !)

that SHOULD push content at most within 15 seconds - in average 7

thus ClRepl is also sometime called RTR

or "ALMOST" REAL TIME REPLICATOR

the KEY here is in "ALMOST"

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

ClRepl (cont'd)

ClRepl PUSHES content modified locally to all cluster mates containing replicas of the modified database

Tips: It PUSHES ignoring source ACL

Check that the queue is not over filled

Always schedule CLASS+1 of themNB: CLREPL does NOT initialize "Replica Stubs"

It also knows what YES/NOT to push

Out Of Service (for quite obvious reasons) but also

Pending Delete (cldbdir does final push, not clrepl !)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

ClRepl (cont'd)

ClRepl will keep an IN-memory queue

It's a QUEUE, and can be overfilled

It's in MEMORY and is NOT disk persistent

THUS, also schedule normal replicas: Tips: within reason, overschedulling pull replicas is not a huge issue, because the deltas are small

i.e. Enabled Replica From */Srv/Whatever to <each>/Srv/Whatever, PULL, every 60 Mins

Will make servers catch up fast, pulling at restart time.

TIP: SH ST REPLICA.CLUSTER.*Q*(Daniel to explain detail stats)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

General Rule: number of clrepl = cluster members "minus" 1

R5: servertasks=events4, repl, router, clrepl, clrepl, clrepl, ...

D6: Cluster_Replicators=n

My Tip, set to CLASS_OF_SERVICE PLUS one, not minus one, over schedule it and it's cheap, underschedule it and you will have problems!

Check if clustering works properly via

Show Stat Replica.Cluster.*

Replica.Cluster.WorkQueueDepth should be "small", i.e. less than 10

Replica.Cluster.RetryWaiting should be also "small" i.e. less than 5

Replica.Cluster.Failed should be zero if possible (easy to say :-)

Check the Max and Average Times in queue, should be < 10 seconds

Show Stat Server.Cluster.*Server.Cluster.OpenRedirects.xxx.Unsuccessful = 0

check for unsuccessful redirects!

Cluster Replicator Performance & Statistics

Page 8: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

How to restrict access (LKB 7002910)Domino server clusters have an optional workload balancing feature that lets you distribute the workload of heavily-used databases across multiple servers in a cluster. To distribute workload, you limit or restrict the work that a server can perform using the following settings in the NOTES.INI:

Server_Availability_Threshold

This setting allows you to specify the maximum availability level beyond which the server attempts to redirect user requests to other servers in the cluster. A server's availability index is recalculated each minute and compared against any threshold you set. If the index falls below the server threshold, the server becomes BUSY. The Cluster Manager redirects access requests from a BUSY server to the servers in the cluster. When an attempt to redirect is unsuccessful, the user receives access to the BUSY server. Each time a redirection occurs, Notes generates a workload balancing event in the Notes log (LOG.NSF).

Server_MaxUsers This setting specifies the maximum number of user sessions allowed on a server. When the server reaches this limit, the server goes into a MAXUSERS state. The Cluster Manager then attempts to redirect new user request to other servers in the cluster. To see how often requests are being redirected, check the LOG.NSF for failover events. If redirection of the user request is unsuccessful, the user receives a message, and is not allowed access to the server.

Server_Restricted

This setting enables a server to deny new open database requests and places the server in a RESTRICTED state. Users who have active connections to databases retain their connections. The Cluster Manager attempts to redirect new requests to other servers in the cluster. When an attempt to redirect is unsuccessful, the user receives a message and is not allowed access to the server. For each redirection attempt, Notes generates a failover event in the LOG.NSF.

Note: You can use the Server_Restricted setting for any Domino server. This setting is not restricted to clusters.

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

SAI examples, un/touched

You may want to smooth this (or not)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Ensure you have full manager access for LocalDomainServers as a Server group or better */Srv/Org as Manager of type Server in all ACLs.. I prefer hardcoding OUs to groups. Works always!

Make sure all applications provide roles to give access to documents with reader fields (remember computed auth fields)

Give Servers all rights and roles to "see" all documents

Don't use replication formulas for clustered databases

Have a scheduled replication in case some events in the clrep-queue get lost or the server is down...

Add startup replication documents "from *" to ensure databases are up to date after server restart

Schedule replication to the Name of the cluster instead of single server names (load balancing & failover)

Best Practices for Cluster ReplicationC

opyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

There are issues with Database Quotas before R5.0.10

Good news:

New option in R5.0.10 CLREPL_OVERRIDE_QUOTAS=1

Domino 6 overrides quotas by default

you get the old behavior with Clrepl_Obeys_Quotas=1 (DDT)

Bad news:

If you already have this problem you need to delete replication history and CutOff Date to resolve existing replication problems

Lotus Script can clear the replication history

Set rep = db.ReplicationInfo , Call rep.ClearHistory() , Call rep.Save()

But not remove the CutOffDate (in most cases not needed)

Cluster Replication & Database Quotas

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Notes Named Network & Directory Assistance

Customer was using Notes Named Networks (NNN) across WAN connection

Caused unintended traffic

Directory Assistance (DA)Multiple replicas of 4 Directories where used

First Server in the list was a remote server in the same NNN in some cases!

Changed configuration to use the local server only

All servers had replicas of all directories

One external directory had huge number of deletion stubs due to external company always reimporting the directory :-(

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Changes/Recommendations

Only local servers in the same NNN

Use only local directories in (DA)

Used "*" to specify the local replica only (TN #1087708)

Evaluating Extended Directory Catalog to further optimization

Directory catalog could simplify working with external addresses and allow more flexibility

Avoid large number of changes in Domino directoriesLess need to update views in Domino Directory

Less deletion stubs

Not the first time we have seen nightly complete delete/add import agents in customer environments

Page 9: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

How to use NNNs (KISS)

One for TCPIP (and one per Cluster Port )

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Other High Availability Tips Domino 6/7 support multiple versions on one logical UNIX/Linux box

much easier update and coexistence of multiple releases and allows to have a easy to handle "go back" scenario

Fault-RecoveryMaximize server availability

Faster Server Restart after crash!

Automatic collect NSDs for faster troubleshooting

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Domino Server Availability Index (SAI or AI)

Domino 6+ uses a new algorithm to calculate the workload of a server and the resulting AI

A number of customers reported unpredictable, alternating AI which caused Clustering to fail.

Algorithm was enhanced in D6.0.2CF2 and additional notes.ini parameters have been introduced.

But there is another bug that is hopefully finally fixed in D6.5.6 and D7.0.2!

We traced AI at customer site

Live Environment

Test Environment with Server.Load

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

LoadMon

Domino 6/7 use a module called "LoadMon"Routine calculating speed of current transactions, summarizes and compares them with previous intervals and minimum values (RunningAvgTime & MinAvgTransTime)

unit: microseconds

OPEN_DB

OPEN_NOTE

CLOSE_DB

DB_INFO_GET

DB_REPLINFO_GET

GET_OBJECT_SIZE

READ_OBJECT

GET_SPECIAL_NOTE_ID

DB_READ_HIST

DB_WRITE_HIST

SERVER_AVAILABLE_LITE

NIF_OPEN_NOTE

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Expansion Factor (XF)

XF is calculated based on the performance values of current transactions in relation to minimum time for a transaction

It's the number of times the current transactions take longer than the minimum transaction time

XF values for different transactions build a overall XF

This XF is computed and converted into AI based on a Range to scale the XF (TN #1112352)

Notes.ini Server_Transinfo_Range n is 6 by default and specifies the maximum Expansion Factor of a Domino Server. The XF is calculated 2 raised to the power n (64 by default)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

LoadMon Notes.ini Settings

SERVER_TRANSINFO_MAX (default 5 / max 60)

number of statistics collections stored in LoadMon

SERVER_TRANSINFO_UPDATE_INTERVAL (default 15)

interval for statistics capturing & calculation

SERVER_MIN_TRANS (default 5)

minimum transactions needed for a statistic value to be valid

SERVER_TRANSINFO_NORMALIZE (default 3000)

SERVER_TRANSINFO_HTTP_NORMALIZE (12000)

as far we found out used to initialize empty statistics (zero in loadmon.ncf) on startup in Domino 6

Page 10: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Debugging LoadMondebug_loadmon=1

Enables LoadMon Debugging, writes additional information to server console

07.10.2003 07:08:09 Loadmon: Domino AI = 100, XF = 1

And adds additional 46 statistics counters (server.loadmon.*)

Can be captured locally or remotely via "show server" or statistics collection program.

nstats servername or C-API NSFGetServerStats (...)

loadmon.ncfloadmon.ncf in Domino data directory stores last information from loadmon before server is shutdown

loaded on server start to initialize statistics counters

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Server.LoadMon.TransInfo.AI.Type = 0

Server.LoadMon.TransInfo.CurrentTransCount.CLOSE_DB = 3

Server.LoadMon.TransInfo.CurrentTransCount.DB_INFO_GET = 2

Server.LoadMon.TransInfo.CurrentTransCount.DB_READ_HIST = 0

Server.LoadMon.TransInfo.CurrentTransCount.DB_REPLINFO_GET = 5

Server.LoadMon.TransInfo.CurrentTransCount.DB_WRITE_HIST = 0

Server.LoadMon.TransInfo.CurrentTransCount.GET_NOTE_INFO = 0

Server.LoadMon.TransInfo.CurrentTransCount.GET_OBJECT_SIZE = 0

Server.LoadMon.TransInfo.CurrentTransCount.GET_SPECIAL_NOTE_ID = 0

Server.LoadMon.TransInfo.CurrentTransCount.NIF_OPEN_NOTE = 0

Server.LoadMon.TransInfo.CurrentTransCount.OPEN_DB = 3

Server.LoadMon.TransInfo.CurrentTransCount.OPEN_NOTE = 7

Server.LoadMon.TransInfo.CurrentTransCount.READ_OBJECT = 0

Server.LoadMon.TransInfo.CurrentTransCount.SERVER_AVAILABLE_LITE = 2

Server.LoadMon.TransInfo.HttpNormalize = 12000

Server.LoadMon.TransInfo.IntervalInSeconds = 15

Server.LoadMon.TransInfo.Max = 5

Server.LoadMon.TransInfo.MinAvgTransTime.CLOSE_DB = 58.1818181818182

46 statistics found

BEWARE LARGE OVERFLOW

INTO NEGATIVE VALUES

Quit, delete loadmon.ncf, restart server

(do after upgrades!)

se co DEBUG_LOADMON=1

Server.LoadMon.TransInfo.MinAvgTransTime.DB_INFO_GET = 119.875

Server.LoadMon.TransInfo.MinAvgTransTime.DB_READ_HIST = 210.666666666667

Server.LoadMon.TransInfo.MinAvgTransTime.DB_REPLINFO_GET = 88.5714285714286

Server.LoadMon.TransInfo.MinAvgTransTime.DB_WRITE_HIST = 240.2

Server.LoadMon.TransInfo.MinAvgTransTime.GET_NOTE_INFO = 110.235087719298

Server.LoadMon.TransInfo.MinAvgTransTime.GET_OBJECT_SIZE = 141.777777777778

Server.LoadMon.TransInfo.MinAvgTransTime.GET_SPECIAL_NOTE_ID = 93.333333333

Server.LoadMon.TransInfo.MinAvgTransTime.NIF_OPEN_NOTE = 1,031.4285714286

Server.LoadMon.TransInfo.MinAvgTransTime.OPEN_DB = 429.166666666667

Server.LoadMon.TransInfo.MinAvgTransTime.OPEN_NOTE = 272.987714987715

Server.LoadMon.TransInfo.MinAvgTransTime.READ_OBJECT = 134.285714285714

Server.LoadMon.TransInfo.MinAvgTransTime.SERVER_AVAILABLE_LITE = 95.3333333

Server.LoadMon.TransInfo.MinTrans = 5

Server.LoadMon.TransInfo.Normalize = 3000

Server.LoadMon.TransInfo.Range = 15

Server.LoadMon.TransInfo.RunningAvgTime.CLOSE_DB = 214.333333333333

Server.LoadMon.TransInfo.RunningAvgTime.DB_INFO_GET = 172

Server.LoadMon.TransInfo.RunningAvgTime.DB_READ_HIST = 0

Server.LoadMon.TransInfo.RunningAvgTime.DB_REPLINFO_GET = 187

Server.LoadMon.TransInfo.RunningAvgTime.DB_WRITE_HIST = 0

Server.LoadMon.TransInfo.RunningAvgTime.GET_NOTE_INFO = 0

Server.LoadMon.TransInfo.RunningAvgTime.GET_OBJECT_SIZE = 0

Server.LoadMon.TransInfo.RunningAvgTime.GET_SPECIAL_NOTE_ID = 0

Server.LoadMon.TransInfo.RunningAvgTime.NIF_OPEN_NOTE = 0

Server.LoadMon.TransInfo.RunningAvgTime.OPEN_DB = 4,143

Server.LoadMon.TransInfo.RunningAvgTime.OPEN_NOTE = 738

Server.LoadMon.TransInfo.RunningAvgTime.READ_OBJECT = 0

Server.LoadMon.TransInfo.RunningAvgTime.SERVER_AVAILABLE_LITE = 104

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

AI in D6.0.1 without Optimizing of Loadmon

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Domino 6.0.1 AIX 5.1 dropping AI

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

What did we find out?

AI with default interval 15 sec and 5 sampling values does not always result in steady AI

we needed to find values which provide

steady values for cluster-failover not to occur "randomly" or cause Ping-Pong effects

reasonable time to reflect current workload in AI

Standard interval and sampling 15*5 cover 45 seconds

Interval 10 seconds with 20 sampling values cover 200 seconds

Standard Server.Load Scripts do not help much because most transactions are not used in standard scripts

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Listen...(HACK 2)

You need to understand which fields are

Listens

(usually in specific tabs)

HostNames that are

NOT Listens

for example:you can tell domino that it's HTTP hostname

is the name of something else

even in a different machine

urls will be created nicely

Page 11: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

HACK 3: How to use clustering for server consolidation

Add ALL servers to ONE CLUSTER...

Make sure you have Dbs no more than 3 times

SET the SAT of the OLD servers to 100

This will BUSY them out

Users will LOADBALANCE to new servers

for all NON ADMIN/Managers users

Unless you forgot an app just in old servers

because it will continue to access old servers

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

For example, to get this

Cluster name: DOMPMAC01, Server name: DOMAGP01/SRV/Customer

Server cluster probe timeout: 1 minute(s)

Server cluster probe count: 47191

Server cluster default port: *

Server availability threshold: 100

Server availability index: 0 (state: BUSY)

Server availability default minimum transaction time: 3000

Cluster members (11):

Server: DOMPMA01/SRV/Customer, availability index: 81

Server: DOMPMA02/SRV/Customer, availability index: 78

Server: DOMPIN02/SRV/Customer, availability index: 65

Server: DOMPIN01/SRV/Customer, availability index: 63

Server: DOMMYP01/OLD/SRV/Customer, availability index: 0

Server: DOMMYP02/OLD/SRV/Customer, availability index: 0

server: DOMOEP01/SRV/Customer, availability: BUSY

server: DOMHEP01/SRV/Customer, availability: BUSY

server: DOMCVP01/SRV/Customer, availability: BUSY

server: DOMVGP01/SRV/Customer, availability: BUSY

server: DOMAGP01/SRV/Customer, availability: BUSY

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Cluster information:

Cluster name: DOMPMAC01, Server name: DOMMYP01/SRV/Customer

Server cluster probe timeout: 1 minute(s)

Server cluster probe count: 62831

Server cluster default port: *

Server availability threshold: 0

Server availability index: 28 (state: AVAILABLE)

Server availability default minimum transaction time: 3000

Cluster members (11):

Server: DOMPMA02/SRV/Customer, availability index: 79 )) SERVER_AVAILABILITY_THRESHOLD=5

Server: DOMPMA01/SRV/Customer, availability index: 78 )) SERVER_AVAILABILITY_THRESHOLD=5

Server: DOMPIN01/SRV/Customer, availability index: 64 )) SERVER_AVAILABILITY_THRESHOLD=5

Server: DOMPIN02/SRV/Customer, availability index: 39 )) SERVER_AVAILABILITY_THRESHOLD=5

Server: DOMMYP02/OLD/SRV/Customer, availability index: 0 )) SERVER_TRANSINFO_RANGE=2 & SAT=0

Server: DOMMYP01/OLDSRV/Customer, availability index: 0 )) SERVER_TRANSINFO_RANGE=2 & SAT=0

server: DOMHEP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100

server: DOMVGP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100

server: DOMCVP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100

server: DOMAGP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100

server: DOMOEP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100

Fine tuning via SAI/SAT/RangeC

opyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

And when you turn off a server...

Remember to ignore the probes failures

if annoyed increase the period of the probe

Server_Cluster_Probe_Timeout=1 (minute)

Dead server do not run cldbdir, thus (hack!)In New servers' CLDBDIR DELETE manually

ALL instances of DBs in the old servers

Failover by replicaID uses the new servers!

CLREPL will NOT attempt to keep dead servers updated (EXTREMELY IMPORTANT!!!!!!!!)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

You can keep old dead servers

In the cluster for reasonable long time

BUT you must check the logs and

sh st replica.cluster.*q*

You can't have lost transactions..

because CLDBDIR thinks the old servers

are EMPTY but alive

CL Manager will say once a minute they are

unreacheable, which is what you want for

AUTOMATIC user failover... over time...

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

To finally delete the server

use AdminP !!!

Page 12: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Other Caveats/Tips/Tricks:

You must make sure you edit the old servers' records in NAB to remove mail routing

You do not want mail to be attempted to be

routed via old dead servers

You'd better do server decomission reportBEFORE turning them off...

a machine turned off produces no reportsDO NOT remove old old server from cluster yet

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Always TEST failovers

with a TEST user ID that is

NON Administrator

NON manager of apps databases

It is assumed that managers knowwhere they want to access dbs

and will NOT attempt failover

if you test with ADMIN.id: will drive you MAD

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Cluster Analysis is a great feature to figure out about problems in your cluster

It's part of the Admin Client and (Server / Analysis / Cluster ...)

Run it to find problems with ACL, Replication, not existing databases, ...

Tips

Run it, print it and sign off all warnings you find

Use FT Search to remove multiple occurrences of similar or already fixed problems until DB is empty

Run Analysis again to see you addressed al problems

Cluster AnalysisC

opyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Failover

Definition:

Server Initiated

due to reactive Load Balance or failures

Client Initiated

server is dead or perceived as dead

requires client to know how to connect to cluster mates without server assistance!

Tips: insert the address in name:CN=<FullyQualifiedDomainName>/Whatever

CN=194.196.39.11/Srv/LotusEmea/Net

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

DEBUG_NOSTDOUT=1

If you leave debug parameters ON in prod

capture the debug in files

debug_Outfile=

and NOT in StdOut

for performance reasons and also...

for sanity of old 3rd party apps (&BACKUPs)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

DO NOT USE THIS PLEASE

DEBUG_RUN_AS_ROOT=1

it WILL allow you to run as root in UNIX/Linux

it will NOT allow you later to run as non root

unless you fix all the owners, permissions,etc

of everything it created. (just DDT please!)

Exception: Some custom restores required rootGET A NEW VERSION OF RESTORE TOOL

Page 13: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Replication DebuggingDEBUG_REPL=2 & DEBUG_REPL_ALL=1

Log_Replication: (not ORable, different values, -1 does not work!)

Log_Replication=0....No replication logging

Log_Replication=1....Logs server replication events

Log_Replication=2....Adds logging of replication activity at the database level

Log_Replication=3....Adds logging of replication activity at the note level

Log_Replication=4....Adds logging of replication activity at the field level

Log_Replication=5....Adds summary logging

RTR_Logging: (Tip: You can OR (sum) these, i.e. 63 is a LOT!)

RTR_Logging= 1....Default level of logging (major routines, events, etc.)

RTR_Logging= 2....Log all context structure changes

RTR_Logging= 4....Log replications: attempted & performed

RTR_Logging= 8....Log iterations through main polling loop

RTR_Logging=16...Verbose debug logging

RTR_Logging=32...Log all lock operations

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Replica.Cluster.Docs.Added = 26790

Replica.Cluster.Docs.Deleted = 16060

Replica.Cluster.Docs.Updated = 378378

Replica.Cluster.Failed = 30

Replica.Cluster.Files.Local = 83

Replica.Cluster.Files.Remote = 83

Replica.Cluster.Retry.Skipped = 222

Replica.Cluster.Retry.Waiting = 0

Replica.Cluster.SecondsOnQueue = 13

Replica.Cluster.SecondsOnQueue.Avg = 2

Replica.Cluster.SecondsOnQueue.Max = 3593

Replica.Cluster.Servers = 1

Replica.Cluster.SessionBytes.In = 160450213

Replica.Cluster.SessionBytes.Out = 824894460

Replica.Cluster.Successful = 13484

Replica.Cluster.WorkQueueDepth = 0

Replica.Cluster.WorkQueueDepth.Avg = 0

Replica.Cluster.WorkQueueDepth.Max = 4

sh st replica.cluster.* (if you do not read the stats, why bother clustering?)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Network_Sprayer_Address=*

Useful to disable name checking after connect

I just wished it did work better (not always works)

DO_NOT_USE_REMEMBERED_ADDRESSES=1

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Failover by Path

Normally, you should NOT get it

What you should get are mostly by RepId

It is a sign that you have multiple instances of the same replica id in one server

You should (almost) never have duplicate

SH DIR in the server tells you duplicates

Requested to be added to ADMIN client next

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Server_TransInfo_Normalize

default = 3000

Units is Miliseconds * 100 of std transaction

3000 is a BAAAAAAAAD default

Fortunately Loadmon.ncf helpsto save old real times for all transactions

USE: AvailabilityIndexType=1 (for nonHTTP)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Server_TransInfo_Range

If you don't know better,

set between 10 and 40

default is 6 and is WAAAAAAAAY TOOO LOW

Alledgedly (rumour)

it helps also NON clustered HTTP servers

Apparently some code in http checks SAI

for self tuning, and a better SAI uses HW better

Page 14: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Tell CLREPL pause/resume

Useful to be able to read something

If you are using a very high debug level

Remember to resume it, else you will get nuts trying to figure out what happened.

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Show AI (in AIx but is different)

What should be seen is this; > show aiRange XF Hits Min AI Max AI

nconsole DOMPHU00 "sh ai"

1 2 48406 93 100

2 4 1380 77 93

3 8 1226 64 77

4 16 821 51 64

5 32 106 38 51

6 64 39 26 37

7 128 16 20 25 ...Current value of SERVER_TRANSINFO_RANGE is 6.

<<changes suggested for

SERVER_TRANSINFO_RANGE>>

nconsole DOMPHU01 "sh ai "

1 2 48826 93 100

2 4 1052 77 93

3 8 1148 64 77

4 16 711 51 64

5 32 197 38 51

6 64 40 27 38

7 128 0

8 256 4 1 5

9 512 13 0 0

10 1024 11 0 0

11 2048 1 0 0

12 4096 1 0 0

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Clustering: For Geeks... and for Normal People Too! Q&A

George Chiesa <[email protected]>

Daniel Nashed <[email protected]>

DATABASE

VIEW DATA

REPLICA

Push

Pull

Push

Pull

SERVER UPDATESERVERUPDATE

DATABASE

DATA VIEW

DATABASE

VIEW DATA

(replica)

Push

Push

SERVER UPDATESERVERUPDATE

DATABASE

DATA VIEW

CLREPL

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

SUPPORT "EXTRA" MATERIAL

These are the support pages...

Which you can get by asking for them at the back of your business card...

We politely request NO REPOSTING...

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Mission Critical Service

Much better defined by the

Total Cost of NOT HAVING IT

when you need it

In other words, something that despite having a (well known?) TCO

may prove too much more significantly

painful & expensive "NOT TO HAVE"

Keys: TOTAL costs of NOT having

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

The "Nines":

2 nines (99%) =circa= 88 hours/year

3 nines (99.9%) =circa= 9 hours/year

4 nines (99.99%) =circa= 52 minutes/year

5 nines (99.999%) =circa= 5 minutes/year

Downtime costs per user = [(Total hours of Unscheduled downtime (25% of user population) X (Hourly user salary) + (Total hours of Scheduled downtime X Hourly Messaging Administrator Salary) ] / Number of messaging users

NOTA BENE: R.S.E. and Change Management/Control needs

Page 15: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Business Users do NOT care what you dowith your PLANNED down time

as much as they care NOT to have ANY UN-PLANNED down times during "biz time"

Business users can plan around PLANNED un-availability of mission critical sytems

What Business Users can NOT usually acceptis having to have both Planned and UN-Pl'd

YOU CAN NOT REDUCE BOTH TO ZERO

on an individual component basis

Key: "individual component basis"

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Never begin asking for the budget...

ask for preference/aversion

acceptable time of UNplanned downtime against money to prevent them

Have the user KEEP updated a contingency "Plan B" for alternative/manual processing, so they realise how much mission critical their system really is...

TEST their plan B (fire drill :-)

Ask again for the "TC of not Having"

Ask again for "Not Having Aversion"

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

RunFaster=1

RunSafer=1

DoNotCrash=1

DoNotGetHacked=1

DoNotScrewMySLA=1

DoNotRuinMyBonus=1

DoNotGetMeSacked=1

Which of these do ACTUALLY EXIST ?

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

High Availability

My petty own TWO definitions

Historical = (ex-post)

the FACT that a service has been available in the past

Predicted = (ex-ante)

a "PERCEPTION" in terms of Probability that a service will be up

when it will be needed in the future

KEY: do NOT extrapolate past availability

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Strategic Planning:

My petty own definition (borrowed from many:-)Analize possible future scenarios/events, their value and impact to you

What can go wrong, and how much will it cost me/my entity NOT to have the service

Estimate the "a priori" / "pari passu" probability of these events

Analize, decide and take actions TODAY that will improve the probability of the desired events and scenarios actually happening

Keyword of this slide is TODAY

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

There is no such thing as

"THE BEST" practice as absolute recipe

Does it make sense to ask ?

Will the server be up tomorrow?

NO SLA will make it happen...at most you will get damages/penalties

It makes sense to Actively Plan & Design:

WHAT CAN I DO TODAY to IMPROVE the probablity or likelihood that a Service will be perceived as available when needed?

Page 16: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

The (pre) Works

You must apply generally agreed Best Practices

for making the individual items more reliable

Examples:

Clean your network of unwanted traffic

Deploy Storage & IO sensibly, i.e. http://www.Lotus.com/Performance

Automate the deployment customizations

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

The Works: Networking

Apply standard tuning to OS and TCP

DELETE every single other protocol you can

PRINT and understand relevant KB notes

Examples of TcpIp advised hacks:EnablePMTUDiscovery=0

TcpTimedWaitDelay=30

etc

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Analyze your network and Investigate and EliminateALL non essential traffic

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Domino and I/O Optimization

single RAID5 volume

I/O controller

Don't do this!

bottlenecks

Controller Channels

OS Kernel

Page file

Notes executables

Log files

Domino data

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Domino and I/O Optimization

(better)

RAID5 volume

I/O controller

Separate drive

OS Kernel

Page file

bottlenecksController Channels

Notes executables

Log files

Domino data

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Domino and I/O Optimization

(even better)

RAID(1, 5) volume

I/O controller

Separate drive

OS Kernel

Page file

Controller Channels

bottlenecksNotes executables

Log files

Domino data

OSPage

Domino

Page 17: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Domino and I/O Optimization

(much better)

RAID(1, 5) volume

Separate drive

OS kernel

Page files

Controllers

Notes executables

Log files

bottlenecks

Apps, Domino

I/O technology

OS technology

I/O controller

OS

Page

Domino

I/O controller

RAID(1, 5) volume

\data

\data Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Hardening HARDWARE Installand Post-upgrade script

Any/Everything in the box installed CAN fail

If something is not installed it can not fail

Physically remove from the boxes ANY hw not used

Modems, Audio, etc

DISABLE everything you can't take out

Classic: lpt1, com1, com2, etc

BOOT SEQUENCE: C, CD, A

DOCUMENT AND REQUIRE PAPER SIGN OFF BY OPS

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Hardening SOFTWARE Installand Post-upgrade scripts

MOST SW vulnerabilities are based on SW Bugs

ALL software has (some) known + unknown BUGS

If a software is not installed it can not run :-)

If a software is not running its Bugs don't matter

UNINSTALL everything you do not absolutely need

Remove all un-needed online-documentation

Win32: SPECIFICALLYUNINSTALL WORKSTATION LAN SERVICES!!!

Remote Management: do NOT mix/shareintranet security/passwords/domains/etc

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

UNPLUG ALL NETWORK CABLES BEFORE UPGRADE, install from "safe" CDs, NEVER via LAN/WAN/etc

After new Install, WindowsUpdate or equivalentdisable everything you do not need

better yet, UNINSTALL what you do not need

check what services are running / started / auto

netstat -an | find "Listen" (check EACH)

Beware of R.S.E. (Reverse Social Engineering)

DOCUMENT AND REQUIRE PAPER SIGN OFF BY OPs

Hardening SOFTWARE Installand Post-upgrade scripts (cont')

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Hard trends / environmental changes

It's a wild world out there...There is a lot of Win32 out there...online / aDSL! unpatched / running "Admin"

Most Win32 patches require "reboots"

Linux is as secure as senior the admins

and viceversa, also true to the lower end

Vulnerabilities (KNOWN and not)

13% of DNS servers have known vulnerabilities, according to ICANN

PACE of change in OS patching levels

External and "Internal" ScriptKiddies

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Dilbertian Examples or WYPIWYG

IF you Pay people to keep the UPTIME of individual machines (stress on individual)

They WILL schedule + preventative maint timeThey will NOT apply patches a.s.a.p./available

They will NOT down a service EVEN when at risk

99% of hacked/virused machines were

"already well known vulnerabilities"

It will cost you much MORE money and troublesand you will get LESS value for your money

Page 18: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

SLAs are as useful to prevent damage as insurance/assurance [ :-) ]

Make you feel better about evil things OUTCOMESbut they do NOTHING TO prevent evil things

from happening in the first place

Some "Dilbertian" examples:I will insure my house in order for it

NOT to go on fire, when you'd better buy insurance in case of disaster BUT ALSO

get a smoke detector (detection)

get fire estinguishers (response !)

I will ask people to sign NDAs...

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

High Availability

Something that is "likely" to be available...

Must be architected and run as such

"Architected" implies with "HEURISTICS", most of which are "difficult to quantify"

It's easier to measure Sq Feet of Grass to Mown

than quantifying "Garden Landscaping Work"

"Run" requires having meaningful WYPIWYG

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

The HUMAN Factor: WYPIWYG

WYPIWYG is actually W.Y.P.I.W.Y.G.

"What You Print Pay Is What You Get"

If you measure the wrong things...

you WILL get wrong behaviours and outputs

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

SPOFs = Single Point of FailureDefinition:

A single point of failure is a anything that is not redundant enough and whose failure will cause damage to the availability of a service

I will NOT repeat here the trivial ones

Some "hidden SPOFs":

check bill of materials for anything that has1

mouse/keyboard/Switch ==>IMPLY SAME RACK

UPS/ISP/Site:

you may have to consider multi site/homed

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

The HUMAN Factor: WTPIWTD

WTPIWTD is actually W.T.P.I.W.T.D.

"What THEY Pay Is What THEY Demand"Make sure the BizSponsor pays by BILLBACK

a class of service with expected resilience

a % of your fulfillment platform

Never let a user "own" a box that you runeasier to say than to do, but try :-)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

The beauty of Notes/Domino: Secure Replication

Deploy to more than one site enabled byReplication of databases

scheduled replication

event driven replication

both

Tips:do NOT deploy by OS copy nor FTP, use replica

Hardcode Cluster OU in ACLs ie. */Srv/<whatever>

[Names]: Add to prevent pull replication issues

Page 19: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Credits:

Our Teachers

Lotus/IBM/Iris:

too many links, thanx to all !

Our Partners:

Penumbra Partnering Inc. http://www.PENUMBRA.org

Our Customers

Some names in our site :-)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

From Lotus Operating Principles:

"Establish Purpose Before Action" as in

Alice (In wonderland)

Tell me Mr. Cat, which "Route" should I use?

Cat:

Where do you want to go ?

Alice:

Dunno, haven't figured that out yet!

Cat:

it does not matter which one you choose!

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

ALL the "answers"are already out there

somewhere most, in the internet

the VALUE question is how to figure out

WHAT ARE THERELEVANT QUESTIONS ?

It's uselful to define "relevant"the "YOU ARE HERE"

has changedfrom "my Domino World"

to "my Enterprise choices"

Who moved my cheese ?C

opyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

MTBF = MEAN TIME BETWEEN (garanteed !!!) FAILURES

Average of when you can expect something to fail

Assumes eveything will eventually fail - by design!MTBF implies P(F,eventually)=1.0

Murphy's LAW ...and...

Never Let a Machine Know You Need It :-)

Please engrave in my tomb-stone:The devil is in the variances to averages

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

PLAN AROUND UnPlanned Failures

you KNOW with a P(X fail,eventually)=1

that individual components =

something = will fail (eventually)

but you do not know WHEN, WHAT, HOW

TRY to make cross-correlations work for you

Don't forget Murphy's Law

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Leverage on differences

reduce risk by using stuff that will fail eventually BUT with negative or zero correlation

Win32 code-streams have a huge in-built-correlation, so do UNIX's/Linux's

Lower Correlation between Win32,Linux,etc

Lower Correlation between AS400/iSeries / rest

Use this to weight how you "spread" stuff

Page 20: Cluster in Detail

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Embedded Dis-Services

Anything having EITHER an MTBF, an SLA or windowsupdate.com or liveupdates

has "Embedded individual outages"

SLA implies Dis-Service agreement trade-offs

The Business User does NOT care for INDIVIDUAL SLAs/MTBFs

So you could, can and must

Architect and Design

a CLUSTERed Solution

and offer a CLUSTER SLA

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Manage measurables, the right ones

If you measure & pay people for the cluster SLAand "free" them from component's SLAs:

For Individual Machines/OS/HW/Components:they will get downed to investigate/fix/update

sooner, a.s.a.p. known vulnerabilities/problems

+ preventive maint made during prime time

less dependencies on graveyard-shift work

The user will get better and overall cheaper service

less dis-service, and smoother/safer Operations

Operators will match demand of services with + offer

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Portfolio Principles

"there is nothing wrong with putting all your eggs in one basket, just watch that basket" Henry Ford

don't put all your eggs in one basket cause you can't watch it close enough

don't put all your eggs in too many baskets cause you can't watch them all close enough

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Testing Tips & Tricks:

my first SW manager taught me in 1980:Design with Testing in mind;

what you can not PROVE that works will either NOT work from day one but remain hidden until needed or fail in the future...

Document the testing... for regression testing

A Fellow Penumbra told me: You do not need a boat, you need a friend who has one and knows how to use it....

Same for a protocol analyser: you just can NOT guess the client/server dialogue (ex caching)

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

Co pyright 200 2 dotNSF, In c. - All r ight s

r eser vedP lease cont act dot NSF a t +44 771 85 87 673 f or mor ep resenta tions & inf orm ati on ... . vi sit: ht tp:/ /dot NSF.com

"What You Print Pay Is What You Get"

If you measure the wrong things...

you WILL get wrong behaviours and output

WYPIWYG is actually W.Y.P.I.W.Y.G

Copyright 2000-2

006 b

y G

eorg

e C

hie

sa a

nd d

otN

SF

, In

c -

ALL R

IGH

TS

RE

SE

RV

ED

It is k

indly

re

que

ste

d that th

is p

resenta

tion is N

OT

public

ly p

oste

d, see "

license"

slid

e.

High Availability

The art of doing something "automagically" to improve the perceived performance of the cluster, usually by making intelligent usage of idle resources.

Proactive:

Load Spreading

Reactive

Performning Load "re-"Balancing by trying to fail over to less busy clustermates