Sunfire Design and Configuation

8/3/2019 Sunfire Design and Configuation

http://slidepdf.com/reader/full/sunfire-design-and-configuation 1/36

Send comments about this document to: [email protected]

Sun Fire ™ Systems Design andConfiguration Guide

Nathan Wiger

Roger Blythe

Part No. 816-7882-10September 2002 Revision 04

Sun Microsystems, Inc.4150 Network CircleSanta Clara, CA 95054 U.S.A.650-960-1300



PleaseRecycle

Copyrigh t 2002Sun Microsystems, Inc., 901San Antonio Road, Palo Alto, CA 94303-4900 U.S.A.A ll rights reserved.

This product or document is distributed un der licenses restricting its use,copying, distribution, and decompilation.N o part of this product or

docum ent may be reprod uced in any form by any means with out prior written authorization of Sun and its licensors, if any.Third -party

software, including font technology, is copyrighted an d licensed from Sun sup pliers.

Parts of the prod uct may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademar k in

the U.S.an d other coun tries, exclusively licensed throu gh X/ Op en Compan y, Ltd.

Sun, Sun Microsystems, the Sun logo, Answ erBook, Answ erBook2, docs.sun.com, Solaris, Sun Man agement Center, Sun BluePrints, Sun Q uad

FastEthernet, Sun StorEdge, Op enBoot, Sun Enterprise, Sun Fireplane, and Sun Fire are trademar ks, registered tra dem arks, or service marks

of Sun Microsystems, Inc. in the U.S. and o ther coun tries. All SPARC tradema rks are used un der license and are tradem arks or registered

trad emarks of SPARC Internat ional, Inc. in th e U.S. and oth er countries. Products bearing SPARC tradema rks are based u pon an architecture

develop ed by Sun M icrosystems, Inc.

ORACLE is a registered trademark of Oracle Corporation. Netscape is a trademark or registered trademark of Netscape Communications

Corpora tion in the Un ited States and oth er countries. Legato NetWorker is a registered tr adem ark of Legato Systems, Inc. Adob e is a registered

trademark of Adobe Systems, Incorporated.

The OPEN LOOK and Sun Graph ical User Interface was d eveloped by Sun Microsystems, Inc. for its users and licensees. Sun acknowled ges

the p ioneering efforts of Xerox in researching an d d eveloping th e concept of visua l or graph ical user interfaces for the comp uter ind ustry. Sun

hold s a non -exclusiv e license from Xerox to the Xerox Graph ical User Interface, wh ich license also covers Sun’s licensees wh o implem ent OPEN

LOOK GUIs and otherw ise comp ly with Sun’s written license agreements.

RESTRICTED RIGH TS: Use, du plication, or disclosu re by the U.S. Govern men t is subject to restrictions of FAR 52.227-14(g)(2)(6/ 87)a nd

FAR 52.227-19(6/ 87), or D FAR 252.227-7015(b)(6/ 95) and DFAR 227.7202-3(a).

DOCUMENTATION IS PROVIDED “ AS IS” AND ALL EXPRESS OR IMPLIED CON DITIONS, REPRESENTATION S AND WARRAN TIES,

INCLUDING ANY IMPLIED WARRANTY OF MERCHAN TABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NO N-INFRINGEMENT,

ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.

Copyrigh t 2002 Sun Microsystems, Inc., 901San Antonio Road, Palo Alto,CA 94303-4900 Etats-Unis.Tous droits réservés.

Ce prod uit ou documen t est distribué avec des licences qui en restreignent l’utilisation, la copie, la distribu tion, et la décomp ilation. Aucun e

partie de ce prod uit ou docum ent ne peut être reprod uite sous aucun e forme, par qu elque moyen que ce soit, sans l’autorisation pr éalable et

écrite de Sun et de ses bailleurs de licence, s’il y en a. Le logiciel détenu par d es tiers, et qui comprend la techno logie relative aux polices de

caractères, est protégé par u n copyright et licencié par des fournisseurs d e Sun .

Des parties de ce prod uit p ourron t être dérivées des systèmes Berkeley BSD licenciés par l’Université de Californie.UN IXest un e marqu e

dép osée aux Etats-Unis et dans d ’autres pay s et licenciée exclusivement par X/ Op en Comp any,Ltd .

Sun, Sun Microsystems, le Sun logo, Answ erBook, Answ erBook2, docs.sun.com, Solaris, Sun Managem ent Center,Sun BluePrints, Sun Qua d

FastEthernet, Sun StorEdge, Open Boot, Sun Enterprise, Sun Fireplane, and Sun Fire,son t des marques de fabrique ou des marqu es dép osées,o umarqu es de service,d e Sun Microsystems, Inc.a ux Etats-Unis et dan s d’autres pays. Toutes les marques SPARC sont utilisées sous licence et

sont des marqu es de fabrique ou des marqu es déposées de SPARC International, Inc.au x Etats-Unis et dan s d’autres pays. Les prod uits portan t

les marques SPARC sont basés sur u ne architecture développ ée par Sun Microsystems, Inc.

ORACLE est une marque déposée registre de Oracle Corporation. Netscape est une marque de Netscape Communications Corporation aux

Etats-Unis et dan s d’autres pays. Legato NetWorker est une marqu e de fabrique ou un e marqu e déposée de Legato Systems, Inc. Adob e est une

marque enregistree de Adobe Systems, Incorporated.

L’interface d’utilisation graphiqu e OPEN LOOK et Sun a été dévelop pée par Sun Microsystems, Inc.p our ses utilisateurs et licenciés.Sun

reconnaît les efforts de pionn iers de Xerox pour la recherche et le dévelop pem ent d u concept d es interfaces d’utilisation visuelle ou grap hique

pou r l’indu strie de l’informatiqu e. Sun d étient un e licence non exclusive de Xerox sur l’interface d’utilisation graph ique Xerox, cette licencecouvran t également les licenciés de Sun q ui mettent en p lace l’interface d’utilisation graphiqu e OPEN LOOK et qui en outre se conforment aux

licences écrites de Sun .

LA DOCUMENTATION EST FOURNIE “EN L’ETAT” ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSES

OU TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISE’E PAR LA LOI APPLICABLE, Y COMPRIS

NO TAMMEN T TOUTE GARAN TIE IMPLICITE RELATIVE A LA QUALITE MARCHAN DE, A L’APTITUDE A UNE U TILISATION

PARTICULIERE OU A L’ABSENCE DE CONTREFAÇON.



1

CHAPTER 1

Designing You r System

Now that you have completed your statement of requirements you can w ork on the

first ha lf of d esigning a Sun Fire system —designing t he logical server. By the en d of

this chapter, you will have completed a logical design containing a list of how many

of each of the components you need, and a listing of your RAS requirements. Then,

you can apply this configuration in Chapter 4 wh en you choose the p hysical system

in wh ich to p lace your d esign.

Systems design is done in this somewh at “backwards” m anner for two imp ortant

reasons:

s To make su re your requ irements are clearly stated and met.

s Multiple servers can be located inside one physical chassis because Sun Fire systems

support domains.

Followin g this process w ill also help ensure that you p urchase a system w ith enoug h

room for future expan sion.

This chapter covers the following topics, which describe the logical design process:

s Und erstanding a Runn ing System

s Design Rules of Thum b

s Analyzing an Existing System

s Designing for RAS

s A Logical Design Specification

Understand ing a Runn ing System

This section reviews the basics of a computer system. While this is likely all

"refresher" material, many people m isunderstand the real roles of comp uter

components somew hat. As such, an an alogy to a reception desk is used to h elp

http://c4_5.06.02.pdf/

http://c4_5.06.02.pdf/



2 Designing Your System

better illustrate the d ifferent role each major component plays. In th is analogy we

follow a receptionist answering various typ es of incoming calls to show how a

compu ter ma nages the requests it receives.

Every comp uter system h as three main compon ents that can be configured:

1. I/ O devices

2. CPUs

3. Memory

Of course, a Sun Fire system has m any other components too, including rep eater

boards, the Fireplane, and so on. How ever, in the Sun Fire system (as with m ostcompu ter systems), these are part of the fundam ental architecture of the m achine

and cannot be configured by the customer. This fact mean s that to d esign you r

system, you should pay close attention to th e decisions you make regard ing CPUs,

mem ory, and I/ O because th ese decisions w ill d irectly affect the effectiveness of

your design.

Notice the use of the term CPUs. Because the Sun Fire system board is sold with a

minimum of two processors, it is not possible to buy a single-CPU Sun Fire system.

All Sun Fires are multiprocessor systems.

I/ O Devices

The Sun Fire system uses the PCI bus for all I/ O. The I/ O is what allows you to do

anything p rodu ctive with the system. Without I/ O, you wou ld have no keyboard,

no netw ork connection, no d isks, and so forth.

Und erstanding the imp act I/ O has on the system is imp ortant. When something has

to be done w ith I/ O, an interrup t is generated. The CPUs mu st hand le this interrupt.

Frequen tly, I/ O is the single-biggest resource sink on a system. This fact is especially

true w hen you have m ultiple types of I/ O running h eavy loads concurren tly, which

generates a large num ber of interrup t requests.

For examp le, consider a backend d atabase server that is front-ended by a dozen or

more concurren t web servers. When a web server needs some dyn amic data, it hasto make a request via the network to the server, wh ich then m ust do the ap propr iate

database selects and retrieve the data from its local disk, finally shuffling the reply

back across the network to the w eb server that requ ested it. This can result in a

nu mber of I/ O interrup ts, as the system mu st hand le all of the netw ork packets as

well as all of the disk seeks to get the d atabase information off disk.

When you mu ltiply one request times a dozen or more w eb servers, each request

times a d ozen or m ore clients, you can see that the d atabase server could easilybecome swam ped w ith I/ O interrup ts, wh ich excludes the comp uting pow er needed

to run the operating system, manage m emory, and ru n the d atabase itself.



Understanding a Running System 3

To tie everything together, think of I/ O as each individual phone call received by a

receptionist. Each p hone call generates an interru pt that th e receptionist mu st

hand le. Depend ing on the requ est, it m ay result in a lot of data tran sfer (talking)

back to the caller. More calls generate m ore interru pts. Eventu ally, the ph one system(server) hits a limit either in the am ount of concurrent requ ests that it can hand le

(memory), the speed with which the requests can be fulfilled (by the CPUs), or how

fast the caller and the CPUs can comm unicate (I/ O speed ).

CPUs

The CPU is actually responsible for mu ch more than compu tation. Anything that

pu ts a load on the system, includ ing d atabases, web servers, email, NFS, NTP, and

general networ k and user tra ffic, requires a lot of CPU p ower. The CPU d oes not d o

as much thinking as it does handling. Any time the system m ust d o anything, it mu st

ask the CPU, which has to prioritize the task, schedule it, and allocate resources for

it, and do so in a w ay that allows all the other mu ltitud e of things going on to

continue running too.

In this way, the CPU can be thou ght of as a bu sy receptionist. The receptionist ha s anu mber of stand ard routines. These may include forwarding calls to employees,

taking messages, setting u p ap pointments, and even providing d irect responses to

simple requests, “What is your ad dress?” When an incoming p hone call is received,

the receptionist executes the proper routine, and completes the request if possible. If

the request cannot be fulfilled in a reasonable amoun t of time, the receptionist may

have to place the caller on hold temp orarily to handle some other tasks and free up

some time.

In some cases, the receptionist may receive a request that is too complex to behand led by standard routines. For example, the receptionist ma y receive a call that

the boss is running late, and th at several meetings need to be rescheduled. H ere, the

receptionist must d o some thinking to determ ine wh ich m eetings can be moved to

wh en. At the same time, the receptionist mu st still pay attention to other incoming

calls, to ensure an important request is not missed.

If things get too busy for one receptionist to hand le, you m ay need two or m ore

receptionists. Some callers may even get frustrated and hang up. Even for those thatdo get throu gh, there will likely not be enough time to prop erly answer their

queries.

So, it is imp ortan t to consider not on ly the difficulty of each requ est, but the volum e

too. In our analogy, each incoming request requires a certain baseline of time to

han dle p roper ly. Typ ically, the receptionist will have to p ress a butt on to p ick up th e

app ropriate line, answ er the call with a greeting, listen and analyze the requ est, then

pr ioritize it and comp lete it app ropr iately. Even if a request consists of noth ing more

than “Is Mr. John son in?”, it still takes a certain am oun t of time to fulfill the requ est.




Memory

In the Sun Fire system, the system m emory is dynam ic rand om access memory

(DRAM). The system u ses mem ory to store thing s that it is using actively such as theoperating system, programs, and their data.

When asked to execute a program , the system m ust allocate space in m emory to

hold an image of the program and its associated data. This space can grow or shrink

as the program run s, since its resource requirements may change. In reality, most

app lications grow over time because they d o a p oor job of cleaning up after

themselves.

When a system is und er a very heavy load, it may ru n out of room in memory tohold a ll the information it n eeds. In this case, it u ses predeterm ined d isk space,

known as swap space, to tempora rily store lesser-used things from m emory

temporarily to make room for other things. This is known as paging, since it involves

selectively moving specific data out of memory in sections know as pages. When

those pages are needed , the system incurs a page fault , and the d ata is moved from

disk back into memory.

In extreme situations, the system may und ergo swapping. In this case, memoryimages of entire programs a re moved from m emory ou t to d isk. This is a significant

performance hit, and if the system starts swa pp ing, some serious problems may

occur. Unfortunately, the terms p aging and swap ping are often u sed

interchangeably, perhap s because the d isk storage is called “swap space,” bu t they

are really very different.

Do not undervalue how important memory is to a running system. Not having

enough m emory is perhap s the single greatest cause of performance problems.

With th e receptionist analogy, you can th ink of mem ory as th e nu mber of incoming

ph one lines available. Even if you h ave five reception ists (CPUs), it w ill not h elp the

situation if you only hav e four ph one lines (mem ory). The ph one system w ill still be

slow, since you have a bottleneck in th e amou nt of requests you can h and le

concurren tly. To accept anoth er call, the cur rent caller w ill hav e to be p laced on h old

( page-out ) in order to get back to the first caller ( page-in).

If the load gets too heavy for the phone system, and no more lines can be p ut on

hold, calls will have to be d isconnected (swap -out) to make room for others. Thereceptionists w ill then h ave to call the person back (swap -in), a mu ch more time-

intensive process.



Design Rules of Thumb 5

Design Rules of ThumbYou can u se a num ber of rules of thumb to design a system. Properly using these rules

requires a firm grasp on you r needs—that you ha ve comp leted a statement of your

requirements using the information and tables in Chapter 1 an d Chapter 2.

This section describes the following design rules of thumb:

s Spread your I/ O devices across as man y PCI buses as p ossible.

s Decide how man y CPUs the system need s.

s Decide how much memory the system needs.

s A well-designed system should seldom page, and never swap.

s The system shou ld always ha ve some idle time.

s Whenever you add additional CPUs, you should also add memory.

I/ O Devices

You shou ld always d etermine your I/ O design first, as this along w ith your

app lication need s determ ines your comp uting requ irements. To get the best

performance and reliability from your Sun Fire server, you should lay out the I/ O

carefully. An easy rule of thumb is:

Spread your I/O devices across as many PCI buses as possible.

Doing so distributes your I/ O load across as many different controllers as possible,

thus imp roving performance. In ad dition, you are red ucing the nu mber of single

points of failure that could cause your data to go offline. Unfortunately, this rule of

thum b has m any caveats. Unlike CPUs and mem ory, the layout of your I/ O

intimately affects the reliability of your m achine, and wh ether or not you can use

features such as dynamic reconfiguration (DR). Chapter 4 d iscusses th e issue of I/ O

design in detail, taking all of these factors into consideration.

CPUs

Regardless of what your tasks are—NFS service, CAD simulations, or compiling

software builds—handling each requ est requires a certain baseline of time, as th e

receptionist examp le shows. Not only is the type of request imp ortant, but th e

qu antity of requests is impor tant too. In fact, it is often hard er for a system to han d le100 sma ll requ ests than 10 large ones, due to the inh erent overh ead of han dling each

request.

http://c1_5.06.02.pdf/

http://c2_3.24.02.pdf/

http://c4_5.06.02.pdf/

http://c2_3.24.02.pdf/

http://c1_5.06.02.pdf/

http://c4_5.06.02.pdf/




How man y CPUs are enough? The rules of thum b you can use to help you

determine how many CPUs you need are:

s One-half CPU per n etwork card

s One-eighth CPU per I/ O device (disk or tape)

s Two CPUs per application for mostly I/ O-based applications (NFS, web servers

and so forth)

s Four+ CPUs p er ap plication for m ostly CPU-based ap plications (simu lations,

databases and so forth).

These figures assume a mod erate load on you r system. If you are expecting a high

load on certain aspects, you should d ouble the correspond ing num bers. For

example, if a system is going to h ave a lot of network traffic, you sh ould have on e

CPU for every netw ork card to h and le the interrupts. Conversely, if you are

designing a system you expect to have a very light load, cut the nu mbers in half or

consider w hether th e tasks that system is going to be performing could be combined

with another server to lower overhead.

To get an id ea of how m any CPUs you need, ad d up each of the criteria that affect

you, then round up to the nearest multiple of two. We recomm end th at you only buy

the four-CPU boards for you r Sun Fire system. Purchasing a tw o-CPU board limitsyour futu re expansion room, since it takes up the sam e amou nt of space as a four-

CPU board. H owever, there are m erits to the 2-CPU board if you do not need

expansion room, and th e examples later in the book dem onstrate a good u se for it.

So, if you are d esigning an NFS server w ith a gigabit network card and six Sun

StorEdge T3 arrays, you h ave the following CPU requ irements (TABLE 3-1).

Rounding up, you should buy a four-CPU board to run this system.

Note – These rules work well for average systems. H owever, for high-intensity

app lications such as on line transaction p rocessing (OLTP), data m ining, and so forth ,

you should research your needs more carefully. For details, see “Analyzing an

Existing System” on page 8.

TABLE 1-1 CPU Requirements

Description Number

Gigabit network card (1) 1/ 2

Disk arrays (6) 3/ 4

NFS server 2

Total 3 1/ 4



Design Rules of Thumb 7

When buying CPUs, you should make sure you have enough memory to

accomm odate them , or else you run the risk of thrashing. This means that the system

spend s all its time m oving things around in mem ory, and never d oes any real work.

This is like the receptionist who sp ends time picking u p p hone lines and saying”Please hold,“ without actually fulfilling any requests. ”Memory” discusses this in

detail.

Finally, in terms of speed, getting the fastest processor you can buy is always an

adva ntage. In ad dition to the speed of the processor, you also should consider the

size of its cache. Generally this is decided for you based on the p rocessor mod el, but

you w ant to m ake sure to get as large a cache as possible. The cache determin es how

man y operations can be hand led at one time by the processor without having to

make a trip back out to system memory. Processor cache is several orders of magnitude faster than memory , so a large cache is always beneficial.

Memory

Memory is perhaps the single most imp ortant par t of a comp uter system, and h as

the most d irect impact on p erformance. The more memory you ha ve, the morethings you can d o, and th e faster you can do th em, since less disk access is needed.

There is usu ally a greater correlation betw een perceived performan ce and m emory

than processors. With relational databases, for instance, being able to fit as much of

the da tabase in memory as p ossible can yield a big improvemen t in performan ce.

If your system is ru nning slowly, you should p robably buy more m emory, not more

processors. It is more likely that your system is run ning ou t of mem ory, not

processor cycles, and is having to use sw ap space to run your a pp lications.

It is possible to waste m oney and overbuy m emory as w ell, though, so here some

specific rules. On the Sun Fire system, m emor y is tied to a p rocessor (see TABLE 1-5 in

Chapter 1). So, you cannot buy a board with just mem ory and no CPUs. This

actually simplifies the design process considerably because there are only two

decisions to m ake:

s Whether to h alf-popu late or fully-populate each CPU board

s Whether to bu y larger or smaller DIMMs

Fully-populating a CPU board allows you to pu t more mem ory on it. In add ition,

though , you get better interleaving, which increases performance. Thu s, the rules of

thumb for mem ory are:

s For I/ O-based app lications, half-pop ulate the CPU/ Memory board.

s For CPU-based app lications, fully-populate th e CPU/ Memory board.

Then, choose the app ropriate DIMM size to provide enou gh m emory for your

app lication. Following these ru les will naturally lead to smaller mem ory sizes inNFS servers (where memory is basically solely used for the file buffer cache), and

http://c1_5.06.02.pdf/

http://c1_5.06.02.pdf/

http://c1_5.06.02.pdf/

http://c1_5.06.02.pdf/




larger, faster memory configurations in database and compute servers. Most systems

tend to be in one category or the other, but if there is a mix, fully-popu late all

boards.

Remember th at, as d iscussed previously, paging is und esirable. So, another good

rule of thum b is:

A w ell-designed system should seldom page, and n ever swap.

It is possible, in fact, to run a large memory system with very little (if any) swap

space. This fact is somewhat different from other commonly available information.

One comm only-used phr ase is “Your sw ap sp ace should be dou ble the size of your

ph ysical mem ory.” Consid er this for a mom ent. You can easily design a Sun Fire

system that has 64 gigabytes of memory. If you were to follow this advice, you

wou ld hav e to have 128 gigabytes of swap space. While a few vend ors may require

you to have a large swap spa ce, you should not rely on swap for real-world mem ory

usage, as it is too slow. When d esigning a system, make sure that you p urchase

enough m emory so that your system d oes not swa p. If it does, you need more

memory.

Analyzing an Existing System

Often, the pur pose of designing a new system is to replace an existing system in

your infrastructure. If so, you can benefit from analyzing your existing system

because this analysis w ill give you a better idea of w hat p roblems you are facing.

This analysis is also useful if you are trying to up grade a Sun Fire server. A prop eranalysis will ensure that you are up grading th e right parts of the system to add ress

the issues.

Before you go any further, you should revisit your design goals discussed and

developed in Chapter 2. Doing so will help you p roperly formulate your statement

of the performance problems you are encountering. A good p roblem statemen t is:

When many users are logged in, NFS performance is very slow.

A bad p roblem statement is:

There is a large number in the w column of the vmstat output.

Always start with the p erceived problems and requ irements. An imp rovement in

these area s is the only w ay you can tell if you r d esign is a success. You can only

make use of statistics if you know wh at you are looking for.

http://c2_3.24.02.pdf/

http://c2_3.24.02.pdf/



Analyzing an Existing System 9

The easiest way to analyze a system is by using the stat command s that ship w ith

the Solaris OE, and w hich can be u sed to m onitor performance of a runn ing system.

You can get a fu ll list of the available comm and s by typing th e following com mand

in a shell prom pt:

This command will display a series of comm and s, such as vmstat, iostat,

netstat, and so on.

You shou ld never u se the uptime comm and to analyze a system . You can u se it to

show h ow long your system ha s been up, but the notion of a load is very outd ated

and fairly useless in th e Solaris OE. Most notab ly, load varies w idely from system to

system; a load of 10 may indicate a lack of activity on one machine, but extreme

activity on another. We recommend you get in the habit of using vmstat 5 instead

of uptime wh en a machine seems sluggish.

Some stat command s are m ore useful than others, so the following sections focus

on the useful commands (TABLE 3-2).

Collecting and understanding the output from these commands should give you a

good idea of what p roblems your current system is having, and how to improve

up on these problem areas in the d esign of your new server.

The following sections review each comm and in turn , along w ith how to p roperlyuse each one, so you can gather the best statistics possible. It is important to note

that n ot all options of a given stat command produce useful—or even

trustw orthy—outpu t in all situations. The focus is on th e specific parts of the ou tpu t

of each command that are the most important.

# ls /usr/bin/*stat

TABLE 1-2 Useful Stat Commands

Command Description

/usr/bin/vmstat Virtual mem ory/ paging statistics with CPU/ process summ aries

/usr/bin/mpstat Extensive per-processor statistics

/usr/bin/iostat I/ O and NFS statistics

/usr/bin/netstat Netw ork statistics

/usr/bin/prstat Sum mary of active processes very similar to the top utility




How and When

How and when you monitor a system is just as important as what commands you

use and wh y you use them to collect statistics. You shou ld m ake sure that you aremonitoring the system when it is doing what you want it to do.

In some situations, this is relatively straightforwar d, such as on a mu ltiuser

interactive system. In this case, you w ant to ru n your stats during th e day, wh en

everyone is doing th eir normal w ork. Conversely, if you h ave a system that serves

mainly as a d atabase server, and th e load gets very heavy at night w hen batch jobs

are running, you should gather your stats overnight.

When collecting stats un attended (such as overnight), use a simple shell script tha twrites to a log file in /var/tmp with periodic timestamps. You can use something

like the script in CODE EXAMPLE 1-1 to run the stat comman ds mentioned

previously:

The way th is script w orks you w ill get timestamps at each interval count you

specify. So, if you run:

CODE EXAMPLE 1-1 nightstats—Script for Una ttend ed Stat Collection

#!/bin/sh

# nightstats - Script for unattended stat collection

if [ $# -lt 2 ]; then

echo "Usage: $0 stat args ..." >&2

exit 1

fi

# some basic vars holding date, etc.stat=$1

shift

date=‘date +%Y%m%d‘

logfile=/var/tmp/$stat.$LOGNAME.$date

# run the stat, writing output to our logfile

exec 1>$logfile

echo "Running ’$stat $@’ as ’$LOGNAME’"

while true

do

date$stat "$@"

done

# nightstats vmstat 5 12




you get a timestamp every 12 repetitions. Note that th e nightstats script loops

indefinitely so you mu st man ually kill it when you w ant it to stop. You can u se this

script to kick off stats in the background, either using cron or before you go hom e.

For example:

Then, wh en you com e to work the next d ay, you will have a log file in /var/tmp for

each stat, with timestamp s every five minutes. Each file will be named with the

name of the stat command , the date, and your user nam e ($LOGNAME is

au toma tically set to your u ser nam e by the shell). This will allow you to collect stats

during the times when your system is under the type of load you care about.

Note – You also w ant to collect some stats w hen you r system is not busy, which you

can then use as a baseline for comp arison. Otherwise, you w ill not be able to tell

wh at stats change w hen the load increases.

Simulating Loads

Trying to simu late loads is not very u seful. In general, trying to simu late a load gives

you a poor—if not m isleading—picture of w hat th e system is trying to d o. For

example, the common practice of using dd to w rite to disks is usually a

misrep resentative m easure of I/ O load . While dd read s and w rites sequentially, most

real-world disk access is random , and is an u npred ictable combination of reads and

wr ites. Thu s, your configuration could look good on p aper, and work well when

running dd, but work poorly in a real-world app lication.

To get an accurate picture of your requirements, you should monitor a system that is

run ning w hat you wan t it to be running. If you need to simu late this, the best way is

to create a test environment that m irrors what you wan t to design as closely as

possible. If you cannot do so, then w e recomm end that you u se the design rules of

thum b, and avoid a nalyzing a dissimilar system as this can cause you to make p oordecisions.

What and Why

Now that you understand how and when to measure your system, the following

sections examine each of the different stat command s and what they tell you.

# nohup nightstats vmstat 5 300 2>/dev/null &

# nohup nightstats iostat -xcnz 5 300 2>/dev/null &




prstat Command

When looking at your stats, the first thing you should know is what you r system is

doing. Seeing a large amou nt of d isk activity by itself does not tell you anything

other tha n th e system is un dergoing a large amoun t of disk activity. This is where

th e prstat command comes in. It show s you w hat p rocesses are active on the

system, along w ith how m uch CPU time they are using, wh at processor they are

bound to, their size in mem ory, their priority, and more. If you h ave used the

freeware tool top, the outpu t should look very familiar.

Unlike all of the other stat commands, to use prstat you just type the comm and

with no argum ents:

The display fills the terminal w ind ow an d refreshes every 5 second s. You shou ld

launch prstat in a separate wind ow an d keep it going as you use each of the

following stat command s. That w ay, you can correlate the performan ce of your

system with w hat is actively running on it.

vmstat Command

Memory is always the first place to start. If you h ave a mem ory bottleneck, then all

of your other stats are going to be u nreliable, since the system will be introducing

extra delays trying to m anage mem ory. Often mem ory problems are misdiagnosed

as I/ O or CPU problems, since disk access or applications seem slow to the user. In

reality, these operations are slow because the system is paging or even sw app ing.

So remem ber: Always start by looking at memory. Repeat that over an d over a s a kind

of mantra w henever you are analyzing or designing a system.

The virtual memory man agement algorithms in the Solaris OE are complex.

Basically everyth ing is seen as a pag e of mem ory, includin g files. While this is a

benefit as far as the system is concerned , it makes an alysis more d ifficult. Therefore,

properly analyzing m emory takes several steps.

# prstat




The simplest way to look at memory is by specifying a time interval to the vmstat

command, and letting it run u ntil you press Ctrl-C to interrupt it. The following

vmstat command monitors the system in five-second intervals:

The first line of th e vmstat output is a summary.

Note – Always ignore th e first line of any stat command. It does not provide any

useful information because it is a sum mar y for as long as the system has been u p.

Sum maries span too long a p eriod of time, and they give you no indication as to the

use of the system du ring that time.

When looking at the outpu t from vmstat (CODE EXAMPLE 1-2), you w ill notice a lotof columns. You shou ld ignore all the fields about disks and device interru pts, as

there are better tools for m onitoring these stats, which w e w ill describe in

subsequ ent sections. In fact, only some of these colum ns (TABLE 3-3) are really useful.

CODE EXAMPLE 1-2 How to Use the vmstat Command

# vmstat 5procs memory page disk faults cpu

r b w swap free re mf pi po fr de sr s0 -- -- -- in sy cs us sy id

0 0 20 1461688 510080 37 185 30 1 2 0 0 1 0 0 0 667 650 292 4 2 94

0 0 64 1468888 197976 8 43 0 0 0 0 0 0 0 0 0 638 571 269 1 1 98

0 0 64 1469320 198528 0 0 1 0 0 0 0 0 0 0 0 642 467 256 0 1 98

TABLE 1-3 Important vmstat Command Output Columns

Column Heading Meaning

r Nu mber of run nable processes (waiting for CPU time)

b Nu mber of blocked processes (waiting for I/ O, paging, and so on)

w Number of runnable but swapped-out processes (normally 0)

re Page reclaims (memory pages taken from other p rocesses)

mf Minor page faults

pi Kilobytes paged in (including process startup and file access)

po Kilobytes paged ou t (should be close to 0)

sr Pages scanned by p age-out scanner (also close to 0)

us Percentage of CPU time spen t in user mod e

sy Percentage of CPU time spent in system m ode

id Percentage of CPU sp ent id le




First, look at t he procs head ings. Norm ally, the r, b, and w columns are fairly low

nu mber s, if not 0. This is because, generally, these column s only become n onzero if a

process is waiting for something, either a CPU (r), I/ O (b), or enough m emory (w).

Large nu mbers in these colum ns are usu ally bad.

One caveat is that you may occasionally see a steady, unchanging num ber in the w

column . This means that th e Solaris software has d ecided these processes have been

idle so long they should be swap ped out to make room for other things. Do not be

concerned a bout this.

The cpu colum ns give you a good system-at-a-glance snapshot of what th e system is

doing, averaged across all processors. In general, non-idle time should be spent in

roug hly a 2-to-1 ratio in usr -to-sys modes. Also, if idle time (id

) is close to zeroconsistently, you probably need some additional CPUs, especially if the r column is

a large num ber. Beyond th is, to get a good view of your CPUs you should u se the

mpstat command , as explained in “mpstat Command” on page 18.

On to memory. First, note that the free colum n shou ld be comp letely ignored, as it

does not in any way correspond to w hat is thought of as free memory. Because of the

wa y the Solaris software ma nages m emory, the free list does not p roperly count

mu ltiple processes sharing the same p ages, or un used pages that h ave yet to be

reclaimed. In ad dition, the file cache grows to consume m ost of free mem ory toimprove performance.

Consequently, the free list tend s to d ecrease steadily over the up time of a system,

when in fact the system is efficiently reclaiming and reusing memory.

If you w ant a better picture of available virtual mem ory, you can u se the swap

command:

If both the free colum n from the first comm and , and the available column fromthe second comma nd are nonzero, the system is all right. Beyond th at, you can

ignore the concept of free memory.

Instead, the most imp ortant colum n of vmstat is the scan rate (sr). This colum n

shows the nu mber of pages scanned in an attemp t to free unu sed mem ory. The

pageout scanner starts run ning only when free memory goes below the kernel

parameter lotsfree, wh ich is a small p ercentage of p hysical m emory. When you

see an increase in the scan rate, you shou ld also see a jum p in the pa ge-outs (po),

indicating that pages are being moved from physical memory to sw ap sp ace. If you

# swap -l

swapfile dev swaplo blocks free

/dev/dsk/c0t1d0s0 227,6 16 4093712 4093712

# swap -s

total: 494360k bytes allocated + 35568k reserved = 529928k used,

25137440k available




see this consistently, it is evidence of a memory shortage—the system needs more

memory. If this only happ ens occasionally, then you should explore w hether better

job sched uling or /etc/system tuning could h elp. If not, you need more m emory.

Note – A high nu mber in the pa ge-ins (pi) colum n is not necessarily significant.

This is because w hen a new process starts, its executable image an d data mu st be

read into memory. Also, file system access appears in the pi column too. A large

num ber in the pi colum n is only relevant if the po column is large too.

Here is an examp le of a system th at is und ergoing heavy paging because it is

reading in a large file.

CODE EXAMPLE 1-3 vmstat 5 Command Outpu t Reading a Large File

# vmstat 5

procs memory page disk faults cpu

r b w swap free re mf pi po fr de sr f0 s0 s6 s7 in sy cs us sy id

0 0 0 2406032 431280 8 72 2 0 0 0 0 0 1 0 1 121 87 202 1 5 94

0 0 24 2489472 643792 0 0 1 0 0 0 0 0 0 0 0 328 86 108 0 2 98

0 0 24 2489472 643784 61 252 483 0 0 0 0 0 5 0 9 466 718 260 3 8 90

0 0 24 2452936 605616 1396 1753 10950 0 0 0 0 0 9 0 77 1266 2363 801 50 45 50 0 24 2383216 531176 1484 1860 11822 0 0 0 0 0 53 0 40 790 1897 357 55 33 12

0 0 24 2309576 458256 1435 1773 11475 0 0 0 0 0 69 0 23 697 1791 247 51 30 19

0 0 24 2236608 391168 1374 1761 11008 0 0 0 0 0 52 0 40 775 1613 235 49 35 17

0 0 24 2165824 324224 1411 1700 11291 0 0 0 0 0 75 0 16 751 1652 239 47 32 21

0 0 24 2097680 253816 1378 1720 11012 0 0 0 0 0 0 0 87 746 1687 246 47 33 20

0 0 24 2028800 184168 1330 2020 10614 0 0 0 0 0 73 0 11 719 1608 239 52 33 16

0 0 24 1948016 110880 1350 1649 10790 0 0 0 0 0 56 0 37 764 1605 246 49 32 19

0 0 24 1886176 48208 1282 1666 10187 8 8 0 13 0 1 0 89 793 1934 312 44 37 19

0 0 24 1835416 7280 688 836 5598 5328 5529 0 6586 0 94 0 47 1088 889 238 24 30 46

0 0 24 1803768 6680 353 675 3052 6657 6808 0 6749 0 80 0 80 1287 478 435 16 26 58

0 1 24 1790704 15856 236 393 832 4470 4481 0 1665 0 68 0 107 2579 792 1416 11 41 48

0 1 24 1784160 18152 35 388 812 3136 3144 0 839 0 60 0 92 1473 724 800 10 29 62

0 1 24 1777192 18488 29 317 988 2536 2540 0 634 0 66 0 97 988 446 422 7 15 78

0 1 24 1770768 18664 20 326 942 2334 2345 0 616 0 77 0 77 953 518 409 7 18 75

procs memory page disk faults cpu

r b w swap free re mf pi po fr de sr f0 s0 s6 s7 in sy cs us sy id

0 0 24 1764704 18528 37 339 820 2636 2648 0 699 0 105 0 48 961 509 343 8 20 71

0 1 24 1757544 18640 30 264 1051 2206 2214 0 602 0 124 0 43 963 331 398 5 15 80

0 1 24 1753544 18248 19 255 1081 2048 2056 0 880 0 97 0 70 960 323 412 5 13 81

0 1 24 1749440 18664 20 258 1046 2048 2062 0 632 0 99 0 63 974 491 443 8 14 77

0 1 24 1744720 18720 17 255 1009 2152 2153 0 552 0 102 0 58 1012 344 449 6 15 790 1 24 1739992 18920 16 256 1008 1974 1982 0 529 0 101 0 55 929 324 379 6 16 78

0 1 24 1735416 18800 16 261 998 2048 2052 0 536 0 107 0 55 966 315 379 5 15 80

0 0 24 1729704 18768 54 268 833 2177 2179 0 546 0 83 0 55 862 352 338 18 13 69

0 1 24 1728480 18816 105 403 1140 1971 1974 0 552 0 110 0 62 1027 569 492 7 11 83

0 1 24 1728600 18888 48 196 1118 1484 1489 0 470 0 110 0 53 1014 261 484 5 10 85

0 1 24 1728496 18832 42 191 1304 1536 1544 0 525 0 123 0 51 1000 160 455 2 8 89

1 0 24 1728344 37712 372 143 946 1176 1178 0 318 0 103 0 43 789 84 335 3 21 76

0 0 24 2489048 652144 3 78 32 0 0 0 0 0 6 0 5 427 310 168 1 4 95

0 0 24 2488840 651872 0 1 11 0 0 0 0 0 1 0 1 351 134 138 0 2 98

0 0 24 2488792 651776 0 0 3 0 0 0 0 0 0 0 0 349 114 128 0 2 98




Notice that for the first half of the outpu t, there is a large nu mber of pi, but no po,

du e to the file system activity of reading the file. As it prog resses, though , notice the

abrupt jump in po as w ell as the sr. Also notice how m uch the pi and u ser time

(us) drop. The system is spending an inordinate ratio of time m anaging m emory,

slowing d own how quickly it can read in the file.

As with a ll stats, brief periods of paging are not imp ortant. The pur pose of having

virtual mem ory is to allow you to tem porarily exceed you r available physical

mem ory. You just w ant to make sure the system is not paging continu ously for

extended periods of time.

By now you shou ld ha ve a rough idea of wh at your system is doing. To really

und erstand wh at is going on, though , you mu st be able to differentiate between file

system pages, executable pages, and so on. To d o this you can use th e vmstat

-p option.

vmstat Command -p Option

Using the vmstat -p option funda mentally changes the type of data reported by the

command. The -p option replaces the columns on processes, CPUs, disks, and

interrup ts with extend ed statistics on mem ory and paging, and d isplays for

executable, anonymous and file system pi, po, and pf.

Examine each of the three types of pages show n by the vmstat -p option:

Und er each page typ e head ing are the following fields, wh ere ? is replaced with th e

first letter of the p age typ e:

TABLE 1-4 Page Types Shown By vmstat -p Option

Page type Meaning

executable Images of executable program s and th eir data

anonymous Used for a p rocess heap sp ace, stack, and private p ages

filesystem Files mapped into address space through the mmap command

TABLE 1-5 Page Stats Shown By vmstat -p Option


?pi Kilobytes paged in

?po Kilobytes paged ou t

?pf Page faults




As with the vmstat output, the key field is still sr, showing th e scan rate. The

benefit you get with -p is that you can now see wh at types of pages need the space,

allowing you to better understand what the system is doing.

Look again at the system that is reading in a large file, only this time with thevmstat -p option.

As you can see, this makes wh at is happ ening to the system m uch clearer. Thesystem starts by paging in t he file very effectively, until it hits th e lotsfree limit and

the page-out scanner starts. At this point, there is a big jump in the sr column. Also

notice the abrupt shift from file system page-ins (fpi) to anonymous pi, po, and

pf. This means that p ages are being taken from other p rocesses to make room for the

file in mem ory. Thu s, if you see a lot of activity in the apo an d sr colum ns, you

need more memory.

While memory analysis can be complicated if you pay attention solely to the sr an d

po colum ns, you should be able to tell if your system needs ad ditional memory.

CODE EXAMPLE 1-4 vmstat -p 5 Command Output Reading a Large File

# vmstat -p 5

memory page executable anonymous filesystem

swap free re mf fr de sr epi epo epf api apo apf fpi fpo fpf

2406040 431296 8 72 0 0 0 0 0 0 0 0 0 2 0 0

2489992 630792 0 0 0 0 0 0 0 0 0 0 0 0 0 02480080 620344 785 1021 1 0 0 6 0 0 67 0 0 6174 1 1

2417296 557472 1514 2830 0 0 0 0 0 0 0 0 0 10777 0 0

2349520 493576 1330 2515 0 0 0 0 0 0 0 0 0 9523 0 0

2293456 459296 1295 2684 0 0 0 0 0 0 0 0 0 9088 0 0

2230072 399424 1256 1751 0 0 0 0 0 0 0 0 0 9881 0 0

2164832 334864 1403 1700 0 0 0 0 0 0 0 0 0 11212 0 0

2097432 267288 1415 1716 0 0 0 0 0 0 0 0 0 11212 0 0

2021736 192344 1330 2024 0 0 0 0 0 0 0 0 0 10638 0 0

1947168 122688 1330 1604 0 0 0 0 0 0 0 0 0 10558 0 0

1883288 59216 1324 1658 0 0 0 0 0 0 0 0 0 10568 0 0

1832784 12056 836 863 3936 0 5059 1 0 76 1 3548 3846 6808 4 12

1798648 8656 207 654 6531 0 6519 0 0 72 353 6374 6446 1502 1 12

1787016 17864 49 461 4094 0 927 6 0 6 646 4076 4084 12 3 3

memory page executable anonymous filesystem

swap free re mf fr de sr epi epo epf api apo apf fpi fpo fpf

1776040 18448 38 530 3036 0 678 8 0 6 774 3020 3028 3 0 1

1766800 18488 32 319 2592 0 625 4 0 3 952 2585 2585 1 1 3

1761080 18696 32 309 2465 0 549 0 0 1 963 2457 2460 1 0 3

1754696 18600 31 302 2420 0 534 0 0 8 937 2406 2412 1 0 0

1748608 18640 30 308 2488 0 534 3 0 3 945 2483 2484 1 0 0

1742504 18784 23 285 2318 0 508 3 0 6 968 2304 2307 3 1 41736960 18784 21 291 2268 0 491 3 0 8 979 2252 2259 3 1 1

1731008 18584 94 291 2369 0 535 0 0 9 811 2355 2358 3 0 1

1729800 18744 75 214 1697 0 497 0 0 4 1112 1689 1692 1 0 0

1729840 18664 57 202 1601 0 538 0 0 4 1156 1587 1595 0 1 1

1881984 149608 470 122 984 0 366 30 0 6 728 972 976 0 0 1

2490440 672488 0 0 0 0 0 0 0 0 0 0 0 4 0 0

2490744 672672 10 168 0 0 0 8 0 0 6 0 0 16 0 0

2490768 672512 0 2 0 0 0 0 0 0 16 0 0 0 0 0




mpstat Command

The Sun Fire system is designed to be a multiprocessor system, as evidenced by the

fact that you cannot even buy a system with only one CPU. Even thou gh you are

looking at CPUs secondarily, being processor-bound is the least likely candidate for

bad performance. If anything, you are exploring CPUs secondarily so that you can

d ouble-check this assum ption , and ru le it out as a possible factor. CPUs usu ally only

become a factor in heavily loaded systems that are doing lots of interactive or

transactional processing. In m ost other cases, if you buy enough system boards to

hold a ll your m emory, the CPUs that are includ ed are usu ally sufficient.

As mentioned previously, the cpu colum ns of the vmstat outp ut are a good place to

start. Generally, a large percentage of idle time ind icates that you r processing p oweris sufficient. However, measuring idle time across a lot of processors can mask

situations such as one processor getting swa mp ed w ith interrupts w hile the rest do

nothing. So, it is important to look at your CPUs in d etail to make sure you are not

missing an ything.

Like vmstat, just laun ch mpstat with a time interval and let it run:

This command prod uces a lot of column s, only some of w hich you care about:

CODE EXAMPLE 1-5 How to Use the mpstat Command

# mpstat 5CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl

0 372 2 836 447 300 393 47 25 26 1 918 23 10 0 67

1 370 2 622 543 523 301 40 23 35 0 932 24 11 0 65

2 376 2 527 151 100 396 48 25 26 0 926 24 10 0 66

3 372 2 531 151 100 397 48 25 26 0 921 23 10 0 67

CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl

0 229 0 546 400 300 458 0 12 13 1 563 2 9 0 89

1 132 0 2018 585 585 111 0 9 16 0 621 4 8 1 88

2 265 0 199 100 100 354 1 9 15 0 770 21 9 1 68

3 363 0 491 101 100 671 1 14 18 0 1339 22 11 0 67CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl

0 155 0 445 400 300 495 0 12 7 0 398 1 6 0 92

1 99 0 145 348 347 134 1 10 10 0 487 13 4 0 83

2 154 0 401 101 100 255 1 8 4 0 723 21 5 0 73

3 307 0 227 100 100 178 0 11 9 0 989 23 8 1 69

TABLE 1-6 Important mpstat Command Output Columns


xcal Interprocessor cross-calls

intr Interrupts

csw Context sw itches

icsw Involuntary context switches




A cross-call (xcal

) is a call used by a processor to tell other processors to dosometh ing. Cross-calls are used for a variety of things, such as d elivering a signal to

another processor or ensuring virtual m emory consistency. This latter use is very

common, as it happens during file system activity. Heavy file system activity (such

as N FS) can result in a lot of cross-calls. Also, it is not u nu sual for th e boot proc to

show thousands of xcals, as it maintains lots of information about the others.

An interrupt (intr) is the m echanism th at a d evice uses to signal to the kernel that

it needs attention, and some imm ediate processing is required on its behalf. I/ O is

the m ajor contributor of interrup ts, although there are also “special” interrup ts suchas the system-wide clock thread that occurs regularly. Interrupts, u nlike everything

else, are not d istributed across all CPUs. Instead , the Solaris OE binds each sou rce of

interrupts to a specific CPU.

The term context switch (csw) refers to the p rocess of moving a thread on and off a

CPU. Context sw itches are a norma l but som ewh at expensive occurrence because

switching context involves certain overh ead , such as pop ulating th e stack. Norm ally,

a context switch occurs w hen a process is done with th e CPU and another p rocess is

given a chance to run. Thus, a steady number of context switches is insignificant.

Involuntary context switches (icsw), on the other hand , are mu ch less favorable.

When a process is given access to the CPU, it is has a limited t ime w ind ow in w hich

to run, dep ending on h ow m any other p rocesses are runn ing, what their priority is,

and so on. This is the nature of scheduling. An involuntary context switch m eans

that th e process was forcibly stopped by the schedu ler before it wa s finished; the

time allotted was too short for the p rocess to finish in, or a h igher-priority thread

preemp ted it. A few of these is nothing to be concerned abou t, but getting a largenum ber of these regularly indicates that the system d oes not have enough

processing pow er to hand le all of the things that need to run . You n eed ad ditional

CPUs.

Finally, a spin on a m utex lock (smtx) happen s when a thread cannot access a

section of the kernel that it needs on the first try. The term mutex is short for a

mu tu al exclusion lock, and is used in mu ltithread ed op erating system s like the Solaris

OE to allow mu ltiple threads to ru n concurrently in system m ode. When a thread

enters system mod e, it locks the par t of the kernel it is using by acquiring the m utex

smtx Spins on mu tex locks

usr Percent u ser time

sys Percent system tim e

wt Percent w ait time

idl Percent id le time

TABLE 1-6 Important mpstat Command Output Columns (Continued)







One n ice thing is that the solution to all of these problems is the same. The system

needs m ore and/ or faster CPUs. Once again though, the imp ortance of having

enough memory is emphasized here. When you add a CPU, you incur additional

overhead in the form of more kernel space needed to m anage that CPU, and spa ce

for that CPU to d o its own work. Therefore, the rule of thumb is:

Whenever you add additional CPUs, you should also add memory.

Doing so will help p revent accidental mem ory shortages, which can actually m ake

your system run slower as you add more CPUs.

iostat Command

Proper I/ O layout is complicated; it is almost never done right the first time. Part of

the reason for this is that usage p atterns and requirements change over time. Also,

wh ere you add m emory and CPUs is somew hat predeterm ined. Where you add d isk

dev ices and controller cards, thoug h, has a big imp act on the system . Therefore, it is

importan t to make su re that the I/ O layout is flexible enough to h and le futu re

changes and expansion.

Highsmtx

High sys or idl Contention for kernel system resources exists.

High icsw Contention for basic CPU resources exists.

High csw or xcal

High sys

Low usr

If this happen s consistently, you m ay require m ore CPUs,

dep ending on your ap plications. If you are not noticing any

slowness in ap plications or system problems, how ever, ignore it.

High sys

Low usr

All other stats low

Your system is spend ing too m uch time man aging resources. Check

vmstat first.

TABLE 1-7 Analyzing the mpstat Command Output (Continued)

If you see this... It probably means...




Fortunately, I/ O analysis is very straightforward. There is only one version of the

iostat comman d to run, iostat -zxcn.

For this version of the iostat command , the outpu t shows extended statistics for

only those disk devices with nonzero activity, by physical device path instead of the

logical kernel disk name (that is, c0t0d0 instead of sd0). If you are using

individual d isk partitions, you m ay also wan t to use the -p option. How ever, most

production environments manage their disks with some type of volume manager

package, so in practice this option is not that useful.

CODE EXAMPLE 1-6 How to Use the iostat Command

# iostat -zxcn 5<summary omitted>

cpu

us sy wt id

0 1 5 93

extended device statistics

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device

0.0 0.2 0.0 1.6 0.0 0.0 0.0 8.2 0 0 c0t0d0

0.0 0.2 0.0 1.6 0.0 0.0 0.0 10.1 0 0 c5t0d0

0.0 34.6 0.0 2201.0 0.0 1.0 0.0 27.9 0 97 c12t1d0

0.0 0.2 0.0 1.6 0.0 0.0 0.0 12.2 0 0 c20t122d00.0 0.2 0.0 1.6 0.0 0.0 0.0 14.1 0 0 c20t98d0

0.0 14.2 0.0 113.6 0.0 0.1 0.0 5.5 0 8 c20t101d0

0.0 0.2 0.0 0.5 0.0 0.0 0.0 30.1 0 1 c10t1d0

0.0 58.4 0.0 135.4 0.0 0.3 0.0 5.6 0 30 c2t17d0

1.0 12.8 8.0 135.0 0.0 0.2 0.0 11.6 0 13 c2t16d0

0.0 3.4 0.0 19.2 0.0 0.1 0.0 17.9 0 4 c2t9d0

0.0 0.4 0.0 0.8 0.0 0.0 0.0 5.3 0 0 c2t21d0

0.0 1.8 0.0 1.4 0.0 0.0 0.0 4.2 0 1 c27t42d0

0.4 9.2 3.2 155.9 0.0 0.1 0.0 10.9 0 8 c28t69d0

0.0 9.0 0.0 157.5 0.0 0.1 0.0 9.0 0 6 c28t68d0

0.0 1.8 0.0 1.4 0.0 0.0 0.0 4.8 0 1 c29t1d0

0.0 9.0 0.0 157.5 0.0 0.1 0.0 9.5 0 7 c30t35d00.0 0.4 0.0 0.8 0.0 0.0 0.0 5.1 0 0 c30t52d0

0.0 9.2 0.0 155.9 0.0 0.1 0.0 10.3 0 7 c30t36d0

0.0 58.4 0.0 135.4 0.0 0.4 0.0 6.5 0 35 c31t66d0

0.4 12.8 3.2 135.0 0.0 0.2 0.0 12.1 0 12 c31t64d0

0.0 3.4 0.0 19.2 0.0 0.1 0.0 17.7 0 4 c31t90d0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 tomax:/export/mirrors/

pkg.eng/export/pkg

0.0 0.2 0.0 0.4 0.0 0.0 0.1 1.0 0 0 twinsun-n1:/export/workspace/

d0/nwiger




As with the other stat command s, there are only a few colum ns you care about

(TABLE 3-8).

You can ignore two comm only used colum ns, %w an d %b, which are sup posedly the

percentage of time sp ent w aiting an d b usy, respectively. Because of the comp lexity of

mod ern d isks and controllers, these calculations are very inaccurate. Often the two

will total more th an 100 percent, w hich shou ld be imp ossible. Besides, these colum ns

do n ot tell you anyth ing that you cannot find ou t by looking at wsvc_t or asvc_t.

Analogous to the mpstat command, when looking at iostat you should always

watch th e first tw o colum ns listed (kr/s an d kw/s) to see how m uch activity the

disks are undergoing. Then, basically, the last three columns should be as close to

zero as possible. This indicates that the system has very fast disks, and that the I/ O

is laid out correctly to avoid controller bottlenecks.1

In practice, asvc_t will be nonzero for any disks undergoing activity, since it

always takes some am ount of time for a d isk to fulfill a request. As with any stat,

you will only be able to tell if the system is particularly busy after establishing a

baseline. However, several facts are true:

1. Service times across equally active disks should be fairly even.

2. You should not see huge peaks and valleys und er norm al conditions.

3. You sh ould rarely, if ever, see a non zero nu mber in wait or wsvc_t.

You may, occasionally, see a tem por ary jum p in service times (asvc_t) even thoughthere is nothing app arently going on (that is, kr/s an d kw/s are a lmost 0). This is

due to a somewhat strange behavior of fsflush, the d aemon responsible for

flush ing d isk buffers. Periodically, it will generate a lon g, rand om series of writes in

a short time period. This results in a qu eue forming, which bum ps u p th e service

time, even th ough there is no real appa rent activity on th e disk. If you see this,

ignore it.

TABLE 1-8 Important iostat Command Columns


kr/s Kilobytes read per second

kw/s Kilobytes written p er second

wait Nu mber of tran sactions w aiting for service

wsvc_t Average service time in w ait queu e, in millisecond s

asvc_t Average service time for active transactions, in milliseconds

1. Withou t the -n option,wsvc_t an d asvc_t are combined in to a single svc_t column.








Despite its limitations, you can tell several things from the netstat command

outp ut. Unlike the other stats, you m ust run the netstat command separately for

each interface you have configured by specifying the -I option along with the

interface n ame.

You can tell two th ings from this d isplay:

1. Total number of packets received (inpu t) and transm itted (outp ut) du ring that

interva l, both for that int erface (left set of colum ns) and for all interfaces (right set

of column s). This is not an average p er second, but a total count.

2. Nu mber of errors and collisions, which should alw ays be low or zero.

Netw ork capacity is very d ifficult to gauge with this limited informa tion. Without

the sizes of each p acket, it is imp ossible to know if you are anyw here near the

through pu t limits for the interface you are analyzing. Given this information, if the

netw ork seems slow, and you are seeing thou sands and thousand s of packets each

second, try add ing another network interface card to see if it helps. If not, you

should examine your n etwork as a w hole to see if you have more w idespread issues.

Man y ava ilable freeware tools, such as th e SE Toolkit and Multi Router Traffic

Graph er (MRTG), provide better netw ork an alysis than netstat. You can u se tools

such as these to more properly gauge the band wid th being used by each interface.

MRTG is especially useful, as it graphs utilization over time so you can easily see

wh en your n etwork interfaces are getting busy, as well as how mu ch bandw idth

they are pushing.

Analysis Reveals...

By this point, you shou ld have a good idea abou t wh ere the system is w eak. Make

sure you hav e good notes, as you need th is information in the next chap ter wh en

you design your new system.

Giving p erformance tuning a full treatment is beyond the scope of this book. True

performan ce tuning gets exponentially h arder; it is much more d ifficult to get thelast 10 percent ou t of a system th an th e first 90 percent. If you are interested in high-

CODE EXAMPLE 1-8 How to Use the netstat Command

# netstat -I ge0 5input hme0 output input (Total) output

packets errs packets errs colls packets errs packets errs colls

909076714 0 837319344 0 0 918674892 0 846917522 0 0

667 0 681 0 0 673 0 687 0 0

426 0 402 0 0 428 0 404 0 0

1886 0 3684 0 0 1886 0 3684 0 0

1878 0 3117 0 0 1882 0 3121 0 0411 0 391 0 0 411 0 391 0 0



Designing for RAS 27

end p erformance tuning, read Sun Performance and Tuning—Java and the Internet, 2nd

Edition by Adrian Cockcroft and Richard Pettit (ISBN 0-13-095249-4) and

“Application Performance Optimization” by Börje Lindh—Sun Microsystems AB,

Sweden Sun BluePrints™ OnLine—March 2002.

Designing for RAS

This is the final step in the design process. By now, you should have a fairly clear

un derstand ing of wh at your requ irements are, as well as any possible problems w ithyour existing system. Up un til now, this book focused mainly on performan ce

because you shou ld make sure any solution you develop can meet your fund amental

application requirements. However, properly designing for RAS is just as important,

and requires some thought.

Always keep th ree principles in mind wh en d esigning for RAS:

s The more RAS you w ant, the more hardware you must add to the system.

s RAS is not just a fun ction of the Sun Fire server, but of you r entire site.s Maximizing RAS can decrease performance.

The first point is almost always overlooked. As an example, to effectively use DR,

you should add boards in your design beyond th ose required for your app lications.

Why? Because otherwise, when the system d ynam ically reconfigures a board out of

the system, it will not have enough resources to ru n you r ap plications. The system

could start paging, or the CPUs could get too busy hand ling I/ O interrupts to d o

any real work.The requirements you have formed up to this point are the minimum you

need for your system.

As for the second p oint, pu rchasing redu nd ant p ower su pp lies does not benefit you

if your site has only a single pow er grid with no UPS system. RAS is a fun ction of

your entire site, not just one server in isolation. As w ith performa nce, getting that

final 10 percent of reliability out of a site gets exp onen tially more d ifficult—and

costly. Therefore, you should be realistic about both your requirements and

expectations—and your ability to fund them.

Third, taking advantage of certain RAS features and methodologies can decrease the

performance of your system. For example, if you mirror file systems, for each write

the system must now perform tw o writes, one to each half of the mirror. Some of

these effects can be m itigated, for instance by placing the tw o halves of the mirror on

different I/ O controllers.1 How ever, such p erformance hits can ad d u p, so it is

importan t to realize it is imp ossible to maximize both RAS and p erformance.

1. In fact, man y volum e manag ers will "round rob in" between th e two halves of a mirror on read s, actuallyincreasing your read p erform ance over a single disk.




Uptime Requirements

You w ere first asked to consider your u ptime requ irements in Chapter 2, “What are

the u ptime requ irements of the system?” To help an swer th is question, you can

consider the following:

s How much time do you have available for planned maintenance?

s How long can you afford to be offline during an unplanned downtime?

There are two types of dow ntime—planned and unp lanned. Planned down time

includes hardw are and software up grades, whereas unp lanned dow ntime includes

system crashes and emergency reboots. All comp uter systems have some am ount of

dow ntime; the goal of a good server d esign is to minimize the impact this down timehas on your organ ization.

For some organizations, schedu led m aintenance is not an issue; the systems u nd ergo

heavy usage d uring the d ay from employees, so taking the m achine down a fter-

hour s is a viable solution. Other organizations, however, serve a w orldw ide

aud ience and can afford little scheduled m aintenance du e to time zone d ifferences.

Also, it is not un common to have a m ix of different requ irements for different

systems at a single site. One thing that every organization has in common, though, is

the desire to minimize unplanned downtime as much as possible.

There is no reason to d ifferentiate between the tw o types of dow ntime, other than to

help you come to a conclusion regard ing your overall requirements. When you h ave

a good idea of the uptim e required for this system, TABLE 1-9 will help you d etermine

wh at your design should include to ensu re that its RAS properties meet your

requirements.

http://c2_3.24.02.pdf/

http://c2_3.24.02.pdf/




Note – You shou ld always p urchase redu nd ant SCs for a system to ensu re

availability in the event of a System Controller board failure. Without a functioning

System Controller board , none of the dom ains in a system w ill work.

Note – Even though you can use DR to replace failed components, a critical

component failure on a run ning system (such as a failed CPU) w ill still cause the

system to crash. If you cann ot afford th is type of d own time, you fit in the almost none

category, and should use a clustering prod uct to guard against system failures.

For most organ izations, the little downtime category is a good cost/ benefit tradeoff.

You w ill hav e a system th at is resilient to failures and , if prop erly configured,relatively easy to service. You can u se DR to add m ore CPU/ Memory board s for

increased capacity, or to replace failed components.

Make a note of w hat category your system fits into, as well as the ad ditional

components you will need. You are going to u se this in the next chapter to d esign

your system. You w ill also use it later in the book du ring th e discussion on

configuring the system to integrate w ith your site.

TABLE 1-9 RAS Design Decision Table

Allowable

downtime Your design should include...

Some Red undant fan trays

Redundant power supplies and transfer switches1

1. Remem ber, redund ant pow er helps only if your site is equipped to sup ply it.

Little Red un dan t CPU/ Mem ory board s

DR for CPU/ Memory board s

Volum e man agem ent softw are (such as Solaris™ Volum e Manag er (SVM)

or VERITAS Volume Manager (VxVM)

Very li tt le Redundant paths to I/ O devices

Multipathing software for I/ O (such as Multipath I/ 0 (MPxIO) or

VERITAS Dynam ic Multip athin g (VxDMP)

Redundant network connectionsMultipathing software for netw orks—such as Internet p rotocol

multipathing (IPMP)

DR for I/ O devices and networks

Almost none Multiple instances of fully redund ant systems

Clustering software (such as Sun™ Cluster 3.0)




Finally, some closing words on RAS. It is very important that you do not sacrifice

parts of your required configuration for add itional RAS features. For examp le, do

not d ecide to buy less mem ory so tha t you can afford add itional fan trays. You

should ensure that your base requ irements are met, or else you w ill not benefit from

add itional RAS because you r system will have fund amental shortcomings.Disk Redundancy and RAID Basics

To ensure the integrity of the data, some type of disk redu nd ancy should be u sed on

any system with importan t local data storage. The d ifferent schemes for achieving

such redun dan cy are often denoted by their RAID level. The term RAID comes from

Redu nd ant Arra y of Inexpensive Disks, and there are num bers from 0 all the way u p

through 53 denoting d ifferent w ays of laying ou t sets of disks.

For most app lications, how ever, only th ree RAID levels are useful: 0, 1, and 5. Each

of these allow you to combine multiple physical disks into a single logical volume.

The operating system then sees this volume just like a norm al disk, and it can be

mounted an d u sed in the regular manner.

RAID 0

RAID 0, comm only called striping, provides n o ad ditional data safety. Instead, it isd esigned to increase the sp eed of file system access. With striping , disks in a volum e

are interleaved at a certain data interval, called the stripe unit size. This means that

wh en read ing or w riting d ata, multiple disks are accessed in parallel, decreasing the

amou nt of time it takes to access the data. Striping is very common on any system

that n eeds fast data access, such as d atabase servers.

RAID 1

RAID 1, also referred to as mirroring, is just the reverse. It provides full data

redu nd ancy, but w ith some p erformance costs. In mirroring, twice the num ber of

disks are used for the d ata that n eeds to be stored. These disks are then arran ged in

pairs, and identical data is stored on both disks. On a file system write, two physical

wr ites must be performed , one to each d isk of the pair. The advantage is you now

have tw o complete copies of your d ata.

This means you can lose half of your d isks and still continue ru nning w ithout d ata

loss. In a large volume, this is obviously an advan tage.

RAID 0+1

RAID 0+1, usually called striping and mirroring, is a combination of these tw o

techniques. In a striped/ m irrored volum e, a set of disks is striped together to form

each half. Then, these tw o halves are mirrored to on e anoth er. It is possible to design




a striped/ mirrored volum e so that the performance is better than the individu al

disks (due to striping), and th at fully half the d isks can fail without imp acting the

volume (du e to mirroring). This technique is w idely-used in prod uction systems.

RAID 1+0

RAID 1+0 is very similar to RAID 0+1, except the volumes are assembled in the

reverse order. Here, pairs of disks are mirrored to on e another, and then these

mirrored p airs are striped togeth er. Volum es created in this man ner are slightly more

complicated to ma nage, but are slightly more reliable because of the w ays in w hichdisks typically fail. Generally, vendors decide to implement either RAID 0+1 or

RAID 1+0, but not both, so the choice of which to use is often made for you.

RAID 5

Finally, RAID 5 is one of the m ost econom ical forms of redu nd ancy. In this schem e, a

portion of each disk in a volum e is used to hold p arity. On a w rite, data isdistributed across all the d isks in the volume except on e, with the p arity being

written to the rem aining d isk. This process is repeated in a "roun d robin " fashion, so

that each w rite places the p arity for that wr ite on a d ifferent d isk. In the event of a

single disk failure, the par ity is used to recreate data th at was on the failed disk. This

allows you to lose a single disk (the most common typ e of failure) and continue

run ning w ithout interru ption. RAID 5 is somewh at slow, though , since it must

perform all those add itional w rites for the parity.

While RAID 5 is not a s reliable as RAID 0+1 (striping and mirroring), it can still be agood solution, especially for NFS servers. While you can only lose one disk, it is

un comm on to lose a whole enclosure barring hu man error or a pow er failure, both

of which will probably affect much more than your disks. To make use of RAID 5,

you shou ld consider on ly those enclosures that su pp ort hard ware RAID, since

otherwise it is too slow for many app lications.

Once you h ave selected wh at type of RAID you w ish to use for each of your

different volum es, you shou ld a djust your storage pu rchase accordingly. For

example, if you w ant to mirror a set of data, you m ust pu rchase dou ble the amou nt

of disk you calculated above. You w ill need to make sure to increase you r controller

cards as w ell.

With RAID 5, check the enclosure you are considering purchasing to verify that it

supports hard ware RAID.




A Logical Design Specification

By now you sh ould hav e available all of the information you need to create a logical

d esign sp ecification:

s Design ru les of thumb

s Your existing sy stem ana lysis (if ap plicable)

s Your RAS design requ iremen ts

As you did in the Statement of Requirements Worksheet in Chapter 2, formalize this

informa tion into a d esign for you r logical system that will serve as an accurate

picture of your needs, so you can use it in the next chapter to choose the appropriate

ph ysical system. List you r sp ecifications in TABLE 1-10.

TABLE 1-10 Logical Design Specification Worksh eet

Item Description

http://c2_3.24.02.pdf/

http://c2_3.24.02.pdf/






Sunfire Design and Configuation

Documents