Send comments about this document to: [email protected]Sun Fire ™ Systems Design and Configur ation Gui de Nathan Wi ger Roger Bl yt he Part No. 816-7882-10 September 2002 Revision 04 Sun Microsystems, Inc. 4150 Network Circle Santa Clara, CA 95054 U.S.A. 650-960-1300
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Copyrigh t 2002Sun Microsystems, Inc., 901San Antonio Road, Palo Alto, CA 94303-4900 U.S.A.A ll rights reserved.
This product or document is distributed un der licenses restricting its use,copying, distribution, and decompilation.N o part of this product or
docum ent may be reprod uced in any form by any means with out prior written authorization of Sun and its licensors, if any.Third -party
software, including font technology, is copyrighted an d licensed from Sun sup pliers.
Parts of the prod uct may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademar k in
the U.S.an d other coun tries, exclusively licensed throu gh X/ Op en Compan y, Ltd.
Sun, Sun Microsystems, the Sun logo, Answ erBook, Answ erBook2, docs.sun.com, Solaris, Sun Man agement Center, Sun BluePrints, Sun Q uad
FastEthernet, Sun StorEdge, Op enBoot, Sun Enterprise, Sun Fireplane, and Sun Fire are trademar ks, registered tra dem arks, or service marks
of Sun Microsystems, Inc. in the U.S. and o ther coun tries. All SPARC tradema rks are used un der license and are tradem arks or registered
trad emarks of SPARC Internat ional, Inc. in th e U.S. and oth er countries. Products bearing SPARC tradema rks are based u pon an architecture
develop ed by Sun M icrosystems, Inc.
ORACLE is a registered trademark of Oracle Corporation. Netscape is a trademark or registered trademark of Netscape Communications
Corpora tion in the Un ited States and oth er countries. Legato NetWorker is a registered tr adem ark of Legato Systems, Inc. Adob e is a registered
trademark of Adobe Systems, Incorporated.
The OPEN LOOK and Sun Graph ical User Interface was d eveloped by Sun Microsystems, Inc. for its users and licensees. Sun acknowled ges
the p ioneering efforts of Xerox in researching an d d eveloping th e concept of visua l or graph ical user interfaces for the comp uter ind ustry. Sun
hold s a non -exclusiv e license from Xerox to the Xerox Graph ical User Interface, wh ich license also covers Sun’s licensees wh o implem ent OPEN
LOOK GUIs and otherw ise comp ly with Sun’s written license agreements.
RESTRICTED RIGH TS: Use, du plication, or disclosu re by the U.S. Govern men t is subject to restrictions of FAR 52.227-14(g)(2)(6/ 87)a nd
FAR 52.227-19(6/ 87), or D FAR 252.227-7015(b)(6/ 95) and DFAR 227.7202-3(a).
DOCUMENTATION IS PROVIDED “ AS IS” AND ALL EXPRESS OR IMPLIED CON DITIONS, REPRESENTATION S AND WARRAN TIES,
INCLUDING ANY IMPLIED WARRANTY OF MERCHAN TABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NO N-INFRINGEMENT,
ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.
Copyrigh t 2002 Sun Microsystems, Inc., 901San Antonio Road, Palo Alto,CA 94303-4900 Etats-Unis.Tous droits réservés.
Ce prod uit ou documen t est distribué avec des licences qui en restreignent l’utilisation, la copie, la distribu tion, et la décomp ilation. Aucun e
partie de ce prod uit ou docum ent ne peut être reprod uite sous aucun e forme, par qu elque moyen que ce soit, sans l’autorisation pr éalable et
écrite de Sun et de ses bailleurs de licence, s’il y en a. Le logiciel détenu par d es tiers, et qui comprend la techno logie relative aux polices de
caractères, est protégé par u n copyright et licencié par des fournisseurs d e Sun .
Des parties de ce prod uit p ourron t être dérivées des systèmes Berkeley BSD licenciés par l’Université de Californie.UN IXest un e marqu e
dép osée aux Etats-Unis et dans d ’autres pay s et licenciée exclusivement par X/ Op en Comp any,Ltd .
Sun, Sun Microsystems, le Sun logo, Answ erBook, Answ erBook2, docs.sun.com, Solaris, Sun Managem ent Center,Sun BluePrints, Sun Qua d
FastEthernet, Sun StorEdge, Open Boot, Sun Enterprise, Sun Fireplane, and Sun Fire,son t des marques de fabrique ou des marqu es dép osées,o umarqu es de service,d e Sun Microsystems, Inc.a ux Etats-Unis et dan s d’autres pays. Toutes les marques SPARC sont utilisées sous licence et
sont des marqu es de fabrique ou des marqu es déposées de SPARC International, Inc.au x Etats-Unis et dan s d’autres pays. Les prod uits portan t
les marques SPARC sont basés sur u ne architecture développ ée par Sun Microsystems, Inc.
ORACLE est une marque déposée registre de Oracle Corporation. Netscape est une marque de Netscape Communications Corporation aux
Etats-Unis et dan s d’autres pays. Legato NetWorker est une marqu e de fabrique ou un e marqu e déposée de Legato Systems, Inc. Adob e est une
marque enregistree de Adobe Systems, Incorporated.
L’interface d’utilisation graphiqu e OPEN LOOK et Sun a été dévelop pée par Sun Microsystems, Inc.p our ses utilisateurs et licenciés.Sun
reconnaît les efforts de pionn iers de Xerox pour la recherche et le dévelop pem ent d u concept d es interfaces d’utilisation visuelle ou grap hique
pou r l’indu strie de l’informatiqu e. Sun d étient un e licence non exclusive de Xerox sur l’interface d’utilisation graph ique Xerox, cette licencecouvran t également les licenciés de Sun q ui mettent en p lace l’interface d’utilisation graphiqu e OPEN LOOK et qui en outre se conforment aux
licences écrites de Sun .
LA DOCUMENTATION EST FOURNIE “EN L’ETAT” ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSES
OU TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISE’E PAR LA LOI APPLICABLE, Y COMPRIS
NO TAMMEN T TOUTE GARAN TIE IMPLICITE RELATIVE A LA QUALITE MARCHAN DE, A L’APTITUDE A UNE U TILISATION
better illustrate the d ifferent role each major component plays. In th is analogy we
follow a receptionist answering various typ es of incoming calls to show how a
compu ter ma nages the requests it receives.
Every comp uter system h as three main compon ents that can be configured:
1. I/ O devices
2. CPUs
3. Memory
Of course, a Sun Fire system has m any other components too, including rep eater
boards, the Fireplane, and so on. How ever, in the Sun Fire system (as with m ostcompu ter systems), these are part of the fundam ental architecture of the m achine
and cannot be configured by the customer. This fact mean s that to d esign you r
system, you should pay close attention to th e decisions you make regard ing CPUs,
mem ory, and I/ O because th ese decisions w ill d irectly affect the effectiveness of
your design.
Notice the use of the term CPUs. Because the Sun Fire system board is sold with a
minimum of two processors, it is not possible to buy a single-CPU Sun Fire system.
All Sun Fires are multiprocessor systems.
I/ O Devices
The Sun Fire system uses the PCI bus for all I/ O. The I/ O is what allows you to do
anything p rodu ctive with the system. Without I/ O, you wou ld have no keyboard,
no netw ork connection, no d isks, and so forth.
Und erstanding the imp act I/ O has on the system is imp ortant. When something has
to be done w ith I/ O, an interrup t is generated. The CPUs mu st hand le this interrupt.
Frequen tly, I/ O is the single-biggest resource sink on a system. This fact is especially
true w hen you have m ultiple types of I/ O running h eavy loads concurren tly, which
generates a large num ber of interrup t requests.
For examp le, consider a backend d atabase server that is front-ended by a dozen or
more concurren t web servers. When a web server needs some dyn amic data, it hasto make a request via the network to the server, wh ich then m ust do the ap propr iate
database selects and retrieve the data from its local disk, finally shuffling the reply
back across the network to the w eb server that requ ested it. This can result in a
nu mber of I/ O interrup ts, as the system mu st hand le all of the netw ork packets as
well as all of the disk seeks to get the d atabase information off disk.
When you mu ltiply one request times a dozen or more w eb servers, each request
times a d ozen or m ore clients, you can see that the d atabase server could easilybecome swam ped w ith I/ O interrup ts, wh ich excludes the comp uting pow er needed
to run the operating system, manage m emory, and ru n the d atabase itself.
To tie everything together, think of I/ O as each individual phone call received by a
receptionist. Each p hone call generates an interru pt that th e receptionist mu st
hand le. Depend ing on the requ est, it m ay result in a lot of data tran sfer (talking)
back to the caller. More calls generate m ore interru pts. Eventu ally, the ph one system(server) hits a limit either in the am ount of concurrent requ ests that it can hand le
(memory), the speed with which the requests can be fulfilled (by the CPUs), or how
fast the caller and the CPUs can comm unicate (I/ O speed ).
CPUs
The CPU is actually responsible for mu ch more than compu tation. Anything that
pu ts a load on the system, includ ing d atabases, web servers, email, NFS, NTP, and
general networ k and user tra ffic, requires a lot of CPU p ower. The CPU d oes not d o
as much thinking as it does handling. Any time the system m ust d o anything, it mu st
ask the CPU, which has to prioritize the task, schedule it, and allocate resources for
it, and do so in a w ay that allows all the other mu ltitud e of things going on to
continue running too.
In this way, the CPU can be thou ght of as a bu sy receptionist. The receptionist ha s anu mber of stand ard routines. These may include forwarding calls to employees,
taking messages, setting u p ap pointments, and even providing d irect responses to
simple requests, “What is your ad dress?” When an incoming p hone call is received,
the receptionist executes the proper routine, and completes the request if possible. If
the request cannot be fulfilled in a reasonable amoun t of time, the receptionist may
have to place the caller on hold temp orarily to handle some other tasks and free up
some time.
In some cases, the receptionist may receive a request that is too complex to behand led by standard routines. For example, the receptionist ma y receive a call that
the boss is running late, and th at several meetings need to be rescheduled. H ere, the
receptionist must d o some thinking to determ ine wh ich m eetings can be moved to
wh en. At the same time, the receptionist mu st still pay attention to other incoming
calls, to ensure an important request is not missed.
If things get too busy for one receptionist to hand le, you m ay need two or m ore
receptionists. Some callers may even get frustrated and hang up. Even for those thatdo get throu gh, there will likely not be enough time to prop erly answer their
queries.
So, it is imp ortan t to consider not on ly the difficulty of each requ est, but the volum e
too. In our analogy, each incoming request requires a certain baseline of time to
han dle p roper ly. Typ ically, the receptionist will have to p ress a butt on to p ick up th e
app ropriate line, answ er the call with a greeting, listen and analyze the requ est, then
pr ioritize it and comp lete it app ropr iately. Even if a request consists of noth ing more
than “Is Mr. John son in?”, it still takes a certain am oun t of time to fulfill the requ est.
In the Sun Fire system, the system m emory is dynam ic rand om access memory
(DRAM). The system u ses mem ory to store thing s that it is using actively such as theoperating system, programs, and their data.
When asked to execute a program , the system m ust allocate space in m emory to
hold an image of the program and its associated data. This space can grow or shrink
as the program run s, since its resource requirements may change. In reality, most
app lications grow over time because they d o a p oor job of cleaning up after
themselves.
When a system is und er a very heavy load, it may ru n out of room in memory tohold a ll the information it n eeds. In this case, it u ses predeterm ined d isk space,
known as swap space, to tempora rily store lesser-used things from m emory
temporarily to make room for other things. This is known as paging, since it involves
selectively moving specific data out of memory in sections know as pages. When
those pages are needed , the system incurs a page fault , and the d ata is moved from
disk back into memory.
In extreme situations, the system may und ergo swapping. In this case, memoryimages of entire programs a re moved from m emory ou t to d isk. This is a significant
performance hit, and if the system starts swa pp ing, some serious problems may
occur. Unfortunately, the terms p aging and swap ping are often u sed
interchangeably, perhap s because the d isk storage is called “swap space,” bu t they
are really very different.
Do not undervalue how important memory is to a running system. Not having
enough m emory is perhap s the single greatest cause of performance problems.
With th e receptionist analogy, you can th ink of mem ory as th e nu mber of incoming
ph one lines available. Even if you h ave five reception ists (CPUs), it w ill not h elp the
situation if you only hav e four ph one lines (mem ory). The ph one system w ill still be
slow, since you have a bottleneck in th e amou nt of requests you can h and le
concurren tly. To accept anoth er call, the cur rent caller w ill hav e to be p laced on h old
( page-out ) in order to get back to the first caller ( page-in).
If the load gets too heavy for the phone system, and no more lines can be p ut on
hold, calls will have to be d isconnected (swap -out) to make room for others. Thereceptionists w ill then h ave to call the person back (swap -in), a mu ch more time-
Design Rules of ThumbYou can u se a num ber of rules of thumb to design a system. Properly using these rules
requires a firm grasp on you r needs—that you ha ve comp leted a statement of your
requirements using the information and tables in Chapter 1 an d Chapter 2.
This section describes the following design rules of thumb:
s Spread your I/ O devices across as man y PCI buses as p ossible.
s Decide how man y CPUs the system need s.
s Decide how much memory the system needs.
s A well-designed system should seldom page, and never swap.
s The system shou ld always ha ve some idle time.
s Whenever you add additional CPUs, you should also add memory.
I/ O Devices
You shou ld always d etermine your I/ O design first, as this along w ith your
app lication need s determ ines your comp uting requ irements. To get the best
performance and reliability from your Sun Fire server, you should lay out the I/ O
carefully. An easy rule of thumb is:
Spread your I/O devices across as many PCI buses as possible.
Doing so distributes your I/ O load across as many different controllers as possible,
thus imp roving performance. In ad dition, you are red ucing the nu mber of single
points of failure that could cause your data to go offline. Unfortunately, this rule of
thum b has m any caveats. Unlike CPUs and mem ory, the layout of your I/ O
intimately affects the reliability of your m achine, and wh ether or not you can use
features such as dynamic reconfiguration (DR). Chapter 4 d iscusses th e issue of I/ O
design in detail, taking all of these factors into consideration.
CPUs
Regardless of what your tasks are—NFS service, CAD simulations, or compiling
software builds—handling each requ est requires a certain baseline of time, as th e
receptionist examp le shows. Not only is the type of request imp ortant, but th e
qu antity of requests is impor tant too. In fact, it is often hard er for a system to han d le100 sma ll requ ests than 10 large ones, due to the inh erent overh ead of han dling each
How man y CPUs are enough? The rules of thum b you can use to help you
determine how many CPUs you need are:
s One-half CPU per n etwork card
s One-eighth CPU per I/ O device (disk or tape)
s Two CPUs per application for mostly I/ O-based applications (NFS, web servers
and so forth)
s Four+ CPUs p er ap plication for m ostly CPU-based ap plications (simu lations,
databases and so forth).
These figures assume a mod erate load on you r system. If you are expecting a high
load on certain aspects, you should d ouble the correspond ing num bers. For
example, if a system is going to h ave a lot of network traffic, you sh ould have on e
CPU for every netw ork card to h and le the interrupts. Conversely, if you are
designing a system you expect to have a very light load, cut the nu mbers in half or
consider w hether th e tasks that system is going to be performing could be combined
with another server to lower overhead.
To get an id ea of how m any CPUs you need, ad d up each of the criteria that affect
you, then round up to the nearest multiple of two. We recomm end th at you only buy
the four-CPU boards for you r Sun Fire system. Purchasing a tw o-CPU board limitsyour futu re expansion room, since it takes up the sam e amou nt of space as a four-
CPU board. H owever, there are m erits to the 2-CPU board if you do not need
expansion room, and th e examples later in the book dem onstrate a good u se for it.
So, if you are d esigning an NFS server w ith a gigabit network card and six Sun
StorEdge T3 arrays, you h ave the following CPU requ irements (TABLE 3-1).
Rounding up, you should buy a four-CPU board to run this system.
Note – These rules work well for average systems. H owever, for high-intensity
app lications such as on line transaction p rocessing (OLTP), data m ining, and so forth ,
you should research your needs more carefully. For details, see “Analyzing an
When buying CPUs, you should make sure you have enough memory to
accomm odate them , or else you run the risk of thrashing. This means that the system
spend s all its time m oving things around in mem ory, and never d oes any real work.
This is like the receptionist who sp ends time picking u p p hone lines and saying”Please hold,“ without actually fulfilling any requests. ”Memory” discusses this in
detail.
Finally, in terms of speed, getting the fastest processor you can buy is always an
adva ntage. In ad dition to the speed of the processor, you also should consider the
size of its cache. Generally this is decided for you based on the p rocessor mod el, but
you w ant to m ake sure to get as large a cache as possible. The cache determin es how
man y operations can be hand led at one time by the processor without having to
make a trip back out to system memory. Processor cache is several orders of magnitude faster than memory , so a large cache is always beneficial.
Memory
Memory is perhaps the single most imp ortant par t of a comp uter system, and h as
the most d irect impact on p erformance. The more memory you ha ve, the morethings you can d o, and th e faster you can do th em, since less disk access is needed.
There is usu ally a greater correlation betw een perceived performan ce and m emory
than processors. With relational databases, for instance, being able to fit as much of
the da tabase in memory as p ossible can yield a big improvemen t in performan ce.
If your system is ru nning slowly, you should p robably buy more m emory, not more
processors. It is more likely that your system is run ning ou t of mem ory, not
processor cycles, and is having to use sw ap space to run your a pp lications.
It is possible to waste m oney and overbuy m emory as w ell, though, so here some
specific rules. On the Sun Fire system, m emor y is tied to a p rocessor (see TABLE 1-5 in
Chapter 1). So, you cannot buy a board with just mem ory and no CPUs. This
actually simplifies the design process considerably because there are only two
decisions to m ake:
s Whether to h alf-popu late or fully-populate each CPU board
s Whether to bu y larger or smaller DIMMs
Fully-populating a CPU board allows you to pu t more mem ory on it. In add ition,
though , you get better interleaving, which increases performance. Thu s, the rules of
thumb for mem ory are:
s For I/ O-based app lications, half-pop ulate the CPU/ Memory board.
s For CPU-based app lications, fully-populate th e CPU/ Memory board.
Then, choose the app ropriate DIMM size to provide enou gh m emory for your
app lication. Following these ru les will naturally lead to smaller mem ory sizes inNFS servers (where memory is basically solely used for the file buffer cache), and
larger, faster memory configurations in database and compute servers. Most systems
tend to be in one category or the other, but if there is a mix, fully-popu late all
boards.
Remember th at, as d iscussed previously, paging is und esirable. So, another good
rule of thum b is:
A w ell-designed system should seldom page, and n ever swap.
It is possible, in fact, to run a large memory system with very little (if any) swap
space. This fact is somewhat different from other commonly available information.
One comm only-used phr ase is “Your sw ap sp ace should be dou ble the size of your
ph ysical mem ory.” Consid er this for a mom ent. You can easily design a Sun Fire
system that has 64 gigabytes of memory. If you were to follow this advice, you
wou ld hav e to have 128 gigabytes of swap space. While a few vend ors may require
you to have a large swap spa ce, you should not rely on swap for real-world mem ory
usage, as it is too slow. When d esigning a system, make sure that you p urchase
enough m emory so that your system d oes not swa p. If it does, you need more
memory.
Analyzing an Existing System
Often, the pur pose of designing a new system is to replace an existing system in
your infrastructure. If so, you can benefit from analyzing your existing system
because this analysis w ill give you a better idea of w hat p roblems you are facing.
This analysis is also useful if you are trying to up grade a Sun Fire server. A prop eranalysis will ensure that you are up grading th e right parts of the system to add ress
the issues.
Before you go any further, you should revisit your design goals discussed and
developed in Chapter 2. Doing so will help you p roperly formulate your statement
of the performance problems you are encountering. A good p roblem statemen t is:
When many users are logged in, NFS performance is very slow.
A bad p roblem statement is:
There is a large number in the w column of the vmstat output.
Always start with the p erceived problems and requ irements. An imp rovement in
these area s is the only w ay you can tell if you r d esign is a success. You can only
make use of statistics if you know wh at you are looking for.
The easiest way to analyze a system is by using the stat command s that ship w ith
the Solaris OE, and w hich can be u sed to m onitor performance of a runn ing system.
You can get a fu ll list of the available comm and s by typing th e following com mand
in a shell prom pt:
This command will display a series of comm and s, such as vmstat, iostat,
netstat, and so on.
You shou ld never u se the uptime comm and to analyze a system . You can u se it to
show h ow long your system ha s been up, but the notion of a load is very outd ated
and fairly useless in th e Solaris OE. Most notab ly, load varies w idely from system to
system; a load of 10 may indicate a lack of activity on one machine, but extreme
activity on another. We recommend you get in the habit of using vmstat 5 instead
of uptime wh en a machine seems sluggish.
Some stat command s are m ore useful than others, so the following sections focus
on the useful commands (TABLE 3-2).
Collecting and understanding the output from these commands should give you a
good idea of what p roblems your current system is having, and how to improve
up on these problem areas in the d esign of your new server.
The following sections review each comm and in turn , along w ith how to p roperlyuse each one, so you can gather the best statistics possible. It is important to note
that n ot all options of a given stat command produce useful—or even
trustw orthy—outpu t in all situations. The focus is on th e specific parts of the ou tpu t
of each command that are the most important.
# ls /usr/bin/*stat
TABLE 1-2 Useful Stat Commands
Command Description
/usr/bin/vmstat Virtual mem ory/ paging statistics with CPU/ process summ aries
How and when you monitor a system is just as important as what commands you
use and wh y you use them to collect statistics. You shou ld m ake sure that you aremonitoring the system when it is doing what you want it to do.
In some situations, this is relatively straightforwar d, such as on a mu ltiuser
interactive system. In this case, you w ant to ru n your stats during th e day, wh en
everyone is doing th eir normal w ork. Conversely, if you h ave a system that serves
mainly as a d atabase server, and th e load gets very heavy at night w hen batch jobs
are running, you should gather your stats overnight.
When collecting stats un attended (such as overnight), use a simple shell script tha twrites to a log file in /var/tmp with periodic timestamps. You can use something
like the script in CODE EXAMPLE 1-1 to run the stat comman ds mentioned
previously:
The way th is script w orks you w ill get timestamps at each interval count you
specify. So, if you run:
CODE EXAMPLE 1-1 nightstats—Script for Una ttend ed Stat Collection
#!/bin/sh
# nightstats - Script for unattended stat collection
The simplest way to look at memory is by specifying a time interval to the vmstat
command, and letting it run u ntil you press Ctrl-C to interrupt it. The following
vmstat command monitors the system in five-second intervals:
The first line of th e vmstat output is a summary.
Note – Always ignore th e first line of any stat command. It does not provide any
useful information because it is a sum mar y for as long as the system has been u p.
Sum maries span too long a p eriod of time, and they give you no indication as to the
use of the system du ring that time.
When looking at the outpu t from vmstat (CODE EXAMPLE 1-2), you w ill notice a lotof columns. You shou ld ignore all the fields about disks and device interru pts, as
there are better tools for m onitoring these stats, which w e w ill describe in
subsequ ent sections. In fact, only some of these colum ns (TABLE 3-3) are really useful.
CODE EXAMPLE 1-2 How to Use the vmstat Command
# vmstat 5procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 -- -- -- in sy cs us sy id
First, look at t he procs head ings. Norm ally, the r, b, and w columns are fairly low
nu mber s, if not 0. This is because, generally, these column s only become n onzero if a
process is waiting for something, either a CPU (r), I/ O (b), or enough m emory (w).
Large nu mbers in these colum ns are usu ally bad.
One caveat is that you may occasionally see a steady, unchanging num ber in the w
column . This means that th e Solaris software has d ecided these processes have been
idle so long they should be swap ped out to make room for other things. Do not be
concerned a bout this.
The cpu colum ns give you a good system-at-a-glance snapshot of what th e system is
doing, averaged across all processors. In general, non-idle time should be spent in
roug hly a 2-to-1 ratio in usr -to-sys modes. Also, if idle time (id
) is close to zeroconsistently, you probably need some additional CPUs, especially if the r column is
a large num ber. Beyond th is, to get a good view of your CPUs you should u se the
mpstat command , as explained in “mpstat Command” on page 18.
On to memory. First, note that the free colum n shou ld be comp letely ignored, as it
does not in any way correspond to w hat is thought of as free memory. Because of the
wa y the Solaris software ma nages m emory, the free list does not p roperly count
mu ltiple processes sharing the same p ages, or un used pages that h ave yet to be
reclaimed. In ad dition, the file cache grows to consume m ost of free mem ory toimprove performance.
Consequently, the free list tend s to d ecrease steadily over the up time of a system,
when in fact the system is efficiently reclaiming and reusing memory.
If you w ant a better picture of available virtual mem ory, you can u se the swap
command:
If both the free colum n from the first comm and , and the available column fromthe second comma nd are nonzero, the system is all right. Beyond th at, you can
ignore the concept of free memory.
Instead, the most imp ortant colum n of vmstat is the scan rate (sr). This colum n
shows the nu mber of pages scanned in an attemp t to free unu sed mem ory. The
pageout scanner starts run ning only when free memory goes below the kernel
parameter lotsfree, wh ich is a small p ercentage of p hysical m emory. When you
see an increase in the scan rate, you shou ld also see a jum p in the pa ge-outs (po),
indicating that pages are being moved from physical memory to sw ap sp ace. If you
As with the vmstat output, the key field is still sr, showing th e scan rate. The
benefit you get with -p is that you can now see wh at types of pages need the space,
allowing you to better understand what the system is doing.
Look again at the system that is reading in a large file, only this time with thevmstat -p option.
As you can see, this makes wh at is happ ening to the system m uch clearer. Thesystem starts by paging in t he file very effectively, until it hits th e lotsfree limit and
the page-out scanner starts. At this point, there is a big jump in the sr column. Also
notice the abrupt shift from file system page-ins (fpi) to anonymous pi, po, and
pf. This means that p ages are being taken from other p rocesses to make room for the
file in mem ory. Thu s, if you see a lot of activity in the apo an d sr colum ns, you
need more memory.
While memory analysis can be complicated if you pay attention solely to the sr an d
po colum ns, you should be able to tell if your system needs ad ditional memory.
CODE EXAMPLE 1-4 vmstat -p 5 Command Output Reading a Large File
# vmstat -p 5
memory page executable anonymous filesystem
swap free re mf fr de sr epi epo epf api apo apf fpi fpo fpf
The Sun Fire system is designed to be a multiprocessor system, as evidenced by the
fact that you cannot even buy a system with only one CPU. Even thou gh you are
looking at CPUs secondarily, being processor-bound is the least likely candidate for
bad performance. If anything, you are exploring CPUs secondarily so that you can
d ouble-check this assum ption , and ru le it out as a possible factor. CPUs usu ally only
become a factor in heavily loaded systems that are doing lots of interactive or
transactional processing. In m ost other cases, if you buy enough system boards to
hold a ll your m emory, the CPUs that are includ ed are usu ally sufficient.
As mentioned previously, the cpu colum ns of the vmstat outp ut are a good place to
start. Generally, a large percentage of idle time ind icates that you r processing p oweris sufficient. However, measuring idle time across a lot of processors can mask
situations such as one processor getting swa mp ed w ith interrupts w hile the rest do
nothing. So, it is important to look at your CPUs in d etail to make sure you are not
missing an ything.
Like vmstat, just laun ch mpstat with a time interval and let it run:
This command prod uces a lot of column s, only some of w hich you care about:
) is a call used by a processor to tell other processors to dosometh ing. Cross-calls are used for a variety of things, such as d elivering a signal to
another processor or ensuring virtual m emory consistency. This latter use is very
common, as it happens during file system activity. Heavy file system activity (such
as N FS) can result in a lot of cross-calls. Also, it is not u nu sual for th e boot proc to
show thousands of xcals, as it maintains lots of information about the others.
An interrupt (intr) is the m echanism th at a d evice uses to signal to the kernel that
it needs attention, and some imm ediate processing is required on its behalf. I/ O is
the m ajor contributor of interrup ts, although there are also “special” interrup ts suchas the system-wide clock thread that occurs regularly. Interrupts, u nlike everything
else, are not d istributed across all CPUs. Instead , the Solaris OE binds each sou rce of
interrupts to a specific CPU.
The term context switch (csw) refers to the p rocess of moving a thread on and off a
CPU. Context sw itches are a norma l but som ewh at expensive occurrence because
switching context involves certain overh ead , such as pop ulating th e stack. Norm ally,
a context switch occurs w hen a process is done with th e CPU and another p rocess is
given a chance to run. Thus, a steady number of context switches is insignificant.
Involuntary context switches (icsw), on the other hand , are mu ch less favorable.
When a process is given access to the CPU, it is has a limited t ime w ind ow in w hich
to run, dep ending on h ow m any other p rocesses are runn ing, what their priority is,
and so on. This is the nature of scheduling. An involuntary context switch m eans
that th e process was forcibly stopped by the schedu ler before it wa s finished; the
time allotted was too short for the p rocess to finish in, or a h igher-priority thread
preemp ted it. A few of these is nothing to be concerned abou t, but getting a largenum ber of these regularly indicates that the system d oes not have enough
processing pow er to hand le all of the things that need to run . You n eed ad ditional
CPUs.
Finally, a spin on a m utex lock (smtx) happen s when a thread cannot access a
section of the kernel that it needs on the first try. The term mutex is short for a
mu tu al exclusion lock, and is used in mu ltithread ed op erating system s like the Solaris
OE to allow mu ltiple threads to ru n concurrently in system m ode. When a thread
enters system mod e, it locks the par t of the kernel it is using by acquiring the m utex
smtx Spins on mu tex locks
usr Percent u ser time
sys Percent system tim e
wt Percent w ait time
idl Percent id le time
TABLE 1-6 Important mpstat Command Output Columns (Continued)
As with the other stat command s, there are only a few colum ns you care about
(TABLE 3-8).
You can ignore two comm only used colum ns, %w an d %b, which are sup posedly the
percentage of time sp ent w aiting an d b usy, respectively. Because of the comp lexity of
mod ern d isks and controllers, these calculations are very inaccurate. Often the two
will total more th an 100 percent, w hich shou ld be imp ossible. Besides, these colum ns
do n ot tell you anyth ing that you cannot find ou t by looking at wsvc_t or asvc_t.
Analogous to the mpstat command, when looking at iostat you should always
watch th e first tw o colum ns listed (kr/s an d kw/s) to see how m uch activity the
disks are undergoing. Then, basically, the last three columns should be as close to
zero as possible. This indicates that the system has very fast disks, and that the I/ O
is laid out correctly to avoid controller bottlenecks.1
In practice, asvc_t will be nonzero for any disks undergoing activity, since it
always takes some am ount of time for a d isk to fulfill a request. As with any stat,
you will only be able to tell if the system is particularly busy after establishing a
baseline. However, several facts are true:
1. Service times across equally active disks should be fairly even.
2. You should not see huge peaks and valleys und er norm al conditions.
3. You sh ould rarely, if ever, see a non zero nu mber in wait or wsvc_t.
You may, occasionally, see a tem por ary jum p in service times (asvc_t) even thoughthere is nothing app arently going on (that is, kr/s an d kw/s are a lmost 0). This is
due to a somewhat strange behavior of fsflush, the d aemon responsible for
flush ing d isk buffers. Periodically, it will generate a lon g, rand om series of writes in
a short time period. This results in a qu eue forming, which bum ps u p th e service
time, even th ough there is no real appa rent activity on th e disk. If you see this,
ignore it.
TABLE 1-8 Important iostat Command Columns
Column Heading Meaning
kr/s Kilobytes read per second
kw/s Kilobytes written p er second
wait Nu mber of tran sactions w aiting for service
wsvc_t Average service time in w ait queu e, in millisecond s
asvc_t Average service time for active transactions, in milliseconds
1. Withou t the -n option,wsvc_t an d asvc_t are combined in to a single svc_t column.
Despite its limitations, you can tell several things from the netstat command
outp ut. Unlike the other stats, you m ust run the netstat command separately for
each interface you have configured by specifying the -I option along with the
interface n ame.
You can tell two th ings from this d isplay:
1. Total number of packets received (inpu t) and transm itted (outp ut) du ring that
interva l, both for that int erface (left set of colum ns) and for all interfaces (right set
of column s). This is not an average p er second, but a total count.
2. Nu mber of errors and collisions, which should alw ays be low or zero.
Netw ork capacity is very d ifficult to gauge with this limited informa tion. Without
the sizes of each p acket, it is imp ossible to know if you are anyw here near the
through pu t limits for the interface you are analyzing. Given this information, if the
netw ork seems slow, and you are seeing thou sands and thousand s of packets each
second, try add ing another network interface card to see if it helps. If not, you
should examine your n etwork as a w hole to see if you have more w idespread issues.
Man y ava ilable freeware tools, such as th e SE Toolkit and Multi Router Traffic
Graph er (MRTG), provide better netw ork an alysis than netstat. You can u se tools
such as these to more properly gauge the band wid th being used by each interface.
MRTG is especially useful, as it graphs utilization over time so you can easily see
wh en your n etwork interfaces are getting busy, as well as how mu ch bandw idth
they are pushing.
Analysis Reveals...
By this point, you shou ld have a good idea abou t wh ere the system is w eak. Make
sure you hav e good notes, as you need th is information in the next chap ter wh en
you design your new system.
Giving p erformance tuning a full treatment is beyond the scope of this book. True
performan ce tuning gets exponentially h arder; it is much more d ifficult to get thelast 10 percent ou t of a system th an th e first 90 percent. If you are interested in high-
end p erformance tuning, read Sun Performance and Tuning—Java and the Internet, 2nd
Edition by Adrian Cockcroft and Richard Pettit (ISBN 0-13-095249-4) and
“Application Performance Optimization” by Börje Lindh—Sun Microsystems AB,
Sweden Sun BluePrints™ OnLine—March 2002.
Designing for RAS
This is the final step in the design process. By now, you should have a fairly clear
un derstand ing of wh at your requ irements are, as well as any possible problems w ithyour existing system. Up un til now, this book focused mainly on performan ce
because you shou ld make sure any solution you develop can meet your fund amental
application requirements. However, properly designing for RAS is just as important,
and requires some thought.
Always keep th ree principles in mind wh en d esigning for RAS:
s The more RAS you w ant, the more hardware you must add to the system.
s RAS is not just a fun ction of the Sun Fire server, but of you r entire site.s Maximizing RAS can decrease performance.
The first point is almost always overlooked. As an example, to effectively use DR,
you should add boards in your design beyond th ose required for your app lications.
Why? Because otherwise, when the system d ynam ically reconfigures a board out of
the system, it will not have enough resources to ru n you r ap plications. The system
could start paging, or the CPUs could get too busy hand ling I/ O interrupts to d o
any real work.The requirements you have formed up to this point are the minimum you
need for your system.
As for the second p oint, pu rchasing redu nd ant p ower su pp lies does not benefit you
if your site has only a single pow er grid with no UPS system. RAS is a fun ction of
your entire site, not just one server in isolation. As w ith performa nce, getting that
final 10 percent of reliability out of a site gets exp onen tially more d ifficult—and
costly. Therefore, you should be realistic about both your requirements and
expectations—and your ability to fund them.
Third, taking advantage of certain RAS features and methodologies can decrease the
performance of your system. For example, if you mirror file systems, for each write
the system must now perform tw o writes, one to each half of the mirror. Some of
these effects can be m itigated, for instance by placing the tw o halves of the mirror on
different I/ O controllers.1 How ever, such p erformance hits can ad d u p, so it is
importan t to realize it is imp ossible to maximize both RAS and p erformance.
1. In fact, man y volum e manag ers will "round rob in" between th e two halves of a mirror on read s, actuallyincreasing your read p erform ance over a single disk.
Note – You shou ld always p urchase redu nd ant SCs for a system to ensu re
availability in the event of a System Controller board failure. Without a functioning
System Controller board , none of the dom ains in a system w ill work.
Note – Even though you can use DR to replace failed components, a critical
component failure on a run ning system (such as a failed CPU) w ill still cause the
system to crash. If you cann ot afford th is type of d own time, you fit in the almost none
category, and should use a clustering prod uct to guard against system failures.
For most organ izations, the little downtime category is a good cost/ benefit tradeoff.
You w ill hav e a system th at is resilient to failures and , if prop erly configured,relatively easy to service. You can u se DR to add m ore CPU/ Memory board s for
increased capacity, or to replace failed components.
Make a note of w hat category your system fits into, as well as the ad ditional
components you will need. You are going to u se this in the next chapter to d esign
your system. You w ill also use it later in the book du ring th e discussion on
configuring the system to integrate w ith your site.
TABLE 1-9 RAS Design Decision Table
Allowable
downtime Your design should include...
Some Red undant fan trays
Redundant power supplies and transfer switches1
1. Remem ber, redund ant pow er helps only if your site is equipped to sup ply it.
Little Red un dan t CPU/ Mem ory board s
DR for CPU/ Memory board s
Volum e man agem ent softw are (such as Solaris™ Volum e Manag er (SVM)
or VERITAS Volume Manager (VxVM)
Very li tt le Redundant paths to I/ O devices
Multipathing software for I/ O (such as Multipath I/ 0 (MPxIO) or
VERITAS Dynam ic Multip athin g (VxDMP)
Redundant network connectionsMultipathing software for netw orks—such as Internet p rotocol
multipathing (IPMP)
DR for I/ O devices and networks
Almost none Multiple instances of fully redund ant systems
Finally, some closing words on RAS. It is very important that you do not sacrifice
parts of your required configuration for add itional RAS features. For examp le, do
not d ecide to buy less mem ory so tha t you can afford add itional fan trays. You
should ensure that your base requ irements are met, or else you w ill not benefit from
add itional RAS because you r system will have fund amental shortcomings.Disk Redundancy and RAID Basics
To ensure the integrity of the data, some type of disk redu nd ancy should be u sed on
any system with importan t local data storage. The d ifferent schemes for achieving
such redun dan cy are often denoted by their RAID level. The term RAID comes from
Redu nd ant Arra y of Inexpensive Disks, and there are num bers from 0 all the way u p
through 53 denoting d ifferent w ays of laying ou t sets of disks.
For most app lications, how ever, only th ree RAID levels are useful: 0, 1, and 5. Each
of these allow you to combine multiple physical disks into a single logical volume.
The operating system then sees this volume just like a norm al disk, and it can be
mounted an d u sed in the regular manner.
RAID 0
RAID 0, comm only called striping, provides n o ad ditional data safety. Instead, it isd esigned to increase the sp eed of file system access. With striping , disks in a volum e
are interleaved at a certain data interval, called the stripe unit size. This means that
wh en read ing or w riting d ata, multiple disks are accessed in parallel, decreasing the
amou nt of time it takes to access the data. Striping is very common on any system
that n eeds fast data access, such as d atabase servers.
RAID 1
RAID 1, also referred to as mirroring, is just the reverse. It provides full data
redu nd ancy, but w ith some p erformance costs. In mirroring, twice the num ber of
disks are used for the d ata that n eeds to be stored. These disks are then arran ged in
pairs, and identical data is stored on both disks. On a file system write, two physical
wr ites must be performed , one to each d isk of the pair. The advantage is you now
have tw o complete copies of your d ata.
This means you can lose half of your d isks and still continue ru nning w ithout d ata
loss. In a large volume, this is obviously an advan tage.
RAID 0+1
RAID 0+1, usually called striping and mirroring, is a combination of these tw o
techniques. In a striped/ m irrored volum e, a set of disks is striped together to form
each half. Then, these tw o halves are mirrored to on e anoth er. It is possible to design
a striped/ mirrored volum e so that the performance is better than the individu al
disks (due to striping), and th at fully half the d isks can fail without imp acting the
volume (du e to mirroring). This technique is w idely-used in prod uction systems.
RAID 1+0
RAID 1+0 is very similar to RAID 0+1, except the volumes are assembled in the
reverse order. Here, pairs of disks are mirrored to on e another, and then these
mirrored p airs are striped togeth er. Volum es created in this man ner are slightly more
complicated to ma nage, but are slightly more reliable because of the w ays in w hichdisks typically fail. Generally, vendors decide to implement either RAID 0+1 or
RAID 1+0, but not both, so the choice of which to use is often made for you.
RAID 5
Finally, RAID 5 is one of the m ost econom ical forms of redu nd ancy. In this schem e, a
portion of each disk in a volum e is used to hold p arity. On a w rite, data isdistributed across all the d isks in the volume except on e, with the p arity being
written to the rem aining d isk. This process is repeated in a "roun d robin " fashion, so
that each w rite places the p arity for that wr ite on a d ifferent d isk. In the event of a
single disk failure, the par ity is used to recreate data th at was on the failed disk. This
allows you to lose a single disk (the most common typ e of failure) and continue
run ning w ithout interru ption. RAID 5 is somewh at slow, though , since it must
perform all those add itional w rites for the parity.
While RAID 5 is not a s reliable as RAID 0+1 (striping and mirroring), it can still be agood solution, especially for NFS servers. While you can only lose one disk, it is
un comm on to lose a whole enclosure barring hu man error or a pow er failure, both
of which will probably affect much more than your disks. To make use of RAID 5,
you shou ld consider on ly those enclosures that su pp ort hard ware RAID, since
otherwise it is too slow for many app lications.
Once you h ave selected wh at type of RAID you w ish to use for each of your
different volum es, you shou ld a djust your storage pu rchase accordingly. For
example, if you w ant to mirror a set of data, you m ust pu rchase dou ble the amou nt
of disk you calculated above. You w ill need to make sure to increase you r controller
cards as w ell.
With RAID 5, check the enclosure you are considering purchasing to verify that it