BP Networking Concepts and Technology

Sun Microsystems, Inc.www.sun.com

Submit comments about this document at: http://www.sun.com/hwdocs/feedback

Networking Concepts andTechnology: A Designer’s Resource

Deepak Kakadia and Francesco DiMambro

Part No. 817-1046-10June 2004, Revision A

DENA.book Page i Friday, June 11, 2004 11:30 AM

PleaseRecycle

Copyright 2004 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, California 95054, U.S.A. All rights reserved.

Sun Microsystems, Inc. has intellectual property rights relating to technology that is described in this document. In particular, and withoutlimitation, these intellectual property rights may include one or more of the U.S. patents listed at http://www.sun.com/patents and one ormore additional patents or pending patent applications in the U.S. and in other countries.

This document and the product to which it pertains are distributed under licenses restricting their use, copying, distribution, anddecompilation. No part of the product or of this document may be reproduced in any form by any means without prior written authorization ofSun and its licensors, if any.

Third-party software, including font technology, is copyrighted and licensed from Sun suppliers.

Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark inthe U.S. and in other countries, exclusively licensed through X/Open Company, Ltd.

Sun, Sun Microsystems, the Sun logo, AnswerBook2, docs.sun.com, iPlanet, Java, JavaDataBaseConnectivity, JavaServer Pages, EnterpriseJavaBeans, Netra Sun ONE , Sun Trunking, JumpStart, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S.and in other countries.

All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and in othercountries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc.

The OPEN LOOK and Sun™ Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledgesthe pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sunholds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Sun’s licensees who implement OPENLOOK GUIs and otherwise comply with Sun’s written license agreements.

U.S. Government Rights—Commercial use. Government users are subject to the Sun Microsystems, Inc. standard license agreement andapplicable provisions of the FAR and its supplements.

DOCUMENTATION IS PROVIDED "AS IS" AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES,INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT,ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.

Copyright 2004 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, Californie 95054, Etats-Unis. Tous droits réservés.

Sun Microsystems, Inc. a les droits de propriété intellectuels relatants à la technologie qui est décrit dans ce document. En particulier, et sans lalimitation, ces droits de propriété intellectuels peuvent inclure un ou plus des brevets américains énumérés à http://www.sun.com/patents etun ou les brevets plus supplémentaires ou les applications de brevet en attente dans les Etats-Unis et dans les autres pays.

Ce produit ou document est protégé par un copyright et distribué avec des licences qui en restreignent l’utilisation, la copie, la distribution, et ladécompilation. Aucune partie de ce produit ou document ne peut être reproduite sous aucune forme, par quelque moyen que ce soit, sansl’autorisation préalable et écrite de Sun et de ses bailleurs de licence, s’il y en a.

Le logiciel détenu par des tiers, et qui comprend la technologie relative aux polices de caractères, est protégé par un copyright et licencié par desfournisseurs de Sun.

Des parties de ce produit pourront être dérivées des systèmes Berkeley BSD licenciés par l’Université de Californie. UNIX est une marquedéposée aux Etats-Unis et dans d’autres pays et licenciée exclusivement par X/Open Company, Ltd.

Sun, Sun Microsystems, le logo Sun, AnswerBook2, docs.sun.com, iPlanet, Java, JavaDataBaseConnectivity, JavaServer Pages, EnterpriseJavaBeans, Netra Sun ONE , Sun Trunking, JumpStart, et Solaris sont des marques de fabrique ou des marques déposées de Sun Microsystems,Inc. aux Etats-Unis et dans d’autres pays.

Toutes les marques SPARC sont utilisées sous licence et sont des marques de fabrique ou des marques déposées de SPARC International, Inc.aux Etats-Unis et dans d’autres pays. Les produits portant les marques SPARC sont basés sur une architecture développée par SunMicrosystems, Inc.

L’interface d’utilisation graphique OPEN LOOK et Sun™ a été développée par Sun Microsystems, Inc. pour ses utilisateurs et licenciés. Sunreconnaît les efforts de pionniers de Xerox pour la recherche et le développement du concept des interfaces d’utilisation visuelle ou graphiquepour l’industrie de l’informatique. Sun détient une license non exclusive de Xerox sur l’interface d’utilisation graphique Xerox, cette licencecouvrant également les licenciées de Sun qui mettent en place l’interface d ’utilisation graphique OPEN LOOK et qui en outre se conforment auxlicences écrites de Sun.

LA DOCUMENTATION EST FOURNIE "EN L’ÉTAT" ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSESOU TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISEE PAR LA LOI APPLICABLE, Y COMPRIS NOTAMMENTTOUTE GARANTIE IMPLICITE RELATIVE A LA QUALITE MARCHANDE, A L’APTITUDE A UNE UTILISATION PARTICULIERE OU AL’ABSENCE DE CONTREFAÇON.

DENA.book Page ii Friday, June 11, 2004 11:30 AM

DENA.book Page iii Friday, June 11, 2004 11:30 AM

Acknowledgements

Deepak says, “I would like to thank the many people who have gone out of theirway to help me not only with writing this book, but also teaching me about variousaspects of my professional and academic career.

“First, I am very grateful for the tremendous corporate support I received from ScottMcNealy, Clark Masters, Gary Beck, Brad Carlile, and Bill Sprouse. I feel extremelyfortunate to be a part of a team with the greatest, most unselfish corporateleadership of modern time. I would like to thank Kemer Thomson, Vicky Hardman,Gary Rush, Barb Jugo, Alice Kemp, a veteran book writer and also my book mentor,and Diana Lins, who did the illustrations in this book.

“Much of the data center work developed in this book was built on the shoulders ofgiants – Richard Croucher, a world-renowned expert in datacenter technologies, Dr.Jim Baty, Dr. Joseph Williams, Mikael Lofstrand, and Jason Carolan.

“Frank and I are very grateful to the technical reviewers who spent considerabletime reviewing and providing feedback and comments: David Auslander, MartinLorenz, Ken Pepple, Mukund Buddhikot, my good friends and collegues: MarkGarner, who sacrificed many pub nights for me, David Deeths, Don Devitt, and JohnHoward.

I would also like to thank Dr. Nick McKeown, professor at Stanford University, RuiZhang-Shen, Ph.D. student at Stanford University, John Fong, NortelNetworks, JohnReuter and David Bell of Foundry Networks, Dan Mercado and Bill Cormier ofExtreme Networks, and Sunil Cherian of ArrayNetworks.

“Above all,” says Deepak, “I thank my wife, Jagruti, and daughters, Angeli andKristina, for their patience, sacrifice, and understanding of why I had to miss bothschool programs and family events. Finally, I want thank my mother, mother-in-law,and father-in-law for helping out at home during my many absences.”

iii

DENA.book Page iv Friday, June 11, 2004 11:30 AM

Frank says, “We must also remember the unsung heros who implement, test, sustain,and measure performance of the Sun network technology, device drivers, and devicedriver framework for their outstanding contribution to the Sun networkingtechnology. Their assistance and support helped gain the collective experience thathas been essential to providing the best possible networking capability for Sun and,in turn, the material necessary for some parts of this book.

“These include the Network ASIC development team, in particular Shimon Muller,Binh Pham, George Chu, and Carlos Castil. The device driver development team,including Sumanth Kamatala, Joyce Yu, Paul Simons, Raghunath Shenbagam, DavidGordon and Paul Lodrige. The Networking Quality Assurance team, including LalitBhola, Benny Chin, Jie Zhu, Alan Hanson, Neeraj Gupta, Deb Banerjee, CharleenYee, and Ovid Jacob. The Solaris Device Driver Framework Development team,including Adi Masputra, Jerry Chu, Priyanka Agarwal, and Paul Durrant. TheSystem Performance Measurement team: Patrick Ong, Jian Huang, Paul Rithmuller,Charles Suresh, and Roch Borbonnais.”

“I would love to say a big thank you to my wife, Bridget,” Frank says, “for herpatience and encouragment as I progressed with my contribution to this book.

“Thanks to my two sons, Francesco and Antonio, for distracting me from time totime and forcing me to take a break from the book and play. For them, ‘Dad’swriting his book’ was not a reasonable excuse. God bless them, for they were right.Who would imagine someone three feet tall could have that much insight? Thosebreaks made all the difference.”

iv Networking Concepts and Technology: A Designer’s Resource

DENA.book Page v Friday, June 11, 2004 11:30 AM

Contents

1. Overview 1

Evolution of Web Services Infrastructures 1

The Data Center IP Network 4

Network Traffic Characteristics 7

End-to-End Session: Tuning the Transport Layer 9

Network Edge Traffic Steering: IP Services 10

Server Networking Internals 12

Network Availability Design Patterns 13

Reference Implementations 14

2. Network Traffic Patterns: Application Layer 17

Services on Demand Architecture 18

Multi-Tier Architecture and Traffic Patterns 20

Mapping Tiers to the Network Architecture 21

Inter-tier Traffic Flows 22

Web Services Tier 23

Application Services Tier 26

Architecture Examples 29

Designing for Vertical Scalability and Performance 31

Designing for Security and Vertical Scalability 32

v

DENA.book Page vi Friday, June 11, 2004 11:30 AM

Designing for Security and Horizontal Scalability 33

Example Solution 34

3. Tuning TCP: Transport Layer 37

TCP Tuning Domains 38

TCP Queueing System Model 40

Why the Need to Tune TCP 41

TCP Packet Processing Overview 44

TCP STREAMS Module Tunable Parameters 46

TCP State Model 48

Connection Setup 49

Connection Established 50

Connection Shutdown 50

TCP Tuning on the Sender Side 51

Startup Phase 51

Steady State Phase 53

TCP Congestion Control and Flow Control – Sliding Windows 53

TCP Tuning for ACK Control 54

TCP Example Tuning Scenarios 56

Tuning TCP for Optical Networks – WANS 56

Tuning TCP for Slow Links 59

TCP and RDMA Future Data Center Transport Protocols 62

4. Routers, Switches, andAppliances—IP-Based Services: Network Layer 65

Packet Switch Internals 66

Emerging Network Services and Appliances 71

Server Load Balancing 72

Hash 73

Round-Robin 74

vi Networking Concepts and Technology: A Designer’s Resource

DENA.book Page vii Friday, June 11, 2004 11:30 AM

Smallest Queue First /Least Connections 74

Finding the Best SLB Algorithm 76

How the Proxy Mode Works 78

Advantages of Using Proxy Mode 80

Disadvantages of Using Proxy Mode 80

How Direct Server Return Works 80

Advantages of Direct Server Return 82

Disadvantages of Direct Server Return 82

Server Monitoring 83

Persistence 83

Commercial Server Load Balancing Solutions 84

Foundry ServerIron XL—Direct Server Return Mode 84

Extreme Networks BlackDiamond 6800 Integrated SLB— ProxyMode 86

Layer 7 Switching 88

Network Address Translation 91

Quality of Service 92

The Need for QoS 92

Classes of Applications 92

Data Transfers 93

Video and Voice Streaming 93

Interactive Video and Voice 93

Mission-Critical Applications 94

Web-Based Applications 94

Service Requirements for Applications 94

QoS Components 95

Implementation Functions 95

QoS Metrics 95

Network and Systems Architecture Overview 96

Contents vii

DENA.book Page viii Friday, June 11, 2004 11:30 AM

Implementing QoS 98

ATM QoS Services 98

Sources of Unpredictable Delay 99

QoS-Capable Devices 102

Implementation Approaches 102

Functional Components—High-Level Overview 102

QoS Profile 103

Deployment of Data and Control Planes 104

Packet Classifier 105

Metering 105

Marking 106

Policing and Shaping 107

IP Forwarding Module 107

Queuing 107

Congestion Control 107

Packet Scheduler 109

Secure Sockets Layer 109

SSL Protocol Overview 110

SSL Acceleration Deployment Considerations 112

Software-SSL Libraries—Packet Flow 112

The Crypto Accelerator Board—Packet Flow 113

SSL Accelerator Appliance—Packet Flow 115

SSL Performance Tests 117

Test 1: SSL Software Libraries versus SSL AcceleratorAppliance—Netscaler 9000 117

Test 2: Sun Crypto Accelerator 1000 Board 118

Test 3: SSL Software Libraries versus SSL Accelerator Appliance—ArrayNetworks 119

Conclusions Drawn from the Tests 121

viii Networking Concepts and Technology: A Designer’s Resource

DENA.book Page ix Friday, June 11, 2004 11:30 AM

5. Server Network Interface Cards: Datalink and Physical Layer 123

Token Ring Networks 123

Token Ring Interfaces 125

Configuring the SunTRI/S Adapter with TCP/IP 125

Setting the Maximum Transmission Unit 126

Disabling Source Routing 127

Disabling ARI/FCI Soft Error Reporting 127

Configuring the Operating Mode 127

Resource Configuration Parameter Tuning 128

Configuring the SunTRI/P Adapter with TCP/IP 128


Configuring the Ring Speed 129

Configuring the Locally Administered Address 130

Fiber Distributed Data Interface Networks 131

FDDI Stations 132

Single-Attached Station 132

Dual-Attached Station 133

FDDI Concentrators 134

Single-Attached Concentrator 134

Dual-Attached Concentrator 135

FDDI Interfaces 136

Configuring the SunFDDI/S Adapter with TCP/IP 137


Target Token Rotation Time 137

Configuring the SunFDDI/P Adapter with TCP/IP 138


Target Token Rotation Time 139

Ethernet Technology 139

Contents ix

DENA.book Page x Friday, June 11, 2004 11:30 AM

Software Device Driver Layer 140

Transmit 140

Receive 144

Jumbo Frames 152

Ethernet Physical Layer 152

Basic Mode Control Layer 153

Basic Mode Status Register 154

Link-Partner Auto-negotiation Advertisement Register 155

Gigabit Media Independent Interface 157

Ethernet Flow Control 161

Example 1 164

Example 2 164

Fast Ethernet Interfaces 165

10/100 hme Fast Ethernet 165

Current Device Instance in View for ndd 168

Operational Mode Parameters 168

Transceiver Control Parameter 169

Inter-Packet Gap Parameters 170

Local Transceiver Auto-negotiation Capability 171

Link Partner Capability 173

Current Physical Layer Status 174

10/100 qfe Quad Fast Ethernet 175







x Networking Concepts and Technology: A Designer’s Resource

DENA.book Page xi Friday, June 11, 2004 11:30 AM


10/100 eri Fast Ethernet 182





Receive Interrupt Blanking Parameters 187




10/100 dmfe Fast Ethernet 190





Fiber Gigabit Ethernet 195

1000 vge Gigabit Ethernet 196

1000 ge Gigabit Ethernet 196









Performance Tunable Parameters 206

10/100/1000 ce GigaSwift Gigabit Ethernet 209

Contents xi

DENA.book Page xii Friday, June 11, 2004 11:30 AM



Flow Control Parameters 213

Gigabit Link Clock Mastership Controls 213




Random Early Drop Parameters 216

PCI Bus Interface Parameters 217

Jumbo Frames Enable Parameter 217

Performance Tunables 218

10/100/1000 bge Broadcom BCM 5704 Gigabit Ethernet 220





Sun VLAN Technology 228

VLAN Configuration 230

Sun Trunking Technology 230

Trunking Configuration 231

Trunking Policies 232

Network Configuration 233

Configuring the System to Use the Embedded MAC Address 233

Configuring the Network Host Files 234

Setting Up a GigaSwift Ethernet Network on a Diskless Client System 235

Installing the Solaris Operating System Over a Network 236

Configuring Driver Parameters 238

Setting Network Driver Parameters Using the ndd Utility 238

xii Networking Concepts and Technology: A Designer’s Resource

DENA.book Page xiii Friday, June 11, 2004 11:30 AM

Using the ndd Utility in Non-interactive Mode 240

Using the ndd Utility in Interactive Mode 240

Reboot Persistence Using driver.conf 242

Global driver.conf Parameters 242

Per-Instance driver.conf Parameters 243

Using /etc/system to Tune Parameters 244

Network Interface Card General Statistics 245

Ethernet Media Independent Interface Kernel Statistics 246

Maximizing the Performance of an Ethernet NIC Interface 249

Ethernet Physical Layer Troubleshooting 250

Deviation from General Ethernet MII/GMII Conventions 253

Ethernet Performance Troubleshooting 254

ge Gigabit Ethernet 255

ce Gigabit Ethernet 256

6. Network Availability Design Strategies 261

Network Architecture and Availability 261

Layer 2 Strategies 264

Trunking Approach to Availability 264

Theory of Operation 265

Availability Issues 265

Load-Sharing Principles 267

Availability Strategies Using SMLT and DMLT 271

Availability Using Spanning Tree Protocol 274

Availability Issues 274

Layer 3 Strategies 278

VRRP Router Redundancy 279

IPMP—Host Network Interface Redundancy 279

Integrated VRRP and IPMP 280

Contents xiii

DENA.book Page xiv Friday, June 11, 2004 11:30 AM

OSPF Network Redundancy—Rapid Convergence 281

RIP Network Redundancy 288

Conclusions Drawn from Evaluating Fault Detection and Recovery Times 292

7. Reference Design Implementations 295

Logical Network Architecture 296

IP Services 298

Stateless Server Load Balancing 298

Stateless Layer 7 Switching 299

Stateful Layer 7 Switching 300

Stateful Network Address Translation 302

Stateful Secure Sockets Layer Session ID Persistence 302

Stateful Cookie Persistence 304

Design Considerations: Availability 305

Collapsed Layer 2/Layer 3 Network Design 308

Multi-Tier Data Center Logical Design 309

How Data Flows Through the Service Modules 312

Physical Network Implementations 315

Secure Multi-Tier 315

Multi-Level Architecture Using Many Small Switches 316

Flat Architecture Using Collapsed Large Chassis Switches 317

Physical Network—Connectivity 320

Switch Configuration 322

Configuring the Extreme Networks Switches 323

Configuring the Foundry Networks Switches 324

Master Core Switch Configuration 326

Standby Core Switch Configuration 327

Server Load Balancer 328

Server Load Balancer 329

xiv Networking Concepts and Technology: A Designer’s Resource

DENA.book Page xv Friday, June 11, 2004 11:30 AM

Network Security 330

Netscreen Firewall 333

A. Lyapunov Analysis 337

Glossary 341

Index 355

Contents xv

DENA.book Page xvi Friday, June 11, 2004 11:30 AM

DENA.book Page xvii Friday, June 11, 2004 11:30 AM

Figures

FIGURE 1-1 Web Services Infrastructure Impact on Data Center Network Architectures 2

FIGURE 1-2 High-Level Overview of Networks Spanning Clients, Data Center, Vendors, and Partners(a) 5

FIGURE 1-3 High-Level Overview of Networks Spanning Clients, Data Center, Vendors, and Partners(b) 6

FIGURE 1-4 Influence of Multi-Tier Software Architectures on Network Architecture 8

FIGURE 1-5 Transport Layer Traffic Flows Tuned According to Client Links 10

FIGURE 1-6 Data Center Edge IP Services 11

FIGURE 1-7 Data Center Networking Considerations on the Server 12

FIGURE 1-8 Availability Strategies in the Data Center 14

FIGURE 1-9 Example Implementation of an Enterprise Muli-Tier Data Center 15

FIGURE 2-1 Main Components of Multi-Tier Architecture 19

FIGURE 2-2 Logical View of Multi-Tier Service on Demand Architecture 20

FIGURE 2-3 Network Inter-tier Traffic Flows of a Web-based Transaction 22

FIGURE 2-4 Model of Presentation/Web Tier Components and Interfacing Elements 24

FIGURE 2-5 High-Level Survey of EJB Availability Mechanisms 27

FIGURE 2-6 Decoupled Web Tier and Application Server Tier—Vertically Scaled 31

FIGURE 2-7 Tightly Coupled Web Tier and Application Server Tier—Vertically Scaled 32

FIGURE 2-8 Decoupled Web Tier and Application Server Tier—Horizontally Scaled 33

FIGURE 2-9 Tested and Implemented Architecture Solution 35

FIGURE 3-1 Overview of Overlapping Tuning Domains 39

xvii

DENA.book Page xviii Friday, June 11, 2004 11:30 AM

FIGURE 3-2 Closed-Loop TCP System Model 40

FIGURE 3-3 Perfectly Tuned TCP/IP System 42

FIGURE 3-4 Tuning Required to Compensate for Faster Links 43

FIGURE 3-5 Tuning Required to Compensate for Slower Links 44

FIGURE 3-6 Complete TCP/IP Stack on Computing Nodes 45

FIGURE 3-7 TCP and STREAM Head Data Structures Tunable Parameters 47

FIGURE 3-8 TCP State Engine Server and Client Node 49

FIGURE 3-9 TCP Startup Phase 52

FIGURE 3-10 TCP Tuning for ACK Control 55

FIGURE 3-11 Comparison between Normal LAN and WAN Packet Traffic 57

FIGURE 3-12 Tuning Required to Compensate for Optical WAN 59

FIGURE 3-13 Comparison between Normal LAN and WAN Packet Traffic—Long Low Bandwidth Pipe 60

FIGURE 3-14 Increased Performance of InfiniBand/RDMA Stack 63

FIGURE 4-1 Internal Architecture of a Multi-Layer Switch 68

FIGURE 4-2 High-Level Model of Server Load Balancing 73

FIGURE 4-3 High-Level Model of the Shortest Queue First Technique 75

FIGURE 4-4 Round-Robin and Weighted Round-Robin 76

FIGURE 4-5 Server Load Balanced System Modeled as N - M/M/1 Queues 77

FIGURE 4-6 System Model of One Queue 78

FIGURE 4-7 Server Load Balance—Packet Flow: Proxy Mode 79

FIGURE 4-8 Direct Server Return Packet Flow 81

FIGURE 4-9 Content Switching Functional Model 90

FIGURE 4-10 Overview of End-to-End Network and Systems Architecture 97

FIGURE 4-11 One-Way End-to-End Packet Data Path Transversal 100

FIGURE 4-12 QoS Functional Components 104

FIGURE 4-13 Traffic Burst Graphic 106

FIGURE 4-14 Congestion Control: RED, WRED Packet Discard Algorithms 108

FIGURE 4-15 High-Level Condensed Protocol Overview 111

FIGURE 4-16 Packet Flow for Software-based Approach to SSL Processing 113

FIGURE 4-17 PCI Accelerator Card Approach to SSL Processing—Partial Offload 114

xviii Networking Concepts and Technology: A Designer’s Resource

DENA.book Page xix Friday, June 11, 2004 11:30 AM

FIGURE 4-18 SSL Appliance Offloads Frontend Client SSL Processing 116

FIGURE 4-19 SSL Test Setup with No Offload 117

FIGURE 4-20 Throughput Increases Linearly with More Processors 119

FIGURE 4-21 SSL Test Setup for SSL Software Libraries 119

FIGURE 4-22 SSL Test Setup for an SSL Accelerator Appliance 120

FIGURE 4-23 Effect of Number of Threads on SSL Performance 120

FIGURE 4-24 Effect of File Size on SSL Performance 121

FIGURE 5-1 Token Ring Network 124

FIGURE 5-2 Typical FDDI Dual Counter-Rotating Ring 132

FIGURE 5-3 SAS Showing Primary Output and Input 133

FIGURE 5-4 DAS Showing Primary Input and Output 134

FIGURE 5-5 SAC Showing Multiple M-ports with Single-Attached Stations 135

FIGURE 5-6 DAC Showing Multiple M-ports with Single-Attached Stations 136

FIGURE 5-7 Communication Process between the NIC Software and Hardware 140

FIGURE 5-8 Transmit Architecture 141

FIGURE 5-9 Basic Receive Architecture 145

FIGURE 5-10 Hardware Transmit Checksum 147

FIGURE 5-11 Hardware Receive Checksum 148

FIGURE 5-12 Software Load Balancing 149

FIGURE 5-13 Hardware Load Balancing 150

FIGURE 5-14 Basic Mode Control Register 153

FIGURE 5-15 Basic Mode Status Register 154

FIGURE 5-16 Link Partner Auto-negotiation Advertisement 155

FIGURE 5-17 Link Partner Priority for Hardware Decision Process 156

FIGURE 5-18 Auto-negotiation Expansion Register 157

FIGURE 5-19 Extended Basic Mode Control Register 158

FIGURE 5-20 Basic Mode Status Register 158

FIGURE 5-21 Gigabit Extended Status Register 159

FIGURE 5-22 Gigabit Control Status 159

FIGURE 5-23 Gigabit Status Register 160

Figures xix

DENA.book Page xx Friday, June 11, 2004 11:30 AM

FIGURE 5-24 GMII Mode Link Partner Priority 161

FIGURE 5-25 Flow Control Pause Frame Format 161

FIGURE 5-26 Link Partner Auto-negotiation Advertisement Register 162

FIGURE 5-27 Rx/Tx Flow Control in Action 163

FIGURE 5-28 Typical hme External Connectors 166

FIGURE 5-29 Typical qfe External Connectors 175

FIGURE 5-30 Typical vge and ge MMF External Connectors 196

FIGURE 5-31 Sun GigaSwift Ethernet MMF Adapter Connectors 209

FIGURE 5-32 Sun GigaSwift Ethernet UTP Adapter Connectors 209

FIGURE 5-33 Example of Servers Supporting Multiple VLANs with Tagging Adapters 229

FIGURE 6-1 Network Topologies and Impact on Availability 263

FIGURE 6-2 Trunking Software Architecture 265

FIGURE 6-3 Trunking Failover Test Setup 266

FIGURE 6-4 Correct Trunking Policy on Switch 268

FIGURE 6-5 Incorrect Trunking Policy on Switch 268

FIGURE 6-6 Correct Trunking Policy on Server 269

FIGURE 6-7 Incorrect Trunking Policy on a Server 270

FIGURE 6-8 Incorrect Trunking Policy on a Server 271

FIGURE 6-9 Layer 2 High-Availability Design Using SMLT 272

FIGURE 6-10 Layer 2 High-Availability Design Using DMLT 273

FIGURE 6-11 Spanning Tree Network Setup 275

FIGURE 6-12 High-Availability Network Interface Cards on Sun Servers 280

FIGURE 6-13 Design Pattern—IPMP and VRRP Integrated Availability Solution 281

FIGURE 6-14 Design Pattern—OSPF Network 282

FIGURE 6-15 RIP Network Setup 289

FIGURE 7-1 Logical Network Architecture Overview 297

FIGURE 7-2 IP Services—Switch Functions Operate on Incoming Packets 299

FIGURE 7-3 Application Redirection Functional Model 300

FIGURE 7-4 Content Switching Functional Model 301

FIGURE 7-5 Network Switch with Persistence Based on SSL Session ID 303

xx Networking Concepts and Technology: A Designer’s Resource

DENA.book Page xxi Friday, June 11, 2004 11:30 AM

FIGURE 7-6 Tested SSL Accelerator Configuration—RSA Handshake and Bulk Encryption 304

FIGURE 7-7 Network Availability Strategies 305

FIGURE 7-8 Logical Network Architecture—Design Details 306

FIGURE 7-9 Traditional Availability Network Design Using Separate Layer 2 Switches 308

FIGURE 7-10 Availability Network Design Using Large Chassis-Based Switches 309

FIGURE 7-11 Logical Network Architecture with Virtual Routers, VLANs, and Networks 310

FIGURE 7-12 Logical Network 313

FIGURE 7-13 Secure Multi-Tier 315

FIGURE 7-14 Multi-Tier Data Center Architecture Using Many Small Switches 316

FIGURE 7-15 Network Configuration with Extreme Networks Equipment 318

FIGURE 7-16 Sun ONE Network Configuration with Foundry Networks Equipment 319

FIGURE 7-17 Physical Network Connections and Addressing 321

FIGURE 7-18 Collapsed Design Without Layer 2 Switches 322

FIGURE 7-19 Foundry Networks Implementation 325

FIGURE 7-20 Firewalls between Service Modules 331

FIGURE 7-21 Virtual Firewall Architecture Using Netscreen and Foundry Networks Products 332

Figures xxi

DENA.book Page xxii Friday, June 11, 2004 11:30 AM

DENA.book Page xxiii Friday, June 11, 2004 11:30 AM

Tables

TABLE 2-1 Network Inter-tier Traffic Flows of a Web-based Transaction 23

TABLE 5-1 tr.conf Parameters 126

TABLE 5-2 MTU Sizes 126

TABLE 5-3 Source Routing Values 127

TABLE 5-4 ARI/FCI Soft Error Reporting Values 127

TABLE 5-5 Operating Mode Values 128

TABLE 5-6 trp.conf Parameters 129

TABLE 5-7 Maximum Transmission Unit 129

TABLE 5-8 Ring Speed 130

TABLE 5-9 nf.conf Parameters 137


TABLE 5-11 Request Operating TTRT 138

TABLE 5-12 pf.conf Parameters 138


TABLE 5-14 Request Operating Target Token Rotation Time 139

TABLE 5-15 Multi-Data Transmit Tunable Parameter 144

TABLE 5-16 Possibilities for Resolving Pause Capabilities for a Link 163

TABLE 5-17 Driver Parameters and Status 167

TABLE 5-18 Instance Parameter 168

TABLE 5-19 Operational Mode Parameters 168

xxiii

DENA.book Page xxiv Friday, June 11, 2004 11:30 AM

TABLE 5-20 Transceiver Control Parameter 170

TABLE 5-21 Inter-Packet Gap Parameter 171

TABLE 5-22 Local Transceiver Auto-negotiation Capability Parameters 172

TABLE 5-23 Link Partner Capability Parameters 173

TABLE 5-24 Current Physical Layer Status Parameters 174











TABLE 5-35 Inter-Packet Gap Parameters 186

TABLE 5-36 Receive Interrupt Blanking Parameters 187














xxiv Networking Concepts and Technology: A Designer’s Resource

DENA.book Page xxv Friday, June 11, 2004 11:30 AM




TABLE 5-53 Performance Tunable Parameters 207




TABLE 5-57 Read-Write Flow Control Keyword Descriptions 213

TABLE 5-58 Gigabit Link Clock Mastership Controls 214



TABLE 5-61 Rx Random Early Detecting 8-Bit Vectors 216

TABLE 5-62 PCI Bus Interface Parameters 217

TABLE 5-63 Jumbo Frames Enable Parameter 217

TABLE 5-64 Performance Tunable Parameters 218






TABLE 5-70 General Network Interface Statistics 245

TABLE 5-71 General Network Interface Statistics 246

TABLE 34 Physical Layer Configuration Properties 253

TABLE 5-72 List of ge Specific Interface Statistics 255

TABLE 5-73 List of ce Specific Interface Statistics 256

TABLE 7-1 Network and VLAN Design 311

TABLE 7-2 Sequence of Events for FIGURE 7-12 314

TABLE 7-3 Physical Network Connections and Addressing 320

Tables xxv

DENA.book Page xxvi Friday, June 11, 2004 11:30 AM

DENA.book Page xxvii Friday, June 11, 2004 11:30 AM

Preface

Networking Concepts and Technology: A Designer’s Resource is a resource for networkarchitects who must create solutions for emerging network environments inenterprise data centers. You’ll find information on how to leverage Sun™ OpenNetwork Environment (Sun ONE) technologies to create Services on Demandsolutions as well as technical details about the networking internals. You’ll also learnhow to integrate your environment with advanced networking switchingequipment, providing sophisticated Internet Protocol (IP) services beyond plainvanilla Layer 2 and Layer 3 routing. Based upon industry standards, expertknowledge, and hands-on experience, this book provides a detailed technicaloverview of the following:

■ Design of highly available, scalable, manageable gigabit network architectureswith a focus on the server-to-switch tier. We will share key ingredients forsuccessful deployments based on actual experiences.

■ Emerging IP services that vastly improve Sun ONE-based solutions, giving you acentralized source of concise information about these services, the benefits theyprovide, how to implement them, and where to use them. Example servicesinclude quality of service (QoS), server load balancing (SLB), Secure SocketsLayer (SSL), and IPSec.

■ Sun Networking software and hardware technologies available. We describe andexplain how Sun differs from the competition in the networking arena, and thensummarize the internal operations and describe technical details that lead into thetuning section. Currently there are only blind recommendations for tuning, withno explanations. This book fills that void by first describing the networkingtechnology, which variables serve what purpose, what tuning will do, and why.

xxvii

DENA.book Page xxviii Friday, June 11, 2004 11:30 AM

The Sun BluePrints ProgramThe mission of the Sun BluePrints program is to empower Sun s customers with thetechnical knowledge required to implement reliable, extensible, and secureinformation systems within the data center using Sun products. This programprovides a framework to identify, develop, and distribute preferred practicesinformation that applies across the Sun product lines. Experts in technical subjects invarious areas contribute to the program and focus on the scope and advantages ofthe information.

The Sun BluePrints program includes books, guides, and online articles. Throughthese vehicles, Sun can provide guidance, installation and implementationexperiences, real-life scenarios, and late-breaking technical information.

The monthly electronic magazine, Sun BluePrints OnLine, is located on the Web at:

http://www.sun.com/blueprints.

To be notified about updates to the Sun BluePrints program, please register on thissite.

Who Should Read This BookThis book is intended for readers with varying degrees of experience with andknowledge of computer system and server technology, who are designing,deploying, and managing a data center within their organizations. Typically theseindividuals already have UNIX® knowledge and a clear understanding of their IPnetwork architectural needs.

The book is targeted at network architects who must design and implement highlyavailable, scalable data centers.

How This Book Is OrganizedThis book is organized into the following chapters:

■ Chapter 1 provides an overview of this book and its concepts.

xxviii Networking Concepts and Technology: A Designer’s Resource

DENA.book Page xxix Friday, June 11, 2004 11:30 AM

■ Chapter 2 explores the main components of a typical enterprise Services onDemand network architecture and some of the more important underlying issuesthat impact network architecture design decisions.

■ Chapter 3 describes some of key Transport Control Protocol (TCP) tunableparameters related to performance tuning: how these tunables work, how theyinteract with each other, and how they impact network traffic when they aremodified.

■ Chapter 4 describes the internal architecture of a basic network switch andprovides a comprehensive discussion of server load balancing.

■ Chapter 5 discusses the networking technologies that are regularly found in adata center.

■ Chapter 6 provides an overview of the various approaches and describes where itmakes sense to apply that solution.

■ Chapter 7 describes network implementation concepts and details.

■ Appendix A provides an example of the Lyapunov function.

■ Glossary provides definitions for the technical terms and acronyms used in thisbook.

Shell Prompts

Shell Prompt

C shell machine-name%

C shell superuser machine-name#

Bourne shell and Korn shell $

Bourne shell and Korn shell superuser #

Preface xxix

DENA.book Page xxx Friday, June 11, 2004 11:30 AM

Typographic Conventions

Accessing Sun DocumentationYou can view, print, or purchase a broad selection of Sun documentation, includinglocalized versions, at:

http://www.sun.com/documentation

Typeface Meaning Examples

AaBbCc123 The names of commands, files,and directories; on-screencomputer output

Edit your.login file.Use ls -a to list all files.% You have mail.

AaBbCc123 What you type, when contrastedwith on-screen computer output

% su

Password:

AaBbCc123 Book titles, new words or terms,words to be emphasized.Replace command-line variableswith real names or values.

Read Chapter 6 in the User’s Guide.These are called class options.You must be superuser to do this.To delete a file, type rm filename.

xxx Networking Concepts and Technology: A Designer’s Resource

DENA.book Page 1 Friday, June 11, 2004 11:30 AM

CHAPTER 1

Overview

This book provides a resource for network architects who design IP networkarchitectures for the typical data center. It provides abstractions as well as detailedinsights based on actual network engineering experience that include networkproduct development and real-world customer experiences. The focus of this book islimited to network architectures that support Web services-based multi-tierarchitectures. However, this includes everything from the edge data center switch(which connects the data center to the existing backbone network) to the servernetwork protocol stacks. While there is tremendous acceptance of Web servicestechnologies and multi-tier software architectures, there is limited information abouthow to create the network infrastructures required in the data center to optimallysupport these new architectures. This book also provides a new perspective on howto think about solving this problem, leveraging new emerging technologies that helpcreate superior solutions. It explains in detail how certain key technologies work andwhy they are recommended procedures.

One of the complexities of networking in general is that the technology requiresbreadth of knowledge. Networking connectivity spans many completely differenttechnologies. It is a complex, interconnected, commingled, interrelated set ofcomponents and devices, including hardware, software, and different solutionapproaches for each segment. We try to simplify this complexity by extracting keysegments and taking a layered approach in describing an end-to-end solution whilelimiting the scope of material to the data center.

Evolution of Web ServicesInfrastructuresThe Web service infrastructure is middleware, which is embraced by both earlyadopters and mainstream enterprises interested in reducing costs and integratinglegacy applications. The Web service paradigm is platform neutral as well as

1


hardware and software neutral. Thus enterprises can easily communicate withemployees, customers, business partners, and vendors while maintaining the specificsecurity requirements, access, and privileges needs of all. The examples used in thisbook will focus on Web-based systems, which simplifies our discussion from anetworking perspective.

FIGURE 1-1 shows a conceptual model of how this new paradigm impacts the datacenter network architecture, which must efficiently support this infrastructure. TheWeb services-based infrastructure allows different applications to integrate throughthe exposed interface, which is the advertised service. The internal details of theservice and subservices required to provide the exposed service are hidden. Thisapproach has a profound impact on the various networks, including the serviceprovider network and the data center network that support the bulk of theintelligence required to deliver these Web-based services.

FIGURE 1-1 Web Services Infrastructure Impact on Data Center Network Architectures

Data center network architectures are driven by computing paradigms. One canargue that the computing paradigm has now come full circle. From the 1960s to the1980s, the industry was dominated by a centralized data center architecture that

Business Partner

NetworkCustomerNetwork Vendor

Network Service Provider

Network

Data Center

Network

Network Architectures

Network Architectures

Federated Networks

Federated Networks

Legacy Application 1

Legacy Application 2

CustomerApplication

BusinessPartner

Application

VendorApplication

Web Services Middleware InfrastructureWSDL

UDDISOAP

HTTPXML

Web-basedService B

Web-basedService A

APPLICATIONS

Protocols

Protocols

2 Networking Concepts and Technology: A Designer’s Resource


revolved around a mainframe with remote terminal clients. Systems NetworkArchitecture (SNA) and Binary Synchronous Communication (BSC) were dominantprotocols. In the early to mid 1990s, client-server computing influenced a distributednetwork architecture. Departments had their local workgroup server with localclients and an occasional link to the corporate database or mainframe. Now,computing has returned to a centralized architecture (where the enterprise datacenter is more consolidated) for improved manageability and security. Thiscentralized data center architecture is required to provide access to intranet andInternet clients, with different devices, link speeds, protocols, and security levels.Clients include internal corporate employees, external customers, partners, andvendors—each with different security requirements. A single flexible and scalablearchitecture is required to provide all these different services. Now the networkarchitect requires a wider and deeper range of knowledge, including Layer 2 andLayer 3 networking equipment vendors, emerging startup appliance makers, andserver-side networking features. Creating optimal data center edge architectures isnot only about routing packets from the client to the target server or set of serversthat collectively expose a service, but also about processing, steering, and providingcascading services at various layers.

For the purposes of this book, we distinguish network design from architecture asfollows:

■ Architecture is a high-level description of how the major components of the systeminterconnect from a logical and physical perspective.

■ Design is a process that specifies, in sufficient detail for implementation, how toconstruct a network of interconnected nodes that meets or exceeds functional andnon-functional requirements (performance, availability, scalability, and such).

Advances in networking technologies, combined with the rapid deployment of Web-based, mission-critical applications, brought growth and significant changes inenterprise IP network architectures. The ubiquitous deployment of Web-basedapplications that has streamlined business processes has further accelerated Webservices deployments. These deployments have a profound impact on thesupporting infrastructures, often requiring a complete paradigm shift in the way wethink about building the network architectures. Early client-server deployments hadnetwork traffic pattern characteristics that were predominantly localized traffic overlarge Layer 2 networks. As the migration towards Web-based applicationsaccelerated, client-server deployments evolved to multi-tier architectures, resultingin different network traffic patterns, often outgrowing the old architectures.Traditional network architectures were designed on the assumption that the bulk ofthe traffic would be local or Layer 2, with proportionately less inter-company orInternet traffic. Now, traffic is very different due to the changing landscape ofcorporate business policies towards virtual private networks (VPNs), consumer-to-business, and business-to-business e-commerce. These innovations have also givenrise to new challenges and opportunities in the design and deployment of emergingenterprise data center IP network architectures.

Chapter 1 Overview 3


This book describes why these network traffic patterns have changed, definingmulti-tier data centers that support these emerging applications and then describinghow to design and build suitable network architectures that will optimally supportmulti-tier data centers. The focus of this book spans the edge of the data centernetwork to the servers. The scope of this book is limited to the data center edge, soit does not cover the core of the enterprise network.

The Data Center IP NetworkTogether FIGURE 1-2 and FIGURE 1-3 provide a high-level overview of the variousinterconnected networks, collectively referred to as the Internet. These illustrationsalso show the relationship between the client-side and enterprise-side networks.They show two distinct networks that can be segregated based on business entities:

■ The Internet Service Provider (ISP) that provides connectivity to the publicInternet for both clients and enterprises

■ The owners of the physical plant and communications equipment, which fall intoone of the following categories:

■ The Incumbent Local Exchange Carrier (ILEC), which provides local access tosubscribers in a local region

■ The Inter Exchange Carrier (IXC), which provides national and internationalaccess to subscribers

■ The Tier 2 ISP, which is usually a private company that leases lines and cagespace at an ILEC facility, or it can be the ILEC or IXC itself.

The diagram shows Tier 2 ISPs as being relatively local ISPs, situated in the accessnetworks, whereas Tier 1 ISPs have their own longhaul backbone and provide widerregional coverage, situated in the Core, or national backbone. Tier 1 often aggregatesthe traffic of many Tier 2 ISPs, in addition to providing services directly toindividual subscribers. Large networks connect to each other through peeringpoints, such as MAE-East/MAE-West, which are public peering points, or throughNetwork Access Points (NAPs), such as SPRINTS NAP, which are private.



FIGURE 1-2 High-Level Overview of Networks Spanning Clients, Data Center, Vendors, and Partners (a)

PDA

Cable modem

DSLmodem

CSU/DSU

DSLAM

Cable

HFC

PSTN

DSL

Mobile Wireless

Dial Up

Remote AccessTier 2 ISP/

Access Networks

IPEthernet

IPEthernet

IPEthernet

IPPPPAAL5/ATMSONET

IPPPPL2TPEthernet

IPPPPCDMA/GPRS/UMTS

IPPPPV.34, V.90,56kbps

IPDOCSIS

Cablehead end

ATM

ATM

ATM

Metro areanetwork

CO/POP

Accessswitch

GPRS

GPRSEdge

Voice Circuit

T1-LeasedLine



FIGURE 1-3 High-Level Overview of Networks Spanning Clients, Data Center, Vendors, and Partners (b)

MAENAP

Enterprise Data Center Network

Enterprise RemoteBranch Office

Vendors

Partners

Tier 1 ISPBackbone Network

Peeringpoints



A client can be any software that initiates a request for a service. This means that aWeb server itself can be a client. For example, when in the process of replying backto a client for a Web page, it needs to fetch images from an image server. FIGURE 1-2and FIGURE 1-3 show remote dial-up clients as well as corporate clients. Dependingon the distances between the client and the server hosting the Web service, datamight need to traverse a variety of networks for end-to-end communication. Thefocus of this book is the network that interconnects the servers located in anenterprise data center or the data center of an ISP offering collocation services. Wedescribe the features and functions of the networking equipment and servers insufficient depth to help a network architect in the design of enterprise IP networkarchitectures.

We take a layered approach, starting from the application layer down to the physicallayer to describe the implications on the design of network architectures. Wedescribe not only high-level architectural principles, but also key details such astuning the transport layer for optimal working networks. We discuss in detail howthe building blocks of the network architecture are constructed and work to makemore informed design decisions. Finally, we present actual tested configurations,providing a baseline for customizing and extending these configurations to meetactual customer requirements.

Network Traffic CharacteristicsOne of the first steps in designing network architectures for the data center isunderstanding network traffic patterns. FIGURE 1-4 shows how the application layerfits within the overall networking infrastructure. The network architecture must bedesigned to meet the bandwidth and latency requirements of the enterprise networkapplications, both at steady state and during episodes of congestion. We will providean overview of some of the tiers that are typically deployed in all multi-tiersolutions. There are several reasons for partitioning the solution into tiers, includingthe ability to control network traffic access between tiers. Switches are designed tolook at traffic from source to destination by dividing the application layer into tiers.This allows each tier one-to-one mapping to a corresponding virtual local areanetwork (VLAN), which is also directly mapped to a specific network or subnet.



FIGURE 1-4 Influence of Multi-Tier Software Architectures on Network Architecture

Chapter 2 offers insight into the applications that generate the traffic flows across thetiers. Inter-tier traffic starts with a client request, which can originate from remotedial-up, intranet corporate employee, Internet partner, and so on. This HyperTextTransport Protocol (HTTP) or HyperText Transport Protocol over SSL (HTTPS)packet is usually about a hundred bytes. The server response is usually a 1000-byteto 200-kilobyte file, often consisting of Web page images. Chapter 2 describes the keycomponents and technologies used at the application layer and provides somedeeper insights into achieving availability.

Enterprise Multi-Tier

Data Center

ORB



The processing of client Web requests and generation of an HTTP response mayrequire significant processing across various Web and application or legacy servers.Examples of typical applications include business applications implemented usingEnterprise JavaBeans™ (EJBs) on an application server, mail messaging, anddynamic Web page generation using JavaServer Pages™ (JSP) and servlets. Thenature of the traffic requirements should be clearly identified and quantified. Mostimportant are the identification and specification of handling peaks or bursts. Weprovide detailed Web, application, and database tier traffic flows and availabilitystrategies, which directly impact inter-tier traffic flows.

End-to-End Session: Tuning the TransportLayerOne key factor in successful deployments is the optimal working network. InChapter 3, we describe how to tune server-side Transport Control Protocol (TCP) tomeet the challenges that arise as a result of clients connecting at different networkbandwidths and latencies. TCP is a complex window-based protocol that must betuned in order to achieve good throughput. FIGURE 1-5 shows the importance of end-to-end connectivity and tuning TCP due to the widely different latencies,bandwidths, and congestion that each client may be subjected to when connecting tothe data center network server. Almost all current material provides blindrecommendations about which parameters to tune. We describe exactly whichparameters are important to tune and the impact that tuning has on other parts ofTCP. A network architect is often consulted about how to improve performance aftera network has been designed and implemented. Chapter 3, although used after thedesign phase, was included to fill a void in this area.



FIGURE 1-5 Transport Layer Traffic Flows Tuned According to Client Links

Network Edge Traffic Steering: IPServicesFIGURE 1-6 shows an overview of the various IP services that can be used in creatinga multi-tier data center network architecture. These services are essentially packetprocessing functions that alter the flow of traffic from client to server for increasingcertain aspects of the architecture. Firewalls and Secure Sockets Layer (SSL) areadded for increasing security. Server load balancing (SLB) is used to increaseavailability, scalability, flexibility, and performance. In Chapter 4, we describe keyservices that the architect can leverage, including in-depth explanations of how theywork and which variant is the best and why. A question that is often asked is whichserver load balancing algorithm is best and why. We provide a detailed technicalanalysis explaining exactly why one algorithm is the best. We also provide detailed

TCP

IP

Ethernet


Data Center



explanation of the new emerging quality of service (QoS), which has gained moreimportance because of increasing deployment time-dependent applications, such asVoice over IP (VoIP) or multimedia.

Most enterprise networks are overprovisioned. Normal steady-state network flowsare usually not an issue. What really concerns most competent network architects ishow to handle peaks or bursts. Every potential incoming HTTP request could be arevenue-producing opportunity that cannot be discarded. Here is where QoS playsan essential role. One of the missing pieces in most network architectures is planningfor handling peak workloads and providing differentiated services. When there iscongestion, we absolutely must prioritize and service the important customers aheadof casual browsing Web surfers. Quality of service will be discussed in detail: itsimportance, where to use it, and how it works.

FIGURE 1-6 Data Center Edge IP Services

SLB

QoS

NAT

SSL

FW

ServiceAccessPoint


Data Center



Server Networking InternalsThe design of data center network architectures not only involves an understandingof networking equipment, but equally important, the network interface cards (NICs)and software protocols and drivers on the servers that provide services (seeFIGURE 1-7). There are a variety of different cards that the architect can choose from.Depending on the requirements, appropriate design choices dictate which NIC ismost suitable. Further tuning drivers that directly interface with NICs and the rest ofthe server protocol stack requires understanding of server networking internals.Chapter 5 presents a survey of available NICs and then provides an overview of theinternal operations and insights ito tuning key parameters.

FIGURE 1-7 Data Center Networking Considerations on the Server

IP

Driver

Network

IP

Driver

NetworkInterfaceCard

InterfaceCard


Data Center



Network Availability Design PatternsChapter 6 presents the various approaches to achieving high-availability networkdesigns, including trade-offs and recommendations. We describe various techniques,provide some detailed configuration examples of the different ways to connectservers to the edge switches, and touch on how data center switches can beconfigured for increased availability, as shown in FIGURE 1-8. The material in Chapter6 is based on actual customer experiences and has proven to be quite valuable to thenetwork architect. We describe the following Layer 2 approaches:

■ Trunking – NIC, server side

■ Trunking – Switch side, including Distributed Multi-link Trunking (DMLT)

■ Spanning Tree Protocol (STP)

The following Layer 3 strategies are described:

■ Virtual Router Redundancy Protocol (VRRP) – default router redundancymechanisms

■ IP Multipathing (IPMP) – NIC redundancy

■ Open Shortest Path First (OSPF) and Routing Information Protocol (RIP) – datacenter routing protocol availability features.

The advantages and disadvantages will be described, along with suggestions onwhich approach makes sense for which situation.



FIGURE 1-8 Availability Strategies in the Data Center

Reference ImplementationsThe final chapter ties together all the concepts from previous chapters and describesactual network architectures from a complete solution standpoint. The material inChapter 7 is based on actual tested configurations. FIGURE 1-9 shows an example ofthe tested configurations. The solution described in Chapter 7 is generic enough tobe useful for actual solutions, yet customizable for specific requirements. The logicalnetwork architecture describes a high-level overview of the Layer 3 networks andsegregates the various service tiers. IP services, which are implemented at keyboundary points of the architecture, are then reviewed. We describe the designconsiderations that lead to the physical architecture. We discuss two differentarchitectural approaches and show how different network switch vendors can bedeployed, including the advantages and disadvantages of each. We present detaileddescriptions of configurations and implementations. A hardware-based firewall isused to show a logical firewall solution, providing security between each tier, yetusing only one appliance. For increased availability, a second appliance is optionallyadded.


Data Center



FIGURE 1-9 Example Implementation of an Enterprise Muli-Tier Data Center

Client 1 Client 2

Level 2-3 edge switch

WebserviceTier

Sun Fire6800

Sun Fire680010.30.0.101

Sun Fire6800

Sun Fire6800

10.30.0.100

ClientaccessTier

ApplicationserviceTier

DatabaseserviceTier

DirectoryserviceTier

Master core192.168.10.2

Sun Fire 280R Sun Fire 280R Sun Fire 280R Sun Fire 280R


T3

T3

192.168.10.1

10.10.0.110.20.0.110.40.0.110.30.0.1

10.50.0.1

10.10.0.100 10.10.0.101 10.10.0.102 10.10.0.103

10.20.0.100 10.20.0.101 10.20.0.102 10.20.0.103

10.40.0.100 10.40.0.101

Standby core192.168.10.3

Server load-balancer switches

Foundryswitches




CHAPTER 2

Network Traffic Patterns:Application Layer

In the design of network architectures, it is essential to understand the applicationsused and the resulting network traffic injected into the environment. There are manyclasses of networked applications. This chapter focuses on Web-based applicationsthat are predominant in the data center. Most Web-based applications can bepartitioned into the following functional tiers:

■ Presentation Tier■ Web Tier■ Application Tier■ Naming Services Tier■ Data Tier

In this chapter we will explore the main components of a typical enterprise Serviceson Demand network architecture and some of the more important underlying issuesthat impact network architecture design decisions. We describe in detail the Web tierand Application tier, pointing out issues that impact design decisions for availability,performance, manageability, security, and scalability. We then describe someexample architectures that were actually deployed in industry. Topics include:

■ “Services on Demand Architecture” on page 18 describes the overall architecturefrom a software perspective, showing the applications that generate the networktraffic.

■ “Multi-Tier Architecture and Traffic Patterns” on page 20 describes the mappingprocess from the logical architecture to the physical realization onto the networkarchitecture. It also describes the inter-tier traffic patterns and the reasons behindthe network traffic.

■ “Web Services Tier” on page 23 describes the most important tier, which is presentin all Web-based architectures. This section provides detailed insights into theapplications that run on this tier and directly impact the network architecture.

■ “Application Services Tier” on page 26 describes the Application tier and therelationship to the Web services tier. This section provides detailed insights intothe applications that run on this tier and directly impact the network architecture.

17


■ “Architecture Examples” on page 29 provides examples of various architecturesbased design trade offs and the reasons behind them. It’s important to note that itis the application characteristics that influence the design of the networkarchitecture.

■ “Example Solution” on page 34 describes an actual tested and implementedmulti-tier architecture, using the design concepts and principles described in thischapter.

Services on Demand ArchitectureEnterprise architects are faced with a wide and often confusing variety ofdeployment options for configuring corporate applications that are accessed notonly by employees but also by customers and business partners. These applicationscan include legacy mainframe, traditional client-server, Web-based, and otherapplications. The challenge is to create a unified framework where all these differentsoftware technologies can be easily managed, deployed, and universally accessible.Conceptually, all these applications can be considered as services. When a userneeds a service, it should be immediately available on demand. A Services onDemand architecture consists of a framework of completely integrated technologiesthat empowers enterprise architects to develop, manage, integrate, and deploy anyapplication within an open standards-based set of protocols that can be accessed byany Web client running on any device such as PC desktops, laptops, PDAs, or cellphones.

Sun ONE is a set of products that uses an open standards-based softwarearchitecture. Sun ONE is designed for implementing and deploying Web-basedapplications, also known as Services on Demand.

The tiers are usually segregated based on functionality and security. The front tierusually performs some form or presentation functionality. The Application tierusually performs the business logic, and the Data tier maintains persistent storage ofdata. In actual practice, the tiers are usually less distinct because of optimizations forsecurity or performance. In addition, designs will vary due to optimizations forperformance, availability, and security. These directly impact the networkarchitecture.

The Sun ONE architecture spans the Web, Application, and Data tiers. However, thefocus is on the Application tier. Sun ONE provides a common framework where allthese components can be developed, implemented, and deployed, being assured oftested and proven integration capabilities.

FIGURE 2-1 illustrates the main components of a typical multi-tier architecture fordeploying Web-based applications. Note the firewall component is not included inthis illustration to simplify the focus of this discussion.



FIGURE 2-1 Main Components of Multi-Tier Architecture

Service Creation, Assembly and Deployment

Identity and PolicyService

Delivery

Service

Integration

Platform

Web

Server

Web

Server

Router

Router

JDBC

Connections

Naming

Messaging

Mail

Security

JPDA

Admin & Monitor

Servlet

Container

Sun ONE Application Server

EJB

Container

HTTP

Server

ORB

JDBC

Oracle

Legacy

Connections

Naming

Messaging

Mail

Security

JPDA

Admin & Monitor

Servlet

Container

Sun ONE Application Server

EJB

Container

HTTP

Server

ORB

Internet

Service

Container

Web

Services

Applications

Core Web

Services

Fire

wal

l

Fire

wal

l

Fire

wal

l

Chapter 2 Network Traffic Patterns: Application Layer 19


Multi-Tier Architecture and TrafficPatternsWe first take a look at a the high-level logical view of a typical multi-tierarchitecture. FIGURE 2-2 shows the main logical networks that support an enterprisemulti-tier based service infrastructure:

FIGURE 2-2 Logical View of Multi-Tier Service on Demand Architecture

ExternalNetwork

192.168.10.0.

Web TierNetwork

10.10.0.0.

Directory TierNetwork

10.20.0.0.

ManagementNetwork

10.100.0.0. Accessto all Networks

Database TierNetwork

10.100.0.0.

BackupNetwork

10.110.0.0.

SANNetwork

10.50.0.0.

App Serv TierNetwork

10.30.0.0.

ClientNetwork

172.16.0.0.



FIGURE 2-2 shows how the various services map directly to a corresponding logicalLayer 3 network cloud, shown above in boxes, which then maps directly onto aLayer 2 VLAN. The mapping process starts with the high-level model of the servicesto be deployed onto the physical model. This top-down approach allows networkarchitects to maintain some degree of platform neutrality. The target hardware canchange or scale, but the high-level model remains intact.1

Mapping Tiers to the Network ArchitectureThis mapping process allows the software architecture to be decoupled from thehardware architecture, resulting in a flexible modular solution. From a networkarchitecture perspective, there are two key tools you can use:

■ Layer 2 VLANs segregate Layer 2 broadcast domains and service domains. Anexample of a service domain would be a group of Web servers, load balanced,horizontally scaled, and aggregated to provide a highly available service with asingle IP access point, commonly deployed in actual practice as a VIP on a loadbalancer.

■ Layer 3 IP networking segregates Layer 3 routed domains and service domains.Segregating service domains based on IP addresses makes this service networkaccessible to any host on any Layer 3 IP network. One advantage of this approachis that the service interface for each cloud only needs to be one endpoint, which iseasily implemented by a virtual IP (VIP) address. The service is actually providedacross many subservice instances running on physically separated servers,collectively forming a logical cluster. The external world does not need to know(and should not know for many reasons, especially security) about the individualservers that provide the service. By creating a layer of indirection, the requestingclient need not be modified if any one server is removed or replaced. Thisdecoupling improves manageability and serviceability.

This mapping process allows better control of the network traffic by providing amechanism for routers and switches to steer the traffic according to user-definedrules. In actual practice, these user-defined rules are accomplished by configuringVLANs, static routes, and access control lists (ACLs). A further benefit allows trafficto be filtered at wirespeed to identify flows for other services such as Quality ofService (QoS).

1Keep in mind the physical constraints imposed by the actual target hardware. Examples of physical constraintsinclude the number of ports on a network switch and computing capacities.



Inter-tier Traffic FlowsUnderstanding the traffic flows is important when determining the inter-tier linkbandwidth requirements. FIGURE 2-3 illustrates the typical network traffic flows as aresult of a Web-based transaction. TABLE 2-1 describes each flow in detail,corresponding to the numbers in the illustration.

FIGURE 2-3 Network Inter-tier Traffic Flows of a Web-based Transaction

Clients

SwitchingServices

WebServices

ApplicationServices

DatabaseServices

DirectoryServices

1 10

2 9

7 6

3 4 8 5



The Item column in TABLE 2-1 corresponds with the numbers in FIGURE 2-3.

Web Services TierDesign strategies for the Web Services tier are directly influenced by thecharacteristics of the software components that run on the Web server. Thesecomponents include static Web pages, Java Server Pages (JSP), and servlets. In thissection we describe the important characteristics, how they work, and how theyimpact the design of the network architecture.

The Sun ONE Web server will be used to illustrate the concepts presented in thischapter. It is important to note that the Sun ONE Web server can be deployed as astandalone product or as an integrated component within the Sun ONE Applicationserver. This has implications on the design strategies detailed in this chapter.

TABLE 2-1 Network Inter-tier Traffic Flows of a Web-based Transaction

Item Interface1 Interface2 Protocol Description

1 Client Switch HTTP Client initiates Web request.

2 Switch Web service HTTP Switch redirects client request toparticular Web server based on L2-L7and SLB configuration.

3 Web service Directory service LDAP Web service request directory service.

4 Directory service Web service LDAP Directory service resolves request.

5 Web service Applicationservice

RMI Servlet obtains handle to EJB bean,invokes a method on remote object.Web server talks to the iAS through aWeb connector, which uses NSAPI,ISAPI, or optimized CGI.

6 Applicationservice

Database service OracleProprietary TNS

Entity Bean requests to retrieve orupdate row in DB table.

7 Database service Applicationservice

OracleProprietary TNS

Entity Bean request completed.

8 Applicationservice

Web service RMI Application server returns dynamiccontent to Web server.

9 Web service Switch HTTP Switch receives reply from Webserver.

10 Switch Client HTTP Switch rewrites IP header, returnsHTTP request to client.



FIGURE 2-4 Model of Presentation/Web Tier Components and Interfacing Elements

FIGURE 2-4 provides an overview of a high-level model of the Presentation, Web,Application, and Data-tier components and interfacing elements. The followingdescribes the sequence of interaction between the client and the multi-tierarchitecture:

1. Client initiates a Web request; HTTP request reaches Web server.

2. Web server processes client request and passes request to backend Applicationserver, containing another Web server with a servlet engine.

3. Servlet processes a portion of the request and requests spurring service from anEnterprise Java Bean™ (EJB) running on Application server containing an EJBcontainer.

4. EJB retrieves data from database.

The Sun ONE Application server comes with a bundled Web server container.However, reasons for deploying a separate Web tier include security, loaddistribution, and functional distribution. There are two availability strategies thatdepend on the type of operations that are executed between the client and Webserver processes:

FIREWALL FIREWALL

ClientBrowser

LoadBalancerSession

Stickyness

Sun ONEWeb Server 6.0

Sun ONEWeb Server 6.0

Sun ONEApp Server

Sun ONEApp Server



■ Stateless and Idempotent – If the nature of transactions is idempotent (where theorder of transactions is not dependent on each other), then the availabilitystrategy at the Web tier is trivial. Both availability and scalability are achieved byreplication. Web servers are added behind a load-balancer switch. This class oftransactions includes static Web pages and simple servlets that perform a singlecomputation.

■ Stateful – If the transactions between the client and server require that state bemaintained between individual client HTTP requests and server HTTP responses,then the problem of availability is more complicated and discussed in this section.Examples of this class of applications include shopping carts, bankingtransactions, and the like.

The Sun ONE Web servers provide various services including SSL, a Web containerthat serves static content, JSP software, and a servlet engine. Availability strategiesinclude a front-end multilayer switch with load balancing capabilities and the abilityto switch based on SSL session IDs and cookies. If the Web servers are only servingstatic pages, then the load balancer will provide sufficient availability; if any Webserver fails, subsequent client requests will be forwarded to the remaining survivingservers. However, if the Web servers are running JSP software or servlets thatrequire session persistence, the availability strategy is more complex.

Implementing session failover capabilities can be accomplished by coding, Webcontainer support, or a combination of both. There are actually severalcomplications, including the fact that even if the transparent session failoverproblem is solved for failures that occur at the beginning of transactions, idempotenttransactions still pose a problem for transactions that have started and then failedbecause the client is unaware of the server state. A programmatic session failoversolution can involve leveraging the javax.servlet.http.HttpSession object,storing and retrieving user session state to or from an LDAP directory or databaseusing cookies in the client’s HTTP request. Some Web containers provide the abilityto cluster HttpSession objects using elaborate schemes, but they still have flaws suchas failures in the middle of a transaction. These clustering schemes involve memory-based session persistence or database-based persistence and a replicated HttpSessionobject on a backup server. If the primary server fails, the replica takes over. The SunONE Web server availability strategy for HttpSession persistence offers extendingthe IWSSessionManager, which in multiprocess mode can share session informationacross multiple processes running on multiple Web servers. This means that a clientrequest has an associated session ID, which identifies the specific client. Thisinformation can be saved and subsequently retrieved either in a file that resides on aNetwork File System (NFS) mounted directory or by having the databaseIWSSessionManager create an IWSHttpSession object for each client session. TheIWSSessionManager will require some coding efforts to support distributed sessionsso that if the primary server that maintained a particular session fails, the standbyserver running another IWSSessionManager should retrieve the persistent sessioninformation from persistent store based on the session ID. Logic is also required toensure the load balancer would redirect the client’s HTTP request to the backup Webserver based on additional cookie information.



Currently there is no support for SSL session failover in Sun ONE Web Server 6.0.

HttpSession failover can be implemented by extending IWSSessionManager using ashared NFS file or database session persistence strategies, providing user control andflexibility.

Application Services TierDesign strategies for increased availability for the Application Services tier canbecome complex because of the different entities with varying availabilityrequirements. Some entities require failover mechanisms and some do not. Thissection presents a survey of various availability strategies implemented by variousvendors and then discusses examples of the various Sun ONE Application Serverarchitectures and associated availability strategies.

At the Application Server tier, you are working with the following entities:

■ HttpSession – This is the client session object that the Web container creates andmanages for each client HTTP request. Session failover mechanisms weredescribed in the previous section.

■ Stateless Session Bean – This type of EJB does not require any session failoverservices. If a client request requires logic to be executed in a stateless sessionbean, and the server where that bean is deployed fails, an alternative server canredo the operation correctly without any knowledge of the failed bean. Failuredetection by the client plug-in or application login must detect when theoperation has failed and reinitiate the same operation on a secondary server withthe appropriately deployed EJB component.

■ Stateful Session Bean – This type of EJB component requires sophisticatedmechanisms to maintain state between the primary and backup—in addition tothe required failover mechanisms described in the Stateless Session Bean case.

■ Entity Bean – There are two types of Entity Beans: Container ManagedPersistence (CMP) and Bean Managed Persistence (BMP). These essentially differin whether the container or the user code is responsible for ensuring persistence.In either case, session failover mechanisms other than those already provided inthe EJB 2.0 specification are not required because Entity Beans represent a row ina database and the notion of session is replaced by transaction. Clients usuallyaccess Entity Beans at the start of transactions. If a failure occurs, the entiretransaction is rolled back. An alternative server can redo the transaction, resultingin correct operation.



The degree of transparency of the failover requires some consideration. In somecases, the client is completely unaware that a failure occurred and an automaticfailover action took place. In other situations, the client times out and must reinitiatea transaction.

FIGURE 2-5 High-Level Survey of EJB Availability Mechanisms

FIGURE 2-5 shows an abstract logical overview of the various transactions that cantranspire among a client, a Web server instance, and an Application server instance.Note that the firewall is not shown to simplify our discussion. FIGURE 2-5 illustratesthree scenarios: Points 1 through 7 depict one scenario, Point 8 depicts the secondscenario, and Point 9 depicts the third scenario. The numbered arrows in the figurecorrespond to the following:

CookiePrim - AS1Sec - AS2

1

2

WS1

WS2

JSP/Servlet

Web Container EJB Container

Web Container

JNDI

HomeHomeSTUS

ObjectSTUS

WebPlug-in

RemoteStaticHTML

3

6

5

4

9

9

7

8

AS1 Primary

NamespaceJNDI local tree JNDI global tree

Modified JNDI

System Services - DSYNCGXDSyncModule

Home Object EJB

EJBClass

Instance

EJBClass

Instance

EJBClass

Instance

EJB Object

EJB Container

AS1 Replicated

NamespaceJNDI local tree JNDI global tree

Modified JNDI

System Services - DSYNCGXDSyncModule

Home Object EJB

EJBClass

Instance

EJBClass

Instance

EJBClass

Instance

EJB Object



Scenario 1

1. A client makes an HTTP request, which may contain some cookie stateinformation to preserve state between that individual’s HTTP requests to aparticular server.

2. The load balancer switch ensures that the client’s request is forwarded to theappropriate server.

3. The JSP software or servlet retrieves a handle to a remote EJB object residing inthe application server instance.

4. The client must first find the home object using a naming service such as JavaNaming and Directory Interface (JNDI). The returned object is cast to the homeinterface type.

5. The client uses this home interface reference to create instances.

6. The client continues to create instances.

7. The application server provides replication services. When an EJB object isupdated on the active application server instance, the standby server updates thecorresponding backup EJB object’s state information. These replication servicesare provided by the application server systems services.

Scenario 2

8. A JNDI tree cluster that manages replication of the EJB state updates and keepstrack of the primary and replicated objects. This scenario occurs when vendorimplementations use a modified JNDI as a clustering mechanism. In the standardJNDI implementation, multiple objects cannot bind to a single name, but usingadded logic, each member of a cluster can have a local and shared global JNDItree. If the primary object fails, the JNDI will return the backup object bound to aparticular name. If a failure occurs after a client has performed a JNDI lookup, theclient will hang or time out and try again. The subsequent request will be directedto a secondary server, which will have the correct state of the failed node of aparticular entity.

Scenario 3

9. This scenario simply forwards the HTTP request to the Application server using aplug-in. The HTTP request would be received by the Application server’s HTTPserver. The HTTP request would recursively arrive at point 2 in FIGURE 2-5.

Another mechanism includes adding a replica-aware or cluster-aware stub to the EJBobjects and system services including a cluster module that runs on the appserver,which is loaded in the deployment descriptor, if specified. The cluster module might



consist of various subsystems that provide data synchronization services, keep thestate of the backup EJB object synchronized with the primary, manage clusterfailovers, and monitor the health of the appserver instances. If the primaryappserver instance fails, the cluster failover manager can redirect client-side EJBmethod invocations to the backup node.

Another approach involves the primary and secondary cluster nodes inserting andaltering a cookie on the client’s HTTP request, which would apply in the case wherethe Web server and app server reside on the same server. If the primary node of thecluster fails, the load-balancing switch must be configured to redirect the request tothe backup node of the cluster. The backup node must look at the client’s cookie andretrieve state information.

However, most of these solutions suffer one drawback: Idempotent transactions arenot handled transparently or properly in the event that a failure occurs after amethod invocation has commenced.

At the time of this writing, the Sun ONE Application Server 7 Enterprise Edition isexpected to provide a highly available and scalable EJB clustering solution thatallows enterprise customers to create solutions with minimal downtime.

Architecture ExamplesThis section describes three architecture designs. Deciding which architecture tochoose can be reduced to identifying the following design objectives:

■ Application partitioning – The application itself might make better use ofresources by segregating or collapsing the Web tier from the Application tier. If anapplication makes heavy use of static Web pages, JSP software, or servlet code,and minimal EJB architecture, it might make sense to horizontally scale the Webtier and have only one or two small application servers. Similarly, at the other endof the spectrum, it might make sense for an application to deploy all the servletand EJB war, jar, and ear files on the same application server if there is a lot ofservlet-to-EJB communication.

■ Security level – Separating the Web tier and Application Server tier with afirewall creates a more secure solution. The potential drawbacks includehardware and software costs, increased communication latencies between servletsand EJB components, and increased manageability costs.

■ Performance – In some cases, customers are willing to forego tight securityadvantages for increased performance. For example, the firewall between the Webtier and the Application Server tier might be considered overkill because theingress traffic is already firewalled in front of the Web tier.

■ Scalability – Applications can be partitioned and deployed in two ways:



■ Horizontally scaled, where many small separate Web systems are utilized■ Vertically scaled, where a few monolithic systems support many instances of

Web servers

■ Manageability – In general, the fewer the number of servers, the lower the totalcost of operation (TCO).

The next three sections describe three architecture designs.

■ “Designing for Vertical Scalability and Performance” on page 31 describes avertically scaled design where the primary objectives are security and verticalscalability.

■ “Designing for Security and Vertical Scalability” on page 32 describes a tightly-coupled design where the primary objectives are performance between the Webtier and Application tier and vertical scalability.

■ “Designing for Security and Horizontal Scalability” on page 33 describes a highlydistributed solution with the primary design objective being horizontal scalabilityand security. It is the application characteristics that directly influence thenetwork architecture.



Designing for Vertical Scalability and Performance

FIGURE 2-6 Decoupled Web Tier and Application Server Tier—Vertically Scaled

The architecture example shown in FIGURE 2-6 provides enhanced security. The Webserver can be configured as a reverse proxy by receiving an HTTP request on theingress network side from a client, then opening another socket connection on theappserver side to send an HTTP request to the Web server running inside the SunONE Application Server instance. Alternatively, the Web server instance couldinstantiate EJB components after performing a lookup on the home interface of aparticular EJB component. One advantage of this decoupled architecture isindependent scaling. If it turns out that the Web server servlets need to scalehorizontally, they can do so independently of the application server logic. Similarly,if the EJB architecture’s logic needs to scale or be modified, it can do so

App Server 7.0 Instance

servlet container

EJBcontainer

servlet container

EJBcontainer

PersistenceJDBCJCAJNDIJMS

JavaMailSecurityJPDA

Admin &Monitoring


servlet container

EJBcontainer

servlet container

EJBcontainer



Admin &Monitoring

LoadBalancing

Switch

Web Server Instance

http listener virtual servervirtual server

virtual serverhttp listener

http listener

Web Server Instance



http listener

Web Server Instance



http listener

Web Server Instance



http listener


servlet container

EJBcontainer

servlet container

EJBcontainer

Enterprise Information Tier

Web/presentationTier

IngressNetwork

FIR

EW

ALL

FIR

EW

ALL



Admin &Monitoring


servlet container

EJBcontainer

servlet container

EJBcontainer



Admin &Monitoring



independently of the Web tier. Potential disadvantages include increased latencybetween the Web tier and Application Server tier communications and increasedmaintenance.

Designing for Security and Vertical Scalability

FIGURE 2-7 Tightly Coupled Web Tier and Application Server Tier—Vertically Scaled

The example shown in FIGURE 2-7 represents a collapsed architecture that takesadvantage of the Web server already included in the Sun ONE Application Serverinstance process. This architecture is suitable for applications that have relativelyintensive servlet-to-EJB communications and less stringent security requirements.

LoadBalancing

Switch

Web Server Instance



http listener

Web Server Instance



http listener

Web Server Instance



http listener

Web Server Instance



http listener


ApplicationServer Tier


IngressNetwork

FIR

EW

ALL

FIR

EW

ALL


servlet container

EJBcontainer

servlet container

EJBcontainer



Admin &Monitoring


servlet container

EJBcontainer

servlet container

EJBcontainer



Admin &Monitoring


servlet container

EJBcontainer

servlet container

EJBcontainer



Admin &Monitoring


servlet container

EJBcontainer

servlet container

EJBcontainer



Admin &Monitoring



From an availability standpoint, fewer horizontal servers result in lower availability.A potential advantage of this architecture is lower maintenance cost because thereare fewer servers to manage and configure.

Designing for Security and Horizontal Scalability

FIGURE 2-8 Decoupled Web Tier and Application Server Tier—Horizontally Scaled

The architecture shown in FIGURE 2-8 is a more horizontally scaled variant of thearchitecture shown in FIGURE 2-6. This results in increased availability. More serverfailures can be tolerated without bringing down services in this configuration.


servlet container

EJBcontainer

servlet container

EJBcontainer



Admin &Monitoring


servlet container

EJBcontainer

servlet container

EJBcontainer



Admin &Monitoring


servlet container

EJBcontainer

servlet container

EJBcontainer



Admin &Monitoring


servlet container

EJBcontainer

servlet container

EJBcontainer



Admin &Monitoring

LoadBalancing

Switch

Web Server Instance



http listener

Web Server Instance



http listener

Web Server Instance



http listener

Web Server Instance



http listener


ApplicationServer Tier


IngressNetwork

FIR

EW

ALL

FIR

EW

ALL

FIR

EW

ALL



Example SolutionThis section describes an example of a tested and implemented data center, multi-tier network architecture shown in FIGURE 2-9. The network design is composed ofsegregated networks, implemented physically using VLANs configured by thenetwork switches. This internal network used the 10.0.0.0 private IP address spacefor security and portability advantages.

This design is an implementation of the design described in “Designing for Securityand Horizontal Scalability” on page 33. It includes availability design principles,which will be discussed further in Chapter 6.

The management network allows centralized data collection and management of alldevices. Each device has a separate interface to the management network to avoidcontaminating the production network performance measurements. Themanagement network is also used for jumpstart installation and terminal serveraccess.

Although several networks physically reside on a single active core switch, networktraffic is segregated and secured using static routes, access control lists (ACLs), andVLANs. From a practical perspective, this can be as secure as separate individualswitches, depending on the switch manufacturer’s implementation of VLANs.



FIGURE 2-9 Tested and Implemented Architecture Solution

Client 1 Client 2


WebserviceTier

Sun Fire6800

Sun Fire680010.30.0.101

Sun Fire6800

Sun Fire6800

10.30.0.100

ClientaccessTier


DatabaseserviceTier





T3

T3

192.168.10.1

10.10.0.110.20.0.110.40.0.110.30.0.1

10.50.0.1

10.10.0.100 10.10.0.101 10.10.0.102 10.10.0.103

10.20.0.100 10.20.0.101 10.20.0.102 10.20.0.103

10.40.0.100 10.40.0.101



Foundryswitches




CHAPTER 3

Tuning TCP: Transport Layer

This chapter describes some of key Transport Control Protocol (TCP) tunableparameters related to performance tuning. More importantly it describes how thesetunables work, how they interact with each other, and how they impact networktraffic when they are modified.

Applications often recommend TCP settings for tunable parameters, but offer fewdetails on the meaning of the parameters and adverse effects that might result fromthe recommended settings. This chapter is intended as a guide to understandingthose recommendations. This chapter is intended for network architects andadministrators who have an intermediate knowledge of networking and TCP. This isnot an introductory chapter on TCP terminology. The concepts discussed in thischapter build on basic terminology concepts and definitions. For an excellentresource, refer to Internetworking with TCP/IP Volume 1, Principles, Protocols, andArchitectures by Douglas Comer, Prentice Hall, New Jersey.

Network architects responsible for designing optimal backbone and distribution IPnetwork architectures for the corporate infrastructure are primarily concerned withissues at or below the IP layer—network topology, routing, and so on. However, indata center networks, servers connect either to the corporate infrastructure or theservice provider networks, which host applications. These applications providenetworked application services with additional requirements in the area ofnetworking and computer systems, where the goal is to move data as fast as possiblefrom the application out to the network interface card (NIC) and onto the network.Designing network architectures for performance at the data center includes lookingat protocol processing above Layer 3, into the transport and application layers.Further, the problem becomes more complicated because many clients’ statefulconnections are aggregated onto one server. Each client connection might havevastly different characteristics, such as bandwidth, latencies, or probability of packetloss. You must identify the predominant traffic characteristics and tune the protocolstack for optimal performance. Depending on the server hardware, operatingsystem, and device driver implementations, there could be many possible tuningconfigurations and recommendations. However, tuning the connection-orientedtransport layer protocol is often most challenging.

37


This chapter includes the following topics:

■ “TCP Tuning Domains” on page 38 provides an overview of TCP from a tuningperspective, describing the various components that contain tunable parametersand where they fit together from a high level, thus showing the complexities oftuning TCP.

■ “TCP State Model” on page 48 proposes a model of TCP that illustrates thebehavior of TCP and the impact of tunable parameters. The system model thenprojects a network traffic diagram baseline case showing an ideal scenario.

■ “TCP Congestion Control and Flow Control – Sliding Windows” on page 53shows various conditions to help explain how and why TCP tuning is needed andwhich are the most effective TCP tunable parameters needed to compensate foradverse conditions.

■ “TCP and RDMA Future Data Center Transport Protocols” on page 62 describesTCP and RDMA, promising future networking protocols that may overcome thelimitations of TCP.

TCP Tuning DomainsTransport Control Protocol (TCP) tuning is complicated because there are manyalgorithms running and controlling TCP data transmissions concurrently, each withslightly different purposes.



FIGURE 3-1 Overview of Overlapping Tuning Domains

FIGURE 3-1 shows a high-level view of the different components that impact TCPprocessing and performance. While the components are interrelated, each has itsown function and optimization strategy.

■ The STREAMS framework looks at raw bytes flowing up and down the streamsmodules. It has no notion of TCP, congestion in the network, or the client load. Itonly looks at how congested the STREAMS queues are. It has its own flow controlmechanisms.

■ TCP-specific control mechanisms are not tunable, but they are computed based onalgorithms that are tunable.

■ Flow control mechanisms and congestion control mechanisms are functionallycompletely different. One is concerned with the endpoints, and the other isconcerned with the network. Both impact how TCP data is transmitted.

■ Tunable parameters control scalability. TCP requires certain static data structuresthat are backed by non-swappable kernel memory. Avoid the following twoscenarios:

■ Allocating large amounts of memory. If the actual number of simultaneousconnections is fewer than anticipated, memory that could have been used byother applications is wasted.

STREAMS

Timers

TCP InternalVariables, RTT

ssthresh

FlowControl

TCP StateData Structures

tcp_conn_hash

CongestionControl

Chapter 3 Tuning TCP: Transport Layer 39


■ Allocating insufficient memory. If the actual number of connections exceedsthe anticipated TCP load, there will not be sufficient free TCP data structures tohandle the peak load.

This class of tunable parameters directly impacts the number of simultaneousTCP connections a server can handle at peak load and control scalability.

TCP Queueing System ModelThe goal of TCP tuning can be reduced to maximizing the throughput of a closedloop system, as shown in FIGURE 3-2. This system abstracts all the main componentsof a complete TCP system, which consists of the following components:

■ Server—The focus of this chapter.

■ Network—The endpoints can only infer the state of the network by measuringand computing various delays, such as round-trip times, timers, receipt ofacknowledgments, and so on.

■ Client—The remote client endpoint of the TCP connection.

FIGURE 3-2 Closed-Loop TCP System Model

This section requires basic background in queueing theory. For more information,refer to Queueing Systems, Volume 1, by Dr. Lenny Kleinrock, 1975, Wiley, New York.In FIGURE 3-2, we model each component as an M/M/1 queue. An M/M/1 queue is

write ()

ServerApplication

SendBuffer

Server TCP/IPStack

ClientTCP/IP

Stack

read ()

ClientApplication

ReceiveBuffer

ACK Receive Window

Network



a simple queue that has packet arrivals at a certain speed, which we’ve designatedas λ. At the other end of the queue, these packets are processed at a certain speed,which we’ve designated as µ.

TCP is a full duplex protocol. For the sake of simplicity, only one side of the duplexcommunication process is shown. Starting from the server side on the left inFIGURE 3-2, the server application writes a byte stream to a TCP socket. This ismodeled as messages arriving at the M/M/1 queue at the rate of l. These messagesare queued and processed by the TCP engine. The TCP engine implements the TCPprotocol and consists of various timers, algorithms, retransmit queues, and so on,modeled as the server process µ, which is also controlled by the feedback loop asshown in FIGURE 3-2. The feedback loop represents acknowledgements (ACKs) fromthe client side and receive windows. The server process sends packets to thenetwork, which is also modeled as an M/M/1 queue. The network can be congested,hence packets are queued up. This captures latency issues in the network, which area result of propagation delays, bandwidth limitations, or congested routers. InFIGURE 3-2 the client side is also represented as an M/M/1 queue, which receivespackets from the network and the client TCP stack, processes the packets as quicklyas possible, forwards them to the client application process, and sends feedbackinformation to the server. The feedback represents the ACK and receive window,which provide flow control capabilities to this system.

Why the Need to Tune TCP

FIGURE 3-3 shows a cross-section view of the sequence of packets sent from the serverto the client of an ideally tuned system. Send window-sized packets are sent oneafter another in a pipelined fashion continuously to the client receiver.Simultaneously, the client sends back ACKs and receive windows in unison with theserver. This is the goal we are trying to achieve by tuning TCP parameters. Problemscrop up when delays vary because of network congestion, asymmetric networkcapacities, dropped packets, or asymmetric server/client processing capacities.Hence, tuning is required. To see the TCP default values for your version of Solaris,refer to the Solaris documentation at docs.sun.com.



FIGURE 3-3 Perfectly Tuned TCP/IP System

In a perfectly tuned TCP system spanning several network links of varying distancesand bandwidths, the clients send back ACKs to sender in perfect synchronizationwith the start of sending the next window.

The objective of an optimal system is to maximize the throughput of the system. Inthe real world, asymmetric capacities require tuning on both the server and clientside to achieve optimal throughput. For example, if the network latency is excessive,the amount of traffic injected into the network will be reduced to more closelymaintain a flow that matches the capacity of the network. If the network is fastenough, but the client is slow, the feedback loop will be able to alert the sender TCPprocess to reduce the amount of traffic injected into the network. Later sections willbuild on these concepts to describe how to tune for wireless, high-speed wide areanetworks (WANs), and other types of networks that vary in bandwidth and distance.

FIGURE 3-4 shows the impact of the links increasing in bandwidth; therefore, tuningis needed to improve performance. The opposite case is shown in FIGURE 3-5, wherethe links are slower. Similarly, if the distances increase or decrease, delays attributedto propagation delays require tuning for optimal performance.

Time

NetworkSegments

(hops)

Router 1

Client - ACKs sent back, A1...A4

Long Fast Link -Optical WAN

Short Med Speed Link-Ethernet

Short Slow Link -Dial Up-POTS

Send Window = 1RTTFirst Batch

Router 2

D1

D1'

D1"

D2 D3 D4 D5 D6 D7 D8

D2'

D2"

D3'

D3"

D4'

D4"

A1 A2 A3 A4

Send Window = 1RTTSecond Batch

Server continuously sendsPackets D1, D2,...D8



FIGURE 3-4 Tuning Required to Compensate for Faster Links

Time

NetworkSegments

(hops)

Router 1


Long Very Fast Link -Optical WAN

High Speed Link-Ethernet

Short Med Link -Dial Up-POTS


Router 2

D1

D1'

D1"

D2

D2'

D2"

D3

D3'

D3"

D4

D4'

D4"

D5

D5'

D5"

D6

D6'

D6"

D7

D7'

D7"

D8

D8'

D8"




FIGURE 3-5 Tuning Required to Compensate for Slower Links

TCP Packet Processing Overview

Now let’s take a look at the internals of the TCP stack inside the computing node.We will limit the scope to the server on the data center side for TCP tuningpurposes. Since the clients are symmetrical, we can tune them using the exact sameconcepts. In a large enterprise data center, there could be thousands of clients, eachwith a diverse set of characteristics that impact network performance. Eachcharacteristic has a direct impact on TCP tuning and hence on overall networkperformance. By focusing on the server, and considering different networkdeployment technologies, we essentially cover the most common cases.

Time

NetworkSegments

(hops)

Router 1


Long Med Speed Link -TDM WAN

Short Low Speed Link-Ethernet

Short Very Slow Link -Dial Up-POTS


Router 2

D1

D1'

D1"

D2

D2'

D2"

D3

D3'

D3"




FIGURE 3-6 Complete TCP/IP Stack on Computing Nodes

FIGURE 3-6 shows the internals of the server and client nodes in more detail.

To gain a better understanding of TCP protocol processing, we will describe how apacket is sent up and down a typical STREAMS-based TCP implementation.Consider the server application on the left side of FIGURE 3-6 as a starting point. Thefollowing describes how data is moved from the server to the client on the right.

1. The server application opens a socket. (This triggers the operating system to setup the STREAMS stack, as shown.) The server then binds to a transport layerport, executes listen, and waits for a client to connect. Once the client connects,the server completes the TCP three-way handshake, establishes the socket, andboth server and client can communicate.

2. Server sends a message by filling a buffer, then writing to the socket.

3. The message is broken up and packets are created, sent down the streamhead(down the read side of each STREAMS module) by invoking the rput routine. Ifthe module is congested, the packets are placed on the service routine fordeferred processing. Each network module will prepend the packet with anappropriate header.

socket () close ()write ()

read ()

ClientApplication

socket () close ()bind () write ()

listen () read ()accept()

Server Node

ClientNode

Libsocket

SYS CALL

rq wq

libnsl

read write Stream Head

TCP

IP

NIC Driver

rput()rsrv()

wsrv()wput()

rq wqrput()rsrv()

wsrv()wput()

rq wqrput()rsrv()

wsrv()wput()

Network Interface Card

Libsocket

rq wq

libnsl

read write Stream Head

TCP

IP

NIC Driver

rput()rsrv()

wsrv()wput()

rq wqrput()rsrv()

wsrv()wput()

rq wqrput()rsrv()

wsrv()wput()

connect ()

Network

ServerApplication

Network Interface Card

SYS CALL



4. Once the packet reaches the NIC, the packet is copied from system memory to theNIC memory, transmitted out of the physical interface, and sent into the network.

5. The client reads the packet into the NIC memory and an interrupt is generatedthat copies the packet into system memory and goes up the protocol stack asshown on right in the Client Node.

6. The STREAMS modules read the corresponding header to determine theprocessing instructions and where to forward the packet. Headers are stripped offas the packet is moved upwards on the write side of each module.

7. The client application reads in the message as the packet is processed andtranslated into a message, filling the client read buffer.

The Solaris™ operating system (Solaris OS) offers many tunable parameters in theTCP, User Datagram Protocol (UDP), and IP STREAMS module implementation ofthese protocols. It is important to understand the goals you want to achieve so thatyou can tune accordingly. In the following sections, we provide a high-level modelof the various protocols and provide deployment scenarios to better understandwhich parameters are important to tune and how to go about tuning them.

We start off with TCP, which is, by far, the most complicated module to tune and hasthe greatest impact on performance. We then describe how to modify these tunableparameters for different types of deployments. Finally, we describe IP and UDPtuning.

TCP STREAMS Module Tunable ParametersThe TCP stack is implemented using existing operating system applicationprogramming interfaces (APIs). The Solaris OS offers a STREAMS framework,originating from AT&T, which was originally designed to allow a flexible modularsoftware framework for network protocols. The STREAMS framework has its owntunable parameters, for example sq_max_size, which controls the depth of aSTREAMS syncq. This impacts how raw data messages are processed for TCP.FIGURE 3-7 provides a more detailed view of the facilities provided by the SolarisSTREAMS framework.



FIGURE 3-7 TCP and STREAM Head Data Structures Tunable Parameters

FIGURE 3-7 shows some key tunable parameters for the TCP-related data path. At thetop is the streamhead, which has a separate queue for TCP traffic, where anapplication reads data. STREAMS flow control starts here. If the operating system issending up the stack to the application and the application cannot read data as fastas the sender is sending it, the stream read queue starts to fill. Once the number ofpackets in the queue exceeds the high-water mark, tcp_sth_recv_hiwat, streams-based flow control triggers and prevents the TCP module from sending any morepackets up to the streamhead. There is some space available for critical controlmessages (M_PROTO, M_PCPROTO). The TCP module will be flow controlled as longas the number of packets is above tcp_sth_recv_lowat. In other words, the

STREAMHEAD

TCP STREAMS Module

rsrv

rput

TCP Server Process Listening on Socket TCP State Lookup

tcp_conn_req_max_q

tcp_conn_req_max_q0

Listener Backlog

SYN Recvd, Pending3 way handshake

tcp_sth_rcv_hiwat

tcp_sth_rcv_lowat

wsrv

wput

rsrv

rput

tcp_sth_rcv_hiwat

tcp_sth_rcv_lowat

wsrv

wput

tcp_bind_hash

tcp_conn_hash

tcp_acceptor_hash

tcp_listen_hash



streamhead queue must drain below the low-water mark to reactivate TCP toforward data messages destined for the application. Note that the write side of thestreamhead does not require any high-water or low-water marks because it isinjecting packets into the downstream, and TCP will flow control the streamheadwrite side by its high-water and low-water marks tcp_xmit_hiwat andtcp_xmit_lowat. Refer to the Solaris AnswerBook2™ at docs.sun.com for thedefault values of your version of the Solaris OS.

TCP has a set of hash tables. These tables are used to search for the associated TCPsocket state information on each incoming TCP packet to maintain state engine foreach socket and perform other TCP tasks to maintain that connection, such asupdate sequence numbers, update windows, round trip time (RTT), timers, and soon.

The TCP module has two new queues for server processes. The first queue, shownon the left in FIGURE 3-7, is the set of packets belonging to sockets that have not yetestablished a connection. The server side has not yet received and processed a client-side ACK. If the client does not send an ACK within a certain window of time, thenthe packet will be dropped. This was designed to prevent synchronization (SYN)flood attacks, where a bunch of unacknowledged client SYN requests caused serversto be overwhelmed and prevented valid client connections from being processed.The next queue is the listen backlog queue, where the client has sent back the finalACK, thus completing the three-way handshake. The server socket for this client willmove the connection from LISTEN to ACCEPT. But the server has not yet processedthis packet. If the server is slow, then this queue will fill up. The server can overridethis queue size with the listen backlog parameter. TCP will flow control on IP on theread side with its parameters tcp_recv_lowat and tcp_recv_hiwat similar tothe streamhead read side.

TCP State ModelTCP is a reliable transport layer protocol that offers a full duplex connection bytestream service. The bandwidth of TCP makes it appropriate for wide area IPnetworks where there is a higher chance of packet loss or reordering. What reallycomplicates TCP are the flow control and congestion control mechanisms. Thesemechanisms often interfere with each other, so proper tuning is critical for high-performance networks. We start by explaining the TCP state machine, then describein detail how to tune TCP, depending on the actual deployment. We also describehow to scale the TCP connection-handling capacity of servers by increasing the sizeof TCP connection state data structures.

FIGURE 3-8 presents an alternative view of the TCP state engine.



FIGURE 3-8 TCP State Engine Server and Client Node

This figure shows the server and client socket API at the top and the TCP modulewith the following three main states:

Connection SetupThis includes the collection of substates that collectively set up the socket connectionbetween the two peer nodes. In this phase, the set of tunable parameters includes:

socket () close ()write ()

read ()

ClientApplication

connect ()

server ACK

bind ()

socket ()

connect ()

send syn/ACK()

recv syn/ACK()

RSTtcp_ip_linterval

Socket Opened

Port Bound

Connect

SYN SENT

SYN RECVD

Established

Close

ConnectionSetup

TCP

SERVER NODE CLIENT NODE

SEND

Socket Closed

RECEIVE Connection Established

Data Transfer

Fin_Wait_1

Fin_Wait_2 Closing

Time_Wait

Close_Wait

Last_Ack

Passive Close(Remote Close)

Active Close(Local Close)

recv ACK

2MSL Timeoutrecv ACK

recvACK

recvACK

recv FIN,send ACK

recv FIN,send ACK

Connection Shutdown

client ACK

bind ()

socket ()

listen ()

client connects()

send syn/ACK()

RSTtcp_ip_linterval

Socket Opened

Port Bound

Listen

SYN RECVD

SYN SENT

Established

Close

ConnectionSetup

CP

SEND

Socket Closed

RECEIVEConnection EstablishedData Transfer

Fin_Wait_1

Fin_Wait_2 Closing

Time_Wait

Close_Wait

Last_Ack

Passive Close(Remote Close)

Active Close(Local Close)

close (), send FIN close (), send FIN

recv ACK

2MSL Timeoutrecv ACK

recvACK

recvACK

recv FIN,send ACK

recv

FIN

,se

nd A

CK

recv

FIN

, A

CK

send

AC

K

recv

FIN

, A

CK

send

AC

K

recv FIN,send ACK

Connection Shutdown

socket () close ()bind () write ()

listen () read ()accept()

ServerApplication



■ tcp_ip_abort_cinterval: the time a connection can remain in half-openstate during the initial three-way handshake, just prior to entering anestablished state. This is used on the client connect side.

■ tcp_ip_abort_linterval: the time a connection can remain in half-openstate during the initial three-way handshake, just prior to entering anestablished state. This is used on the server passive listen side.

For a server, there are two trade-offs to consider:

■ Long Abort Intervals – The longer the abort interval, the longer the server willwait for the client to send information pertaining to the socket connection. Thismight result in increased kernel consumption and possibly kernel memoryexhaustion. The reason is that each client socket connection requires stateinformation, using approximately 1–2 kilobytes of kernel memory. Rememberthat kernel memory is not swappable, and as the number of connectionsincreases, the amount of consumed memory and time delays for lookups forconnections increases. Hackers exploit this fact to initiate Denial of Service(DoS) attacks, where attacking clients constantly send only SYN packets to aserver, eventually tying up all kernel memory, not allowing real clients toconnect.

■ Short Abort Intervals – If the interval is too short, valid clients that have aslow connection or go through slow proxies and firewalls could get abortedprematurely. This might help reduce the chances of DoS attacks, but slowclients might also be mistakenly terminated.

Connection EstablishedThis includes the main data transfer state (the focus of our tuning explanations inthis chapter). The tuning parameters for congestion control, latency, and flow controlwill be described in more detail. FIGURE 3-8 shows two concurrent processes thatread and write to the bidirectional full-duplex socket connection.

Connection ShutdownThis includes the set of substates that work together to shut down the connection inan orderly fashion. We will see important tuning parameters related to memory.Tunable parameters include:

■ until this time has expired. However, if this value is too short and there havebeen many routing changes, lingering packets in the network might be lost.

■ tcp_fin_wait_2_flush_interval: how long this side will wait for theremote side to close its side of the connection and send a FIN packet to closethe connection. There are cases where the remote side crashes and never sends



a FIN. So to free up resources, this value puts a limit on the time the remoteside has to close the socket. This means that half-open sockets cannot remainopen indefinitely.

Note – tcp_close_wait is no longer a tunable parameter. Instead, usetcp_time_wait_interval.

TCP Tuning on the Sender SideTCP tuning on the sender side controls how much data is injected into the networkand the remote client end. There are several concurrent schemes that complicatetuning. So to better understand, we will separate the various components and thendescribe how these mechanisms work together. We will describe two phases: Startupand Steady State. Startup Phase TCP tuning is concerned with how fast we can rampup sending packets into the network. Steady State Phase tuning is concerned aboutother facets of TCP communication such as tuning timers, maximum window sizes,and so on.

Startup Phase

In Startup Phase tuning, we describe how the TCP sender starts to initially send dataon a particular connection. One of the issues with a new connection is that there isno information about the capabilities of the network pipe. So we start by blindlyinjecting packets at a faster and faster rate until we understand the capabilities andadjust accordingly. Manual TCP tuning is required to change macro behavior, suchas when we have very slow pipes as in wireless or very fast pipes such as 10Gbit/sec. Sending an initial maximum burst has proven disastrous. It is better toslowly increase the rate at which traffic is injected based on how well the traffic isabsorbed. This is similar to starting from a standstill on ice. If we initially floor thegas pedal, we will skid, and then it is hard to move at all. If, on the other hand, westart slowly and gradually increase speed, we can eventually reach a very fast speed.In networking, the key concept is that we do not want to fill buffers. We want toinject traffic as close as possible to the rate at which the network and target receivercan service the incoming traffic.

During this phase, the congestion window is much smaller than the receive window.This means the sender controls the traffic injected into the receiver by computing thecongestion window and capping the injected traffic amount by the size of thecongestion window. Any minor bursts can be absorbed by queues. FIGURE 3-9 showswhat happens during a typical TCP session starting from idle.



FIGURE 3-9 TCP Startup Phase

The sender does not know the capacity of the network, so it starts to slowly sendmore and more packets into the network trying to estimate the state of the networkby measuring the arrival time of the ACK and computed RTT times. This results in aself-clocking effect. In FIGURE 3-9, we see the congestion window initially starts witha minimum size of the maximum segment size (MSS), as negotiated in the three-wayhandshake during the socket connection phase. The congestion window is doubledevery time an ACK is returned within the timeout. The congestion window iscapped by the TCP tunable variable tcp_cwnd_max, or until a timeout occurs. Atthat point, the ssthresh internal variable is set to half of tcp_cwnd_max.ssthresh is the point where upon a retransmit, the congestion window growsexponentially. After this point it grows additively, as shown in FIGURE 3-9. Once atimeout occurs, the packet is retransmitted and the cycle repeats.

FIGURE 3-9 shows that there are three important TCP tunable parameters:

■ tcp_slow_start_initial: sets up the initial congestion window just after thesocket connection is established.

■ tcp_slow_start_after_idle: initializes the congestion window after a periodof inactivity. Since there is some knowledge now about the capabilities of thenetwork, we can take a shortcut to grow the congestion window and not startfrom zero, which takes an unnecessarily conservative approach.

Time

idle

ssthresh 1

tcp_cwnd_max

tcp_slow_start_initial

ssthresh 2

CongestionWindow

Size (KB)

Congestion Windowincreases exponentially,doubly, up until packetloss or ssthresh After ssthresh, Congestion

window increases additively

Timeout

tcp_slow_start__after_idle



■ tcp_cwnd_max: places a cap on the running maximum congestion window. If thereceive window grows, then tcp_cwnd_max grows to the receive window size.

In different types of networks, you can tune these values slightly to impact the rateat which you can ramp up. If you have a small network pipe, you want to reduce thepacket flow, whereas if you have a large pipe, you can fill it up faster and injectpackets more aggressively.

Steady State Phase

In Steady State Phase, after the connection has stabilized and completed the initialstartup phase, the socket connection reaches a phase that is fairly steady and tuningis limited to reducing delays due to network and client congestion. An averagecondition must be used because there are always some fluctuations in the networkand client data that can be absorbed. Tuning TCP in this phase, we look at thefollowing network properties:

■ Propagation Delay – This is primarily influenced by distance. This is the time ittakes one packet to traverse the network. In WANs, tuning is required to keep thepipe as full as possible, increasing the allowable outstanding packets.

■ Link Speed – This is the bandwidth of the network pipe. Tuning guidelines forlink speeds from 56kbit/sec dial-up connections differ from 10Gbit/sec opticallocal area networks (LANs).

In short, tuning will be adjusted according to the type of network and associated keyproperties: propagation delay, link speed, and error rate. These properties actuallyself-adjust in some instances by measuring the return of acknowledgments. We willlook at various emerging network technologies: optical WAN, LAN, wireless, and soon—and describe how to tune TCP accordingly.

TCP Congestion Control and FlowControl – Sliding WindowsOne of the main principles for congestion control is avoidance. TCP tries to detectsigns of congestion before it happens and to reduce or increase the load into thenetwork accordingly. The alternative of waiting for congestion and then reacting ismuch worse because once a network saturates, it does so at an exponential growthrate and reduces overall throughput enormously. It takes a long time for the queuesto drain, and then all senders again repeat this cycle. By taking a proactivecongestion avoidance approach, the pipe is kept as full as possible without thedanger of network saturation. The key is for the sender to understand the state ofthe network and client and to control the amount of traffic injected into the system.



Flow control is accomplished by the receiver sending back a window to the sender.The size of this window, called the receive window, tells the sender how much datato send. Often, when the client is saturated, it might not be able to send back areceive window to the sender to signal it to slow down transmission. However, thesliding windows protocol is designed to let the sender know, before reaching ameltdown, to start slowing down transmission by a steadily decreasing windowsize. At the same time these flow control windows are going back and forth, thespeed at which ACKs come back from the receiver to the sender provides additionalinformation to the sender that caps the amount of data to send to the client. This iscomputed indirectly.

The amount of data that is to be sent to the remote peer on a specific connection iscontrolled by two concurrent mechanisms:

■ The congestion in the network - The degree of network congestion is inferred bythe calculation of changes in Round Trip Time (RTT): that is the amount of delayattributed to the network. This is measured by computing how long it takes apacket to go from sender to receiver and back to the client. This figure is actuallycalculated using a running smoothing algorithm due to the large variances intime. The RTT value is an important value to determine the congestion window,which is used to control the amount of data sent out to the remote client. Thisprovides information to the sender on how much traffic should be sent to thisparticular connection based on network congestion.

■ Client load - The rate at which the client can receive and process incoming traffic.The client sends a receive window that provides information to the sender on howmuch traffic should be sent to this connection based on client load.

TCP Tuning for ACK ControlFIGURE 3-10 shows how senders and receivers control ACK waiting and generation.The general strategy is that clients want to reduce receiving many small packets.Receivers try to buffer up a bunch of received packets before sending back anacknowledgment (ACK) to the sender, which will trigger the sender to send morepackets. The hope is that the sender will also buffer up more packets to send in onelarge chunk rather than many small chunks. The problem with small chunks is thatthe efficiency ratio or useful link ratio utilization is reduced. For example, a one-bytedata packet requires 40 bytes of IP and TCP header information and 48 bytes ofEthernet header information. The ratio works out to be 1/(88+1) = 1.1 percentutilization. When a 1500-byte packet is sent, however, the utilization can be1500/(88+1500) = 94.6 percent. Now, consider many flows on the same Ethernetsegment. If all flows are small packets, the overall throughput is low. Hence, anyeffort to bias the transmissions towards larger chunks without incurring excessivedelays is a good thing, especially interactive traffic such as Telnet.



FIGURE 3-10 TCP Tuning for ACK Control

FIGURE 3-10 provides an overview of the various TCP parameters. For a completedetailed description of the tunable parameters and recommended sizes, refer to yourproduct documentation or the Solaris AnswerBooks at docs.sun.com.

There are two mechanisms that are used by senders and receivers to controlperformance:

■ Senders—timeouts waiting for ACK. This class of tunable parameters controlsvarious aspects of how long to wait for the receiver to send back an ACK of thedata that was sent. If tuned too short, then excessive retransmissions occur. Iftuned too long, then excess wasted idle time elapses before the sender realizes thepacket was lost and retransmits.

■ Receivers—timeouts and number of bytes received before sending an ACK tosender. This class of tunable parameters allows the receiver to control the rate atwhich the sender sends data. The receiver does not want to send an ACK forevery packet received because the sender will send many small packets,increasing the ratio of overhead to actual useful data ratio and reducing theefficiency of the transmission. However, if the receiver waits too long, there isexcess latency that increases the burstiness of the communication. The receiverside can control ACKs with two overlapping mechanisms based on timers and thenumber of bytes received.

tcp_remit_interval_min,4000ms [1ms to 20 secs]

tcp_

loca

l_da

ck_i

nter

val,

50m

s [1

ms

to 6

0 se

cs]

-for

dire

ct c

onne

cted

end

poin

ts

tcp_

defe

rred

_ack

_int

erva

l, 10

0ms

[1m

s to

60

secs

]-f

or n

on d

irect

con

nect

ed e

ndpo

ints

tcp_remit_interval_min, 3s[1ms to 20 secs]

tcp_remit_interval_min, 60s [1ms to 120 min]

tcp_ip_abort_interval, 8min [500ms to 1193 hrs]

Data

Sender

ACK

reset

tcp_deferred_ack_max, 2 [1 to 16]-max received tcp segments (multiple of mss) received from non-direct connected endpoints

tcp_local_dacks_max, 8 [0 to 16]-max received tcp segments (multiple of mss received before forcing out ACK)

rexmit Data

rexmit Data

rexmit Data

Receiver



TCP Example Tuning ScenariosThe following sections describe example scenarios where TCP require tuning,depending on the characteristics of the underlying physical media.

Tuning TCP for Optical Networks – WANS

Typically, WANS are high-speed, long-haul network segments. These segmentsintroduce some interesting challenges because of their properties. FIGURE 3-11 showshow the traffic changes as a result of a longer, yet faster, link, comparing a normalLAN and an Optical WAN. The line rate has increased, resulting in more packets perunit time, but the delays have also increased from the time a packet leaves thesender to the time it reaches the receiver. This has the strange effect that morepackets are now in flight.



FIGURE 3-11 Comparison between Normal LAN and WAN Packet Traffic

FIGURE 3-11 shows a comparison of the number of packets that are in the pipebetween a typical LAN of 10 mbps/100 meters with RTT of 71 microseconds, whichis what TCP was originally designed for, and an optical WAN, which spans NewYork to San Francisco at the rate of 1 Gbps with RTT of 100 milliseconds. Thebandwidth delay product represents the number of packets that is actually in thenetwork and implies the amount of buffering the network must provide. This also

Time

Host 2

Host 2

Receiver - ACKs sent, but delayed at sender due to slow link.

Host 1

Long Slow PipeFew packets in Flight,Line delay incurs huge cost in packet transmission, hence selective ACK is a major improvement.

Sender sends fewer data packets due to higher error rates, but there are wasted time slots until the first ACK returns.

Sender, continously sendsPackets Data1, Data2,...

Normal Data Sent and ACK Received Timings Synchronized.

Packet over Sonet (POS)WAN 1Gbs - 2500 miles

Receiver - ACKS sent back, ACK1, ACK2, . . .

Data1

Data1 Data2 ACK1 ACK2

Data2

ACK1 ACK2 ACK3

Data3 Ethernet LAN 100mbs - 100 meters

Time

Send Window 1RTTFirst Batch




gives some insight into the minimum window size, which we discussed earlier. Thefact that the optical WAN has a very large bandwidth delay product as compared toa normal network requires tuning as follows:

■ The window size must be much larger. The current window size allows for 216

bytes. To achieve larger windows, RFC 1323 was introduced to allow the windowsize to scale to larger sizes while maintaining backwards compatibility. This isachieved during the initial socket connection, where during the SYN-ACK three-way handshake, window scaling capabilities are exchanged by both sides, andthey try to agree on the largest common capabilities. The scaling parameter is anexponent base 2. The maximum scaling factor is 14, hence allowing a maximumwindow size of 230 bytes. The window scale value is used to shift the windowsize field value up to a maximum of 1 gigabyte. Like the MSS option, the windowscale option should only appear in SYN and SYN-ACK packets during the initialthree-way handshake. Tunable parameters include:

■ tcp_wscale_always: controls who should ask for scaling. If set to zero, theremote side needs to request; otherwise, the receiver should request.

■ tcp_tstamp_if_wscale: controls adding timestamps to the window scale.This parameter is defined in RFC 1323 and used to track the round-tripdelivery time for data in order to detect variations in latency, which impacttimeout values. Both ends of the connection must support this option.

■ During the slow start and retransmissions, the minimum initial window size,which can be as small as one MSS, is too conservative. The send window sizegrows exponentially, but starting at the minimum is too small for such a largepipe. Tuning in this case requires that the following tunable parameters beadjusted to increase the minimum start window size:

■ tcp_slow_start_initial: controls the starting window just after theconnection is established.

■ tcp_slow_after_idle: controls the starting window after a lengthy periodof inactivity on the sender side.

Both of these parameters must be manually increased according to the actual WANcharacteristics. Delayed ACKs on the receiver side should also be minimizedbecause this will slow the increasing of the window size when the sender is trying toramp up.

RTT measurements require adjustment less frequently due to the long RTT times,hence interim additional RTT values should be computed. The tunabletcp_rtt_updates parameter is somewhat related. The TCP implementation knowswhen enough RTT values have been sampled, and then this value is cached.tcp_rtt_updates is on by default, but a value of 0 forces it to never be cached,which is the same as the case of not having enough for an accurate estimate of RTTfor this particular connection.



■ tcp_recv_hiwat and tcp_xmit_hiwat: control the size of the STREAMSqueues before STREAMS-based flow control is activated. With more packets inflight, the size of the queues must be increased to handle the larger number ofoutstanding packets in the system.

FIGURE 3-12 Tuning Required to Compensate for Optical WAN

Tuning TCP for Slow Links

Wireless and satellite networks have a common problem of a higher bit error rate.One tuning strategy to compensate for the lengthy delays is to increase the sendwindow, sending as much data as possible until the first ACK arrives. This way, thelink is utilized as much as possible. FIGURE 3-13 shows how slow links and normallinks differ. If the send window is small, then there will be significant dead timebetween the time the send window sends packets over the link and the time an ACKarrives and the sender can either retransmit or send the next window of packets inthe send buffer. But due to the increased error probability, if one byte is notacknowledged by the receiver, the entire buffer must be re-sent. Hence, there is atrade-off to increase the buffer to increase throughput. But you don’t want toincrease it so much that if there is an error the performance is degraded by morethan was gained due to retransmissions. This is where manual tuning comes in.You’ll need to try various settings based on an estimation of the link characteristics.One major improvement in TCP is the selective acknowledgement (SACK), whereonly the one byte that was not received can be retransmitted, not the entire buffer.

sender

Propagation = 2x100m/2.8x108m/s = 7.14x10-7s Delay

Bandwidth Delay = Prop Delay x Bandwidth = 7.14x10-7s x 100x106 bits/s = 71.4 bits

receiver

LAN - 10 mbs - 100 meters - 71 s RTT

WAN - 1 Gbps - 100 ms RTT

sender

Propagation = 100 ms round trip Delay

Bandwidth Delay = Prop Delay x Bandwidth = 1x10-1sx1x109 bits/s = 1x108 bits

receiver



FIGURE 3-13 Comparison between Normal LAN and WAN Packet Traffic—Long Low Bandwidth Pipe

Another problem introduced in these slow links is that the ACKs play a major role.If ACKs are not received by the sender in a timely manner, the growth of windowsis impacted. During initial slow start, and even slow start after an idle, the sendwindow needs to grow exponentially, adjusting to the link speed as quickly as

Time

Host 2

Host 1

Host 2

Receiver - ACKs sent, but delayed at sender due to slow link

Host 1

Long Slow PipeMore packets in Flight,Line delay incurs huge cost in packet transmission, hence selective ACK is a major improvement.

Normal Data sent and ACK Received Timings Synchronized

Sender continuously sendsPackets Data 1, Data 2. . .

Sender sends fewer data packets due to higher error rates, but there are wasted time slots until the first ACK returns.

Receiver - ACKS sent back, Ack1, Ack2. . .

Racket over Sonet (POS)QAN 1Gbs -2500miles

Data1 Data2

ACK1

Ethernet LAN 100mbs - 100 metersData3

ACK2 ACK3

Time



Data1Data2 ACK2

ACK1



possible for coarser tuning. It then grows linearly after reaching ssthresh for finer-grained tuning. However, if the ACK is lost, which has a higher probability in thesetypes of links, then the performance throughput is again degraded.

Tuning TCP for slow links includes the following parameters:

■ tcp_sack_permitted: activates and controls how SACK will be negotiatedduring the initial three-way handshake:

■ 0 = no sack – disabled.

■ 1 = TCP will not initiate a connection with SACK information, but if anincoming connection has the SACK-permitted option, TCP will respond withSACK information.

■ 2 = TCP will both initiate and accept connections with SACK information.

TCP SACK is specified in RFC 2018 TCP selective acknowledgement. TCP neednot retransmit the entire send buffer, only the missing bytes. Due to the highercost of retransmission, it is far more efficient to only re-send the missing bytes tothe receiver.

Like optical WANs, satellite links also require the window scale option to increasethe number of packets in flight to achieve higher overall throughput. However,satellite links are more susceptible to bit errors, so too large a window is not agood idea because one bad byte will force a retransmission of one enormouswindow. TCP SACK is particularly useful in satellite transmissions to avoid thisproblem because it allows the sender to select which packets to retransmitwithout requiring an entire window (which contained that one bad byte) forretransmission.

■ tcp_dupack_fast_retransmit: controls the number of duplicate ACKsreceived before triggering the fast recovery algorithm. Instead of waiting forlengthy timeouts, fast recovery allows the sender to retransmit certain packets,depending on the number of duplicate ACKs received by the sender from thereceiver. Duplicate ACKs are an indication that possibly later packets have beenreceived, but the packet immediately after the ACK might have been corrupted orlost.

Adjust all timeouts to compensate for long-delay satellite transmissions and possiblylonger-distance WANs; the timeout values must be compensated.



TCP and RDMA Future Data CenterTransport ProtocolsTCP is ideally suited for reliable end-to-end communications over disparatedistances. However, it is less than ideal for intra-data center networking primarilybecause over-conservative reliability processing drains CPU and memory resources,thus impacting performance. During the last few years, networks have grown fasterin terms of speed and reduced cost. This implies that the computing systems arenow the bottleneck—not the network—which was not the case prior to the mid-1990s. Two issues have resulted due to multi-gigabit network speeds:

■ Interrupts generated to the CPU – The CPU must be fast enough to service allincoming interrupts to prevent losing any packets. Multi-CPU machines can beused to scale. However, the PCI bus then introduces some limitations. It turns outthat the real bottleneck is memory.

■ Memory Speed – An incoming packet must be written and read from the NIC tothe operating system kernel address space to the user address. You can reduce thenumber of memory-to-memory copies to achieve zero copy TCP by usingworkarounds such as page flipping, direct data placement, and scatter-gatherI/O. However, as we approach 10-gigabit Ethernet interfaces, memory speedcontinues to be a source of performance issues. The main problem is that over thelast few years, memory densities have increased, but not speed. Dynamic randomaccess memory (DRAM) is cheap but slow. Static random access memory (SRAM)is fast but expensive. New technologies such as reduced latency DRAM(RLDRAM) show promise, but these seem to be dwarfed by the increases innetwork speeds.

To address this concern, there have been some innovative approaches to increase thespeed and reduce the network protocol processing latencies in the area of remotedirect memory access (RDMA) and infiniband. New startup companies such asTopspin are developing high-speed server interconnect switches based on infinibandand network cards with drivers and libraries that support RDMA, Direct AccessProgramming Library (DAPL), and Sockets Direct Protocol (SDP). TCP wasoriginally designed for systems where the networks were relatively slow ascompared to the CPU processing power. As networks grew at a faster rate thanCPUs, TCP processing became a bottleneck. RDMA fixes some of the latency.



FIGURE 3-14 Increased Performance of InfiniBand/RDMA Stack

FIGURE 3-14 shows the difference between the current network stack and the new-generation stack. The main bottleneck in the traditional TCP stack is the number ofmemory copies. Memory access for DRAM is approximately 50 ns for setup and then9 ns for each subsequent write or read cycle. This is orders of magnitude longer thanthe CPU processing cycle time, so we can neglect the TCP processing time. Savingone memory access on every 64 bits results in huge savings in message transfers.Infiniband is well suited for data center local networking architectures, as both sidesmust support the same RDMA technology.

Application

TCP

Stream Head

User Memory

Kernel Memory

Network Traffic

Network Traffic

CPU

NICPCI Driver

IP

NetworkInterface

Card

PCI Bus

Mac

PHY

TCP/IP Stack

Application

InfinibandHCA

User Memory

Kernel Memory

CPU

Driver

uDAPL SDP

Infiniband/RDMA Stack

Mem

ory




CHAPTER 4

Routers, Switches, andAppliances—IP-Based Services:Network Layer

Traditional Ethernet packet forwarding decisions were based on Layer 2 and Layer 3destination Media Access Control (MAC) and IP addresses. As performance,availability, and scalability requirements grew, advances in switching decisionsbased on intelligent packet processing tried to keep pace by offloading functionstraditionally implemented in software and executed on general purpose RISCprocessors onto network processors, Field Programmable Gate Arrays (FPGAs), orApplication Specific Integrated Circuits (ASICS). Early server load balancingimplementations were implemented in software and executed on general purposeRISC processors, which then evolved to services implemented in the data plane andcontrol plane of packet switches. For example, a server load-balancingimplementation now involves health checks implemented in the control plane. Thehealth check results then update specialized forwarding tables and enableforwarding decisions to be performed at wirespeed by consulting these specializedforwarding tables and rewriting the packet.

SSL was first implemented by Netscape as software libraries, originally executed ongeneral-purpose CPUs. Performance was then improved somewhat by offloading themathematical computations onto ASICS, which were actually delivered on PCI cardsinstalled on servers. Recent startup companies are now working on performing allSSL processing in ASICS, allowing SSL to be a dataplane service.

This chapter reviews internal switching architectures as well as some of the newfeatures that have been integrated in multilayer Ethernet switches due to evolvingrequirements that surfaced during deployment of Internet Web-based applications. Itdiscusses in varying detail the following IP services:

■ Server Load Balancing—a mechanism to distribute loads across a group ofservers, which host identical applications, that logically behaves as oneapplication

■ Layer 7 Switching—packet forwarding decisions based on packet payload

65


■ Network Address Translation (NAT)—rewriting packet source and destinationaddresses and ports for the purpose of decoupling the external public interfacefrom internal interfaces of servers in particular IP addresses and ports

■ Quality of Service (QoS)—providing differentiated services to packet flows

■ Secure Socket Layers (SSL)—encrypting traffic at the application layer for HTTP-based traffic

This chapter first describes the internal architecture of a basic network switch andthen describes more advanced features. It also provides a comprehensive discussionof server load balancing from a detailed conceptual perspective to actual practicalswitch configuration details. Because of the stateless nature of HTTP, server loadbalancing (SLB) has proven to be ideal for scaling the Web tier. However, there aremany different flavors of SLB in terms of fundamental algorithm and deploymentstrategies that this chapter discusses and describes in detail. This chapter alsoanswers a question that crops up over and over and is rarely answered: How do weknow which is the best SLB algorithm, and what is the proof? The chapter thenbriefly describes Layer 7 switching and NAT and variants thereof. This is followedby a detailed look at QoS, showing where and how to use it and how it works.Finally, we look at SSL from a conceptual layer and describe configuring acommercially available SSL appliance.

Packet Switch InternalsThe terms router and switch are often confusing because of the marketing adaptationsfrom vendor to vendor. Original routers performed Layer 3 packet forwardingdecisions on general-purpose computing devices with multiple network interfacecards. The bulk of the packet processing was performed in software. The inherentdesign of Ethernet has limited scalability. As the number of nodes increased on anEthernet segment belonging to one collision domain, the latency increasedexponentially, hence bridges were introduced. Bridges segregated Ethernet segmentsby only allowing broadcasts to certain segments and learning MAC addresses. Thisallowed the number of nodes to increase.

The next advance in network switches was introduced in the early 1990s with theintroduction of the Ethernet switch, which allowed multiple simultaneousforwarding of packets based on Layer 2 MAC addresses. This increased thethroughput of networks dramatically, since the single talker shared-bus Ethernetonly allowed one flow to communicate at any one single instant in time. Packetswitches evolved by making forwarding decisions not only on Layer 2, but also onLayer 3 and Layer 4. These higher layer forwarding packet switches are morecomplicated because more complex software is required to update the correspondingforwarding tables and more memory is needed. Memory bandwidth is a significantbottleneck for wirespeed packet forwarding. Another advance in network switches



was due to cut-through mode, which allowed the switch to immediately make aforwarding decision even before the entire packet was read into the memory of theswitch. Traditional switches were of the store-and-forward type, which needed toread the entire packet before making a forwarding decision.

FIGURE 4-1 shows the internal architecture of a multi-layer switch, including asignificant amount of integration of functions. Most of the important repetitive tasksare implemented in ASIC components, in contrast to the early routers describedpreviously, which performed forwarding tasks in software on a general-purposecomputer CPU card. Here the CPU mostly runs control plane and background tasksand does very little data forwarding. Modern network switches break down tasksinto those that need to be completed quickly and those that do not need to beperformed in real time into layers or “Planes” as follows:

■ Control Plane—the set of functions that controls how incoming packets should beprocessed or how the data path is managed. This includes routing processes andprotocols that populate the forwarding tables that contain routing (Layer 3) andswitching (Layer 2) entries. This is also commonly referred to the “slow path”because timing is not as crucial as in the data path.

■ Data Plane—the set of functions that operates on the data packets, such as Routelookup and rewrite of destination MAC address. This is also commonly referredto as the “fast path.” Packets must be forwarded at wire speed, hence packetprocessing has a much higher priority than control processing and speed is of theessence. The following section describes various common components andfeatures of a modern network switch.

Chapter 4 Routers, Switches, and Appliances—IP-Based Services: Network Layer 67


FIGURE 4-1 Internal Architecture of a Multi-Layer Switch

The following numbered sections describe the main functional components of atypical network switch and correlate to the numbers in FIGURE 4-1.

1. PHY Transceiver

FIGURE 4-1 shows that as a packet enters a port, the physical layer (PHY) chip is inReceive Mode (Rx). The data stream will be in some serialized encoded format,where a 4-bit nibble is built and sent to the MAC to build a complete Ethernet frame.The PHY chip implements critical functions such as collision detection, which isneeded only in half-duplex mode, link monitoring to detect tpe-link-test, andauto negotiation to synchronize with the sender.

FIB Lookup

RXFIFO

TXFIFO

MAC

PHY Transceiver

VLAN Lookup

Packet Classification

10

9

8

6

2

1

FIB Lookup

RXFIFO

TXFIFO

MAC

PHY Transceiver

VLAN Lookup


FIB Lookup

RXFIFO

TXFIFO

MAC

PHY Transceiver

VLAN Lookup


14

TrunkingLACP

RISC Processor(s)

RoutingProtocols

FlowControl

SPTBPDU

Switching Fabric

SNMP MGT SW

FLOWData

StructsCLIS/W

PacketScheduler

Memory

Addressing Tables

Packet/Frame Buffers

ExpressPort

0x00ff

0x0100

Tx, RXDescriptors

0x0101

0xffff

DestMac

SACMac

SRCIP

DestIP

QoSQueues

5

3

11

12

13

4

7



2. Media Access Control

The Media Access Control (MAC) ASIC takes the 4-bit nibble from PHY andconstructs a complete Ethernet frame. The MAC chip inserts a Start Frame Delimiterand Preamble when in Transmit Mode (Tx) and strips off these bytes when in Rxmode. The MAC implements the 802.3u,z functions depending on the link speed.MAC implements functions such as collision backoff and flow control. The flowcontrol feature prevents slower link queues from being overrun. This is an importantfeature. For example, when a 1 Gbit/sec link is transmitting to a slower 100Mbit/sec link, a finite amount of buffer or queue memory is available. By sendingPAUSE frames, the sender slows down, hence using fewer switch resources toaccommodate the fast senders and slow receiver data transmissions. Once a frame isconstructed, the MAC first checks if the destination MAC address is in the range of01-80-C2-00-00-00 to 01-80-C2-00-00-0F. These are special reserved multicastaddresses used for MAC functions such as link aggregation, spanning tree, or pauseframes for flow control.

3. Flow Control (MAC Pause Frames)

When a flow control frame is received, a timer module is invoked to wait until acertain time elapses before sending out the subsequent frame. For example, a flowcontrol frame is sent out when the queues are being overrun, so the MAC is free tocatch up and allow the switch to process the ingress frames that are queued up.

4. Spanning Tree

When Bridge Protocol Data Units (BPDUs) are received by the MAC, the spanningtree process parses the BPDU, determines the advertised information, and comparesit with stored state. This allows the process to compute the spanning tree and controlwhich ports to block or unblock.

5. Trunking

When a Link Aggregation Control Protocol (LACP) frame is received, a linkaggregation sublayer parses the LACP frame, processes the information, andconfigures the collector and distributor functions. The LACP frame containsinformation about the peer trunk device such as aggregation capabilities and stateinformation. This information is used to control the data packets across the trunkedports. The collector is an ingress module that aggregates frames across the ports of atrunk. The distributor spreads out the frames across the trunked ports on egress.

6. Receive FIFO

If the MAC frame is not a control frame, then the MAC frame is stored in theReceive Queue or Rx FIFO (first in, first out), which are buffers that are referencedby Rx descriptors. These descriptors are simply pointers, so when moving packetsaround for processing, small 16-bit pointers are moved around instead of 1500-byteframes.



7. Flow Structures

The first thing that occurs after the Ethernet frame is completely constructed is thata flow structure is looked up. This flow structure will have a pointer to an addresstable that will be able to immediately identify the egress port so that the packet canbe quickly stored, queued, and forwarded out the egress port. On the first packet ofa flow, this flow data structure will not exist, so the lookup will return a failure. TheCPU must be interrupted to create this flow structure and return to caller. This flowstructure has enough information about where to store the packet in a region ofmemory used for storing entire packets. There are associated data structures calledTx or Rx descriptors; these are handles to the packet itself. As with FIFO descriptors,the reason for these data structures is speed. Instead of moving around large 1500-byte packets for queuing up, only 32-bit pointers are moved around.

8. Packet Classification

A switch has many flow-based rules for firewalls, NAT, VPN, and so on. The packetclassification performs a quick lookup for all the rules that apply to this packet.There are many algorithms and implementations that basically inspect the IP headerand try to find a match in the table that contains all the rules for this packet.

9. VLAN Lookup

The VLAN module needs to identify the VLAN membership of this frame bylooking at the VLAN ID (VID) in the tag. If the frame is untagged, then dependingon whether the VLAN is port based or MAC address based, the set of output portsneeds to be looked up. This is usually implemented by vendors in ASICs due towirespeed timing requirements.

10. Forwarding Information Base (FIB) Lookup

After a packet has passed through all the Layer 2 processing, the next step is todetermine the egress ports that this packet must be forwarded to. The routing tablesdetermine the next hop, which is populated in the control plane. There are twoapproaches to implementing this function:

■ Centralized: One central database contains all the forwarding entries.■ Distributed: Each port has a local database for quick lookups.

The distributed implementation is much faster. It will be discussed further later inthis chapter.

11. Routing Protocols

All routing packets are sent to the appropriate routing process, such as RIP, OSPF, orBGP, and this process populates the routing tables. This process is performed in thecontrol plane or slow path. The routing tables are used to populate the Forwarding



Information Base (FIB), which can be in a central memory area or downloaded toeach port’s local memory, providing faster data path performance in the FIB lookupphase.

The next step occurs when the packet is ready to be scheduled for transmission bythe packet scheduler by pulling out the descriptor out of the appropriate QoS queue.Finally, the packet is sent out the egress port.

12. Switch Fabric Module (SFM)

Once the FIB lookup is completed, the packet scheduler must queue the packet ontothe output queues. The output queues can be implemented as a set of multiplequeues, each with a certain priority, to implement different classes of services. TheSFM links the ingress processing to the egress processing. SFM can be implementedusing Shared Memory or CrossPoint Architectures. In a shared memory approach,the packets can be written and read to a shared memory location. An arbitratormodule controls access to the shared memory. In a CrossPoint Architecture, there isno storage of packets; instead, there is a connection from one to another. CrossPointfurther requires that the packet be broken into fixed-sized cells. CrossPoint usuallyhas very high bandwidth used only for backplanes. The bandwidths must be higherbecause of the extra overhead and padding required in the construction anddestruction of fixed-sized cells. Both approaches suffer from Head of Line Blocking(HOL), but usually use some form of virtual output queue workaround to mitigatethe effects. HOL occurs when a large packet holds up smaller packets farther downthe queue when being scheduled.

13. Packet Scheduler

The packet scheduler simply chooses packets that need to be moved from one set ofqueues to another based on some algorithm. The packet scheduler is usuallyimplemented in an ASIC. Instead of moving entire frames, sometimes 1500 bytes,only 16-bit or 32-bit descriptors are moved.

14. Transmit FIFO

The transmit queue or Tx FIFO is the final store before the frame is sent out theegress port. The same functions are performed as those described on the ingress (RxFIFO) but in the opposite directions.

Emerging Network Services andAppliancesOver the past years, enterprise networks have evolved significantly to handle Webtraffic. Enterprise customers are realizing the benefits as a result, embracingintelligent IP-based services in addition to traditional stateless Layer 2 and Layer 3



services at the data center edge. Services such as SLB, Web caching, SSL accelerators,NAT, QoS, firewalls, and others are now common in every data center edge. Thesedevices are either deployed adjacent to network switches or integrated as an addedservice inside the network switch. Often a multitude of vendors can potentiallyimplement a particular set of functions. The following sections describe some thekey IP services you can use in the process of crafting high-quality network designs.

Server Load BalancingNetwork SLB is essentially the distribution of load across a pool of servers. Incomingclient requests that are destined to a specific IP address and port are redirected to apool of servers. The SLB algorithm determines the actual target server. The first formof server load balancing was introduced using DNS round-robin, where a DomainName Service (DNS) resource record allowed multiple IP addresses to be mapped toa single domain name. The DNS server then returned one of the IP addresses usinga round-robin scheme. Round-robin provides a crude way to distribute the loadacross different servers. The limitations of this scheme include the need for a serviceprovider to register each IP address. Some Web farms now grow to hundreds offront-end servers, and every client migh inject a different load, resulting in unevendistribution of load. Modern SLB, where one virtual IP address maps to a pool ofreal servers, was introduced in the mid 1990s. One of the early successes includedthe introduction of the Cisco local director, where it became apparent that round-robin was also an ideal solution for increasing not only the availability but also theaggregate service capacity for HTTP-based Web requests.

FIGURE 4-2 describes a high-level model of server load balancing.



FIGURE 4-2 High-Level Model of Server Load Balancing

In FIGURE 4-2 the incoming load = λ. It is spread out evenly across N servers, eachhaving a service capacity rate = µ. How does the SLB device determine where toforward the client request? The answer depends on the algorithm.

One of the challenges faced by network architects is choosing the right SLBalgorithm from the plethora of SLB algorithms and techniques available. Thefollowing sections explore the more important SLB derivatives, as well as whichtechnique is best for which problem.

HashThe hash algorithm pulls certain key fields from the client incoming request packet,usually the source/destination IP address and TCP/UDP port numbers, and usestheir values as an index to a table that maps to the target server and port. This is ahighly efficient operation because the network processor can execute this instructionin very few clock cycles, only performing expensive read operations for the indextable lookup. However, the network architect needs to be careful about the followingpitfalls:

■ Megaproxy architectures, such as those used by some ISPs, remap the dial-inclient’s source IP addresses to that of the megaproxy, not the client’s actualdynamically allocated IP address, which might not be routable. So be careful notto assume stickiness properties for the hash algorithm.

Total IncomingClient Requestsat Rate =

Distributed IncomingClient Requestsat Uneven Rate =

/N

/N

/N

/N

SLB

1

2

N



■ Hashing bases its assumption of even load distribution on heuristics, whichrequire careful monitoring. It is entirely possible that due to the mathematics, thehash values will skew the load distribution, resulting in worse performance thanround-robin.

Round-RobinRound-robin (RR)—or weighted round-robin (WRR)—is the most widely used SLBalgorithm because it is simple to implement efficiently. The RR/WRR algorithmlooks at the incoming packet and remaps the destination IP address/portcombination to the target IP/port from a fixed table and moving pointer. The LeastConnections algorithm requires at least one more process to continually monitor therequests sent or received to or from each server, hence estimating the queueoccupancy. From that information, the incoming packet can determine the targetIP/port. The major flaw with this algorithm is that the servers must be evenlyloaded or the resulting architecture will be unstable, as requests can build up on oneserver and eventually overload it.

Smallest Queue First /Least ConnectionsThe Smallest Queue First (SQF) is one of the best SLB algorithms because it is self-adapting. This method considers the actual capabilities of the server and knowsexactly which server can best absorb the next request. It also provides the leastaverage delay and above all is stable. In commercial switches, this is close to what isreferred to as Least Connections. However, in commercial implementations, thereare some cost reduction short-cuts that approximate SQF. FIGURE 4-3 provides a high-level model of the SQF algorithm.



FIGURE 4-3 High-Level Model of the Shortest Queue First Technique

Data centers often have servers that all perform the same function but vary inprocessing speed. Even when the servers have identical hardware and software, theactual client requests may exercise different code paths on the servers, henceinjecting different loads on each server. This results in an uneven distribution ofload. The SQF algorithm determines where to spread out the incoming load bylooking at the queue occupancies. If server i is more overloaded than the otherservers, the Queue i of one server i begins to build up. The SQF algorithmautomatically adjusts itself and stops forwarding requests to server i. Because theother SLB variations do not have this crucial property, SQF is the best SLB algorithm.Further analysis shows that SQF has another more important property: stability.Stability describes the long-term behavior of the system.


Unequal Server CapacitiesUneven Distributed IncomingClient Requestsat Uneven Rate =

1

1

2

N

SLB 2

N

1

1

1

2

N



FIGURE 4-4 Round-Robin and Weighted Round-Robin

Finding the Best SLB AlgorithmRecently, savvy customers have begun to ask network architects to substantiate whyone SLB algorithm is better than another. Although this section requires significanttechnical background knowledge, it provides definite proof and explains why theSQF is the best algorithm in terms of system stability.

The SLB system, which is composed of client requests and the servers, can beabstracted for the purposes of analysis as shown in FIGURE 4-5. This shows that initialclient Web requests—that is, when the client picks the first home page, excludingcorrelated subsequent requests—can be modeled as a Poisson process with rate λ.The Poisson process is a probability function with an exponential distribution, and itis reasonably accurate for telecommunication network theory as well as Internetsession initiation traffic analysis. The Web servers or application servers can bemodeled as M/M/1 queues. We have a number (N) of independent and potentiallydifferent capacities. Hence you can model each one with its own range of servicetimes and corresponding average. This model is reasonable because it captures thefact that the client request can invoke software code path traversals that vary as wellas hardware configuration differences. The SLB shown is subjected to an aggregateload from many clients. Each client has its own Poisson process request traffic.However, because one fundamental property of the Poisson process is that the sumof all Poisson processes is also a Poisson process, we can simplify the complete clientside, which we can now model as one large Poisson process of rate λ. The SLB

Client requests are forwarded blindly to servers. Weights W1 ..WNdetermine proportion of incoming load


Equal Server Capacities

SLB

1

1

1

1

/N*W1

/N*W2

/N*WN



device forwards the initial client request to the least-occupied queue. There are Nqueues, each with a Poisson arrival process and an exponential service time. Hencewe can model all the servers as N M/M/1 queues.

To prove that this system is stable, we must show that under all admissible time andinjected load conditions the queues will never grow without any bounds. There aretwo approaches we can take:

■ Model the state of the queues as a stochastic process, determine the MarkovChain, and then solve the long-term equilibrium distribution π.

■ Craft and utilize a Lyapunov Function L (t) which accurately models the growthof the queues, and then show that over the long term—that is, after the systemhas time to warm up and reach a steady state and a certain threshold—the rate ofchange of queue size is negative and remains negative for large enough L (t). Thisis a common and proven technique found in many network analysis researchpapers.

We will show that:

dL/dt = some negative value, for all values of L (t) greater than some threshold.It turns out that the Expected Value of the single step drift is equivalent, butmuch easier to calculate, which is the technique that we will use.

FIGURE 4-5 Server Load Balanced System Modeled as N - M/M/1 Queues

Server Process with Exponential Service Rate =

M/M/1

M/M/1

M/M/1

11

2

N

2

N

SLB 2

N

1

i

= i

Client Requests Session Initiations Modeled as a Poisson Process with Rate =



We will perform this analysis by first obtaining the discrete time model of oneparticular queue and then generalizing the result to all the N queues, as shown inthe system model. If we take the discrete model, the state of one of the queues can bemodeled as shown in FIGURE 4-6.

FIGURE 4-6 System Model of One Queue

Queue Occupancy

at time t+1 = Queue Occupancy at t + Number of Arrivals(t+1) - Number ofDepartures(Serviced)(t+1)

Q(t+1) = Q (t) + A (t+1) - D (t+1)

Because the state of the queue depends only on the previous state, this is easilymodeled as a valid Markov Process, for which there are known, proven methods ofanalysis to find the steady-state distribution. However, since we have N queues, theactual mathematics is very complex. The Lyapunov function is an extremelypowerful and accurate method to obtain the same results, and it is far simpler. SeeAppendix A for more information about the Lyapunov analysis.

How the Proxy Mode WorksThe SQF algorithm is only one component in understanding how to best deploy SLBin network architectures. There are several different deployment scenarios availablefor creating solutions. In Proxy Mode, the client points to the server load balancingdevice, and the server load balancer remaps the destination IP address and port tothe target server as selected by the SLB algorithm. Additionally, the source IP/port ischanged so that the server will return the response to the server load balancer andnot to the client directly. The server load balancer keeps state information to returnthe packet to the correct client.

1iArrival Rate =

Queue State = # of queued Client requests at time t

Web serverService rate =

1

1



FIGURE 4-7 Server Load Balance—Packet Flow: Proxy Mode

FIGURE 4-7 illustrates how the packet is modified from client to SLB to server, back toSLB, and finally back to client. The following numbered list correlates with thenumbers in FIGURE 4-7.

1. The client submits an initial service request targeted to the virtual IP (VIP)address of 120.141.0.19 on port 80. This VIP address is configured as the IPaddress of the SLB appliance.

5

6

7

192.191.3.89 3201 120.141.0.19 80 <hl>..... </hl>

33 80 <hl>..... </hl>

192.191.3.89 3201 120.141.0.19 80 GET://www.abc.com/index.html

SLB

SLB Table

SLB TableInfo

33 10.0.0.1

10.0.0.1

80 GET://www.abc.com/index.html

8

1

2

3

4

120.141.0.19

10.0.0.1

dst ip dst port

dst ipdstport

dstport

src ip src port

src ip src port

dst ipsrc ip src port

payload

dst ip dst portsrcip src port payload

payload

payload

120.141.0.19

120.141.0.19



2. The SLB receives this packet from the client and recognizes that this incomingpacket must be forwarded to a server selected by the SLB algorithm.

3. The SLB algorithm identifies server 10.0.0.1 at port 80 to receive this client requestand modifies the packet so that the server sends it to the SLB and not to the client.Hence, the source and port are also modified.

4. The server receives the client request.

5. Perceiving that the request has come from the SLB, the server returns therequested Web page back to the SLB device.

6. The SLB receives this packet from the server. Based on the state information, itknows that this packet must be sent back to client 192.191.3.89.

7. The SLB device rewrites the packet and sends it out the appropriate egress port.

8. Client receives receives response packet.

Advantages of Using Proxy Mode■ Increases security and flexibility by decoupling the client from the backend

servers

■ Increases switch manageability because servers can be added and removeddynamically without any modifications to the SLB device configuration after it isinitially configured

■ Increases server manageability because any IP address can be used

Disadvantages of Using Proxy Mode■ Limits throughput because the SLB must process packets on ingress as well as

return traffic from server to client

■ Increases client delays because each packet requires more processing

How Direct Server Return WorksOne of the main limitations of Proxy Mode is performance. Proxy Mode requiresdouble work in the sense that incoming traffic from client to servers must beintercepted and processed, as well as return traffic from server to clients. DirectServer Return (DSR) addresses this limitation by requiring that only incoming trafficbe processed by the SLB, thereby increasing performance considerably. To betterunderstand how this works, see FIGURE 4-8. In DSR Mode, the client points to theSLB device, which only remaps the destination MAC address. This is accomplishedby leveraging the loopback interface of the Sun Solaris servers and other servers that



support loopback. Every server has a regular unique IP address and a loopback IPaddress, which is the same as the external VIP address of the SLB. When the SLBforwards a packet to a particular server, the server looks at the MAC address todetermine whether this packet should be forwarded up to the IP stack. The IP stackrecognizes that the destination IP address of this packet is not the same as thephysical interface, but it is identical to the loopback IP address. Hence, the stack willforward the packet to the listening port.

FIGURE 4-8 Direct Server Return Packet Flow

5

6

6

33 80 <hl>..... </hl>

192.191.3.89 3201 120.141.0.19 80 GET://www.abc.com/index.html

SLB

SLB Table

SLB StateInfo

33 10.0.0.1

10.0.0.1

80 GET://www.abc.com/index.html

1

2

3

4

120.141.0.19

real ip:10.0.0.1

loopback lo0:120.141.0.19

mac:0:8:3e:4:4c:84

dst ipdstport

dstport

src ip src port

dst ipsrc ip src port

dst ip dst portsrcip src port payload

payload

payload

120.141.0.19

120.141.0.19



FIGURE 4-8 shows the DSR packet flow process. The following numbered listcorrelates with the numbers in FIGURE 4-8.

1. The client submits an initial service request targeted to the VIP address of120.141.0.19 port 80. This VIP address is configured as the IP address of the SLBappliance.

2. The SLB receives this packet from the client and forwards this incoming packet toa server selected by the SLB algorithm.

3. The SLB algorithm identifies server 10.0.0.1 port 80 to receive this client requestand modifies the packet by only changing the destination MAC Address to0:8:3e:4:4c:84 which is the MAC address of the real server.

Note – Statement 3 implies that the SLB and the servers must be on the same Layer2 VLAN. Hence, DSR is less secure than the proxy mode approach.

4. The server receives the client request and processes the incoming packet.

5. The server returns the incoming packet directly back to the client by swappingthe destination/source IP and TCP address pair.

6. The destination IP address is the same as that configured on the loopback and issent back directly to the client.

Advantages of Direct Server Return■ Increases security and flexibility by decoupling the client from the back-end

servers.

■ Increases switch manageability because servers can be added and removeddynamically without any modifications to the SLB device configuration after it isinitially configured.

■ Increases performance and scalability. The server load-balancing work is reducedby half because the return path is the same as the incoming path. Thus, morecycles are free to process more incoming traffic.

Disadvantages of Direct Server Return■ The SLB must be on same Layer 2 network as the server because they have the

same IP network number, only differing by MAC address.

■ All the servers must be configured with the same loopback address as the SLBVIP. This might be an issue for securing critical servers.



Server MonitoringAll SLB algorithms, except the family of fixed round-robin, require knowledge of thestate of the servers. SLB implementations vary enormously from vendor to vendor.Some poor implementations simply monitor link state on the port to which the realserver is attached. Some monitor using ping request on Layer 3. Port-based healthchecks are superior because the actual target application is verified for availabilityand response time. In some cases, the Layer 2 state might be fine, but the actualapplication has failed, and the SLB device mistakenly forwards requests to thatfailed real server. The features and capabilities of switches are changing rapidly,often in simple flash updates, and you must be aware of the limitations.

PersistenceOften when a client is initially load-balanced to a specific server, it is crucial thatsubsequent requests are forwarded to the same server within the pool. There areseveral approaches to accomplishing this:

■ Allow the server to insert a cookie in the client’s HTTP request.

■ Configure the SLB to look for a cookie pattern and make a forwarding decisionbased on the cookie. The first request of the client will have no cookie, so the SLBwill forward to the best server based on the algorithm. The server will install acookie, which is a name-value pair. On the return of the packet, the SLB will readthe cookie value and record client-server pair. Subsequent requests from the sameclient will have a cookie, which triggers the SLB to forward based on the recordedcookie information, not on the SLB algorithm.

■ Hash, based on the client’s source IP address. This is risky if the client requestcomes from a megaproxy.

It is best to avoid persistence because HTTP was designed to be stateless. Trying tomaintain state across many stateless transactions causes serious issues if there arefailures. In many cases, the application software can maintain state. For example,when a servlet receives a request, it can identify the client based on its own cookievalue and retrieve state information from the database. However, switch persistencemight be required. If so, you should look at the exact capabilities of each vendor anddecide which features are most critical.



Commercial Server Load Balancing SolutionsMany commercial SLB implementations are available both hardware and software.

■ Resonate provides a Solaris library offering, where a STREAMS Module/Driver isinstalled on a server that accepts all traffic, inspects the ingress packet, andforwards it to another server that actually services the request. As the cost ofhardware devices falls and performance increases, the Resonate product is lesspopular.

■ Various companies such as Cisco, F5, and Foundry ServerIron sell hardwareappliances that perform only server load balancing. One important factor toexamine carefully is the method used to implement the server load-balancingfunction.

■ The F5 is limited because it is a PC Intel box, running BSD UNIX®, with two ormore network interface cards.

Wirespeed performance can be limited because these general purpose computer-based appliances are not optimized for packet forwarding. When a packet arrives ata NIC, an interrupt must first be generated and serviced by the CPU. Then the PCIbus arbitration process will grant access to traverse the bus. Finally, the packet iscopied into memory. These events cumulatively contribute to significant delays. Insome newer implementations, wirespeed SLB forwarding can be achieved. DataPlane Layer 2/Layer 3 forwarding tables are integrated with the server load-balancing updates. Hence as soon as a packet is received, a packet classifierimmediately performs an SLB lookup in the data plane with hardware using tablespopulated and maintained by the SLB process that resides in the control plane,which also monitors the health of the servers.

Foundry ServerIron XL—Direct Server Return Mode

CODE EXAMPLE 4-1 shows the configuration file for the setup of a simple server loadbalancer. Refer to the Foundry ServerIron XL user guide for detailed explanations ofconfiguration parameters. This shows the level of complexity for configuring atypical SLB device. This device is assigned a VIP address of 172.0.0.11, which is theIP address exposed to the outside world. On the internal LAN, this SLB device isassigned an IP address of 20.20.0.50, which can be used as the source IP address thatis sent to the servers if you are using proxy mode. However, this device is



configured in DSR mode, where the SLB forwards to the servers, which then returndirectly to the client. Notice that the servers are on the same VLAN as this SLBdevice on the internal LAN side of the 20.0.0.0 network.

CODE EXAMPLE 4-1 Configuration for a Simple Server Load Balancer

!ver 07.3.05T12global-protocol-vlan!!server source-ip 20.20.0.50 255.255.255.0 172.0.0.10!!!!!!server real s1 20.20.0.1 port http port http url "HEAD /"!server real s2 20.20.0.2 port http port http url "HEAD /"!!server virtual vip1 172.0.0.11 port http port http dsr bind http s1 http s2 http

!vlan 1 name DEFAULT-VLAN by port no spanning-tree!hostname SLB0ip address 172.0.0.111 255.255.255.0ip default-gateway 172.0.0.10web-management allow-no-passwordbanner motd ^CReference Architecture -- Enterprise Engineering^CServer Load Balancer-- SLB0 129.146.138.12/24^C

!!



Extreme Networks BlackDiamond 6800 Integrated SLB—Proxy Mode

CODE EXAMPLE 4-2 shows an excerpt of the SLB configuration for a large chassis-based Layer 2/Layer 3 switch with integrated SLB capabilities. Various VLANs andIP addresses are configured on this switch in addition to the SLB. Pools of serverswith real IP addresses are configured. The difference is that this switch is configuredin the more secure proxy mode instead of DSR, shown in the previous example.

CODE EXAMPLE 4-2 SLB Configuration for a Chassis-based Switch

## MSM64 Configuration generated Thu Dec 6 21:27:26 2001# Software Version 6.1.9 (Build 11) By Release_Master on 08/30/0111:34:27

..# Config information for VLAN app.config vlan "app" tag 40 # VLAN-ID=0x28 Global Tag 8config vlan "app" protocol "ANY"config vlan "app" qosprofile "QP1"config vlan "app" ipaddress 10.40.0.1 255.255.255.0configure vlan "app" add port 4:1 untagged..

## Config information for VLAN dns...configure vlan "dns" add port 5:3 untaggedconfigure vlan "dns" add port 5:4 untaggedconfigure vlan "dns" add port 5:5 untagged....configure vlan "dns" add port 8:8 untaggedconfig vlan "dns" add port 6:1 tagged## Config information for VLAN super.config vlan "super" tag 1111 # VLAN-ID=0x457 Global Tag 10config vlan "super" protocol "ANY"config vlan "super" qosprofile "QP1"# No IP address is configured for VLAN super.config vlan "super" add port 1:1 taggedconfig vlan "super" add port 1:2 taggedconfig vlan "super" add port 1:3 taggedconfig vlan "super" add port 1:4 taggedconfig vlan "super" add port 1:5 taggedconfig vlan "super" add port 1:6 taggedconfig vlan "super" add port 1:7 tagged



config vlan "super" add port 1:8 taggedconfig ......config vlan "super" add port 6:4 taggedconfig vlan "super" add port 6:5 taggedconfig vlan "super" add port 6:6 taggedconfig vlan "super" add port 6:7 taggedconfig vlan "super" add port 6:8 tagged..

enable web access-profile none port 80configure snmp access-profile readonly Noneconfigure snmp access-profile readwrite Noneenable snmp accessdisable snmp dot1dTpFdbTableenable snmp trapconfigure snmp community readwrite encrypted "r~`|kug"configure snmp community readonly encrypted "rykfcb"configure snmp sysName "MLS1"configure snmp sysLocation ""configure snmp sysContact "Deepak Kakadia, Enterprise Engineering"..

# ESRP Interface Configurationconfig vlan "edge" esrp priority 0config vlan "edge" esrp group 0config vlan "edge" esrp timer 2config vlan "edge" esrp esrp-election ports-track-priority-mac

..

..

# SLB Configurationenable slbconfig slb global ping-check frequency 1 timeout 2config vlan "dns" slb-type serverconfig vlan "app" slb-type serverconfig vlan "db" slb-type serverconfig vlan "ds" slb-type serverconfig vlan "web" slb-type serverconfig vlan "edge" slb-type clientcreate slb pool webpool lb-method round-robinconfig slb pool webpool add 10.10.0.10 : 0config slb pool webpool add 10.10.0.11 : 0create slb pool dspool lb-method least-connection

CODE EXAMPLE 4-2 SLB Configuration for a Chassis-based Switch (Continued)

#



Layer 7 SwitchingThe recent explosive demand for application hosting and increased security fueledthe demand for a new concept called content switching, also known as Layer 7switching, proxy switching, or URL switching. This switching technology basicallyinspects the payload, which is expected to be some HTTP request, such as a static ordynamic Web page. The content switch searches for a certain string, and if there is amatch, it takes some type of action. For example, the content switch might rewritethe content or redirect it to a pool of servers that specializes in these services or to acaching server for increased performance. The main idea is that a forwardingdecision is made based on the application data, not traditional Layer 2 or Layer 3destination network addresses.

Some major technical challenges arise in performing this type of processing. The firstis a tremendous performance impact. In traditional Layer 2 and Layer 3 processing,the destination addresses and corresponding egress port are found by looking at afixed offset in the packet. This allows for extremely cheap and fast ASICs. Usually,the packet header is read in from the MAC and copied into SRAM, which has an

config slb pool dspool add 10.20.0.20 : 0config slb pool dspool add 10.20.0.21 : 0create slb pool dbpool lb-method least-connectionconfig slb pool dbpool add 10.30.0.30 : 0config slb pool dbpool add 10.30.0.31 : 0create slb pool apppool lb-method least-connectionconfig slb pool apppool add 10.40.0.40 : 0config slb pool apppool add 10.40.0.41 : 0create slb pool dnspool lb-method least-connectionconfig slb pool dnspool add 10.50.0.50 : 0config slb pool dnspool add 10.50.0.51 : 0create slb vip webvip pool webpool mode translation 10.10.0.200 :0 unit 1create slb vip dsvip pool dspool mode translation 10.20.0.200 : 0unit 1create slb vip dbvip pool dbpool mode translation 10.30.0.200 : 0unit 1create slb vip appvip pool apppool mode translation 10.40.0.200 :0 unit 1create slb vip dnsvip pool dnspool mode translation 10.50.0.200 :0 unit 1....

CODE EXAMPLE 4-2 SLB Configuration for a Chassis-based Switch (Continued)

#



access time of around five nanoseconds. The variable size and bulky payload areusually copied into DRAM, which has a higher initial setup time. The forwardingdecision requires two SRAM memory accesses, where the header is read, modified,written, and a quick lookup is performed—usually a Telecommunications AccessMethod (TCAM) or Patricia Tree lookup in SRAM, which takes a few nanoseconds.However, for Layer 7 forwarding decisions, almost all commercial switches, exceptthe Extreme Px1, must perform this function in much slower CPU, running a real-time operating system, such as VxWorks. The payload, which resides in DRAM,must be read, processed, and written. This string search is also time intensive. (Therehave been recent advances in Layer 7 technology such as that offered by Solidumand PMC-Sierras ClassiPI, which perform this at wirespeed rates. However, at thetime of this writing, we are not aware of any major switch manufacturer using thistechnology.) This operation takes orders of magnitude more time.

NAT can be extended not only to hide internal private IP addresses but also to basepacket forwarding decisions on the payload. There are two approaches toaccomplish this function:

■ Application Gateway – This approach terminates the socket connection on theclient side and creates another connection on the server side, providing completeisolation between the client and the server. This requires more processing timeand resources on the switch. However, it allows the switch to make acomprehensive application-layer forwarding decision.

■ TCP Splicing – This approach simply rewrites the TCP/IP packet headers, therebyreducing the amount of processing required on the switch. This makes it moredifficult for the switch to make application-layer forwarding decisions if thecomplete payload spans many small TCP packets.

This section describes an application gateway approach to NAT and performingLayer 7 processing.

FIGURE 4-9 shows an overview of the functional content switching model.



FIGURE 4-9 Content Switching Functional Model

Content switching with full network address translation (NAT) serves the followingpurposes:

■ Isolates internal IP addresses from being exposed to the public Internet.

■ Allows reuse of a single IP address. For example, clients can send their Webrequests to www.a.com or www.b.com, where DNS maps both domains to asingle IP address. The proxy switch receives this request with the packetcontaining an HTTP header in the payload that contains the target domain, forexample a.com or b.com, and determines to which group of servers to redirectthis request.

servergroup 1stata

Proxy Switching Function

client httprequest

-Terminate socket connection-get url-check against rules-forward to servergroup /slb function-or get valid cookie, with server id,and forward to same server

-http:www.a.com/SMA/stata/index.html servergroup1

-http:www.a.com/SMA/dnsa/index.html servergroup2

-http:www.a.com/SMA/statb/index.html servergroup3

-http:www.a.com/SMA/CHACHEB/index.html servergroup4

-http:www.a.com/SMA/DYNA/index.html servergroup5

servergroup 2dnsa

servergroup 3statb

servergroup 4cacheb

servergroup 5dynab



■ Allows parallel fetching of different parts of Web pages from servers optimizedand tuned for that type of data. For example, a complex Web page might needGIFs, dynamic content, cached content, and so on. With content switching, one setof Web servers can hold the GIFs, while another can hold the dynamic content orcached content. The proxy switch can make parallel fetches and retrieve the entirepage at a faster rate than would be possible otherwise.

■ Ensures requests with cookies or SSL session IDs are redirected to the same serverto take advantage of persistence.

FIGURE 4-9 shows that the client’s socket connection is terminated by the proxyfunction. The proxy retrieves as much of the URL as is needed to make a decisionbased on the retrieved URL. In FIGURE 4-9, various URLs map to various servergroups, which are VIP addresses. The proxy determines whether to forward the URLdirectly or pass it off to a server load-balancing function that is waiting for trafficdestined to the server group.

The proxy is configured with a VIP address, so the switch forwards all clientrequests destined to this VIP address to the proxy function. The proxy function alsorewrites the IP header, particularly the source IP and port, so that the server sendsback the requested data to the proxy, not to the client directly.

Network Address TranslationNetwork Address Translation (NAT) is a critical component for security and propertraffic direction. There are two basic types of NAT: half and full. Half NAT rewritesthe destination IP address and MAC address to a redirected location such as a Webcache, which returns the packet directly to the client because the source IP address isunchanged. Full NAT is where the socket connection is terminated by a proxy, so thesource IP and MAC are changed to that of the proxy server.

NAT serves the following purposes:

■ Security—Prevents exposing internal private IP addresses to the public.

■ IP Address Conservation—Requires only one valid exposed IP address to fetchInternet traffic from internal networks with invalid IP addresses.

■ Redirection—Intercepts traffic destined to one set of servers and redirects it toanother by rewriting the destination IP and MAC addresses. The redirectedservers can send back the request directly to the clients with half NAT-translatedtraffic because the original source IP has not been rewritten.

NAT is configured with a set of filters, usually a 5-tuple Layer 3 rule. If the incomingtraffic matches a certain filter rule, the packet IP header is rewritten or anothersocket connection is initiated to the target server, which itself can be changed,depending on the particular rule. NAT is often combined with other IP services suchas SLB and content switching. The basic idea is that the client and servers are



completely decoupled from each other, and the NAT device manages the IP addressconversions, while the partner service is responsible for another decision such asdetermining which server will handle the request based on load or other rules.

Quality of ServiceAs a result of emerging real-time and mission-critical applications, enterprisecustomers realize that the traditional “Best Effort” IP network service model isunsuitable. The main concern is that poorly behaved flows adversely affect otherflows that share the same resources. It is difficult to tune resources to meet therequirements of all deployed applications.

Quality of Service (QoS) measures the ability of network and computing systems toprovide different levels of services to selected applications and associated networkflows. Customers that deploy mission-critical applications and real-time applicationshave an economic incentive to invest in QoS capabilities so that acceptable responsetimes are guaranteed within certain tolerances.

The Need for QoSTo understand why QoS is critical, it helps to understand what has happened toenterprise applications over the past decade. In the late 1980s and early 1990s, theclient/server was the dominant architecture. The main principle involved a thickclient and local server, where 80 percent of the traffic was from the client to a localserver and 20 percent of the client traffic needed to traverse the corporate backbone.In the late 1990s, with the rapid adoption of Internet-based applications, thearchitecture changed to a thin client, and servers were located anywhere andeverywhere. This had one significant implication: The network became a criticallyshared resource, where priority traffic was dangerously impacted by nonessentialtraffic. A common example is the difference between downloading images andprocessing sales orders. Different applications have different resource needs. Thefollowing section describes why different applications have different QoSrequirements and why QoS is a critical resource for enterprise data centers andservice providers whose customers drive the demand for QoS.

Classes of ApplicationsThere are five classes of applications, having different network and computingrequirements. They are:



■ Data transfers■ Video and voice streaming■ Interactive video and voice■ Mission-critical■ Web-based

These classes are important in classifying, prioritizing, and implementing QoS. Thefollowing sections detail these five classes.

Data Transfers

Data transfers include applications such as FTP, email, and database backup. Datatransfers tend to have zero tolerances for packet loss and high tolerances for delayand jitter. Typical acceptable response times range from a few seconds for FTPtransfers to hours for email. Bandwidth requirements in the order of Kbyte/sec areacceptable, depending on the file size, which keeps response times to a few seconds.Depending on the characteristics of the application (for example, the size of a file),disk I/O transfer times can contribute cumulatively to delays along with networkbottlenecks.

Video and Voice Streaming

Video and voice streaming includes applications such as Apple QuickTimeStreaming or Real Networks streaming video and voice products. Video and voicestreams have low tolerances for packet loss, and medium tolerances for delay andjitter. Typical acceptable response times are only a few seconds. This is possiblebecause the server can pre-buffer multimedia data on the client to a certain degree.This buffer drains at a constant rate on the client side, while simultaneouslyreceiving bursty streaming data from the server with variations in delay. As long asthe buffer can absorb all variations (without draining to empty), the client receives aconstant stream of video and voice. Typical bandwidth requirements are about oneMbyte/sec, depending on the frame rate, compression/decompression algorithms,and the size of images. Disk I/O and CPU also contribute to delays. Large MPEGfiles must be read from disks and compression/decompression algorithms.

Interactive Video and Voice

Interactive video and voice tends to have low tolerance for packet loss and lowtolerance for delay and jitter. Typical bandwidth requirements are tremendous(depending on the number of simultaneous participants in the conference, growingexponentially). Due to the interactive nature of the data being transferred, tolerancesare very low for delay and jitter. As soon as one participant moves or talks, all otherparticipants need to immediately see and hear this change. Response time



requirements range from 250 to 500 milliseconds. This response time is compoundedby the bandwidth requirements, with each stream requiring a few Mbit/sec. In aconference of five participants, each participant pumps out a voice and video streamwhile at the same time receiving streams from the other participants.

Mission-Critical Applications

Mission-critical applications vary in bandwidth requirements, but they tend to havezero tolerance for packet loss. Depending on the application, bandwidthrequirements are about one Kbyte/sec. Response times range from 500 ms to a fewseconds. Server resource requirements (CPU, disk, and memory) vary, depending onthe application.

Web-Based Applications

Web-based applications tend to have low bandwidth requirements (unless largeimage files are associated with the requested Web page) and grow in CPU and diskrequirements, due to dynamically generated Web pages and Web transaction-basedapplications. Response time requirements range from 500 milliseconds to onesecond.

Different classes of applications have different network and computingrequirements. The challenge is to align the network and computing services to theapplication’s service requirements from a performance perspective.

Service Requirements for Applications

The two most common approaches used to satisfy the service requirements forapplications are:

■ Overprovisioning■ Managing and controlling

Overprovisioning allows overallocation of resources to meet or exceed peak loadrequirements. Depending on the deployment, overprovisioning can be viable if it isa simple matter of just upgrading to faster lLAN switches and NICs or addingmemory, CPUs, or disks. However, overprovisioning might not be viable in certaincases, for example when dealing with relatively expensive long-haul WAN links,resources that on average are underutilized, or sources that are busy only duringshort peak periods.

Managing and controlling allows allocation of network and computing resources.Better management of existing resources attempts to optimize utilization of existingresources such as limited bandwidth, CPU cycles, and network switch buffermemory.



QoS ComponentsTo give you enough background on the fundamentals and an implementationperspective, this section describes the overall network and systems architecture andidentifies the sources of delays. It also explains why QoS is essentially aboutcontrolling network and system resources in order to achieve more predictabledelays for preferred applications.

Implementation FunctionsThree necessary implementation functions are:

■ Traffic Rate Limiting and Traffic Shaping – Token Leaky Bucket Algorithm.Network traffic is always bursty. The level of burstiness is controlled by the timeresolution of the measurements. Rate limiting controls the burstiness of the trafficcoming into a switch or server. Shaping refers to the smoothing of the egresstraffic. Although these two functions are opposite, the same class of algorithms isused to implement both.

■ Packet Classification – Individual flows must be identified and classified at linerate. Fast packet classification algorithms are crucial, as every packet must beinspected and matched against a set of rules that determine the class of servicethe specific packet should receive. The packet classification algorithm has seriousscalability issues; as the number of rules increases, it takes longer to classify apacket.

■ Packet Scheduling – To provide differentiated services, the packet schedulermust decide quickly which packet to schedule and when. The simplest packetscheduling algorithm is strict priority. However, this often does not work becauselow-priority packets are starved and might never get scheduled.

QoS Metrics

QoS is defined by a multitude of metrics. The simplest is bandwidth, which can beconceptually viewed as a logical pipe of a larger pipe. However, actual networktraffic is bursty, so a fixed bandwidth would be wasteful because at one instant intime one flow might use 1 percent of this pipe while another might need 110 percentof the allocated pipe. To reduce waste, certain burst metrics are used to determinehow much of a burst and how long a burst can be tolerated. Other important metricsthat directly impact the quality of service include packet loss rate, delay, and jitter(variation in delay). The network and computing components that control thesemetrics are described later in this chapter.



Network and Systems Architecture OverviewTo fully understand where QoS fits into the overall picture of network resources, it isuseful to take a look at the details of the complete network path traversal, startingfrom the point where a client sends a request, traverses various network devices,and finally arrives at the destination where the server processes the request.

Different classes of applications have different characteristics and requirements (seethe “The Need for QoS” on page 92 for additional details). Because several federatednetworks with different traffic characteristics are combined, end-to-end QoS is acomplex issue.

FIGURE 4-10 illustrates a high-level overview of the components involved in an end-to-end packet traversal for an enterprise that relies on a service provider. Twodifferent paths are shown. Both originate from the client and end at a server.



FIGURE 4-10 Overview of End-to-End Network and Systems Architecture

PDA

PSTNPSTN

MAE

NAP

Switch delay=ƒ(queueing, scheduling, packet classification lookup time, route lookup time, congestion, backplane

Link delay=ƒ(Propagation, LineRate)

Server delay=ƒ(CPU,memory, disk )

Cable modem DSLmodem

Cable

HFC

DSL Mobile wireless Dial upRemote access users

Tier 2 ISP/access networks

IPEthernet

IPEthernet

IPPPPAALS/ATMSONET

IPPPPL2TPEthernet

IPEthernetCDMA/GPRS/UMTS

IPPPPV.34, V.90,56kbps

IPDOCSIS

Cableheadend

ATM

ATM

ATM

Metro areanetwork

CO/POPAccessswitch

GPRS

GPRS Edge


ATM

ATM

ATM

Metroarea

network

Accessswitch

Enterprise network

Firewall

CSU/DSU

Switch

Enterprise network

FirewallCSU/DSU

Corporateuser

Switch

Tier 1 ISPbackbone network

A

C

E

F

G

H

D

1

2

3

4

B

Peeringpoints

Voice circuit

Leas

ed li

ne

T1T1



Path A-H is a typical scenario, where the client and server are connected to differentlocal ISPs and must traverse different ISP networks. Multiple Tier 1 ISPs can betraversed, connected together by peering points such as MAE-East or privatepeering points such as Sprint’s NAP.

Path 1-4 shows an example of the client and server connected to the same localTier 2 ISP, when both client and server are physically located in the samegeographical area.

In either case, the majority of the delays are attributed to the switches. In the Tier 2ISPs, the links from the end-user customers to the Tier 2 ISP tend to be slow links,but the Tier 2 ISP aggregates many links, hoping that not all subscribers will use thelinks at the same time. If they do, packets get buffered up and eventually aredropped.

Implementing QoS

You can implement QoS in many different ways. Each domain has control over itsresources and can implement QoS on its portion of the end-to-end path usingdifferent technologies. Two domains of implementation are enterprises and networkservice providers.

■ Enterprise – Enterprises can control their own networks and systems. From alocal Ethernet or token ring LAN perspective, IEEE 801.p can be used to markframes according to priorities. These marks allow the switch to offer preferentialtreatment to certain flows across VLANS. For computing devices, there arefacilities that allow processes to run at higher priorities, thus obtainingdifferentiated services from a process computing perspective.

■ Network Service Provider (NSP) – The NSP aggregates traffic and forwards eitherwithin its own network or hands off to another NSP. The NSP can usetechnologies such as DiffServ or IntServ to prioritize the handling traffic withinits networks. Service Level Agreements (SLAs) are required between NSPs toobtain a certain level of QoS for transit traffic.

ATM QoS Services

It is interesting that NSPs implement QoS at both the IP layer and the asynchronoustransfer mode (ATM) layer. Most ISPs still have ATM networks that carry IP traffic.ATM itself offers six types of QoS services:

■ Constant Bit Rate (CBR) – Provides a constant bandwidth, delay, and jitterthroughout the life of the ATM connection.

■ Variable Bit Rate-Real Time (VBR-rt) – Provides constant delay and jitter, butvariations in bandwidth.



■ Variable Bit Rate-Non Real Time (VBR-nrt) – Provides variable bandwidth, delay,and jitter, but has a low cell loss rate.

■ Unspecified Bit Rate (UBR) – Provides “Best Effort” service but no guarantees.

■ Available Bit Rate (ABR) – Provides no guarantees and expects the applications toadapt according to network availability.

■ Guaranteed Frame Rate (GFR) – Provides some minimum frame rate, deliversentire frame or none, and is used for ATM Adaptation Layer 5 (AAL5).

One of the main difficulties in providing an end-to-end QoS solution is that so manyprivate networks must be traversed, and each network has its own QoSimplementations and business objectives. The Internet is constructed so thatnetworks interconnect or “peer” with other networks. One network might need toforward traffic of other networks. Depending on the arrangements, competitorsmight not forward the traffic in the most optimal manner. This is what is meant bybusiness objectives.

Sources of Unpredictable DelayFrom a system computing perspective, unpredictable delays are often due to limitedCPU resources or disk I/O latencies. These degrade during a heavy load. From anetwork perspective, many components add up to the cumulative end-to-end delay.This section describes some of the important components that contribute to delayand explains the choke points at the access networks, where the traffic is aggregatedand forwarded to a backbone or core. Service providers overallocate their networksto increase profits and hope that not all subscribers will access the network at thesame time.

FIGURE 4-11 was constructed by taking out path A-G in FIGURE 4-10 and projecting itonto a Time-Distance plane. This is a typical Web client accessing the Internet site ofan enterprise. The vertical axis indicates the time that elapsed for a packet to travela certain link segment. The horizontal axis indicates the link segment that the packettraverses. At the top, we see the network devices and vertical lines that project downto the distance axis, showing the corresponding link segment. In this illustration, anIP packet’s journey starts when a user clicks on a Web page. The HTTP request mapsfirst to a TCP three-way handshake to create a socket connection. The first TCPpacket is the initial SYN packet, which first traverses segment 1 and is usually quiteslow because this link is typically 30 kbit/sec over a 56 kbit/sec modem, dependingon the quality and distance of the “last mile” wiring.



FIGURE 4-11 One-Way End-to-End Packet Data Path Transversal

Network Delay is composed of the following components:

■ Propagation delay that depends on the media and distance

■ Line rate that primarily depends on the link rate and loss rate or Bit Error Rate(BER)

■ Node transit delay that is the time it takes a packet to traverse an intermediatenetwork switch or router

The odd-numbered links of FIGURE 4-11 represent the link delays. Note that segmentand link are used interchangeably.

■ Link 1, in a typical deployment, is the copper wire, or the “last mile” connectionfrom the home or Small Office/Home Office (SOHO) to the Regional BellOperating Company (RBOC). This is how a large portion of consumer clientsconnect to the Internet.

Dial up



Enterprisenetwork

Tier 1 ISPbackbone network

1

2

3

4

5

6

789

1011

12

13

141516

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Tim

e de

lays

Distance - various physical networks(Not to scale)

Accessswitch

AccessswitchATM ATM

OC3

T1-OC3

T1-OC3

Core

MPLS

D1-OC48

T1Leased

line

Ethernet

56 KbpsPOTS

Packet



■ Link 3 is an ATM link inside the carrier’s internal network, usually aMetropolitan Area Network link.

■ Link 5 connects the Tier 2 ISP to the Tier 1 ISP.

This provides a Backbone Network. This link is a larger pipe, which can rangefrom T1 to OC-3 while growing.

■ Link 7 is the Core Network of the backbone Tier 1 provider.

Typically, this core is extremely fast, consisting of DS3 links (the same ones usedby IDT) or more modern links (like those used by VBNS of OC-48) and links thatare beta testing OC-192 links while running Packet over SONET and eliminatingthe inefficiencies of ATM altogether.

■ Links 9 and 11 are a reflection of links 5 and 3.

■ Link 13 is a typical leased line, T1 link to the enterprise. This is how mostenterprises connect to the Internet. However, after the 1996 TelecommunicationsAct, competitive local exchange carriers (CLECs) emerged. CLECs providesuperior service offerings at lower prices. Providers such as Qwest and Telseonprovide gigabit Ethernet connectivity at prices that are often below OC-3 costs.

■ Link 15 is the enterprise’s internal network.

There should be a channel service time division multiplexing (TDM) and dataservice device (data side) that terminates the T1 line and converts it to Ethernet.

The even-numbered links of FIGURE 4-11 represent the delays experienced inswitches. These delays are composed of switching delays, route lookups, packetclassification, queueing, packet scheduling, and internal switch forwarding delays,such as sending a packet from the ingress unit through the backplane to the egressunit.

As FIGURE 4-11 illustrates, QoS is needed to control access to shared resources duringepisodes of congestion. The shared resources are servers and specific links. Forexample, Link 1 is a dedicated point-to-point link, where a dedicated voice channelis set up at call time with a fixed bandwidth and delay. Link 13 is a permanentcircuit as opposed to a switched dedicated circuit. However, this is a digital line.QoS is usually implemented in front of a congestion point. QoS restricts the trafficthat is injected into the congestion point. Enterprises have QoS functions that restrictthe traffic being injected into their service provider. The ISP has QoS functions thatrestrict the traffic injected into their core. Tier 2 ISPs oversubscribe their bandwidthcapacities, hoping that not all their customers will need bandwidth at the same time.During episodes of congestion, switches buffer packets until they can be transmitted.Links 5 and 9 are boundary links that connect two untrusted parties. The Tier 2 ISPmust control the traffic injected into the network that must be handled by the Tier 1ISP’s core network. Tier 1 polices the traffic that customers inject into the network atLinks 5 and 9. At the enterprise, many clients need to access the servers.



QoS-Capable DevicesThis section describes the internals of QoS-capable devices. One of the difficulties ofdescribing QoS implementations is the number of different perspectives that can beused to describe all the features. The scope of this section is limited to the priority-based model and the related functional components to implement this model. Thepriority-based model is the most common implementation approach because of itsscalability advantage.

Implementation ApproachesThere are two completely different approaches to implementing a QoS-capable IPswitch or server:

The Reservation Model, also known as Integrated Services/RSVP or ATM, is theoriginal approach, requiring applications to signal their traffic handlingrequirements. After signaling, each switch that is in the path from source todestination reserves resources, such as bandwidth and buffer space, that eitherguarantee the desired QoS service or ensure that the desired service is provided.This model is not widely deployed because of scalability limitations. Each switch hasto keep track of all this information for each flow. As the number of flows increases,the amount of memory and processing increases, hence limiting scalability.

The Precedence Priority Model, also known as Differentiated Services, IP PrecedenceTOS, or IEEE 802.1pQ, takes aggregated traffic, segregates the traffic flows intoclasses, and provides preferential treatment of classes. It is only during episodes ofcongestion that noticeable differentiated services effects are realized. Packets aremarked or tagged according to priority. Switches then read these markings and treatthe packets according to their priority. The interpretation of the markings must beconsistent within the autonomous domain.

Functional Components—High-Level Overview“Implementation Functions” on page 95 describes the three high-level QoScomponents: traffic shaping, packet classification, and packet scheduling. Thissection describes these QoS components in further detail.

A QoS-capable device consists of the following functions:



■ Admission Control accepts or rejects access to a shared resource. This is a keycomponent for Integrated Services and ATM networks. Admission control ensuresthat resources are not oversubscribed. Due to this, admission control is moreexpensive and less scalable than other components.

■ Congestion Management prioritizes and queues traffic access to a sharedresource during congestion periods.

■ Congestion Avoidance prevents congestion early, using preventive measures.Algorithms such as Weighted Random Early Detection (WRED) exploit TCP’scongestion avoidance algorithms to reduce traffic injected into the network,preventing congestion.

■ Traffic Shaping reduces the burstiness of egress network traffic by smoothing thetraffic and then forwarding it out to the egress link.

■ Traffic Rate Limiting controls the ingress traffic by dropping packets that exceedburst thresholds, thereby reducing device resource consumption such as buffermemory.

■ Packet Scheduling schedules packets out the egress port so that differentiatedservices are effectively achieved.

The next section describes the modules that implement these high-level functions inmore detail.

QoS ProfileThe QoS Profile contains information put in by the network or systems administratoron the definition of classes of traffic flows and how these flows should be treated interms of QoS. For example, a QoS profile might have a definition that Web trafficfrom the CEO should be given EF DiffServ Marking, Committed Information Rate(CIR) 1 Mbit/sec, Peak Information Rate (PIR) 5 Mbit/sec, Excess Burst Size (EBS)100 Kbyte, and Committed Burst Size (CBS) 50 Kbyte. This profile defines the flowand level of QoS the Web traffic from the CEO should receive. This profile iscompared against the actual measured traffic flow. Depending on how the actualtraffic flow compares against this profile, the type of service (TOS) field of the IPheader is re-marked or an internal tag is attached to the packet header, whichcontrols how the packet is handled inside this device.

FIGURE 4-12 shows the main functional components involved in delivering prioritizeddifferentiated services that apply to a switch or a server. These include the packetclassification engine, the metering, the marker function, policing/shaping, I/Pforwarding module, queuing, congestion control management, and the packetscheduling function.



FIGURE 4-12 QoS Functional Components

Deployment of Data and Control PlanesTypically, if the example in FIGURE 4-12 were deployed on a network switch, therewould be an ingress board and an egress board connected together through abackplane. It would be deployed on a server. These functions would be

Flow 1

Flow n

Packetclassification

Marker Policer/shaper

IPforwarding

Meter

MarkerIP forwardinginformation

base

Packetscheduler

Policer/shaper

Meter

QoS

profiles

QoS

profiles

Queuingcongestion

control

Control and

management

plane

Data plane



implemented in the network protocol stack, either in the IP module, adjacent to theIP module, or possibly on the network interface card, offering superior performancedue to the ASIC/FPGA implementation.

There are two planes:

■ The Data Plane operates the functional components that actually read and writethe IP header.

■ The Control Plane operates the functional components that control how thefunctional units read information from the Network Administrator, directly orindirectly.

Packet Classifier

The Packet Classifier is a functional component responsible for identifying a flowand matching it with a filter. The filter is composed of source and destination, IPaddress, port, protocol, and the type of service field—all in the IP Header. The filteris also associated with information that describes the treatment of this packet.Aggregate ingress traffic flows are compared against these filters. Once a packetheader is matched with a filter, the QoS profile is used by the meter, marker,policing, and shaping functions.

Metering

The metering function compares the actual traffic flow against the QoS profiledefinition. FIGURE 4-13 illustrates the different measurement points. On average, theinput traffic arrives at 100 Kbyte/sec. However, for a short period of time, the switchor server allows the input flow rate to reach 200 Kbyte/sec for one second, whichcomputes to a buffer of 200 Kbyte. For the time period of t=3 to t=5, the buffer drainsat a rate of 50 Kbyte/sec as long as the input packets arrive at 50 Kbyte/sec, keepingthe output constant. Another more aggressive burst arrives at the rate of 400Kbyte/sec for 5.5 sec, filling up the 200 Kbyte buffer. From t=5.0 to 5.5, however, 50Kbyte are drained, leaving 150 Kbyte at t=5.5 sec. This buffer drains for 1.5 sec at arate of 100 Kbyte/sec. This example is simplified, so the real figures need to beadjusted to account for the fact that the buffer is not completely filled at t=5.5 secbecause of the concurrent draining. Notice that the area under the graph, or theintegral, represents the approximate number of bytes in the buffer, and burstsrepresent the high sloped lines above the dotted line, representing the average rateor the CIR.



FIGURE 4-13 Traffic Burst Graphic

Marking

Marking is tied in with metering so that when the metering function compares theactual measured traffic against the agreed QoS profile the traffic is handledappropriately. The measured traffic measures the actual burst rate and amount ofpackets in the buffer against the CIR, PIR, CBS, and EBS. The Two Rate Three Color(TrTCM) algorithm is a common algorithm that marks the packets green if the actualtraffic is within the agreed-upon CIR. If the actual traffic is between CIR and PIR, thepackets are marked yellow. Finally, if the actual metered traffic is at PIR or above,the packets are marked red. The device then uses these markings on the packet inthe policing and shaping functions to determine how the packets are treated (forexample, whether the packets should be dropped, shaped, or queued in a lowerpriority queue).

900

800

700

600

500

400

300

200

100

00 1 2 3 4 5 6 7 8 9

EBS= 200 Kbyte

CBS= 200 Kbyte

Packets drain

Packets drain

PIR= 400 Kbyte/sec.

Burst rate at200 Kbyte/sec.

CIR= 100 Kbyte/sec.

Time

Kby

teBursts - must buffer packets



Policing and Shaping

The policing functional component uses the metering information to determine if theingress traffic should be buffered or dropped. Shaping pumps out the packets at aconstant rate, buffering packets to achieve a constant output rate. The commonalgorithm used here is the Token Bucket algorithm to shape the egress traffic and topolice ingress traffic.

IP Forwarding Module

The IP forwarding module inspects the destination IP address and determines thenext hop using the forwarding information base. The forwarding information base isa set of tables populated by routing protocols and/or static routes. The packet isthen forwarded internally to the egress board, which places the packet in theappropriate queue.

Queuing

Queuing encompasses two dimensions, or functions. The first function is congestioncontrol that controls the number of packets queued up in a particular queue (see thefollowing section). The second function is differential services. Differential servicesqueues are serviced by the packet scheduler in a certain manner (providingpreferential treatment to preselected flows) by servicing packets in certain queuesmore often than others.

Congestion Control

There is a finite amount of buffer space or memory, so the number of packets thatcan be buffered within a queue must be controlled. The switch or server forwardspackets at line rate. However, when a burst occurs, or if the switch is oversubscribedand congestion occurs, packets are buffered. There are several packet discardalgorithms. The simplest is Tail Drop: Once the queue fills up, any new packets aredropped. This works well for UDP packets, but causes severe disadvantages for TCPtraffic. Tail Drop causes TCP traffic in already-established flows to quickly go intocongestion avoidance mode, and it exponentially drops the rate at which packets aresent. This problem is called global synchronization. It occurs when all TCP trafficsimultaneously increases and decreases flow rates. What is needed is to have someof the flows slow down so that the other flows can take advantage of the freed-upbuffer space. Random Early Detection (RED) is an active queue managementalgorithm that drops packets before buffers fill up and randomly reduces globalsynchronization.



FIGURE 4-14 describes the RED algorithm. Looking at line C on the far right, when theaverage queue occupancy goes from empty up to 75 percent full, no packets aredropped. However, as the queue grows past 75 percent, the probability that randompackets are discarded quickly increases until the queue is full, where the probabilityreaches certainty. Weighted Random Early Detection (WRED) takes RED one stepfurther by giving some of the packets different thresholds at which packetprobabilities of discard start. As illustrated in FIGURE 4-14, Line A starts to getrandom packets dropped at only 25 percent average queue occupancy, making roomfor higher-priority flows B and C.

FIGURE 4-14 Congestion Control: RED, WRED Packet Discard Algorithms

1.0

.75

.50

.25

00 25% 50% 75% 100%

Full queue

Average queue occupancy

Dro

p pr

obab

ility

Packets

WRED queue

A B C



Packet Scheduler

The packet scheduler is one of the most important QoS functional components. Thepacket scheduler pulls packets from the queues and sends them out the egress portor forwards them to the adjacent STREAMS module, depending on implementation.There are several packet scheduling algorithms that service the queues in a differentmanner. Weighted Round-Robin (WRR) scans each queue, and depending on theweight assigned a certain queue, allows a certain number of packets to be pulledfrom the queue and sent out. The weights represent a certain percentage of thebandwidth. In actual practice, unpredictable delays are still experienced because alarge packet at the front of the queue can hold up smaller-sized packets behind it.Weight Fair Queuing (WFQ) is a more sophisticated packet scheduling algorithmthat computes the time the packet arrives and the time to actually send out the entirepacket. WFQ is then able to handle varying-sized packets and optimally selectpackets for scheduling. WFQ conserves work, meaning that no packets wait idlywhen the scheduler is free. WFQ can also put a bound on the delay, as long as theinput flows are policed and the lengths of the queues are bound. In Class-BasedQueuing (CBQ), used in many commercial products, each queue is associated with aclass, where higher classes are assigned a higher weight translating to relativelymore service time from the scheduler than the lower-priority queues.

Competitive product offerings by Packeteer and Allot offer hardware solutions thatsit between the clients and servers. These products offer pure QoS solutions, butthey use the term policy as a specific QoS rule. These products are limited in theirflexibility and integration with policy servers.

Secure Sockets LayerIn 1994, Netscape Communications proposed SSL V.1 and shipped the first productswith SSL V.2. SSL V3 was introduced to address some of the limitations of SSL V2 inthe area of cryptographic security limitations and functionality. Transport LayerSecurity (TLS) was created to allow an open standard to prevent any one companyfrom controlling this important technology. However, it turns out that even thoughNetscape was granted a patent for SSL, SSL is now the defacto standard for securedWeb transactions.

This section provides a brief overview of the SSL protocol, and it then describesstrategies for deploying SSL processing in the design of data center networkarchitectures.



SSL Protocol OverviewThe basic operation of SSL includes the following phases:

1. Initial Full Handshake – The client and server authenticate each other, exchangekeys, negotiate preferred cryptographic algorithms (such as RSA or 3DES) andperform a CPU-intensive public key cryptographic mathematical computation.This full handshake can occur again during the life of a client servercommunication if the session information is not cached or reused and needs to beregenerated. More details are described below in the Handshake Layer.

2. Data Transfer Phase–Bulk Encryption – Once the session is established, data isauthenticated and encrypted using the master secret.

A typical Web request can span many HTTP requests, requiring that each HTTPsession establish an individual SSL session. The resulting higher performance impactmight not outweigh the marginal incremental security benefit. Hence, a techniquecalled SSL resumption can be exploited to save the session information for aparticular client connection that has already been authenticated at least once.

SSL is composed of two sublayers:

■ Record Layer – This layer operates in two directions:

■ Downstream – The record layer receives clear messages from the handshakelayer. The record layer encapsulates, encrypts, fragments, and compresses themessages using the Message Authentication Code (MAC) operations beforesending the messages downstream to the TCP Layer.

■ Upstream – The record layer receives TCP packets from the TCP layer anduncompresses, reassembles, decrypts, runs a MAC verification, anddecapsulates the packets before sending them to higher layers.

■ Handshake Layer – This layer exchanges messages between client and server inorder to exchange public keys, negotiate and advertise capabilities, and agree on:■ SSL version■ Cryptographic algorithm■ Cipher suite

The cipher suite contains key exchange method, data transfer cipher, andMessage Digest for Message Authentication Code (MAC). SSL 3.0 supports avariety of key exchange algorithms.

FIGURE 4-15 illustrates an overview of the SSL-condensed protocol exchanges.



FIGURE 4-15 High-Level Condensed Protocol Overview

Once the first set of messages is successfully completed, an encryptedcommunication channel is established.

The following sections describe the differences between using a pure softwaresolution and an SSL accelerator appliance in terms of packet processing andthroughput.

Transport: TCP

Fragment Message Reassemble Message

Compress Message Uncompress Message

Encrypt Message Decrypt Message

Certificate

CertificateVerifyChangeCipherSpec

Finished

ClientKeyExchange

AuthenticateMessage (MAC)

VerifyMessage (MAC)

HTTP

Network: IP

Data Link:Ethernet

PHY

Transport: TCP

Fragment Message Reassemble Message

Compress Message Uncompress Message

Encrypt Message Decrypt Message

AuthenticateMessage (MAC)

VerifyMessage (MAC)

SSL Handshake SubLayer

HTTP

Client Server

Network: IP

Data Link:Ethernet

PHY

ClientHelloServerHelloCertificate

CertificateRequestServerKeyExchange

ServerHelloDone

ChangeCipherSpec

Finished

SSL Record SubLayer



We will not be discussing SSL in depth. The purpose of this section is to describe thedifferent network architectural deployment scenarios you can apply to SSLprocessing. The following sections describe various approaches to scaling SSLprocessing capabilities from a network architecture perspective.

SSL Acceleration Deployment ConsiderationsOne of the fundamental limitations of SSL is performance. When SSL is added to aWeb server, performance drops dramatically because of the strain on the CPUcaused by the mathematical computations and the number of sessions thatconstantly need to be set up. There are three common SSL approaches:

■ Software-SSL libraries – This approach uses the bundled SSL libraries and offersthe most cost-effective option for processing SSL transactions.

■ Crypto Accelerator Board – This approach can offer a massive improvement inperformance for SSL processing for certain types of SSL traffic. “ConclusionsDrawn from the Tests” on page 121 suggests when best to use the Sun™ CryptoAccelerator 1000 board, for example.

■ SSL Accelerator Appliance – This solution might have a high initial cost, but itproves to be very effective and manageable for large-scale SSL Web server farms.“Conclusions Drawn from the Tests” on page 121 suggests when best to deployan appliance such as Netscaler or ArrayNetworks.

There are several deployment options for SSL acceleration. This section describeswhere it makes sense to deploy different SSL acceleration options. It is important toconsider certain characteristics, including:

■ The level or degree of security■ The number of client SSL transactions■ The volume of bulk encrypted data to be transferred in the secure channel■ Cost■ The number of horizontally scaled SSL Web servers.

Software-SSL Libraries—Packet Flow

FIGURE 4-16 shows the packet flow for a software-based approach to SSL processing.Although the path seems direct, the SSL processing is bottlenecked by millions ofCPU cycles consumed in the processing of cryptographic algorithms such RSA and3DES.



FIGURE 4-16 Packet Flow for Software-based Approach to SSL Processing

The Crypto Accelerator Board—Packet Flow

FIGURE 4-17 shows the packet flow using a PCI accelerator card for SSL acceleration.In this case, the incoming encrypted packet reaches the SSL libraries. The SSLlibraries maintain various session information and security associations, but themathematical computations are offloaded to the PCI accelerator card, which contains

Application

Memory

SSL

NIC

TCP

IP

PCI

STREAM Head

NIC Driver

PCI Bridge



an ASIC that can compute the cryptographic algorithms in very few clock cycles.However, there is a an overhead of transferring data to the card, as the PCI bus mustfirst be arbitrated and traversed. Note that in the case of small data transfers, theoverhead of PCI transfers might not outweigh the benefit of the cryptographiccomputation acceleration offered by the card. Further, it is important to make surethe PCI slot used is 64 bit and 66 MHz. Using a 32-bit slot could have a performanceimpact.

FIGURE 4-17 PCI Accelerator Card Approach to SSL Processing—Partial Offload

NIC

Application

Memory

SSL

SSLAccelerator

STREAM Head

TCP

IP

NIC Driver

PCI Bridge

PCI

1 32



SSL Accelerator Appliance—Packet Flow

FIGURE 4-18 illustrates how a typical SSL accelerator appliance can be exploited toreduce load on servers by offloading front-end client SSL processing. CommercialSSL accelerators at the time of this writing are all PC-based boxes with PCIaccelerator cards. The operating systems and network protocol stack are optimizedfor SSL processing. The major benefit to the backend servers is that CPU cycles arefreed up by not having to process thousands of client SSL transactions. Theaccelerator can either offload all SSL processing and forward cleartext to the serveror terminate all client SSL connections and maintain only one SSL connection to thetarget server, depending on the customer’s requirements.



FIGURE 4-18 SSL Appliance Offloads Frontend Client SSL Processing

Application

Memory

SSL Appliance offloads clientfront-end SSL processing

SSL

STREAMHead

TCP

IP

NIC Driver

PCI Bridge

PCI

SSL Accelerator

Persistent SSL connection appliance - server

NIC

Server

Client 1Client 2

Client 5000

Server

Server



SSL Performance TestsTo gain a better understanding of the trade-offs of the three approaches to SSLacceleration, we ran various tests using the Sun Crypto Accelerator 1000 board,Netscaler 9000 SSL Accelerator appliance, and ArrayNetworks SSL acceleratorappliance.

Due to limited time and resources, tests were selected that enabled us to comparekey attributes among approaches. In the first test, we compare raw SSL processingdifferences between SSL libraries and an appliance.

Test 1: SSL Software Libraries versus SSL AcceleratorAppliance—Netscaler 9000

In this test, we looked closely at CPU utilization and network traffic with a softwaresolution. We found a tremendous load on the CPU, which completely pinned theCPU. It took over two minutes to complete 100 SSL transactions.

We then looked at CPU utilization and network traffic using an SSL appliance withthe exact same server and client used in the first example. With this setup, it tookunder one second to complete 100 SSL transactions.

The main reason the SSL appliance is so much faster is that the appliance maintainsfew long-lived SSL connections on the server. Hence the server is less burdened withrecalculating cryptographic computations, which are CPU intensive—as is setting upand tearing down SSL sessions. The appliance terminates the SSL session betweenthe client and the appliance and then reuses the SSL connection at the backend withthe servers.

FIGURE 4-19 SSL Test Setup with No Offload

deepak2129.146.138.98

deepak129.146.138.99

Client - benchmark Alteon Switch SSL Libs

Sun ONE Web Server



Test 1 (A)—Software-SSL Libraries

We used an industry standard benchmark load generator on the client to generateSSL traffic. Both tests ran the same tests on the same 100-megabyte server file.

100 requests were injected into the SSL Web server in concurrency of 10 requests.

Test 1 (B)—SSL Accelerator Appliance

We used the Netscaler 9000 SSL Accelerator device on the client to generate SSLtraffic. Both tests ran on the same 100-megabyte server file.

The performance gains using the SSL offload device were significant. Some of thekey reasons include:

■ Hardware SSL implementation, including hardware coprocessor formathematically intensive computations of cryptographic algorithms.

■ Reuse of backend SSL tunnel. By keeping one SSL tunnel alive and reusing it, theresult is massive server SSL offload.

We ran the benchmark load generator on client (deepak2). The client points to theVIP on the Netscaler, which terminates one side of the SSL connection. The Netscalerthen reuses the backend SSL connection. This is also more secure because the clientis unaware of the backend servers and hence can do less damage:

612 packets were transferred to complete 100 SSL handshakes in less that onesecond!

Test 2: Sun Crypto Accelerator 1000 Board

In this test set, we leveraged the work done by the Performance Availability andEngineering group regarding performance tests of the Sun Crypto Accelerator 1000board. The test setup consisted of a Sun Fire™ 6800, using eight 900-MHzUltraSPARC™ III processors, and a single Sun Crypto Accelerator 1000 board.FIGURE 4-20 shows that the throughput increases linearly as the number of processorsincreases on the software approach versus the near-constant performance at 500Mbit/sec using the Sun Crypto Accelerator 1000 board. Tests show that the idealbenefit of the accelerator board results when the minimum message size exceeds1000 bytes. If the messages are too small, the benefit of the card acceleration does notoutweigh the overhead of diverting SSL computations from the CPU to the boardand back.

#abc -n 100 -c 10 -v 4 http://129.146.138.52:443/100m.file1>./netscaler100mfel1n100c10.softwareonly



FIGURE 4-20 Throughput Increases Linearly with More Processors

Test 3: SSL Software Libraries versus SSL AcceleratorAppliance—Array Networks

In this set of tests, we performed more detailed tests to better understand not onlythe value of the SSL appliance, but the impact of threads and file size. FIGURE 4-21shows the basic test setup for the SSL software test, where an Sun Enterprise™ 450server used as the client was sufficient to saturate the Sun Blade™ server.

FIGURE 4-21 SSL Test Setup for SSL Software Libraries

FIGURE 4-22 shows the SSL appliance tests. Larger clients were required to saturatethe servers. We used two additional Sun Fire 3800 servers in addition to theEnterprise 450 server. The reason for this was that the SSL appliance terminated theSSL connection, performed all SSL processing, and maintained very few socketconnections to the backend servers, thereby reducing the load on the servers.

1 2 3 4 5 6 7 8

100

200

300

400

500Mbps

Number of CPUS (Ultra SPARCIII - 900MHz)

SSLThroughput

Client E4504 cpu 450 mhz

FoundryBig Iron

4000

B1600SunBlade650 Mhz



FIGURE 4-22 SSL Test Setup for an SSL Accelerator Appliance

FIGURE 4-23 suggests that there is a sweet spot for the number of threads to be usedfor the client load generator. After a certain point, performance drops. This suggeststhat the SSL processing of software only approaches benefits from increased threadsup to a certain maximum point. These are initial tests and not comprehensive by anymeans. Our intent is to show that this is one potentially important configurationconsideration, which might be beyond the scope of pure design.

FIGURE 4-23 Effect of Number of Threads on SSL Performance

FIGURE 4-24 shows the impact of file size on SSL performance. Note that these are SSLencrypted bulk files. The SSL appliance has a dramatic impact on increasingperformance of SSL throughput for large files. However, the number of transactions

Client SF38008 cpu 900 mhz

Client E4504 cpu 450 mhz

FoundryBig Iron

4000

ArrayNetworks

SSL Appliance

B1600SunBlade650 Mhz

Client SF38008 cpu 900 mhz

5 10 15 20 30 40 50

200

400

600

800

SSLFetches/Sec

Number of Threads



decreases in direct proportion to the file size. The link was a 1-gigabit pipe, whichcan support 125 MByte/sec throughput. The results show that the limiting factoractually is not the network pipe.

FIGURE 4-24 Effect of File Size on SSL Performance

Conclusions Drawn from the TestsThe software solution is best used in situations that require relatively few SSLtransactions throughput, which is typical for sign-on and credit card Web-basedtransactions, where only certain aspects require SSL encryption.

The PCI accelerator card dramatically increases performance at relatively low cost.The PCI card also offers true end-to-end security and is often desirable for highlysecure environments.

1KB 100KB 500KB

20

40

60

80

100

Fetches/Sec

SSL Performance and File Size

800KB

600KB

400KB

200KB

1MB

Kilobytes (KB)Transferred/Sec

File Size Kilobytes (KB)



The accelerator device can be installed in an existing infrastructure and can offervery good performance. The servers do not need to be modified. Hence, only onedevice must be managed for SSL acceleration. Another benefit is that the applianceexploits the fact that not every server will be loaded with SSL at the same time.Hence, from a utilization standpoint, an appliance is more economically feasible.



CHAPTER 5

Server Network Interface Cards:Datalink and Physical Layer

This chapter discusses the networking technologies available through SunMicrosystems, which are regularly found in a data center. In many cases, a high-level overview of the networking technology is provided for completeness. Thetechnologies covered include:

■ Token Ring Networks■ Fiber Distributed Data Interface (FDDI) Networking■ Ethernet Networking

Token Ring NetworksA token ring network is a physically star-wired local area network that interconnectsvarious devices such as personal computers and workstations into a logical ringconfiguration. The cabling system consists of wiring concentrators, connectors, andend stations.

The Sun token ring protocol conforms to the IEEE 802.5-1988 standard. Token ringrefers to the media access control (MAC) portion of the link layer (DLC) as well asthe entire physical layer (PHY). Access to the ring is controlled by a bit pattern,called a token, that circulates from station to station around the ring. Any station canuse the ring. Capturing the token means that a station changes the token bit patternso that it is no longer that of a token but that of a data frame. The sending stationthen sends its data within the information field of the frame. The frame also includesthe destination address of the destination station. The frame is passed from stationto station until it arrives at the proper destination. At the destination station, theframe is altered to indicate that the address was recognized and that the data wascopied. The frame is then passed back to the original sending station, where the

123


sending station checks to see that the destination station copied the data. If there isno more data to be sent, the sending station alters the frame’s bit configuration sothat it now functions as a free token available to another station on the ring.

If a station fails, it is physically switched out of the ring, dynamically. The ring isthen automatically reconfigured. When the station has been repaired, the ring isagain automatically reconfigured to include the added station.

FIGURE 5-1 Token Ring Network

A

BFreetoken

Freetoken

Data

Data

Station A changes free token to busy.Station A sends data.Station D repeats data.Station C receives data.Station B will repeat data.

Station A receives its own busy token.Station A generates new free token.

Station A is waiting to send data to Station C.Station A waits for free token.

C

D

A

B

C

D

A

B

C

D



Token Ring InterfacesSun supports two token ring drivers for its range of SPARC® platforms. This sectiondescribes the token ring interfaces in detail.

The SBus-based and the PCI token ring interfaces provide access to 4-Mbit/sec or 16-Mbit/sec token ring local area networks. SunTRI/S software supports the IEEE 802.5standards for token ring networks.

The IEEE standard specifies the lower two layers of the OSI 7-layer model. The twolayers are the Physical layer (Layer 1) and the Data Link layer (Layer 2). The DataLink layer is further divided into the Logical Link sublayer (LLC) and the MediaAccess Control (MAC) sublayar.

The token ring driver is a multi-threaded, loadable, clonable, STREAMS hardwaredriver that supports the connectionless Data Link Provider Interface, dlpi(7p), over atoken ring controller. Multiple token ring controllers installed within the system aresupported by the driver. SunTRI/S software can support different protocolarchitectures concurrently, via the SNAP encapsulation technique of RFC1042. Fromthis SNAP encapsulation, high-level applications can communicate through theirdifferent protocols over the same SunTRI/S interface. Support also exists for addingdifferent protocol packages (not included with SunTRI/S). These protocol packagesinclude OSI and other protocols available directly from Sun or through third-partyvendors. TCP/IP is implicit with the Solaris operating system.

The software driver also provides source routing, which enables the workstation toaccess multiple ring networks connected by source-route bridges. Locallyadministered addressing is also supported and aids in management of certain user-specific and vendor-specific network configurations.

Support for IBM LAN Manager is provided by the TMS380 MAC-level firmware thatcomplies with the IEEE 802.5 standard.

Configuring the SunTRI/S Adapter with TCP/IPThe Sbus token ring driver is called tr and can be configured using ifconfig onceyou have established that the interface is physically present in the system and thedevice driver is installed. Refer to “Configuring the Network Host Files” onpage 234. The rest of this section describes the configuration of individual

Chapter 5 Server Network Interface Cards: Datalink and Physical Layer 125


parameters of the tr device that can be altered in the driver.conf file and globalparameters that can be altered using /etc/system. TABLE 5-1 describes thetr.conf parameters.

Setting the Maximum Transmission Unit

Sun supports the IEEE 802.5 Token Ring Standard Maximum Transmission Unit(MTU) size of 17800 bytes. All hosts should use the same MTU size on any particularnetwork. Additionally, if different types of IEEE 802 networks are connected bytransparent link layer bridges, all hosts on all of these networks should use the sameMTU size.

The maximum MTU sizes supported are 4472 for 4 Mbit/sec operation and 17800 for16 Mbit/sec operation. These are the rates specified by the token ring chip set on theSunTRI/S adapter. TABLE 5-2 lists the MTU indices and their corresponding sizes.

The default value of the MTU index is 3 (4472 bytes).

TABLE 5-1 tr.conf Parameters

Parameter Description

mtu Maximum transfer unit index

sr Source routing enable

ari Disabling ARI/FCI Soft Error Reporting

TABLE 5-2 MTU Sizes

MTU Index MTU Size (bytes)

0 516

1 1470

2 2052

3 4472

4 8144

5 11407

6 17800



Disabling Source RoutingSource routing is the method used within the token ring network architecture toroute frames through a multiple-ring local area network. A route is a path taken bya frame as it travels through a network from the originating station to a destinationstation.

By default, source routing is enabled. To disable source routing, set the sr value inthe tr.conf file to 0.

Disabling ARI/FCI Soft Error Reporting

In 1989, the Token Ring committee changed its recommendations on the use of theAddress Recognized Indicator/Frame Copied Indicator (ARI/FCI). The oldrecommendation was to use the bits to confirm the receipt or delivery of frames. Thenew recommendation is to use the bits to report soft errors. This recommendationgave rise to issues in networks that had devices developed based on the oldrecommendation and in networks with devices developed since 1989 that did notadhere to the new recommendation.

The ari parameter in the tr.conf file can be used to set ARI/FCI soft errorreporting. By default, the ARI/FCI soft error reporting parameter is enabled. If youhave an older network device, you might need to disable ARI/FCI error reportingby setting the ari parameter to 1.

Configuring the Operating Mode

The SunTRI/S adapter supports both classic and Dedicated Token Ring (DTR)modes of operation. You can use the mode command to set the operating mode.

TABLE 5-3 Source Routing Values


sr 0 Enables source routing

1 Disables source routing

TABLE 5-4 ARI/FCI Soft Error Reporting Values


ari 0 Enables ARI/FCI error reporting

1 Disables ARI/FCI error reporting



By default, the adapter is set to classic mode (half duplex). If the mode is set to DTR,the adapter will come up in full duplex mode. If the mode is set to auto, the adapterwill automatically choose between classic and DTR mode, depending on thecapabilities of the switch or media access unit (MAU).

Resource Configuration Parameter Tuning

The SunTRI/S driver is shipped with 64 2-kilobyte buffers for receiving andtransmitting packets. This configuration should be adequate under normalsituations. However, the token ring interface throughput can be sluggish underheavy load and may even lock up for an indefinite time. This is especially true underNFS-related operations.

This problem can be resolved by increasing the number of buffers available in thedriver. The tunable parameter tr_nbufs can be set in the file /etc/system. Addthis line in the file if it does not already exist:

xxx is the number of 2-kilobyte buffers desired. You should not see a value less thanthe default value of 64. Proper setting of this parameter requires “tuning.” Numbersbetween 400 and 500 should be reasonable for medium load. You must reboot thesystem after you have updated the /etc/system file for the changes to take effect.

Configuring the SunTRI/P Adapter with TCP/IPThe PCI Bus token ring driver is called trp and can be configured using ifconfigonce you have established that the interface is physically present in the system andthe device driver is installed. Refer to “Configuring the Network Host Files” on

TABLE 5-5 Operating Mode Values


mode The Operation mode values can be set to:

0 Classic mode

1 Auto mode

2 DTR mode

# set tr:tr_nbufs=<xxx>



page 234. The rest of this section describes the configuration of individualparameters of the trp device that can be altered in the driver.conf file and globalparameters that can be altered using /etc/system.


Sun supports the IEEE 802.5 Token Ring Standard Maximum Transmission Unit(MTU) size of 17800 bytes.

Configuring the Ring Speed

The ring speed is the number of megabits per second (Mbit/sec) at which theadapter transmits and receives data. The SunTRI/P software sets the ring speed toauto-detect by default. When the workstation enters the token ring,it willautomatically detect the speed at which the ring is running and set itself to that ringspeed. If your workstation is the first workstation on the token ring, the ring speedis set by the hub. However, if your workstation is the first workstation on the tokenring and the token ring has no active hubs, you must set the ring speed manually.Additional workstations that join the token ring will set their ring speedautomatically.

You can set the ring speed using the trpinstance_ring_speed parameter in thetrp.conf file. The trpinstance_ring_speed parameter can be set for eachinterface. For example, setting the trp0_ring_speed parameter affects the trp0adapter.

TABLE 5-6 trp.conf Parameters


mtu Maximum Transfer unit

sr Source routing enable

ari Disabling ARI/FCI Soft Error Reporting

TABLE 5-7 Maximum Transmission Unit


mtu The maximum MTU sizes supported are 4472 for 4 Mbit/secoperation and 17800 for 16 Mbit/sec operation. The default MTUsize is 4472 bytes.



This parameter can be changed to the following settings.

To change the value of the ring speed on trp0 to 4 Mbit/sec and the ring speed ontrp1 to 16 Mbit/sec, change the following settings in the trp.conf file:

Configuring the Locally Administered AddressThe Locally Administered Address (LAA) is part of the token ring standardspecification. You might need to use an LAA for some protocols, such as DECNET orSNA. To use an LAA, create a file with execute permission in the /etc/rcS.ddirectory, such as /etc/rcS.d/S20trLAA, with the ifconfig trinstance etherXX:XX:XX:XX:XX:XX command. The adapter instance is represented by trinstanceand the LAA for that adapter is used in place of XX:XX:XX:XX:XX:XX.

TABLE 5-8 Ring Speed


trpinstance_ring_speed The ring speed setting applied to the node:0= auto-detect (default)4= 4 Mbit/sec16= 16 Mbit/sec

trp0_ring_speed = 4trp1_ring_speed = 16

# /sbin/shcase “$1” in

'start')echo “Configuring Token Ring LAA...”/sbin/ifconfig trX ether XX:XX:XX:XX:XX:XX;;'stop')echo “Stop of Token Ring LAA is not implemented.”;;*)echo “Usage: $0 { start | stop }”;;

esac



For example, to use an LAA of 04:00:ab:cd:11:12 on the tr0 interface, use the followingcommand within the /etc/rcS.d/S20trLAA file:

The least significant bit of the most significant byte of the address used in the abovecommand should never be 1. That bit is individual/group bit and used bymulticasting. For example, the address 09:00:ab:cd:11:12 would be invalid and wouldcause unexpected networking problems.

Fiber Distributed Data InterfaceNetworksA typical Fiber Distributed Data Interface (FDDI) network is based on a dualcounter-rotating ring, as illustrated in FIGURE 5-2. Each FDDI station is connected insequence to two rings simultaneously—a primary ring and a secondary ring. Dataflows in one direction on the primary ring and in the other on the secondary ring.

The secondary ring serves as a redundant path. It is used during stationinitialization and can be used as a backup to the primary ring in the event of astation or cable failure. When a failure occurs, the dual ring is “wrapped” around toisolate the fault and to create a single one-way ring. The components of a typicalFDDI network and the failure recovery mechanism are described in more detail inthe following sections.

# /sbin/ifconfig tr0 ether 04:00:ab:cd:11:12



FIGURE 5-2 Typical FDDI Dual Counter-Rotating Ring

FDDI StationsAn FDDI station is any device that can be attached to a fiber FDDI network throughan FDDI interface. The FDDI protocols define two types of FDDI stations:

■ Single-attached station (SAS)■ Dual-attached station (DAS)

Single-Attached Station

A SAS is attached to the FDDI network through a single connector, called the S-port.The S-port has a primary input (Pin) and a primary output (Pout). Data from anupstream station enters through Pin and exits from Pout to a downstream station,as shown in FIGURE 5-3. Single-attached stations are normally attached to single- anddual-attached concentrators as described in “FDDI Concentrators” on page 134.

FDDI Stations

FDDI Stations

FDDI Stations

FDDI Stations

Primary RingSecondary Ring



FIGURE 5-3 SAS Showing Primary Output and Input

Dual-Attached Station

A DAS is attached to the FDDI network through two connectors, called the A-portand the B-port, respectively. The A-port has a primary input (Pin) and a secondaryoutput (Sout); the B-port has a primary output (Pout) and a secondary input (Sin).

The primary input/output is attached to the primary ring and the secondaryinput/output is attached to the secondary ring. The flow of data during normaloperation is shown in FIGURE 5-4.

To complete the ring, you must ensure that the B-port of an upstream station isalways connected to the A-port of a downstream station. For this reason, most FDDIDAS connectors are keyed to prevent connections between two ports of the sametype.

Single-Attached Station(SAS)

Data to downstream station

Data from upstream station

Pout Pin

MAC

PHY

S-port



FIGURE 5-4 DAS Showing Primary Input and Output

FDDI ConcentratorsFDDI concentrators are multiplexers that attach multiple single-attached stations tothe FDDI ring. An FDDI concentrator is analogous to an Ethernet hub.

The FDDI protocols define two types of concentrator:

■ Single-attached concentrator (SAC)■ Dual-attached concentrator (DAC)

Single-Attached Concentrator

A SAC is attached to the FDDI network through a single connector, which isidentical to the S-port on a single-attached station. It has multiple M-ports to whichsingle-attached stations are connected, as shown in FIGURE 5-5.

Dual-Attached Station(DAS)


Pout Sin

MAC

PHY B

B-port


Data from downstream station

Data to upstream station

Sout Pin

PHY A

A-port



FIGURE 5-5 SAC Showing Multiple M-ports with Single-Attached Stations

Dual-Attached Concentrator

A DAC is attached to the FDDI network through two ports—the A-port and the B-port, which are identical to the ports on a dual-attached station. A DAC has multipleM-ports, to which single-attached stations are connected as shown in FIGURE 5-6.

Dual-attached concentrators and FDDI stations are often arranged in a flexiblenetwork topology called the ring of trees. Additionally, many failover capabilities arebuilt into the FDDI network to ensure it is robust.


S-port

M-port


Single-Attached Concentrator(SAC)

S-port

M-port

S-port


S-port

M-port



Pout Pin



FIGURE 5-6 DAC Showing Multiple M-ports with Single-Attached Stations

FDDI InterfacesSun supports two FDDI drivers for its range of SPARC platforms: the SBus driver,known as the SunFDDI/S, and the PCI driver, known as SunFDDI/P.

The SBus-based and the PCI FDDI interfaces provide access to 100 mbit/s FDDIlocal area networks.


S-port

M-port


Duel-Attached Concentrator(DAC)

S-port

M-port


S-port

M-port


Pout Sin

B-port




Sout Pin

B-port



Configuring the SunFDDI/S Adapter withTCP/IPThe Sbus FDDI driver is called nf and can be configured using ifconfig once youhave established that the interface is physically present in the system and the devicedriver is installed. Refer to “Configuring the Network Host Files” on page 234. Therest of this section describes the configuration of individual parameters of the nfdevice that can be altered in the driver.conf file.


Sun Supports the FDDI maximum transmission unit (MTU) that has been optimizedfor pure FDDI networks.

Target Token Rotation Time

Target token rotation time (TTRT) is the key FDDI parameter used for networkperformance tuning. In general, increasing the TTRT increases throughput andincreases access delay.

For SunFDDI, the TTRT must be between 4000 and 165,000 ms. The TTRT is set to8000 ms by default. The optimum value for the TTRT is dependent on theapplication and the type of traffic on the network:

■ If the network load is irregular (bursty traffic), the TTRT should be set as high aspossible to avoid lengthy queueing at any one station.

■ If the network is used for the bulk transfer of large data files, the TTRT should beset relatively high to obtain maximum throughput without allowing any onestation to monopolize the network resources.

TABLE 5-9 nf.conf Parameters


nf_mtu Maximum Transmission Unit

nf_treq Target Token Rotation Time



nf_mtu The maximum MTU size can be set to a maximum of 4500 bytes.



■ If the network is used for voice, video, or real-time control applications, the TTRTshould be set low to decrease access delay.

The TTRT is established during the claim process. Each station on the ring bids avalue (T_req) for the operating value of the TTRT (T_opr) and the station with thelowest bid wins the claim. Setting the value of T_req on a single station does notguarantee that this bid will win the claim process.

Configuring the SunFDDI/P Adapter withTCP/IPThe PCI bus FDDI driver is called pf and can be configured using ifconfig onceyou have established that the interface is physically present in the system and thedevice driver is installed. Refer to “Configuring the Network Host Files” onpage 234. The rest of this section describes the configuration of individualparameters of the pf device that can be altered in the driver.conf file.


Sun supports the FDDI maximum transmission unit (MTU) that has been optimizedfor pure FDDI networks.

TABLE 5-11 Request Operating TTRT


nf_treq Requested TTRT should be a value in the range 4000 through165,000.

TABLE 5-12 pf.conf Parameters


pf_mtu Maximum transfer unit

pf_treq Target token rotation time



pf_mtu The maximum MTU size can be set up to a maximum of 4500 bytes.



Target Token Rotation TimeThe target token rotation time (TTRT) can also be programmed with the SunFDDI/P.A detailed explanation of the TTRT is provided above with the SunFDDI/S.

Ethernet TechnologyThis section discusses low-level network interface controller (NIC) architecturefeatures. It explains the elements that make up a NIC adapter, breaking it down intothe transmit (Tx) data path and the receive (Rx) data path followed by accelerationfeatures available with more modern NICs.

The components are broken down in this manner to provide the necessary high-levelunderstanding required to discuss the finer details of the Sun NIC devices available.

These broad concepts plus the finer details included in this explanation will helpyou understand the operation of these devices and how to tune them for maximumbenefit in throughput, request/response performance, and level of CPU utilization.These concepts are also useful in explaining the development path that Sun took forits NIC technology. Each concept is retained from one Sun NIC to the next as eachnew product builds on the strengths of the last.

With the NIC architecture concepts in place, the next area of discussion is by far thelargest source of customer discomfort with Ethernet technology: the physical layer.The original ubiquitous Ethernet technology was 10 Mbit/sec. Ethernet technologyhas been improved continuously over the years, going from 10 Mbit/sec to 100Mbit/sec and most recently to 1 Gbit/sec. Along the way Ethernet technologyalways promised to be backward compatible and accomplished this using atechnology called auto-negotiation, which allows new Ethernet arrivals to connect tothe existing infrastructure and establish the correct speed to operate with and bepart of that infrastructure. On the whole, the technology works very well, but thereare some difficulties with understanding the Ethernet physical layer. Hopefully, ourexplanation of this layer will facilitate better use of this feature.

The last addition to the Ethernet technology is network congestion control usingpause flow control. This is a useful but under-utilized feature of Ethernet that wehope to demystify.

TABLE 5-14 Request Operating Target Token Rotation Time


pf_tReq Requested TTRT should be a value in the range 4000 through165,000.



Software Device Driver LayerThis section discusses the low-level NIC architecture features required to understandthe tuning capabilities of each NIC. To discuss this we will divide the process ofcommunication into the software device driver layer relative to TCP/IP and thenfurther into Transmit and Receive.

The software device driver layer conforms to the data link provider interface (DLPI).The DLPI interface layer is how protocols like TCP/IP, Appletalk, and so on talk tothe software driving the Ethernet device. This is illustrated further in FIGURE 5-7.

FIGURE 5-7 Communication Process between the NIC Software and Hardware

Transmit

The Transmit portion of the software device driver level is the simpler of the twoand basically is made up of a Media Access Control module (MAC), a direct memoryaccess (DMA) engine, and a descriptor ring and buffers. FIGURE 5-8 illustrates theseitems in relation to the computer system.

TCP/IP Protocol Stack

DLPI Interface

Software Domain

Receive Transmit

Hardware Domain

NIC Device Driver

NIC Device



FIGURE 5-8 Transmit Architecture

The key element to this transmit architecture is the descriptor ring. This is the part ofthe architecture where the transmit hardware and the device driver transmitsoftware share information required to move data from the system memory to theEthernet network connection.

The transmit descriptor ring is a circular array of descriptor elements that areconstantly being used by the hardware to find data to be transmitted from mainmemory to the Ethernet media. At a minimum, the transmit descriptor elementcontains the length of the Ethernet packet data to be transmitted and a physicalpointer to a location in system physical memory to find the data.

The transmit descriptor element is created by the NIC device driver as a result of arequest at the DLPI interface layer to transmit a packet. That element is placed onthe descriptor ring at the next available free location in the array. Then the hardwareis notified that a new element is available. The hardware fetches the new descriptor,and using the pointer to the packet data physical memory, moves the data from thephysical memory to the Ethernet media for the given length of the packet providedin the Tx descriptor.

Note that requests for more packets to be transmitted by the DLPI interface continuewhile the hardware is transmitting the packets already posted on the descriptor ring.Sometimes the arrival rate of the transmit packets at the DLPI interface exceeds therate of transmission of the packets by the hardware to the media. In that case, thedescriptor ring fills up and further attempts to transmit must be postponed untilpreviously posted transmissions are completed by the hardware and more descriptorelements are made available by the device driver software. This is a typicalproducer-consumer effect where the producer is the DLPI interface producingrequests for the transmit descriptor ring and the hardware is the consumerconsuming those requests and moving data to the media.

PhyDevice

MediaAccessControlModule

TXDMA

Engine

System Memory

Tx DescriptorRing

DataBuffer

Tx Descriptor



This producer-consumer effect can be reduced by increasing the size of the transmitdescriptor ring to accommodate the delay that the hardware or the underlyingmedia imposes on the movement of the data. This delay is also known astransmission latency.

Later sections describe how many of the device drivers give a measurement of howoften the transmission latency becomes so large that data transmission is postponed,awaiting transmit descriptor ring space. The aim is to avoid this situation. In somecases, NIC hardware allows you to increase the size of the descriptor ring, allowinga larger transmit latency. In other cases, the hardware has a fixed upper limit for thesize of the transmit descriptor ring. In those cases, there’s a hard limit to how muchlatency the transmit can endure before postponing packets is inevitable.

Transmit DMA Buffer Method Thresholds

The packets staged for transmission are buffers present in the kernel virtual addressspace. A mapping is created that provides a physical address that the hardware usesas a base address of the bytes to be fetched from main memory for transmission. Theminimum granularity of a buffer is an 8-kilobyte page, so if an Ethernet packetcrosses an 8-kilobyte page in the virtual address space, there’s no guarantee that thetwo pages will also be adjacent physical address space. To make sure the physicalpages are adjacent in the physical address space, SPARC systems provide aninput/output memory management unit (IOMMU), which is designed to make surethat the device view of main memory matches that of the CPU and hence simplifiesDMA mapping.

The IOMMU simplifies the mapping of virtual to physical address space andprovides a level of error protection against rogue devices that exceed their allowedmemory regions, but it does so at some cost to the mapping set up and tear down.The generic device driver interface for creating this mapping is known as ddi_dma.In SPARC platforms, a newer set of functions for doing the mapping known as fastdvma is now available. With fast dma it is possible to further optimize the mappingfunctions. The use of fast dvma is limited, so when that resource is unavailable,falling back to ddi_dma is necessary.

Another aspect of DMA is the CPU cache coherence. DMA buffers on bridges in thedata path between the NIC device and main memory must be synchronized with theCPU cache prior to the CPU reading or writing data. Two different modes ofmaintaining DMA-to-CPU cache coherency form two types of DMA transaction,know as consistent mode and streaming mode.

■ Consistent mode uses a consistency protocol in hardware, which is common inboth x86 platforms and SPARC platforms.

■ Streaming mode uses a software synchronization method.



The trade-offs between consistent and streaming modes are largely due to the pre-fetch capability of the DMA transaction. In the consistent mode there’s no pre-fetch,so when a DMA transaction is started by the device, each cache line of data isrequested individually. In streaming mode a few extra cache lines can be pre-fetchedin anticipation of being required by the hardware, hence reducing per cache line re-arbitration costs.

All of these trade-offs lead to the following rules for using ddi_dma, fast dvma,and consistent versus streaming mode:

■ If the packets are small, avoid setting up a mapping on a per-packet basis. Thismeans that small packets are copied out of the message and passed down fromthe upper layer to a pre-mapped buffer. That pre-mapped buffer is usually aconsistent mode buffer, as the benefits of streaming mode are difficult to realizefor small packets.

■ Large packets should use the fast dvma mapping interface. Streaming mode isassumed in this mode. On x86 platforms, streaming mode is not available.

■ Mid-range packets should use the ddi_dma mapping interface. This range appliesto all cases where fast dvma is not available. The mid-range can be further split,as one can control explicitly whether the DMA transaction uses consistent modeor streaming mode. Given that streaming mode pre-fetch capability works bestfor larger transactions, the upper half should use streaming mode while the lowerhalf uses consistent mode.

Setting the thresholds for these rules requires clear understanding of the memorylatencies of the system and the distance between the I/O Expander card and theCPU card in a system. The rule of thumb here is the larger the system, the larger thememory latency.

Once the course-grain tuning is applied, more fine-grain tuning is required. The besttuning is established by experimentation. A good way to do this is by running FTPor NFS transfers of large files and measuring the throughput.

Multi-data Transmit Capability

A new development in the process of transmission in the Solaris TCP/IP protocolstack is known as multi-data transmission (MDT). Delivering one packet at a time tothe transmit driver caused a lot of networking layer traversals. Furthermore, theindividual cost of setting up network device DMA hardware to transmit one packetat a time was too expensive. Therefore, a method was devised that allowed multiplepackets to be passed to the driver for transmission. At the same time, every effortwas made to ensure that the data for those packets remained in one contiguousbuffer that could be enabled for transmission with one setup of Network DeviceDMA hardware.



This feature requires a new interface to the driver, so only the most recent deviceshave implemented it. Furthermore, it can only be enabled if TCP/IP is configured toallow it. Even with that, it will only attempt to build an MDT transaction to thedriver if the TCP connection is operating in a Bulk transfer mode such as FTP orNFS.

The Multi-Data transmit capability is also included as part of the performanceenhancements provided in the ce driver. This feature is negotiated with the upperlayer protocol so it can be enabled in the ce driver as well as the upper layerprotocol. If there’s no negotiation, the feature is disabled. The TCP/IP protocolbegan supporting Multi-Data Transmit capability in the Solaris 9 8/03 operatingsystem, but by default it will not negotiate with the driver to enable it. The first stepto making this capability available is to enable the negotiations through an/etc/system tunable parameter.

● To enable the multi-data transmit capability, add the following line to the/etc/system file:

Receive

The receive side of the interface looks much like the transmission side, but itrequires more from the device driver to ensure that packets are passed to the correctstream. There are also multithreading techniques to ensure that the best advantage ismade of multiprocessor environments. FIGURE 5-9 shows the basic Rx architecture.

TABLE 5-15 Multi-Data Transmit Tunable Parameter

Parameter Value Description

ip_use_dl_cap 0-1 Enables the ability to negotiate special hardwareaccelerations with a lower layer.1 Enable0 DisableDefault 0

set ip:ip_use_dl_cap = 1



FIGURE 5-9 Basic Receive Architecture

The receive descriptor plays a key role in the process of receiving packets. Unliketransmission, receive packets originate from remote systems. Therefore, the Rxdescriptor ring refers to buffers where those incoming packets can be placed.

At a minimum, the receive descriptor element provides a buffer length and a pointerto an available buffer. When a packet arrives, it’s received first by the PHY deviceand then passed to the MAC, which notifies the Rx DMA engine of an incomingpacket. The Rx DMA takes that notification and uses it to initiate a Rx descriptorelement fetch. The descriptor is then used by the Rx DMA to post the data from theMAC device internal FIFOs to system main memory. The length provided by thedescriptor ensures the Rx DMA doesn’t exceed the buffer space provided for theincoming packet.

The Rx DMA continues to move data until the packet is complete. Then it places inthe current descriptor location a new completion descriptor containing the size ofthe packet that was just received. In some cases, depending on the hardwarecapability, there might be more information in the completion descriptor associatedwith the incoming packet (for example, a TCP/IP partial checksum).

When the completion descriptor is placed back onto the Rx descriptor ring, thehardware advances its pointer to the next free Rx descriptor. Then the hardwareinterrupts the CPU to notify the device driver that it has a packet that needs to bepassed to the DLPI layer.

Once the device driver receives the packet, it is responsible for replenishing the Rxdescriptor ring. That process requires the driver to allocate and map a new buffer forDMA and post it to the ring. When the new buffer is posted to the ring, thehardware is notified that this new descriptor is available. Once the buffer isreplenished, the current packet can be passed up for classification to the streamexpecting that packet’s arrival.

PHYDevice

MediaAccessControlModule

RXDMA

Engine

System Memory

Rx DescriptorRing

DataBuffer

Rx Descriptor



It is possible that the allocation and mapping can fail. In that case, the current packetcannot be received, as its buffer is reposted to the ring to allow the hardware tocontinue to receive packets. This condition is not very likely, but it is an example ofan overflow condition.

Other overflow conditions can occur on the Rx path starting from the DLPI layer:

■ Overflow can be caused when the DLPI layer cannot receive the incoming packet.In that case, packets are typically dropped, even though they were successfullyreceived by the hardware.

■ Overflow can be caused when the device driver software is unable to replenishthe Rx descriptor elements faster than the NIC hardware consumes them. Thisusually occurs because the system doesn’t have enough CPU performance to keepup with the network traffic.

Overflow also occurs within the NIC device between the MAC and the Rx DMAinterface. This is known as MAC overflow. It is caused when the descriptor overflowsand backfill occurs because of that condition. MAC overflow can occur when a high-latency system bus makes the MAC overflow its internal buffer as it waits for the RxDMA to get access to the bus to move the data from the MAC buffer to mainmemory. Finally, if a MAC overflow condition exists, any packet coming in cannot bereceived. Hence that packet is considered missed.

In some cases, overflow conditions can be avoided by careful tuning of the devicedriver software. The extent of available tuning depends on the NIC hardware. Incases where the Rx descriptor ring is overflowed, many devices allow increases inthe number of descriptor elements available. This will be discussed further withrespect to example NIC cards in later sections.

You can avoid the MAC overflow condition by careful system configuration, whichcan require more memory, faster CPUs, or more CPUs. It might also require that NICcards not share the system bus with other devices. Newer devices have the ability toadjust the priority of the Rx DMA versus the Tx DMA, giving one a more favorableopportunity to access the system bus than the other. Therefore, if the MAC overflowcondition occurs, it might be possible to adjust the Rx DMA priority to make Rxaccesses to the system bus more favorable than the Tx DMA, thus reducing thelikelihood of MAC overflow.

The overflow condition from the DLPI layer is caused by an overwhelmed CPU.There are a few new hardware features that help reduce this effect. Those featuresinclude hardware checksumming, interrupt blanking, and CPU load balancing.

Checksumming

The hardware checksumming feature accelerates the one’s complement checksumapplied to TCP/IP packets. The TCP/IP checksum is applied to each packet sent bythe TCP/IP protocol. The TCP/IP checksum is made up of a one’s complement



addition of the bytes in the pseudo header plus all the bytes in the payload. Thepseudo header is made up of bytes from the source and destination IP address plusthe TCP source and destination port numbers.

The hardware checksumming feature is merely an acceleration. Most hardwaredesigns don’t implement the TCP/IP checksumming directly. Instead, the hardwaredoes the bulk of the one’s complement additions over the data and allows thesoftware to take that result and mathematically adjust it to make it appear thecomplete hardware checksum was calculated. On transmission, the TCP checksumfield is filled with an adjustment value that is considered just another two bytes ofdata that the hardware is applying during the one’s complement addition of all thebytes of the packet. The end result of that sequence is a mathematically correctchecksum that can be placed in the TCP header on transmission by the MAC to thenetwork.

FIGURE 5-10 Hardware Transmit Checksum

On the Rx path, the hardware completes the one’s complement checksum based on astarting point in the packet. That same starting point is passed to TCP/IP along withthe one’s complement checksum from the bytes in the incoming packet. The TCP/IPsoftware again does a mathematical fix-up, using this information before it finallycompares the result with the TCP/IP checksum bytes that arrived as part of thepacket.

The main advantage of hardware checksumming is the reduction in cost of requiringthe system CPU to calculate the checksum for large packets by allowing the majorityof the checksum calculation to be completed by the NIC hardware. Because thehardware does not do the complete TCP/IP checksum calculation, this form ofTCP/IP checksum acceleration is called partial checksumming.

EthernetHeader

Hardwarehere

Start Checksum From

Hardwarehere

Place Checksum Result

IPHeader

TCPHeader Payload Data

End Checksum Here



FIGURE 5-11 Hardware Receive Checksum

Interrupt Blanking

Interrupt blanking is another hardware acceleration. Typically with regular NICdevices, the CPU is interrupted when a receive packet arrives. Hence the CPU isinterrupted on a per-packet basis. While this is reasonable for transactional requests,where you would expect a response to a request immediately, it is not alwaysrequired, especially in large bulk data transfers. In the single-interrupt-per-packetcase, a packet arrival interrupting the CPU adds the cost of processing eachindividual packet to the overhead of the interrupt processing. The interrupt blankingtechnique allows a set number of packets to arrive before the next receive interruptis generated. This allows the overhead of the interrupt processing to be distributed,or amortized, across the number of received packets. If that number of packets is notreached, then the packets that have arrived so far will not generate an interrupt andhence would not be processed. A timeout ensures that the receive packet interruptwill be generated and those received packets will be processed. The best setting forthe interrupt blanking depends on the type of traffic—transactional versus bulk datatransfers—and the speed of the system. The best way to tune these parameters canbe achieved empirically when the given parameters are well known and theinterrupt blanking can be tuned dynamically to match. This will be discussed furtherin the context of individual NICs that provide this feature.

CPU Load Balancing

CPU load balancing is the latest hardware acceleration to become available. It isdesigned to take maximum advantage of the large number of CPUs available inmany UltraSPARC-based systems. There are two forms of CPU load balancing:software load balancing and hardware load balancing.

EthernetHeader

Hardwarehere

Start Checksum From

Hardware Calculate Checksum

Pkt Buffer Pkt Length

IPHeader

TCPHeader Payload Data

End Checksum Here



Software load balancing can be enhanced with hardware support, but it can also beimplemented without hardware support. Essentially, it requires the ability toseparate the workload of different connections from the same protocol stack intoflows that can then be processed on different CPUs. The interrupt thread is nowrequired only to replenish buffers for the descriptor ring, allowing more packets toarrive. Packets taken off the receive rings are then load balanced into packet flowsbased on connection information from the packet. A packet flow has a circular arraythat is updated with receive packets from the interrupt service routine while packetsposted earlier are being removed and post-processed in the protocol stack by thekernel worker thread. Usually more than one flow is set up within a system made upof a circular array and a corresponding kernel worker thread. The more CPUsavailable, the more flows can be allowed. The kernel worker threads are available torun whenever packet data is available on the flows array. The system schedulerparticipates using its own CPU load-balancing technique to ensure a fair distributionof workload for incoming data.

FIGURE 5-12 demonstrates the architecture of software load balancing.

FIGURE 5-12 Software Load Balancing

Hardware load balancing requires that the hardware provide built-in load balancingcapability. The PCI bus enables receive hardware load balancing by using its fouravailable interrupt lines together with the ability of the UltraSPARC III systems toallow each of those four interrupt lines to be serviced by different CPUs within thesystem. The advantage of having the four lines receive interrupts running ondifferent CPUs is that it allows not only the protocol post processing to happen in

Protocol Stack

Flow Workerthread

Circulararray

Flow Workerthread

Circulararray

Flow Workerthread

Circulararray

Rx Interrupt Servicesroutine



parallel, as in the case of software load balancing, but it also allows the processing ofthe descriptor rings in the interrupt service routines to run in parallel, as shown inFIGURE 5-13.

FIGURE 5-13 Hardware Load Balancing

It is possible to combine the concept of software load balancing with the concept ofhardware load balancing if enough CPUs are available to allow all the parallel Rxprocessing to happen.

However, there is a gotcha with this load balancing capability: To realize its benefit,you must have multiple connections in order to provide the load balancing in thefirst place.

Received Packet Delivery Method

The received packet delivery method is the way packets are posted to the upperlayer for protocol processing and refers to which thread of execution takesresponsibility for that last leg of the journey for the received packet.

CPU software load balancing is an example of a received packet delivery methodwhere the interrupt processing is decoupled from the protocol stack processing. Ahint provided by the hardware helps decide which worker thread completes thedelivery of the packet to the protocol. In this model, many CPUs get to participate inthe protocol stack processing.

Protocol Stack

Rx Interruptservice routine




PCI BUS

NIC hardwareproviding hardware

load balancing



The Streams Service Queue model also requires the driver to decouple interruptprocessing from protocol stack processing. In this model, there’s no requirement toprovide a hint because there is only one protocol processing thread per queue opento the driver—with respect to TCP/IP, that’s only one stream. This method worksbest on systems with a small number, but greater than one, of CPUs. Like CPU loadbalancing, it compromises on latency.

The most common received packet delivery method is to do all the interruptprocessing and protocol processing in the interrupt thread. This is a widely acceptedmethod, but it is restricted by the available CPU bandwidth taking all the NIC driverinterrupts. This is really the only option on a single CPU system. In a multi-CPUsystem, you can choose one of the other two methods if it’s established that the CPUtaking the NIC interrupts is being overwhelmed. That situation becomes apparentwhen the system starts to become unresponsive.

Random Early Discard

The Random Early Discard feature was introduced recently to try to reduce the illeffects of a network card going into an overflow state.

Under normal circumstances, there are a couple of overflow possibilities:

■ The internal device memory is full and the adapter is unable to get timely accessto the system bus in order to move data from that device memory to systemmemory.

■ The system is so busy servicing packets that the descriptor rings fill up withinbound packets and no further packets can be received. This overflow conditionis very likely to also trigger the first overflow condition at the same time.

When these overflow conditions occur, the upper layer connections effectively stopreceiving packets and a connection appears to have stopped in motion, at least forthe duration of the overflow condition. With TCP/IP in particular, this leads tomany packets being lost. The connection state is modified to assume a less reliableconnection, and in some cases the connections might be lost completely.

The impact of a lost connection is obvious, but if the TCP/IP protocol assumes a lessreliable connection it will further contribute to the congestion on the network byreducing the number of packets outstanding without an ACK from the regular eightto a smaller value. A technique that can avoid this scenario can take advantage ofthe TCP/IP ability to allow for the occasional single packet loss associated with aconnection and still maintain the same number of packets outstanding without anACK. The lost packet is simply requested again, and the transmitting end of theconnection will perform a retry.

Completely avoiding the overflow scenario is impossible, but you can reduce itslikelihood by beginning to drop random packets already received in the devicememory, avoiding propagating them further into the system and adding to the



workload already piled up for the system. This technique, known as Random EarlyDiscard (RED), has the desired effect of avoiding overwhelming the system, while atthe same time having minimal negative effect on the TCP/IP connections.

The rate of random discard is done relative to how many bytes of packet dataoccupy the device internal memory. The internal memory is split into regions. Asone region fills up with packet data, it spills into the next until all regions of memoryare filled and overflow occurs. When the packet data spills from one region to thenext, that’s the trigger to randomly discard. The number of packets discarded isbased on the number of regions filled; the more regions filled, the more you need todiscard, as you’re getting closer to the overflow state.

Jumbo Frames

Jumbo frames technology allows the size of an Ethernet data packet to be extendedpast the current 1514 standard limit, which is the norm for Ethernet networks. Thetypical size of jumbo frames has been set to 9000 bytes when viewed from the IPlayer. Once the Ethernet header is applied, that grows by 14 bytes for the regularcase or 18 bytes for VLAN packets.

When jumbo frames are enabled on a subnet or VLAN, every member of that subnetor VLAN should be enabled to support jumbo frames. To ensure that this is the case,configure each node for jumbo frames. The details of how to set and check a nodefor jumbo frames capability tend to be NIC device/driver-specific and are discussedfor interfaces that support them below. If any one node in the subnet is not enabledfor jumbo frames, no members in the subnet can operate in jumbo frame mode,regardless of their preconfiguration to support jumbo frames.

The big advantage of jumbo frames is similar to that provided by MDT. Theyprovide a huge improvement in bulk data transfer throughput with correspondingreduction in CPU utilization, but with the addition that the same level ofimprovement is also available in the receive direction. Therefore, best bulk transferresults can be achieved using this mode.

The jumbo frames mode should be used with care because not all switches ornetworking infrastructure elements are jumbo frames capable. When you enablejumbo frames, make sure that they’re contained within a subnet or VLAN where allthe components in that subnet or VLAN are jumbo frames capable.

Ethernet Physical LayerThe Ethernet Physical layer has developed along with the Ethernet technology.When Ethernet moved from 10 Mbit/sec to 100 Mbit/sec, there were a number oftechnologies and media available to provide 100 Mbit/sec line rate. To allow thosemedia and technologies to develop without altering the long-established Ethernet



protocol, a partition was made between the media-specific portion and the Ethernetprotocol portion of the overall Ethernet technology. At that partition was placed theMedia Independent Interface (MII). The MII allowed Ethernet to operate over fiber-optic cables to switches built to support fiber. It also allowed the introduction of anew twisted-pair copper technology, 100BASE-T4. These differing technologies forthe supporting 100 Mbit/sec ultimately did not survive the test of time, leaving100BASE-T as the standard 100 Mbit/sec media type.

The existing widespread adoption of 10 Mbit/sec Ethernet brought with it arequirement that Ethernet media for 100 Mbit/sec should allow for backwardcompatibility with existing 10 Mbit/sec networks. Therefore, the MII was required tosupport 10 Mbit/sec operation as well as 100 Mbit/sec and allow the speed to beuser selectable or automatically detected or negotiated. Those requirements led tothe ability to force a particular speed setting for a link, known as Forced mode, orbased on link speed signaling set a particular speed, known as auto-sensing, whereboth sides of the link share information about link speed and duplex capabilities andnegotiate the best speed and duplex to set for the link, known as Auto-negotiation.

Basic Mode Control Layer

For the benefit of this discussion, the MII is restricted to the registers and bits usedby software to allow the Ethernet physical layer to operate in Forced or Auto-negotiation mode.

The first register of interest is the Basic Mode Control Register (BMCR). This registercontrols whether the link will auto-negotiate or use Forced mode. If Forced mode ischosen, then auto-negotiation is disabled and the remaining bits in the registerbecome meaningful.

FIGURE 5-14 Basic Mode Control Register

■ The Reset bit is a self-clearing bit that allows the software to reset the physicallayer. This is usually the first bit touched by the software in order to begin theprocess of synchronizing the software state with the hardware link state.

■ Speed Selection is a single bit, and it is only meaningful in Forced mode. Forcedmode of operation is available when auto-negotiation is disabled. If this bit is setto 0, then the speed selected is 10 Mbit/sec. If set to 1, then the speed selected is100 Mbit/sec.

Reset Speed Select DuplexMode

Auto-NegotiationEnable

RestartAuto-Negotiation



■ When the Auto-negotiation Enable bit is set to 1, auto-negotiation is enabled, andthe speed selection and duplex mode bits are no longer meaningful. The speedand Duplex mode of the link are established based on auto-sensing or auto-negotiation advertisement register exchange.

■ The Restart Auto-negotiation bit is used to restart auto-negotiation. This isrequired during the transition from Forced mode to Auto-negotiation mode orwhen the Advertisement register has been updated to a different set of auto-negotiation parameters.

■ The Duplex mode bit is only meaningful in Forced mode. When set to 1, the linkis set up for full duplex mode. When set to 0, the link operates in Half-duplexmode.

Basic Mode Status RegisterThe next register of interest is the Basic Mode Status Register (BMSR). This read-onlyregister provides the overall capabilities of the MII physical layer device. From thesecapabilities you can choose a subset to advertise, using the Auto-negotiationAdvertisement register during the auto-negotiation process.

FIGURE 5-15 Basic Mode Status Register

■ When the 100BASE-T4 bit is set to 1, it indicates that the physical layer device iscapable of 100BASE-T4 networking. When set to 0, it is not.

■ When the 100BASE-T Full-duplex bit is set to 1, it indicates that the physical layerdevice is capable of 100BASE-T full-duplex networking. When set to 0, it is notcapable.

■ When the 100BASE-T Half-duplex bit is set to 1, it indicates that the physicallayer device is capable of 100BASE-T half-duplex networking. When set to 0, it isnot capable.

■ When the 10BASE-T Full-duplex bit is set to 1, it indicates that the physical layerdevice is capable of 10BASE-T full-duplex networking. When set to 0, it is not.

■ When the 10BASE-T Half-duplex bit is set to 1, it indicates that the physical layerdevice is capable of 10BASE-T half-duplex networking. When set to 0, it is not.

100BASE-T4100BASE-TFull Duplex

100BASE-THalf Duplex


LinkStatus

Auto-negotiation

capable

Auto-negotiationcomplete

100BASE-TFull Duplex



■ The Auto-negotiation Complete bit is only meaningful when the physical layerdevice is capable of auto-negotiation and it is enabled. Auto-negotiationComplete indicates that the auto-negotiation process has completed and theinformation in the link partner auto-negotiation accurately reflects the linkcapabilities of the link partner.

■ When the Link Status bit is set to 1, it indicates that the physical link is up. Whenset to 0, the link is down. When used in conjunction with auto-negotiation, thisbit must be set together with the Auto-negotiation Complete before the softwarecan establish that the link is actually up. In Forced mode, as soon as this bit is setto 1, the software can assume the link is up.

■ When the Auto-negotiation Capable bit is set to 1, it indicates that the physicallayer device is capable of auto-negotiation. When set to 0, the physical layerdevice is not capable. This bit is used by the software to establish any furtherauto-negotiation processing that should occur.

Link-Partner Auto-negotiation Advertisement Register

The next registers of interest are the Auto-negotiation Advertisement Register(ANAR) and Link-Partner Auto-negotiation Advertisement (LPANAR). These tworegisters at the heart of the Auto-negotiation process, and they both share the samebit definitions.

The ANAR is a read/write register that can be programmed to control the linkpartner’s view of the local link capabilities. The LPANAR is a read-only register thatis used to discover the remote link capabilities. Using the information in theLPANAR together with the information available in the ANAR register, the softwarecan establish what shared link capability has been established once auto-negotiationhas completed.

FIGURE 5-16 Link Partner Auto-negotiation Advertisement

When the 100BASE-T4 is set to 1 in the ANAR, it advertises the intention of the localphysical layer device to use 100BASE-T4. When set to 0, this capability is notadvertised. This same bit, when set to 1 in the LPANAR, indicates that the linkpartner physical layer device has advertised 100BASE-T4 capability. When set to 0,the link partner is not advertising this capability.


10BASE-TFull Duplex

10BASE-THalf Duplex




The 100BASE-T Full-duplex, 100BASE-T Half-duplex, 10BASE-T Full-duplex, and100BASE-T Half-duplex bits all have the same functionality as 100BASE-T4 andprovide the ability to decide what link capabilities should be shared for the link. Thedecision process is made by the physical layer hardware and is based on priority, asshown in FIGURE 5-17. It is the result of logically ANDing ANAR and the LPANARon completion of auto-negotiation.

FIGURE 5-17 Link Partner Priority for Hardware Decision Process

Auto-negotiation in the purest sense requires that both sides participate in theexchange of ANAR. This allows both sides to complete loading of the LPANAR andestablish a link that operates at the best negotiated value.

It is possible that one side, or even both sides, of the link might be operating inForced mode instead of Auto-negotiation mode. This can happen because the newdevice is connected to an existing 10/100 Mbit/sec link that was never designed tosupport auto-negotiation or because the auto-negotiation is switched off on one orboth sides.

If both sides are in Forced mode, one needs to set the correct speed and duplex forboth sides. If the speed is not matched, the link will not come up, so speedmismatches can be easily tracked down once the physical connection is checked andconsidered good. If the duplex is not matched yet the speed is matched, the link willcome up, but there’s an often unnoticed gotcha in that. If one side is set to halfduplex while the other is set to full duplex, then the half-duplex side will operatewith the Ethernet protocol Carrier Sense Multiple Access with Collision Detection(CSMA/CD) while the full-duplex side will not. To the physical layer, this meansthat the full-duplex side is not adhering to the half-duplex CSMA/CD protocol andwill not back off if someone is currently transmitting. For the half-duplex side of theconnection, this appears as a collision, and its transmit is stopped. These collisionswill occur frequently, preventing the link from operating to its best capacity.

If one side of the connection is running Auto-negotiation mode and the other isrunning Forced mode and the auto-negotiating side is capable and advertising allavailable MII speeds and duplex settings, the link speed will always be negotiatedsuccessfully by the auto-sensing mechanism provided as part of the auto-negotiationprotocol. Auto-sensing uses physical layer signaling to establish the operating speed

Highest 100BASE-T Full Duplex

100BASE-T4

100BASE-T Half Duplex

10BASE-T Full Duplex

10BASE-T Half DuplexLowest



of the Forced side of the link. This allows the link to at least come up at the correctspeed. The link duplex, on the other hand, needs the Advertisement registerexchange and cannot be established by auto-sensing. Therefore, if the link duplexsetting on the Forced mode side of the link is full duplex, then the best guess theauto-negotiating side of the link can make is half duplex. This gives rise to the sameeffect discussed when both sides are in Forced mode and there’s a duplex mismatch.The only solution to the issue of duplex mismatch is to be aware that it can happenand make every attempt to configure both sides of the link to avoid it.

In most cases, enabling auto-negotiation on both sides wherever possible willeliminate the duplex mismatch issue. The alternative is Forced mode, which shouldonly be employed in infrastructures that have full-duplex configurations. Wherepossible, those configurations should be replaced with an auto-negotiationconfiguration.

There’s one more MII register worthy of note. The Auto-negotiation Expansionregister (ANER) can be useful in establishing whether a link partner is capable ofauto-negotiation or not, and providing information about the auto-negotiationalgorithm.

FIGURE 5-18 Auto-negotiation Expansion Register

The Parallel Detection Fault bit indicates that the auto-sensing part of the auto-negotiation protocol was unable to establish the link speed, and the regular ANARexchange was also unsuccessful in establishing a common link parameter. Thereforeauto-negotiation failed. If this condition happens, the best course of action is tocheck each side of the link manually and ensure that the settings are mutuallycompatible.

Gigabit Media Independent InterfaceAs time progressed, Ethernet was increased in speed by another multiple of 10 togive 1000 Mbit/sec or 1 Gbit/sec. The MII remained and was extended to supportthe new 1 Gbit/sec operation, giving rise to the Gigabit Media Independent Interface(GMII).

Link PartnerAuto-Negotiation

able

ParallelDetection Fault



The GMII was first implemented using fiber-optic physical layer known as1000BASE-X and was later extended to support twisted-pair copper known as1000BASE-Tx. Those extensions led to additional bits in registers in the MIIspecification and some completely new registers, giving a GMII register setdefinition.

The first register to be extended was the BMCR because it can be used to forcespeed. Then the ability to force 1-gigabit operation was added. All existing bitdefinition was maintained with the addition of one bit taken from the existingreserved bits to allow the enumeration of the different speeds that can now be forcedwith GMII devices.

FIGURE 5-19 Extended Basic Mode Control Register

The next register of interest was the BMSR. This register was extended to indicate tothe driver software that there are more registers that apply to 1-gigabit operation.

FIGURE 5-20 Basic Mode Status Register

When the 1000BASE-T Extended Status is set, that’s the indication to the driversoftware to look at the new 1-gigabit operating registers. The function is similar tothe Basic Mode Status and the ANAR.

The Gigabit Extended Status Register (GESR) is the first of the gigabit operatingregisters. Like the BMSR, it gives an indication of the types of gigabit operation thephysical layer device is capable of.

Reset Speed Select (0)

DuplexMode

SpeedSelect (1)

Auto-NegotiationEnable

RestartAuto-Negotiation



10BASE-THalf Duplex

1000BASE-TExtended

Status

LinkStatus

Auto-Negotiation

Capable

Auto-NegotiationComplete

100BASE-TFull Duplex



FIGURE 5-21 Gigabit Extended Status Register

The 1000BASE-X full duplex indicates that the physical layer device is capable ofoperating with 1000BASE-X fiber media with full-duplex operation.

The 1000BASE-X half duplex indicates that the physical layer device is capable ofoperating with 1000BASE-X fiber media with half-duplex operation.

The 1000BASE-T full duplex indicates that the physical layer device is capable ofoperating with 1000BASE-T twisted-pair copper media with full-duplex operation.

The 1000BASE-T half duplex indicates that the physical layer device is capable ofoperating with 1000BASE-T twisted-pair copper media with half-duplex operation.

The information provided by the GESR gives the possible 1-gigabit capabilities ofthe physical layer device. From that information you can choose the gigabitcapabilities that will be advertised through the Gigabit Control Register (GCR). Inthe case of twisted-pair copper physical layer, there is also the ability to advertise theClock Mastership.

FIGURE 5-22 Gigabit Control Status

Clock Mastership is a new concept that only applies to copper media running at 1gigabit. At such high signaling frequencies, it becomes increasingly difficult tocontinue to have separate clocking for the remote and the local physical layerdevices. Hence, a single clocking domain was introduced, which the remote and thelocal physical layer devices share while a link is established. To achieve the singleclocking domain only one end of the connection provides the clock (the link master),and the other (the link slave) simply uses it. The Gigabit Status Register (GSR) bits,Master/Slave Manual Config Enable, and Master/Slave Config Value control howyour local physical layer device will behave in this master/slave relationship.

1000BaseXFull Duplex

1000BaseXHalf Duplex

1000BaseTFull Duplex

1000BaseTHalf Duplex

Master/SlaveManual config

enable

Master/SlaveConfigValue





When the Master/Slave manual config enable bit is set, the master slaveconfiguration is controlled by the master config value. When it is cleared, theMaster/Slave configuration is established during auto-negotiation by a clock-learning sequence, which automatically establishes a clock master and slave for thelink.

Typically in a network, the master is the switch port and the slave is the end port orNIC.

The Master/Slave manual config enable setting is only meaningful when theMaster/Slave manual config enable bit is set. If set to 1, it will force the localclock mastership setting to be Master. If set to 0, the local clock becomes the Slave.

When using the Master/Slave manual configuration, take care to ensure that the linkpartner is set accordingly. For example, if 1-gigabit Ethernet switches are set up tooperate as link masters, then the computer system attached to the switches shouldbe set up as a slave.

FIGURE 5-23 Gigabit Status Register

When the driver fills in the bits in the GSR, it’s equivalent to filling in the ANAR inMII: It controls the 1-gigabit capabilities that are advertised. Likewise the GSR is likethe LPANAR, providing the capabilities of the link partner. The register definitionfor the GSR is similar to the GCR.

With GMII operation, once auto-negotiation is complete, the contents of the GCR arecompared with those in the GSR and the highest-priority shared capability is used todecide the gigabit speed and duplex.

It is possible to disable 1-gigabit operation. In that case, the shared capabilities mustbe found in the MII registers as described above.

In GMII mode, at the end of auto-negotiation, once the GCR and GSR are comparedand the ANAR and LPANAR are compared, then the choice of the operating speedand duplex is established by the hardware based on the following descendingpriority:

Master/SlaveManual config

enable

Master/SlaveConfigValue





FIGURE 5-24 GMII Mode Link Partner Priority

Once the correct setting is established, the device software makes that setting knownto the user through kernel statistics. It is also possible to manipulate theconfiguration using the ndd utility.

Ethernet Flow Control

One area of MII/GMII that appeared after the initial definition of MII but beforeGMII was the introduction of Ethernet Flow Control. Ethernet Flow Control is aMAC Layer feature that controls the rate of packet transmission in both directions.

FIGURE 5-25 Flow Control Pause Frame Format

The key to this feature is the use of MAC control frames known as pause frames,which have the following formats:

■ The Destination Address is a 6-byte address defined for Ethernet Flow Control asa multicast address of 01:80:C2:00:00:1.

■ The Source Address is a 6-byte address that is the same as the Ethernet stationaddress of the producer of the pause frame.

■ The Protocol Type Field is a 2-byte address set to the MAC control protocol0x8808. Pause capability is one example of the usage of MAC control protocol.

Highest 1000BASE-T Full Duplex



100BASE-T4



10BASE-T Half DuplexLowest

DestinationAddress

SourceAddress

ProtocolType Field

MAC Pause Opcode

MACPause Opcode

PAD to 42bytes

FrameCRC



■ The MAC Control Pause Opcode is a 2-byte value, 0x0001, that indicates the typeof MAC control feature to be used, in this case pause.

■ The Mac Control Pause Parameter is a 2-byte value that indicates whether theflow control is started—also referred to as XOFF or XON. When the MAC ControlPause Parameter is non zero, you have an XOFF pause frame. When the value is0, you have an XON pause frame. The value of the parameter is in units of slottime.

To understand the Flow Control capability, consider symmetric flow control first.With symmetric flow control, a network node can generate flow control frames orreact to flow control frames.

Generating a flow control frame is known as Transmit Pause capability and istriggered by congestion on the Rx side. The Transmit Pause sends an XOFF flowcontrol message to the link partner, who should react to pause frames (Receivepause capability). By reacting to pause frames, the link partner uses the transmittedpause parameter as a duration that the link partner’s transmitter should remainsilent while the Rx congestion clears. If the Rx congestion clears within that pauseparameter period, an XON flow control message can be transmitted telling the linkpartner that the congestion has cleared and transmission can continue as normal.

In many cases, Flow Control capability is available in only one direction. This isknown as Asymmetric Flow Control. This might be a configuration choice or simplya result of a hardware design.

Therefore, the MII/GMII specification was altered to allow Flow Control capabilityto be advertised to a link partner along with the best type of flow control to be usedfor the shared link. The changes were applied to the ANAR along with two new bits:Pause Capability and Asymmetric Pause Capability. FIGURE 5-26 shows the updatedregister.

FIGURE 5-26 Link Partner Auto-negotiation Advertisement Register

Starting with Asymmetric Pause Capability, if the value of this bit is set to 0, then theability to pause is managed by the Pause Capability. If Pause Capability is set to 1, itindicates the local ability to pause in both Rx and Tx direction. If the AsymmetricPause Capability is set to 1, it indicates the local ability to pause in either the Rx orthe Tx direction. When it’s set to 1, it indicates that the local setting is to Receiveflow control. In other words, reception of XOFF can stop transmitting, and XON can


10BASE-TFull Duplex


10BASE-THalf Duplex

PauseCapability

AsymmetricPause

Capability



restart it. When set to 0, it indicates transmit flow control, which means when Rxbecomes congested, it will transmit XOFF, and once the congestion clears, it cantransmit XON.

FIGURE 5-27 Rx/Tx Flow Control in Action

Now that the Pause Capability and Asymmetric Pause Capability are established, itis required to advertise these parameters to the link partner and negotiate the pausesetting to be used for the link.

TABLE 5-16 enumerates all the possibilities for resolving the pause capabilities for alink.

TABLE 5-16 Possibilities for Resolving Pause Capabilities for a Link

Local Device Remote Device Link Resolution

cap_pause cap_asmpause lp_cap_pause lp_cap_asmpause link_pause link_asmpause

0 0 X X 0 X

0 1 0 X 0 0

0 1 1 0 0 0

1 0 0 X 0 0

1 0 1 X 1 0

1 1 0 0 0 0

1 1 1 X 1 0

Remote Machine Local Machine

On Reception of an XOFF, stop transmitting until Pause time has elapsed or XON Arrives.

Enough packets have been serviced to reduce the RX FIFO occupancy to below the threshold; send XON.

Receive in the FIFO has exceeded the Pause threshold; send XOFF.

Incoming PacketTx FIFO Rx FIFO



The link_pause and link_asmpause parameters have the same meanings as thecap_pause and cap_asmpause parameters and enumerate meaningful informationfor a link given the pause capabilities available for both sides of the link.

Example 1

cap_asmpause = 1 The device is capable of asymmetric pause.

cap_pause = 0 The device will send pauses if the Receive sidebecomes congested.

lp_cap_asmpause = 0 The device is capable of symmetric pause.

lp_cap_pause = 1 The device will send pauses if the Receive sidebecomes congested, and it will respond to pause bydisabling transmit.

link_asmpause = 0 Because both the local and remote partner are set tosend a pause on congestion, only the remote partnerwill respond to that pause. This is equivalent to no flowcontrol, as it requires both ends to stop transmitting toalleviate the Rx congestion.

link_pause = 0 Further indication that no meaningful flow control ishappening on the link.

Example 2

cap_asmpause = 1 The device is capable of asymmetric pause.

cap_pause = 1 The device will send pauses if the receive side becomescongested.

lp_cap_asmpause = 0 The device is capable of symmetric pause.

lp_cap_pause = 1 The device will send pauses if the receive side becomescongested, and it will respond to pause by disablingtransmit.

link_asmpause = 1 Because the local setting is to stop sending on arrival ofa flow control message and the remote end is set tosend flow control messages when it gets congested, wehave flow control on the receive direction of the link.Hence it’s asymmetric.

link_pause = 1 The direction of the pauses is incoming.



There are more examples of flow control from the table that can be discussed interms of flow control in action. We’ll return to this topic when discussing individualdevices that support this feature.

This concludes all the options for controlling the configuration of the EthernetPhysical Layer MII/GMII. The preceding information should come in useful whenconfiguring your network and making sure that each Ethernet link is coming up asrequired by the configuration.

Finally, there is a perception that auto-negotiation has difficulties, but most of thesewere cleared up with the introduction of Gigabit Ethernet technology. Therefore it isno longer required to disable auto-negotiation to achieve reliable operation withavailable Gigabit switches.

Fast Ethernet InterfacesSun supports four Fast Ethernet drivers for its range of SPARC platforms:

■ 10/100 hme Fast Ethernet■ 10/100 qfe Quad Fast Ethernet■ 10/100 eri Fast Ethernet■ 10/100 dmfe Fast Ethernet

The following sections describe the details of these interfaces.

10/100 hme Fast EthernetThe hme interface is available for Sbus systems and PCI bus systems. In both cases,they share the same device driver.

The interface at the lowest level provides an RJ-45 twisted-pair Ethernet physicallayer that supports auto-negotiation. There are also motherboard implementationsusing hme that provide a MII connection. This allows an external transceiver to beattached and alternative physical layer interfaces like 100BASE-T4 or 100BASE-SX tobe made available.



FIGURE 5-28 Typical hme External Connectors

At a MAC level, the interface stages packets for transmission using a single 256-element descriptor ring array. Once that ring is exhausted, other packets get queuedwaiting in the stream’s queue. When the hardware completes transmission ofpackets currently occupying space on the ring, the packets waiting on the stream’squeue are moved to the descriptor ring.

The Rx side of the descriptor ring is again a maximum of 256 elements. Once thoseelements are exhausted, no further buffering is available for incoming packets andoverflows begin to occur.

When hme was introduced to the market, 256 descriptors to Tx and Rx werereasonable because the CPU frequencies for that time were around 100 Mhz to 300Mhz, so the arrival rate of transmission packets posted to the descriptor rings closelymatched the transmission capability of the physical media.

As time progressed, CPUs became faster and this number of descriptors becameinadequate. Often the interface began to exhaust the elements in the transmit ringand incurred more scheduling overhead for transmission.

On the Rx side, as CPUs became faster, the ability to receive packets became simplerbecause less time was required to service each packet. The occupancy on the Rx ringof packets needing to be serviced diminished.

The hme interface is limited in tuning capability. If you experience low performancebecause of overflows or the transmit ring being constantly full, no corrective actionis possible.

MII ConnectorsRJ-45

100BASE-TConnector

Media Attachment unit allows a connection to Ethernet via an

alternative Physical layer media type.

Typical HME External Connectors



The physical layer of hme is fully configurable using the driver.conf file and nddcommand.

TABLE 5-17 Driver Parameters and Status

Parameter Type Description

instance Read and Write Current device instance in view for ndd

adv_autoneg_cap Read and Write Operational mode parameters

adv_100T4_cap Read and Write Operational mode parameters

adv_100fdx_cap Read and Write Operational mode parameters

adv_100hdx_cap Read and Write Operational mode parameters



use_int_xcvr Read and Write Transceiver control parameter

lance_mode Read and Write Inter-packet gap parameters

ipg0 Read and Write Inter-packet gap parameters



autoneg_cap Read only Local transceiver auto-negotiation capability

100T4_cap Read only Local transceiver auto-negotiation capability

100fdx_cap Read only Local transceiver auto-negotiation capability

100hdx_cap Read only Local transceiver auto-negotiation capability



lp_autoneg_cap Read only Link partner capability

lp_100T4_cap Read only Link partner capability

lp_100fdx_cap Read only Link partner capability

lp_100hdx_cap Read only Link partner capability



transceiver_inuse Read only Current physical layer status

link_status Read only Current physical layer status

link_speed Read only Current physical layer status

link_mode Read only Current physical layer status



Current Device Instance in View for ndd

The current device instance in view allows you to point ndd to a specific deviceinstance that needs configuration. This must be done prior to altering or viewing anyof the other parameters or you might not be viewing or altering the correctparameters.

Before you view or alter any of the other parameters, make a quick check of thevalue of instance to ensure that it is actually pointing to the device you want toview or alter.

Operational Mode Parameters

The operational mode parameters adjust the MII capabilities that are used for auto-negotiation. When auto-negotiation is disabled, the highest priority value is taken asthe mode of operation. See “Ethernet Physical Layer” on page 152 regarding MII.

TABLE 5-18 Instance Parameter

Parameter Values Description

instance 0-1000 Current device instance in view for the rest of thendd configuration variables

TABLE 5-19 Operational Mode Parameters


adv_autoneg_cap 0-1 Local interface capability of auto-negotiationsignaling advertised by the hardware.0 = Forced mode1 = Auto-negotiationDefault is set to the autoneg_cap parameter.

adv_100T4_cap 0-1 Local interface capability of 100-T4 advertised bythe hardware.0 = Not 100 Mbit/sec T4 capable1 = 100 Mbit/sec T4 capableDefault is set to the 100T4_cap parameter.

adv_100fdx_cap 0-1 Local interface capability of 100 full duplex isadvertised by the hardware.0 = Not 100 Mbit/sec full-duplex capable1 = 100 Mbit/sec full-duplex capableDefault is set based on the 100fdx_capparameter.



If you are using the interactive mode of ndd with this device to alter theadv_100fdx_cap parameter to adv_10hdx_cap, the changes applied to thoseparameters are not actually applied to the hardware until adv_autoneg_cap ischanged to its alternative value and then back again.

Transceiver Control Parameter

The hme driver can have an external MII physical layer device connected though theMII interface. In this case, the driver has two choices of physical layer connection tothe network. Either connection is valid. Therefore a policy is implemented thatassumes that if an external physical layer is attached, it’s because the given internalphysical layer device doesn’t provide the required media, so the driver assumes thatthe external physical layer device is the one to use.

adv_100hdx_cap 0-1 Local interface capability of 100 half duplex isadvertised by the hardware.0 = Not 100 Mbit/sec half-duplex capable1 = 100 Mbit/sec half-duplex capableDefault is set based on the 100hdx_capparameter.



TABLE 5-19 Operational Mode Parameters (Continued)




In some cases it might be necessary to override that policy. Therefore the nddparameter use_int_xcvr is provided.

Inter-Packet Gap Parameters

The Inter-Packet Gap (IPG) parameters are ipg0, ipg1, and ipg2. The total IPG isthe sum of ipg1 and ipg2 plus an optional ipg0 that will only be present when thelance_mode parameter is set. The total default IPG is 9.6 microseconds when thelink speed set by the auto-negotiation protocol is 10 Mbit/sec. When the link speedis 100 Mbit/sec, the total IPG is 0.96 microseconds.

The additional delay set by ipg0 helps to reduce collisions. Systems that havelance_mode enabled might not have enough time on the network. If lance_modeis disabled, the value of ipg0 is ignored and no additional delay is set. Only thedelays set by ipg1 and ipg2 are used. Disable lance_mode if other systems keepsending a large number of back-to-back packets.

You can add the additional delay by setting the ipg0 parameter, which is the nibbletime delay from 0 to 31. Note that nibble time is the time it takes to transfer four bitson the link. If the link speed is 10 Mbit/sec, nibble time is equal to 400 ns. If the linkspeed is 100 Mbit/sec, nibble time is equal to 40 ns.

Note – IPG is sometimes increased on older systems using slower NICs, wherenewer NICs and systems are hogging the network. When a server dominates a half-duplex network it's known as server capture effect.

TABLE 5-20 Transceiver Control Parameter


use_int_xcvr 0-1 Override for the policy that the external XCVRtakes priority over the internal transceiver.0 = If an external transceiver is present, use itinstead of the internal (default).1 = If an external transceiver is present, ignore itand continue to use the internal.



For example, if the link speed is 10 Mbit/sec and you set ipg0 to 20 nibble times,multiply 20 by 400 ns to get 8000 ns. If the link speed is 100 Mbit/sec and you set ipg0to 30 nibble times, multiply 30 by 40 ns to get 1200 ns.

All of the IPG parameters can be set using ndd or can be hard-coded into thehme.conf files. Details of the methods of setting these parameters are provided in“Configuring Driver Parameters” on page 238.

Local Transceiver Auto-negotiation Capability

The local transceiver auto-negotiation capability parameters are read-onlyparameters and represent the fixed set of capabilities associated with the currentPHY that is in use. This allows an external MII PHY device to be attached to the

TABLE 5-21 Inter-Packet Gap Parameter


lance_mode 01

lance_mode disabled

lance_mode enabled (default)

ipg0 0-31 Additional IPG before transmitting a packetDefault = 4

ipg1 0-255 First inter-packet gap parameterDefault = 8

ipg2 0-255 Second inter-packet gap parameterDefault = 8



external MII port. Therefore the capabilities presented in these statistics might varyaccording to the capabilities of the external MII physical layer device that isattached.

TABLE 5-22 Local Transceiver Auto-negotiation Capability Parameters


autoneg_cap 0-1 Local interface is capable of auto-negotiationsignaling.0 = Can only operate in forced mode1 = Capable of auto-negotiation

100T4_cap 0-1 Local interface is capable of 100-T4 operation.0 = Not 100 Mbit/sec T4 capable1 = 100 Mbit/sec T4 capable

100fdx_cap 0-1 Local interface is capable of 100 full duplexoperation.0 = Not 100 Mbit/sec full-duplex capable1 = 100 Mbit/sec full-duplex capable

100hdx_cap 0-1 Local interface is capable of 100 half-duplexoperation.0 = Not 100 Mbit/sec half-duplex capable1 = 100 Mbit/sec half-duplex capable

10fdx_cap 0-1 Local interface is capable of 10 full-duplexoperation.0 = Not 10 Mbit/sec full-duplex capable1 = 10 Mbit/sec full-duplex capable




Link Partner Capability

The link partner capability parameters are read-only parameters and represent thefixed set of capabilities associated with the attached link partner set of advertisedauto-negotiation parameters. These parameters are only meaningful when auto-negotiation is enabled and can be used in conjunction with the operation parametersto establish why there might be problems bringing up the link.

TABLE 5-23 Link Partner Capability Parameters


lp_autoneg_cap 0-1 Link partner interface is capable of auto-negotiation signaling.0 = Can only operate in forced mode1 = Capable of auto-negotiation

lp_100T4_cap 0-1 Link partner interface is capable of 100-T4operation.0 = Not 100 Mbit/sec T4 capable1 = 100 Mbit/sec T4 capable

lp_100fdx_cap 0-1 Link partner interface is capable of 100 full-duplex operation.0 = Not 100 Mbit/sec full-duplex capable1 = 100 Mbit/sec full-duplex capable

lp_100hdx_cap 0-1 Link partner interface is capable of 100 half-duplex operation.0 = Not 100 Mbit/sec half-duplex capable1 = 100 Mbit/sec half-duplex capable

lp_10fdx_cap 0-1 Link partner interface is capable of 10 full-duplexoperation.0 = Not 10 Mbit/sec full-duplex capable1 = 10 Mbit/sec full-duplex capable




Current Physical Layer Status

The current physical layer status gives an indication of the state of the link, whetherit’s up or down, or what speed and duplex it’s operating at. These parameters arederived based on the result of establishing the highest priority shared speed andduplex capability when auto-negotiation is enabled. They can be preconfigured withForced mode.

Note that the physical layer status parameters are only meaningful while ndd isrunning in interactive mode or while the interface being viewed is already initializedby virtue of the presence of open streams such as snoop -d hme0 or ifconfighme0 plumb inet up. If these streams don’t exist, then the device is uninitialized andthe state gets set up when you probe these parameters with ndd. As a result, theparameters are subject to a race between the user viewing them and the linkreaching a steady state. This makes these parameters unreliable unless an existingstream is associated with an instance prior to checking. A good rule to follow is toonly trust these parameters if the interface is configured up using the ifconfigcommand.

TABLE 5-24 Current Physical Layer Status Parameters


transceiver_inuse 0-1 This parameter indicates which transceiver iscurrently in use.0 = Internal transceiver is in use.1 = External transceiver is in use.

link_status 0-1 Current link status0 = Link down1 = Link up

link_speed 0-1 This parameter provides the link speed and isonly valid if the link is up.0 = Not 100 Mbit/sec half-duplex capable1 = 100 Mbit/sec half-duplex capable

link_mode 0-1 This parameter provides the link duplex and isonly valid if the link is up.0 = Half duplex1 = Half duplex



10/100 qfe Quad Fast EthernetThe qfe interface was developed based on the same ASIC as the hme, and the driveris very similar in terms of capabilities and limitations. One key difference is thatthere’s no external MII connector and therefore no possibility to use any physicalmedia other than 100BASE-T.

FIGURE 5-29 Typical qfe External Connectors

With the introduction of qfe came the introduction of trunking technology, whichwill be discussed later.

The physical layer of qfe is fully configurable using the driver.conf file and nddcommand.





















The current device instance in view allows you to point ndd to a particular deviceinstance that needs configuration. This must be applied prior to altering or viewingany of the other parameters or you might not be viewing or altering the correctparameters.


















TABLE 5-25 Driver Parameters and Status (Continued)





The operational mode parameters adjust the MII capabilities that are used for auto-negotiation. When auto-negotiation is disabled, the highest priority value is taken asthe mode of operation. See “Fast Ethernet Interfaces” on page 165 regarding MII.



adv_autoneg_cap 0-1 Local interface capability of auto-negotiationsignaling is advertised by the hardware.0 = Forced mode1 = Auto-negotiationDefault is set to the autoneg_cap parameter.

adv_100T4_cap 0-1 Local interface capability of 100-T4 isadvertised by the hardware.0 = Not 100 Mbit/sec T4 capable1 = 100 Mbit/sec T4 capableDefault is set to the 100T4_cap parameter.







If you are using the interactive mode of ndd with this device to alter theadv_100fdx_cap parameter to adv_10hdx_cap, the changes applied to thoseparameters are not actually applied to the hardware until adv_autoneg_cap ischanged to its alternative value and then back again.


The qfe driver has the capability to have an external MII physical layer deviceconnected, but there’s no implemented hardware to allow this feature to be utilized.The use_int_xcvr parameter should never be altered in the case of qfe.




You can add the additional delay by setting the ipg0 parameter, which is the nibbletime delay, from 0 to 31. Note that nibble time is the time it takes to transfer four bitson the link. If the link speed is 10 Mbit/sec, nibble time is equal to 400 ns. If the linkspeed is 100 Mbit/sec, nibble time is equal to 40 ns.

For example, if the link speed is 10 Mbit/sec and you set ipg0 to 20 nibble times,multiply 20 by 400 ns to get 800 ns. If the link speed is 100 Mbit/sec, and you setipg0 to 30 nibble times, multiply 30 by 40 ns to get 120 ns.



lance_mode 01

lance_mode disabled




All of the IPG parameters can be set using ndd or can be hard-coded into theqfe.conf files. Details about setting these parameters are provided in “RebootPersistence Using driver.conf” on page 242.


The local transceiver auto-negotiation capability parameters are read-onlyparameters and represent the fixed set of capabilities associated with the currentPHY that is in use. This device allows an external MII PHY device to be attached tothe external MII port. Therefore the capabilities presented in these statistics mightvary according to the capabilities of the external MII physical layer device that isattached.


ipg1 0-255 First IPG parameterDefault = 8

ipg2 0-255 Second IPG parameterDefault = 8












100hdx_cap 0-1 Local interface is capable of 100 half duplexoperation.0 = Not 100 Mbit/sec half-duplex capable1 = 100 Mbit/sec half-duplex capable


10hdx_cap 0-1 Local interface is capable of 10 half duplexoperation.0 = Not 10 Mbit/sec half-duplex capable1 = 10 Mbit/sec half-duplex capable











The current physical layer status gives an indication of the state of the link, whetherit’s up or down, or what speed and duplex it’s operating at. These parameters arederived based on the result of establishing the highest priority shared speed andduplex capability when auto-negotiation is enabled or can be pre-configured withForced mode.






transceiver_inuse 0-1 Indicates which transceiver is currently in use.0 = Internal transceiver is in use.1 = External transceiver is in use.




TABLE 5-30 Link Partner Capability Parameters (Continued)




Note that the physical layer status parameters are only meaningful while ndd isrunning in interactive mode or the interface being viewed is already initialized byvirtue of the presence of open streams such as snoop -d qfe0 or ifconfig qfe0plumb inet up. If these streams don’t exist, the device is uninitialized and the stategets set up when you probe these parameters with ndd. As a result, the parametersare subject to a race between the user viewing them and the link reaching a steadystate. This makes these parameters unreliable unless an existing stream is associatedwith an instance prior to checking. A good rule to follow is to only trust theseparameters if the interface is configured up using the ifconfig command.

10/100 eri Fast EthernetWhen the eri interface was applied to the UltraSPARC III desktop systems, itaddressed many of the shortcomings of the hme interface and also eliminated theexternal MII interface.

The detailed architecture of the eri interface is made up again of a single transmitdescriptor ring and a single receive descriptor ring. However, with eri, themaximum size of the descriptor ring was increased to 8-Kbyte elements. So theopportunity to store packets for transmission is much larger and the probability ofnot having a descriptor element when attempting to transmit is reduced—alongwith the need to use the streams queue behind the transmit descriptor ring. Overall,the scheduling overhead described as a problem with hme and qfe is vastly reducedwith eri.

The eri interface is also the first interface capable of supporting the hardwarechecksumming capability, allowing it to be more efficient in bulk transferapplications.

The physical layer and performance features of eri are fully configurable using thedriver.conf file and ndd command.
















































adv_100T4_cap 0-1 Local interface capability of 100-T4 is advertisedby the hardware.0 = Not 100 Mbit/sec T4 capable1 = 100 Mbit/sec T4 capableDefault is set to the 100T4_cap parameter.




If you are using the interactive mode of ndd with this device to alter theadv_100fdx_cap parameter to adv_10hdx_cap, the changes applied to thoseparameters are not actually applied to the hardware until the adv_autoneg_cap ischanged to its alternative value and then back again.


The eri driver is capable of having an external MII physical layer device connected,but there’s no implemented hardware to allow this feature to be utilized. Theuse_int_xcvr parameter should never be altered in the case of eri.











You can add the additional delay by setting the ipg0 parameter, which is the nibbletime delay, from 0 to 31. Note that nibble time is the time it takes to transfer four bitson the link. If the link speed is 10 Mbit/sec, nibble time is equal to 400 ns. If the linkspeed is 100 Mbit/sec, nibble time is equal to 40 ns.

For example, if the link speed is 10 Mbit/sec and you set ipg0 to 20 nibble times,multiply 20 by 400 ns to get 800 ns. If the link speed is 100 Mbit/sec and you setipg0 to 30 nibble times, multiply 30 by 40 ns to get 120 ns.

All of the IPG parameters can be set using ndd or can be hard-coded into theeri.conf files. Details of the methods of setting these parameters are provided in“Configuring Driver Parameters” on page 238.

TABLE 5-35 Inter-Packet Gap Parameters


lance_mode 01

lance_mode disabled


ipg0 0-31 Additional IPG before transmitting a packet

Default = 4

ipg1 0-255 First Inter-packet gap parameterDefault = 8

ipg2 0-255 Second Inter-packet gap parameterDefault = 8



Receive Interrupt Blanking Parameters

The eri device introduces the receive interrupt blanking capability to 10/100Mbit/sec ports on the UltraSPARC III desktop systems. The following table providesthe parameter names, value range, and given defaults.


The local transceiver auto-negotiation capability parameters are read-onlyparameters and represent the fixed set of capabilities associated with the currentPHY that is in use.

TABLE 5-36 Receive Interrupt Blanking Parameters


intr_blank_time 0-127 Interrupt after this number of clock cycleshas passed and the packets pending have notreached the number ofintr_blank_packets. One clock cycleequals 2048 PCI clock cycles. (Default = 6)

intr_blank_packets 0-255 Interrupt after this number of packets hasarrived since the last packet was serviced. Avalue of zero indicates no packet blanking.(Default = 8)


















TABLE 5-37 Local Transceiver Auto-negotiation Capability Parameters (Continued)











transceiver_inuse 0-1 Indicates which transceiver is currently in use.0 = Internal transceiver is in use.1 = External transceiver is in use.








Note that the physical layer status parameters are only meaningful while ndd isrunning in interactive mode or the interface being viewed is already initialized byvirtue of the presence of open streams such as snoop -d eri0 or ifconfig eri0plumb inet up. If these streams don’t exist, the device is uninitialized and the stategets set up when you probe these parameters with ndd. As a result, the parametersare subject to a race between the user viewing them and the link reaching a steadystate. This makes these parameters unreliable unless an existing stream is associatedwith an instance prior to checking. A good rule to follow is to only trust theseparameters if the interface is configured up using the ifconfig command.

10/100 dmfe Fast EthernetThe dmfe interface is a another Ethernet system interface applied to the UltraSPARCrack-mounted Netra™ X1 and Sun Fire™ V100 server systems. This interface ismuch like the others in that its architecture supports a single transmit and receivedescriptor run. The number of elements in its descriptor ring is fixed to 32 elements.

The physical layer of dmfe is fully configurable using the driver.conf file andndd command.







































The local transceiver auto-negotiation capability parameters are read-onlyparameters and represent the fixed set of capabilities associated with the currentPHY that is in use. This device allows an external MII PHY device to be attached to








the external MII port. Therefore, the capabilities presented in these statistics mightvary according to the capabilities of the external MII physical layer device that isattached.



autoneg_cap 0-1 Local interface is capable of auto-negotiationsignaling.0 = Can only operate in Forced mode1 = Capable of auto-negotiation






















Fiber Gigabit EthernetThe first physical media available at 1 gigabit was the fiber media. This mediaallows Ethernet to stretch to the one-kilometer range using fiber-optic cable.

The first interface introduced to provide the 1-gigabit capability was the Sun GigabitEthernet adapter, vge. This was quickly followed by the ge interface, which wasthen followed by the high-performance ce interface. This section describes theseinterfaces in detail and explains how they can be best utilized to maximize theperformance of the network that they drive or are simply part of.








FIGURE 5-30 Typical vge and ge MMF External Connectors

1000 vge Gigabit EthernetThe vge gigabit interface exists only as a fiber interface 1000BASE-SX and isavailable for support of existing Sbus-capable systems or PCI bus systems.

The vge interface was also the first available interface to support VLAN capability.

1000 ge Gigabit EthernetThe ge gigabit interface exists only as a fiber interface 1000BASE-SX and is availablefor support of existing Sbus-capable systems or PCI bus systems. The architecture isthe same as the eri interface, with one transfer ring and one receive ring.

The ge interface employs the hardware checksumming capability described above toreduce the cost of the TCP/IP checksum calculation.

During its development, the interface was always challenging the limits of theSPARC systems, so it has many tunable features that can be set to provide the bestsystem and application performance.



The ge interface also provides Layer 2 flow control capability. The physical layerand performance features of ge are fully configurable using the driver.conf fileand ndd command.

















intr_blank_time Read and Write Receive interrupt blanking parameters

intr_blank_packets Read and Write Receive interrupt blanking parameters








































adv_1000fdx_cap 0-1 Local interface capability of 1000 full duplex isadvertised by the hardware.0 = Not 1000 Mbit/sec full-duplex capable1 = 1000 Mbit/sec full-duplex capableDefault is set to the 1000fdx_cap parameter.

adv_1000hdx_cap 0-1 Local interface capability of 1000 half duplex isadvertised by the hardware.0 = Not 1000 Mbit/sec half-duplex capable1 = 1000 Mbit/sec half-duplex capableDefault is set to the 1000hdx_cap parameter.







The ge driver has the capability to have an external MII physical layer deviceconnected, but there’s no implemented hardware to allow this feature to be utilized.The use_int_xcvr parameter should never be altered in the case of ge.


The Inter-Packet Gap (IPG) parameters are ipg0, ipg1, and ipg2. The total IPG isthe sum of ipg1 and ipg2 plus an optional ipg0 that will only be present when thelance_mode parameter is set. The total default IPG is 9.6 microseconds when thelink speed set by the auto-negotiation protocol is 10 Mbit/sec. When the link speedis 100 Mbit/sec, the total IPG is 0.96 microseconds. When the link speed is 1000Mbit/sec, the total IPG is 0.096 microseconds.









You can add the additional delay by setting the ipg0 parameter, which is the nibbletime delay, from 0 to 31. Note that nibble time is the time it takes to transfer four bitson the link. If the link speed is 10 Mbit/sec, nibble time is equal to 400 ns. If the linkspeed is 100 Mbit/sec, nibble time is equal to 40 ns. If the link speed is 1000Mbit/sec, the nibble time is 4 ns.

For example, if the link speed is 10 Mbit/sec and you set ipg0 to 20 nibble times,multiply 20 by 400 ns to get 800 ns. If the link speed is 100 Mbit/sec and you setipg0 to 30 nibble times, multiply 30 by 40 ns to get 120 ns.

All of the IPG parameters can be set using ndd or can be hard-coded into thege.conf files. Details of the methods of setting these parameters are provided in“Configuring Driver Parameters” on page 238.



lance_mode 01

lance_mode disabled








The ge device introduces the receive interrupt blanking capability to 1-Gbit/secports. TABLE 5-49 lists and describes the parameters.

Note – ge and ce fiber devices do not support 100 Mbit/sec capabilities. Theysupport 1000 Mbit/sec only.



intr_blank_time 0-127 Interrupt after this number of clock cycles haspassed and the packets pending have not reachedthe number of intr_blank_packets. One clockcycle equals 2048 PCI clock cycles. Note: Giventhat this time is linked to PCI clock, an adapterplugged into a 66-MHz PCI slot will have ashorter blanking time. Relative to one 33-MHzslot, it will be a multiple of two. (Default = 6)

intr_blank_packets 0-255 Interrupt after this number of packets has arrivedsince the last packet was serviced. A value ofzero indicates no packet blanking. (Default = 8)




The local transceiver auto-negotiation capability parameters are read-onlyparameters and represent the fixed set of capabilities associated with the currentPHY that is in use.































transceiver_inuse 0-1 This parameter indicates which transceiver iscurrently in use.0 = Internal transceiver is in use.1 = External transceiver is in use.





Note that the physical layer status parameters are only meaningful while ndd isrunning in interactive mode, or the interface being viewed is already initialized byvirtue of the presence of open streams such as snoop -d ge0 or ifconfig hme0plumb inet up. If these streams don’t exist, the device is uninitialized and the stategets set up when you probe these parameters with ndd. As a result, the parametersare subject to a race between the user viewing them and the link reaching a steadystate. This makes these parameters unreliable unless an existing stream is associatedwith an instance prior to checking. A good rule to follow is to only trust theseparameters if the interface is configured up using the ifconfig command.

Performance Tunable Parameters

Gigabit Ethernet pushes systems to their limits, and in some cases it can overwhelmthem. Therefore, much analysis has occurred and special system parameters areavailable that help in tuning the ge card for a particular system or application.




TABLE 5-52 Current Physical Layer Status Parameters (Continued)




Note that just as the tunables can be used to enhance performance, they can alsodegrade performance.

TABLE 5-53 Performance Tunable Parameters


ge_intr_mode 0-1 Enables the ge driver to send packets directlyto the upper communication layers rather thanqueueing.0 = Packets are not passed in the interruptservice routine but are placed in a streamsservice queue and passed to the protocol stacklater, when the streams service routine runs.(default)1 = Packets are passed directly to the protocolstack in the interrupt context.Default = 0 (queue packets to upper layers)

ge_dmaburst_mode 0-1 Enables infinite burst mode for PCI DMAtransactions rather than using cache-line sizePCI DMA transfers. This feature is supportedonly on Sun platforms with the UltraSparc IIICPU.0 = Disabled (default)1 = Enabled

ge_tx_fastdvma_min 59-1500 Minimum packet size to use fast dvmainterfaces rather than standard dma interfaces.Default = 1024

ge_nos_tmd 32-8192 Number of transmit descriptors used by thedriver.Default = 512

ge_tx_bcopy_max 60-256 Maximum packet size to use copy of bufferinto premapped dma buffer rather thanremapping.Default = 256



The ge tunable parameters require that the /etc/system file be modified and thesystem rebooted to apply the changes. See “Using /etc/system to TuneParameters” on page 244.

The tuning variables ge_use_rx_dvma and ge_do_fastdvma are of particularinterest because they control whether the ge driver uses fast dvma or the regularddi_dma interface. Currently the setting applied is fast dvma, but with every newoperating system release the ddi_dma interface is being improved and theperformance difference between the two interfaces might be eliminated.

The ge_nos_tmd can be used to adjust the size of the transmit descriptor ring. Thismight be required if the driver is experiencing a large number of notmd, as thisindicates that the arrival rate of packets for the descriptor ring exceeds the rate thatthe hardware can transmit. In that case, increasing the descriptor ring might be aremedy.

The ge_put_cfg—in conjunction with ge_intr_mode controls the receive packetdelivery model. When the ge_intr_mode is 1, the interface passes packets to theprotocol stack in the interrupt context. When set to 0, the delivery model iscontrolled by ge_put_cfg. When it is set to 0, the ge driver provides a special-casesoftware load balancing where there’s only one worker thread; when set to 1, it usesthe regular streams service routine.

The transmit control tunables, ge_tx_bcopy_max, ge_tx_stream_min, andge_tx_fastdvma_min, define the thresholds for the transmit buffer method.

The ge_tx_onemblk controls coalescing of multiple message blocks that make up asingle packet into one message block. In many cases where system memory latencyis high, it makes sense to avoid individually mapping packet fragments. Instead,

ge_nos_txdvma 0-8192 Number of dvma buffers (for transmit) used inthe driver.Default = 256

ge_tx_onemblk 1-100 Number of fragments that must exist in anyone packet before ge_tx_onemblk coalescesthem into a fresh mblk.Default = 2

ge_tx_stream_min 256-1000 For DMA, this parameter determines whetherto use DDI_DMA_CONSISTENT orDDI_DMA_STREAMING. If the packet length isless than ge_tx_stream_min, then we useDDI_DMA_CONSISTENT.Default = 512

TABLE 5-53 Performance Tunable Parameters (Continued)




you can have the driver create a new buffer, bring all the fragments together, anduse only one DMA buffer. This feature is especially useful for HTTP serverapplications.

The ge_nos_txdvma controls the pool of fast dvma resources associated with adriver. Since fast dvma resources are finite within a system, it is possible for onedevice to monopolize all of those resources. The tunable is designed to avoid thisscenario and allow the ge driver to allocate a limited number of resources that canbe shared at runtime with instances switching to transmit packets using the dvmainterface. A clearer description of this will be presented later based on kstatinformation feedback.

10/100/1000 ce GigaSwift Gigabit EthernetThe Sun GigaSwift Ethernet adapter relieves congestion experienced at the backboneand server levels by today’s networks, and provides a future upgrade path for high-end workstations that require more bandwidth than fast Ethernet can provide.

The Sun GigaSwift Ethernet MMF adapter is a single-port gigabit Ethernet fiber-optics PCI Bus card. It operates in 1000 Mbit/sec Ethernet networks only. Theconfiguration capability of the GigaSwift Ethernet is exactly the same as that of thecopper GigaSwift adapter except that it is unable to negotiate any speeds other than1000 Mbit/sec. The detailed discussion of the copper GigaSwift adapter will coverany configuration details that apply to the MMF interface.

FIGURE 5-31 Sun GigaSwift Ethernet MMF Adapter Connectors

The Sun GigaSwift Ethernet UTP adapter is a single-port gigabit Ethernet copper-based PCI Bus card. It can be configured to operate in 10 Mbit/sec, 100 Mbit/sec, or1000 Mbit/sec Ethernet networks.

FIGURE 5-32 Sun GigaSwift Ethernet UTP Adapter Connectors

There is also a Dual Fast Ethernet/Dual SCSI PCI adapter card that is supported bythe GigaSwift Ethernet device driver yet is limited to 100BASE-TX capability.



The ce interface employs the hardware checksumming capability described above toreduce the cost of the TCP/IP checksum calculation.

The ce interface also provides Layer 2 flow control capability, RED, and InfiniteBurst. The physical layer and performance features of ce are configurable using thedriver.conf file and ndd command.




adv-autoneg-cap Read and Write Operational mode parameters

adv-1000fdx-cap Read and Write Operational mode parameters

adv-1000hdx-cap Read and Write Operational mode parameters

adv-100T4-cap Read and Write Operational mode parameters





adv-asmpause-cap Read and Write Flow control parameter

adv-pause-cap Read and Write Flow control parameter

master-cfg-enable Read and Write Gigabit link clock mastership controls

master-cfg-value Read and Write Gigabit link clock mastership controls

use-int-xcvr Read and Write Transceiver control parameter

enable-ipg0 Read and Write Inter-packet gap parameters




rx-intr-pkts Read and Write Receive interrupt blanking parameters

rx-intr-time Read and Write Receive interrupt blanking parameters

red-dv4to6k Read and Write Random early detection and packet dropvectors





With the ce driver, any changes applied to the above parameters take effectimmediately.


The current device instance in view allows you to point ndd to a particular deviceinstance for configuration. This must be applied prior to altering or viewing any ofthe other parameters or you might not be able to view or alter the correctparameters.

Before viewing or altering any of the other parameters, be sure to check of the valueof instance to ensure that it is actually pointing to the device you want toconfigure.


tx-dma-weight Read and Write PCI interface parameters

rx-dma-weight Read and Write PCI interface parameters

infinite-burst Read and Write PCI interface parameters

disable-64bit Read and Write PCI interface parameters

accept-jumbo Read and Write Jumbo frames enable parameter









The following parameters adjust the MII capabilities which are used for auto-negotiation. When auto-negotiation is disabled, the highest priority value is taken asthe mode of operation. See “Ethernet Physical Layer” on page 152 regarding MII.



adv-autoneg-cap 0-1 Local interface capability is advertised by thehardware.0 = Forced mode1 = Auto-negotiation (default)

adv-1000fdx-cap 0-1 Local interface capability is advertised by thehardware.0 = Not 1000 Mbit/sec full-duplex capable1 = 1000 Mbit/sec full-duplex capable (default)

adv-1000hdx-cap 0-1 Local interface capability is advertised by thehardware.0 = Not 1000 Mbit/sec half-duplex capable1 = 1000 Mbit/sec half-duplex capable (default)

adv-100T4-cap 0-1 Local interface capability is advertised by thehardware.0 = Not 100-T4 capable (default)1 = 100-T4 capable







Flow Control Parameters

The ce device is capable of sourcing (transmitting) and terminating (receiving)pause frames conforming to the IEEE 802.3x Frame Based Link Level Flow ControlProtocol. In response to received flow control frames, the ce device can slow downits transmit rate. On the other hand, the ce device is capable of sourcing flow controlframes, requesting the link partner to slow down, provided that the link partnersupports this feature. By default, the driver advertises both transmit and receivepause capability during auto-negotiation.

TABLE 5-57 provides flow control keywords and describes their function.

Gigabit Link Clock Mastership Controls

The concept of link clock mastership was introduced with one-gigabit twisted-pairtechnology. This concept requires one side of the link to be the master that providesthe link clock and the other to be the slave that uses the link clock. Once thisrelationship is established, the link is up and data can be communicated. Two

TABLE 5-57 Read-Write Flow Control Keyword Descriptions

Keyword Description

adv-asmpause-cap The adapter supports asymmetric pause, which means itcan pause only in one direction.0 = Off (default)1 = On

adv-pause-cap This parameter has two meanings depending on thevalue of adv-asmpause-cap. (Default = 0)If adv-asmpause-cap = 1 while adv-pause-cap = 1,pauses are received.If adv-asmpause-cap = 1 while adv-pause-cap = 0,pauses are transmitted.If adv-asmpause-cap = 0 while adv-pause-cap = 1,pauses are sent and received.If adv-asmpause-cap = 0, then adv-pause-capdetermines whether Pause capability is on or off.



physical layer parameters control whether a side is the master or the slave orwhether mastership is negotiated with the link partner. Those parameters are asfollows.


The ce driver is capable of having an external MII physical layer device connected,but there’s no implemented hardware to allow this feature to be utilized. Theuse_int_xcvr parameter should never be altered.


The Inter-Packet Gap (IPG) parameters are ipg0, ipg1, and ipg2. The total IPG isthe sum of ipg1 and ipg2 plus an optional ipg0 that will only be present when theenable-ipg0 parameter is set. The total default IPG is 9.6 microseconds when thelink speed set by the auto-negotiation protocol is 10 Mbit/sec. When the link speedis 100 Mbit/sec, the total IPG is 0.96 microseconds, and for 1 Gbit/sec, it dropsdown to 0.096 microseconds.

The additional delay set by ipg0 helps to reduce collisions. Systems that haveipg0-enable set might not have enough time on the network. If ipg0 is disabled,the value of ipg0 is ignored and no additional delay is set. Only the delays set byipg1 and ipg2 are used. Clear enable-ipg0 if other systems keep sending a largenumber of back-to-back packets.

TABLE 5-58 Gigabit Link Clock Mastership Controls


master-cfg-enable Determines whether or not during the auto-negotiation process thelink clock mastership is set up automatically.

master-cfg-value If the master-cfg-enable parameter is set, the mastership is notset up automatically but is dependant on the value of master-cfg-value. If the master-cfg-value is set, the physical layerexpects the local device to be the link master. If it is not set, thephysical layer expects the link partner to be the master.If auto-negotiation is not enabled, the value of master-cfg-enable is ignored and the value of master-cfg-value is key tothe link clock mastership. If the master-cfg-value is set, thephysical layer expects the local device to be the link master. If it’snot set, the physical layer expects the link partner to be the master.



You can add the additional delay by setting the ipg0 parameter, which is the mediabyte time delay, from 0 to 255. Note that nibble time is the time it takes to transferfour bits on the link. If the link speed is 10 Mbit/sec, nibble time is equal to 400 ns.If the link speed is 100 Mbit/sec, nibble time is equal to 40 ns.

For example, if the link speed is 10 Mbit/sec and you set ipg0 to 20 nibble times,multiply 20 by 400 ns to get 800 ns. If the link speed is 100 Mbit/sec and you setipg0 to 30 nibble times, multiply 30 by 40 ns. to get 120 ns. If the link speed is 1000Mbit/sec and you set ipg0 to 30 nibble times, multiply 30 by 40 ns to get 1200 ns.

All of the IPG parameters can be set using ndd or can be hard-coded into thece.conf files. Details of the methods of setting these parameters are provided in“Configuring Driver Parameters” on page 238.


The ce device introduces the receive interrupt blanking capability to 1-Gbit/secports. TABLE 5-60 describes the receive interrupt blanking values.



enable-ipg0 0-1 Enables ipg0.0 = ipg0 disabled

1 = ipg0 enabled

Default = 1





Field Name Values Description

rx-intr-pkts 0 to 511 Interrupt after this number of packets have arrived sincethe last packet was serviced. A value of zero indicates nopacket blanking. (Default = 8)

rx-intr-time 0 to 524287 Interrupt after 4.5 microsecond ticks have elapsed sincethe last packet was serviced. A value of zero indicates notime blanking. (Default = 3)



Random Early Drop Parameters

TABLE 5-61 describes the Rx random early detection 8-bit vectors, which allow you toenable random early drop (RED) thresholds. When received packets reach the REDrange, packets are dropped according to the preset probability. The probabilityshould increase when the FIFO level increases. Control packets are never droppedand are not counted in the statistics.

TABLE 5-61 Rx Random Early Detecting 8-Bit Vectors

Field Name Values Description

red-dv4to6k 0 to 255 Random early detection and packet drop vectors whenFIFO threshold is greater than 4096 bytes and less than6144 bytes. Probability of drop can be programmed ona 12.5 percent granularity. For example, if bit 0 is set,the first packet out of every eight will be dropped inthis region. (Default = 0)

red-dv6to8k 0 to 255 Random early detection and packet drop vectors whenFIFO threshold is greater than 6144 bytes and less than8192 bytes. Probability of drop can be programmed ona 12.5 percent granularity. For example, if bit 0 is set,the first packet out of every eight will be dropped inthis region. (Default = 0)

red-dv8to10k 0 to 255 Random early detection and packet drop vectors whenFIFO threshold is greater than 8192 bytes and less than10,240 bytes. Probability of drop can be programmedon a 12.5 percent granularity. For example, if bits 1 and6 are set, the second and seventh packets out of everyeight will be dropped in this region. (Default = 0)

red-dv10to12k 0 to 255 Random early detection and packet drop vectors whenFIFO threshold is greater than 10,240 bytes and lessthan 12,288 bytes. Probability of drop can beprogrammed on a 12.5 percent granularity. If bits 2, 4,and 6 are set, then the third, fifth, and seventh packetsout of every eight will be dropped in this region.(Default = 0)



PCI Bus Interface Parameters

These parameters allow you to modify PCI interface features to gain better PCIperformance for a given application.

Jumbo Frames Enable Parameter

This new feature, only recently added to the GigaSwift driver, allows the ce deviceto communicate with larger MTU frames.

Once jumbo frames capability is enabled, the MTU can be controlled usingifconfig. The MTU can be raised to 9000 or reduced to the regular 1500-byteframes.

TABLE 5-62 PCI Bus Interface Parameters


tx-dma-weight 0-3 Determines the multiplication factor for granting credit tothe Tx side during a weighted round-robin arbitration.Values are 0 to 3. (Default = 0) Zero means no extraweighting. The other values are powers of 2 extraweighting, on that traffic. For example, if tx-dma-weight = 0 and rx-dma-weight = 3, then as long as Rxtraffic is continuously arriving, its priority will be eighttimes greater than Tx to access the PCI.

rx-dma-weight 0-3 Determines the multiplication factor for granting credit tothe Rx side during a weighted round-robin arbitration.Values are 0 to 3. (Default = 0)

infinite-burst 0-1 Allows the infinite burst capability to be utilized. Whenthis is in effect and the system supports infinite burst, theadapter will not free the bus until complete packets aretransferred across the bus. Values are 0 or 1. (Default = 0)

disable-64bit 0-1 Switches off 64-bit capability of the adapter. In somecases, it is useful to switch off this feature.Values are 0 or 1. (Default = 0, which enables 64-bitcapability)

TABLE 5-63 Jumbo Frames Enable Parameter


accept-jumbo 0-1 0 = Jumbo frames are disabled1 = Jumbo frames are enable(Default = 0)



Performance Tunables

GigaSwift Ethernet pushes systems even further than ge did. Many lessons werelearned from ge, leading to a collection of special system tunables that assist intuning the ce card for a specific system or application. Note that just as the tunablescan be used to enhance performance, they can also degrade performance. Handlewith great care.

TABLE 5-64 Performance Tunable Parameters


ce_taskq_disable 0-1 Disables the use of task queues and forces allpackets to go up to Layer 3 in the interruptcontext.Default depends on whether the number of CPUsin the system exceeds the ce_cpu_threshold.

ce_inst_taskqs 0-64 Controls the number of taskqs set up per cedevice instance. This value is only meaningful ifce_taskq_disable is false.Any value less than 64 is meaningful.(Default = 4).

ce_srv_fifo_depth 30-100000 The size of the service FIFO, in number ofelements. This variable can be any integer value.(Default = 2048)

ce_cpu_threshold 1-1000 The threshold for the number of CPUs requiredin the system and online before the taskqs areutilized to Rx packets.(Default = 4)

ce_start_cfg 0-1 An enumerated type that can have a value of 0 orand 1.0 = Transmit algorithm doesn’t do serialization,1 = Transmit algorithm does serialization.(Default = 0)

ce_put_cfg 0-2 An enumerated type that can have a value of 0, 1,or 2.0 = Receive processing occurs in the interruptcontext.1 = Receive processing occurs in the workerthreads.2 = Receive processing occurs in the streamsservice queues routine.(Default = 0)



ce_reclaim_pending 1-4094 The threshold when reclaims start happening.Currently 32 for both ge and ce drivers.Keep it less than ce_tx_ring_size/3.(Default = 32)

ce_ring_size 32-8216 The size of the Rx buffer ring, a ring of bufferdescriptors for Rx. One buffer = 8K. This valuemust be Modulo 2, and its maximum value is 8K.(Default = 256)

ce_comp_ring_size 0-8216 The size of each Rx completion descriptor ring. Italso is Modulo 2.(Default = 2048)

ce_comp_ring_size 0-8216 The size of each Tx descriptor ring. It also isModulo 2.(Default = 2048)

ce_tx_ring_mask 0-3 A mask to control which Tx rings are used.(Default = 3)

ce_no_tx_lb 0-1 Disables the Tx load balancing and forces alltransmission to be posted to a single descriptorring.0 = Tx load balancing is enabled.1 = Tx load balancing is disabled.(Default = 1)

ce_bcopy_thresh 0-8216 The mblk size threshold used to decide when tocopy a mblk into a pre-mapped buffer asopposed to using DMA or other methods.(Default = 256)





The performance tunables require an understanding of some key kernel statisticsfrom the ce driver to be used successfully. There might also be an opportunity to usethe RED features and interrupt blanking both configurable using the nddcommands. A clearer description of this will be presented later based on kstatinformation feedback.

10/100/1000 bge Broadcom BCM 5704 GigabitEthernetThe bge interface is a another Ethernet system interface applied to the UltraSPARCIII rack-mounted Sun Blade V210 and Sun Blade V240 server systems. This interfaceis much like the others in that its architecture supports a single Tx and Rx descriptorring.

ce_dvma_thresh 0-8216 The mblk size threshold used to decide when touse the fast path DVMA interface to transmitmblk.(Default = 1024)

ce_dma_stream_thresh 0-8216 This global variable splits the ddi_dma mappingmethod further by providing Consistentmapping and Streaming mapping. In the Txdirection, Streaming is better for largertransmissions than Consistent mappings. Themblk size falls in the range greater than 256 bytesbut less than 1024 bytes; then mblk fragment willbe transmitted using ddi_dma methods.(Default = 512)

ce_max_rx_pkts 32-1000000

The number of receive packets that can beprocessed in one interrupt before it must exit.(Default = 512)





The physical layer of bge is fully configurable using the bge.conf file and nddcommands.











adv_asm_pause_cap Read and Write Operational mode parameters

adv_pause_cap Read and Write Operational mode parameters







asm_pause_cap Read and Write Local transceiver auto-negotiation capability

pause_cap Read and Write Local transceiver auto-negotiation capability







lp_asm_pause_cap Read only Link partner capability

lp_pause_cap Read only Link partner capability










adv_autoneg_cap 0-1 Local interface capability of auto-negotiation signaling isadvertised by the hardware.0 = Forced mode1 = Auto-negotiationDefault is set to the autoneg_cap parameter.

adv_100T4_cap 0-1 Local interface capability of 100-T4 advertised is by thehardware.0 = Not 100 Mbit/sec T4 capable1 = 100 Mbit/sec T4 capableDefault is set to the 100T4_cap parameter.

adv_100fdx_cap 0-1 Local interface capability of 100 full duplex is advertisedby the hardware.0 = Not 100 Mbit/sec full-duplex capable1 = 100 Mbit/sec full-duplex capableDefault is set based on the 100fdx_cap parameter.

adv_100hdx_cap 0-1 Local interface capability of 100 half duplex is advertisedby the hardware.0 = Not 100 Mbit/sec half-duplex capable1 = 100 Mbit/sec half-duplex capableDefault is set based on the 100hdx_cap parameter.

adv_10fdx_cap 0-1 Local interface capability of 10 full duplex is advertised bythe hardware.0 = Not 10 Mbit/sec full-duplex capable1 = 10 Mbit/sec full-duplex capableDefault is set based on the 10fdx_cap parameter.






adv_10hdx_cap 0-1 Local interface capability of 10 half duplex is advertisedby the hardware.0 = Not 10 Mbit/sec half-duplex capable1 = 10 Mbit/sec half-duplex capableDefault is set based on the 10hdx_cap parameter.

adv_asm_pause_cap 0-1 The adapter supports asymmetric pause, which means itcan pause only in one direction.0 = Off1 = On(Default = 1)

adv_pause_cap 0-1 This parameter has two meanings, depending on the valueof adv_asm_pause_cap.If adv_asm_pause_cap = 1 while adv_pause_cap = 1,pauses are received and Transmit is limited.If adv_asm_pause_cap = 1 while adv_pause_cap = 0,pauses are transmitted.If adv_asm_pause_cap = 0 while adv_pause_cap = 1,pauses are sent and received.If adv_asm_pause_cap = 0, adv_pause_capdetermines whether pause capability is on or off.(Default = 0)






The local transceiver auto-negotiation capability parameters are read-onlyparameters and represent the fixed set of capabilities associated with the PHY that iscurrently in use.





100fdx_cap 0-1 Local interface is capable of 100 full-duplex operation.0 = Not 100 Mbit/sec full-duplex capable1 = 100 Mbit/sec full-duplex capable

100hdx_cap 0-1 Local interface is capable of 100 half-duplex operation.0 = Not 100 Mbit/sec half-duplex capable1 = 100 Mbit/sec half-duplex capable

10fdx_cap 0-1 Local interface is capable of 10 full-duplex operation.0 = Not 10 Mbit/sec full-duplex capable1 = 10 Mbit/sec full-duplex capable



10hdx_cap 0-1 Local interface is capable of 10 half-duplex operation.0 = Not 10 Mbit/sec half-duplex capable1 = 10 Mbit/sec half-duplex capable

asm_pause_cap 0-1 The adapter supports asymmetric pause, which meansit can pause only in one direction.0 = Off1 = On(Default = 1)

pause_cap 0-1 This parameter has two meanings depending on thevalue of asm_pause_cap.If asm_pause_cap = 1 while pause_cap = 1,pauses are received, and transmit is limited.If asm_pause_cap = 1 while pause_cap = 0,pauses are transmitted.If asm_pause_cap = 0 while pause_cap = 1,pauses are sent and received.If asm_pause_cap = 0,pause_cap determines whether pause capability ison or off.(Default = 0)

TABLE 5-67 Local Transceiver Auto-negotiation Capability Parameters (Continued)








lp_autoneg_cap 0-1 Link partner interface is capable of auto-negotiationsignaling.0 = Can only operate in Forced mode1 = Capable of auto-negotiation

lp_100T4_cap 0-1 Link partner interface is capable of 100-T4 operation.0 = Not 100 Mbit/sec T4 capable1 = 100 Mbit/sec T4 capable


lp_100hdx_cap 0-1 Link partner interface is capable of 100 half-duplexoperation.0 = Not 100 Mbit/sec half-duplex capable1 = 100 Mbit/sec half-duplex capable




lp_10hdx_cap 0-1 Link partner interface is capable of 10 half-duplexoperation.0 = Not 10 Mbit/sec half-duplex capable1 = 10 Mbit/sec half-duplex capable

lp_asm_pause_cap 0-1 The adapter supports asymmetric pause, which means itcan pause only in one direction.0 = Off1 = On(Default = 1)

lp_pause_cap 0-1 This parameter has two meanings depending on thevalue of lp_asm_pause_cap.If lp_asm_pause_cap = 1 while lp_pause_cap = 1,pauses are received, and Transmit is limited.If lp_asm_pause_cap = 1 while lp_pause_cap = 0,pauses are transmitted.If lp_asm_pause_cap = 0 while lp_pause_cap = 1,pauses are sent and received.If lp_asm_pause_cap = 0, then lp_pause_capdetermines whether pause capability is on or off.(Default = 0)







Sun VLAN TechnologyVLANs allow you to split your physical LAN into logical subparts, providing anessential tool for increasing the efficiency and flexibility of your network. VLANs arecommonly used to separate groups of network users into manageable broadcastdomains, to create logical segmentation of workgroups, and to enforce securitypolicies within each logical segment. Each defined VLAN behaves as its ownseparate network, with its traffic and broadcasts isolated from the others, increasingthe bandwidth efficiency within each logical group.

VLAN technology is also useful for containing jumbo frames. If a VLAN isconfigured with the ability to use jumbo frames, then the fact that the jumbo frameconfiguration is part of a VLAN ensures that the jumbo frames never leave theVLAN network.





link_mode 0-1 This parameter provides the link duplex and isonly valid if the link is up.0 = Half duplex1 = Full duplex



Although VLANs are commonly used to create individual broadcast domainsand/or separate IP subnets, it is sometimes useful for a server to have a presence onmore than one VLAN simultaneously. Several Sun products support multipleVLANs on a per-port or per-interface basis, allowing very flexible networkconfigurations.

FIGURE 5-33 shows an example network that uses VLANs.

FIGURE 5-33 Example of Servers Supporting Multiple VLANs with Tagging Adapters

SoftwarePC 1(VLAN 2)

VLAN 1VLAN 2VLAN 3

SoftwarePC 2(VLAN 2)

AccountingServer(VLAN 3)

Main ServerAdapterGigabit/Tagged(All VLANs)

EngineeringPC 3(VLAN 1)

AccountingPC 4(VLAN 3)

Engineering/Software PC 5Adapter Gigabit/Tagged(VLAN 1 & 2)

Shared Media Segment



VLAN ConfigurationVLANs can be created according to various criteria, but each VLAN must beassigned a VLAN tag or VLAN ID (VID). The VID is a 12-bit identifier between 1and 4094 that identifies a unique VLAN. For each network interface (ce0, ce1, ce2,and so on), 4094 possible VLAN IDs can be selected over an individual ce instance.

Once the VLAN tag is chosen, a VLAN can be configured on a subnet using a ceinterface with the ifconfig command. The VLAN tag is multiplied by 1000 and theinstance number of the device, also the device Primary Point of Attachment (PPA), isadded to give a VLAN PPA.

For a VLAN with VID 123 that needs to be configured over ce0, the new VLAN PPAwould be 123000.

With this new PPA you can proceed to configure the ce interface within the VLAN.

You can also set up a configuration that is persistent through a reboot by creating ahostname file.

In summary, the VLAN PPA is calculated using the simple formula:

VLAN PPA = VID * 1000 + Device PPA

Note – Only GigaSwift NICs using the ce driver and Solaris 8 VLAN packages haveVLAN tagging capabilities. Other NICs do not.

Sun Trunking TechnologySun Trunking™ software provides the ability to aggregate multiple links between apair of devices so that they work in parallel as if they were a single link. Onceaggregated, these point-to-point links operate as a single highly available “fat pipe,”providing increased network bandwidth as well as high availability. For a given linklevel connection, trunking enables you to add bandwidth up to the maximumnumber of network interface links supported.

# ifconfig ce123000 plumb inet up

# hostaname.ce123000 inet



Note – Sun Trunking is not included with the Solaris operating system. This is anunbundled software product.

Sun Trunking provides trunking support for the following network interface cards:

■ Sun Quad FastEthernet adapter, qfe■ Sun GigabitEthernet adapter, ge■ Sun GigaSwift Ethernet UTP or MMF adapter, ce■ Sun Dual FastEthernet and Dual SCSI/P adapter, ce■ Sun Quad GigaSwift Ethernet adapter, ce

The key to enabling the trunking capability is the nettr command. This commandcan be used to trunk devices of the same technology together. Once trunked, a trunkhead interface is established, and that interface is used by ifconfig to complete theconfiguration.

For example, if the two qfe instances (qfe0 and qfe1) need to be trunked, once thenettr command is complete, the trunk head would be assigned to qfe0. Then youcould proceed to ifconfig to make the trunk operate under the TCP/IP protocolstack.

Trunking ConfigurationThe nettr(1M) utility is used to configure trunking. nettr(1M)can be used to:

■ set up a trunk■ release a trunk■ display a trunk configuration■ display statistics of trunked interfaces

Following is the command syntax for nettr for setting up a trunk or modifying theconfiguration of the trunk members. The items in the square brackets are optional.

nettr -setup head-instance device=<qfe | ce | ge>members=<instance,instance,.,.> [ policy=<number> ]



Trunking Policies

MAC■ Is the default policy used by the Sun Trunking software. MAC is the preferred

policy to use with switches. Most trunking-capable switches require use of theMAC hashing policy, but check your switch documentation.

■ Uses the last three bits of the MAC address of both the source and destination.For two ports, the MAC address of the source and destination is first XORed:Result = 00, 01, which selects the port.

■ Favors a large population of clients. For example, using MAC ensures that 50percent of the client connections will go through one of two ports in a two-porttrunk.

Round-Robin■ Is the preferred policy with a back-to-back connection used between the output of

a transmitting device and the input of an associated receiving device.

■ Uses each network interface of the trunk in turn as a method of distributingpackets over the assigned number of trunking interfaces.

■ Could have an impact on performance because the temporal ordering of packetsis not observed.

IP Destination Address■ Uses the four bytes of the IP destination address to determine the transmission

path.

■ If a trunking interface host has one IP source address and it is necessary tocommunicate to multiple IP clients connected to the same router, then the IPDestination Address policy is the preferred policy to use.

IP Source Address/IP Destination Address■ Connects the source server to the destination based on where the connection

originated or terminated.

■ Uses the four bytes of the source and destination IP addresses to determine thetransmission path.

■ The primary use of the IP Source/IP Destination Address policy occurs whereyou use the IP virtual address feature to give multiple IP addresses to a singlephysical interface.



For example, you might have a cluster of servers providing network services inwhich each service is associated with a virtual IP address over a given interface. Ifa service associated with an interface fails, the virtual IP address migrates to aphysical interface on a different machine in the cluster. In such an arrangement,the IP Source Address/IP Destination Address policy gives you a greater chanceof using more different links within the trunk than would the IP DestinationAddress policy.

Network ConfigurationThis section describes how to edit the network host files after any of the Sunadapters have been installed on your system. The section contains the followingtopics:

■ “Configuring the System to Use the Embedded MAC Address” on page 233■ “Configuring the Network Host Files” on page 234■ “Setting Up a GigaSwift Ethernet Network on a Diskless Client System” on page 235■ “Installing the Solaris Operating System Over a Network” on page 236

Configuring the System to Use the EmbeddedMAC AddressAll Sun networking adapters have a MAC address embedded on their PROMassociated with each port available on the adapter. To use the adapter’s embeddedMAC address instead of the MAC address on the system’s IDPROM, set the local-mac-address\? OBP property to true. You must reboot your system for thesechanges to become active. As a rule, this is something that should be configured toeffectively operate with Solaris software.

● As superuser, set the local-mac-address\? OBP property to true:

Alternatively, this can be set at the Open Boot Prom level:

# eeprom local-mac-address\?=true

ok setenv local-mac-address? true



Configuring the Network Host FilesAfter installing the driver software, you must create a hostname.ceinstance file forthe network interface. You must also create both an IP address and a host name forthat interface in the /etc/hosts file. For the remaining test, we will assume the ceinterface as an example interface for this description.

▼ To Configure the Network Host files

1. At the command line, use the grep command to search the /etc/path_to_instfile for ce interfaces.

In the previous example, the device instance is from a Sun GigaSwift Ethernetadapter installed in slot 1. For clarity, the instance number is in bold italics.

Be sure to write down your device path and instance, which in the example is“/pci@1f,0/pci@1/network@4 0”. While your device path and instance mightbe different, they will be similar. You will need this information to make changes tothe ce.conf file. See “Setting Network Driver Parameters Using the ndd Utility”on page 238.

2. Use the ifconfig command to set up the adapter’s ce interface.

3. Use the ifconfig command to assign an IP address to the network interface.Type the following at the command line, replacing ip_address with the adapter’sIP address:

Refer to the ifconfig(1M) man page and the Solaris documentation for moreinformation.

If you want a setup that will remain the same after you reboot, create an/etc/hostname.ceinstance file, where instance corresponds to the instance numberof the ce interface you plan to use.

To use the adapter’s ce interface in the Step 1 example, create an/etc/hostname.ce0 file where 0 is the instance number of the ce interface. If theinstance number were 1, the filename would be /etc/hostname.ce1.

Do not create an /etc/hostname.ceinstance file for a Sun GigaSwift Ethernetadapter interface you plan to leave unused.

■ The /etc/hostname.ceinstance file must contain the host name for theappropriate ce interface.

# grep ce /etc/path_to_inst"/pci@1f,4000/network@4" 0 "ce"

# ifconfig ce0 plumb ip_address up



■ The host name should have an IP address and should be listed in the/etc/hosts file.

■ The host name should be different from any other host name of any otherinterface; for example: /etc/hostname.ce0 and /etc/hostname.ce1cannot share the same hostname.

The following example shows the /etc/hostname.ceinstance file required for asystem called zardoz that has a Sun GigaSwift Ethernet adapter (zardoz-11).

4. Create an appropriate entry in the /etc/hosts file for each active ce interface.

For example:

Setting Up a GigaSwift Ethernet Network on aDiskless Client SystemBefore you can boot and operate a diskless client system across a network, you mustfirst install the network device driver software packages into the root directory ofthe diskless client.

▼ To Set Up a Network Port on a Diskless Client

1. Locate the root directory of the diskless client on the host server.

The root directory of the diskless client system is commonly installed in the hostserver’s /export/root/client_name directory, where client_name is the disklessclient’s host name. In this procedure, the root directory will be:

# cat /etc/hostname.hme0zardoz# cat /etc/hostname.ce0zardoz-11

# cat /etc/hosts## Internet host table#127.0.0.1 localhost129.144.10.57 zardoz loghost129.144.11.83 zardoz-11

/export/root/client_name



2. Use the pkgadd -R command to install the network device driver softwarepackages to the diskless client’s root directory on the server.

3. Create a hostname.ceinstance file in the diskless client’s root directory.

You will need to create an/export/root/client_name/etc/hostname.deviceinstance file for the networkinterface. See “Configuring the Network Host Files” on page 234 for instructions.

4. Edit the hosts file in the diskless client’s root directory.

You will need to edit the /export/root/client_name/etc/hosts file to include theIP address of the Network interface. See “Configuring the Network Host Files” onpage 234 for instructions.

■ Be sure to set the MAC address on the server side and rebuild the device tree ifyou want to boot from the GigaSwift Ethernet port.

5. To boot the diskless client from the Network interface port, type the followingboot command:

Installing the Solaris Operating SystemOver a NetworkThe following procedure assumes that you have created an install server, whichcontains the image of the Solaris CD, and that you have set up the client system tobe installed over the network.

Before you can install the Solaris operating system on a client system with a givennetwork interface, you must first add the driver software packages to the installserver. These software packages are generally available on the driver installation CD.

# pkgadd -R root_directory/Solaris_2.7/Tools/Boot -d . SUNWced

ok boot path-to-device:link-param, -v



▼ To Install the Solaris Software Over a GigaSwiftEthernet Network

1. Prepare the install server and client system to install the Solaris operating systemover the network.

2. Find the root directory of the client system.

The client system’s root directory can be found in the install server’s/etc/bootparams file. Use the grep command to search this file for the rootdirectory.

In the previous example, the root directory for the Solaris 7 client is /netinstall.In Step 4, you would replace root_directory with /netinstall.

3. Use the pkgadd -R command to install the network device driver softwarepackages to the diskless client’s root directory on the server.

4. Shut down and halt the client system.

5. At the ok prompt, boot the client system using the full device path of thenetwork device.

6. Proceed with the Solaris operating system installation.

7. After installing the Solaris operating system, install the network interfacesoftware on the client system.

This step is required because the software installed in Step 2 was required to bootthe client system over the network interface. Often network interface cards are not abundled option with Solaris. Therefore, after installation is complete, you will needto install the software in order for the operating system to use the client’s networkinterfaces in normal operation.

8. Confirm that the network host files have been configured correctly during theSolaris installation.

Although the Solaris software installation creates the client’s network configurationfiles, you might need to edit these files to match your specific networkingenvironment. See “Configuring the Network Host Files” on page 234 for moreinformation about editing these files.

# grep client_name /etc/bootparamsclient_name root=server_name:/netinstall/Solaris_2.7/Tools/Bootinstall=server_name:/netinstall boottype=:in rootopts=:rsize=32768

# pkgadd -R root_directory/Solaris_2.7/Tools/Boot -d . SUNWced



Configuring Driver ParametersThis section describes how to configure the driver parameters used by thenetworking adapter. This section contains the following topics:

■ Setting networking driver parameters using the ndd utility■ Reboot persistence with driver.conf

Setting Network Driver Parameters Using the nddUtilityMany of the network drivers allow you to configure device driver parametersdynamically while the system is running using the ndd utility. Once configuredusing ndd, those parameters are only valid until you reboot the system—hence therequirement for boot persistence, which is provided by driver.conf.

The following sections describe how you can use the ndd utility to modify (with the-set option) or display (with the -get option) the parameters for a network driverand individual devices.

▼ To Specify Device Instances for the ndd Utility

There are two ways to specify the ndd command line based on the type ofnetworking driver style. The difference in style can be established by looking at the/dev directory for the driver node.

■ Style 1 drivers have a /dev/name instance symbolic link to a physical networkdevice instance.

■ Style 2 drivers have a /dev/name symbolic link to a physical network deviceinstance.

Once the style is established, the way you use the ndd command has to be adjusted,as the way of getting exclusive access to the device instance with ndd is differentbased on the style.

1. Determine the style of driver you’re using.



a. If there exists a Style 1 node /dev/name instance, then you can use the Style 1command form.

b. If there exists a Style 2 node /dev/name, then you cannot use the Style 1 form.You must use the Style 2 form. This requires an initial step, which is to set theconfiguration context.

2. Once you are pointing to the correct instance, you can alter as many parameters asrequired for that instance.

In all networking drivers, the instance number is allocated at the time ofenumeration once the adapter is installed. The instance number is recordedpermanently in the /etc/path_to_inst file. Take note of the instance numbers in/etc/path_to_inst so you can configure the instance using ndd.

The instance association is shown in bold italics and can be used in both theconfiguration styles.

The preceding examples show the ndd utility being used in the non-interactivemode. In that mode, only one parameter can be modified per command line. There isalso an interactive mode that allows you to enter an ndd shell where you can read orissue writes to ndd parameters associated with a device.

# ndd /dev/bge0 -get adv_autoneg_cap 1# ndd /dev/bge0 -set adv_autoneg_cap 0# ndd /dev/bge0 -get adv_autoneg_cap 0

# ndd /dev/hme -set instance 0

# ndd /dev/hme -set instance 0

# grep ce /etc/path_to_inst"/pci@1f,2000/pci@1/network@0" 2 "ce""/pci@1f,2000/pci@2/network@0" 1 "ce""/pci@1f,2000/pci@4/network@0" 0 "ce"



Using the ndd Utility in Non-interactive ModeIn the non-interactive mode, the command line allows you to read (get) the currentsetting of the parameter or write (set) a new setting to a parameter.

This mode assumes that you remember all the parameter options of the networkinterface.

Using the ndd Utility in Interactive ModeThe ndd utility offers an interactive mode that allows you to query the parametersthat you can read or write. The interactive mode is entered by simply pointing thendd at a particular device instance or (in the case of Style 2) a driver instance.

If ndd is pointed to a device node that can only be a Style 1 device, then ndd isalready pointing to a device instance.

If the device node can be a Style 2 device, then ndd is pointing to a driver and notnecessarily a device instance. Therefore, you must always first set the instancevariable to ensure that ndd is pointing to the right device instance beforeconfiguration begins.

# ndd device node -[get/set] parameter [value]

# ndd device node

# ndd /dev/bge0

# ndd /dev/cename to get/set? instancevalue ? 0



A very useful feature of ndd is the ? query, which you can use to get a list ofpossible parameters that a particular driver supports.

Once you have set the desired parameters, they will persist until reboot. To havethem persist through a reboot, you must set those parameters in the driver.conffile.

# ndd /dev/cename to get/set ? ?? (read only)instance (read and write)adv_autoneg_cap (read and write)adv_1000fdx_cap (read and write)adv_1000hdx_cap (read and write)adv_100T4_cap (read and write)adv_100fdx_cap (read and write)adv_100hdx_cap (read and write)adv_10fdx_cap (read and write)adv_10hdx_cap (read and write)adv_asmpause_cap (read and write)adv_pause_cap (read and write)master_cfg_enable (read and write)master_cfg_value (read and write)use_int_xcvr (read and write)enable_ipg0 (read and write)ipg0 (read and write)ipg1 (read and write)ipg2 (read and write)rx_intr_pkts (read and write)rx_intr_time (read and write)red_dv4to6k (read and write)red_dv6to8k (read and write)red_dv8to10k (read and write)red_dv10to12k (read and write)tx_dma_weight (read and write)rx_dma_weight (read and write)infinite_burst (read and write)disable_64bit (read and write)name to get/set ?#



Reboot Persistence Using driver.conf

Reboot persistence is often required to avoid repeatedly invoking ndd to adjustdevice parameters after reboot. There are two common ways to specify driverparameters in the driver.conf:

■ Global driver.conf parameters■ Per instance driver.conf parameters

In both cases, the driver.conf file resides in the same directory as the devicedriver. For example, ge resides in /kernel/drv. Therefore, the ge.conf file alsoresides in /kernel/drv. Note that even when a system is booted in 64-bit mode,the driver.conf file is still located in the same directory as the 32-bit driver.

Global driver.conf Parameters

Global driver.conf parameters apply to all instances of the driver and to alldevices. Parameters specified alone will apply globally. For example, you can disableauto-negotiation for all ce devices in the system using a global property in thece.conf file.

There are other older examples of global driver.conf parameters, which are adesign choice of the driver developer. Those configuration parameters embodyinformation about the device name and instance in the property.

A more common method is to take advantage of the driver.conf framework toidentify an instance that is unique for you.

ce.confadv_autoneg_cap = 0;

trp.conftrp0_ring_speed = 4;



Per-Instance driver.conf Parameters

This method requires you to identify a unique device instance. This method usesthree device driver properties associated with each device in the system: the name,parent, and unit-address. Once these parameters are established, everyproperty-value pair is a parameter that is familiar to the driver. The followingexample illustrates disabling auto-negotiation on an hme instance.

The name is simply the driver name. In the previous example, hme is the name. Theparent and unit address are found using the /etc/path_to_inst file.

It is assumed that when you write a driver.conf file and apply instanceproperties, you know the instance to which you are applying the parameters. Theinstance becomes the key to finding the parent and unit address from the/etc/path_to_inst file.

In the example above, the instance number being configured is 1. The instancenumbers in the example are shown in bold italics to guide the discussion.

Taking the second line as being the line associated with hme1, you can begin toextract the information required to write the driver.conf file.

The unit-address and parent are part of the leaf node information, which is thefirst string in quotes.

The leaf node can be thought of as a file in a directory structure, so you can addressit relative to root or relative to a parent. If it is relative to a parent, then the leafnode breaks down to the string to the right of the last ‘/’ and the string remaining tothe left of the ‘/’ is the parent.

hme.confname = "hme" parent = "pci@if,2000" unit-address = "0"adv_autoneg_cap = 0;

# grep ce /etc/path_to_inst"/pci@1f,2000/network@2" 2 "hme""/pci@1f,2000/network@1" 1 "hme""/pci@1f,2000/network@0" 0 "hme"

"/pci@1f,2000/network@1" 1 "hme"

leaf node instance name



Therefore, in the above example the parent = “/pci@1f,2000”, the unit address isthe number or byte sequence to the right of the ‘@’ in the remaining leaf node, andthe unit address = “1”.

The resulting driver.conf file to disable auto-negotiation for instance 1 is asfollows:

Using /etc/system to Tune ParametersSolaris software provides a system-wide file for tuning system parameters. In thissection, we will look only at how to set tuning parameters in the NIC adapters thathave been described. Exercise great care when using this file because direct access isgiven to global variables of the drivers. You cannot assume that drivers willcompensate for exceeding minimum and maximum values set here.

Each parameter is added to the end of the file and is made of the set commandfollowed by the driver module name, a colon (:), and then the parameter and itsvalue. The following example illustrates this.

Once this file has been modified, the system must be rebooted for the changes totake effect.

hme.confname = "hme" parent = "/pci@1f,2000" \unit-address = "1" adv_autoneg_cap = 0;

"/pci@1f,2000/network@1"

parent leaf node

/etc/systemset ge:ge_dmaburst_mod = 1;

Driver Module nameParameter

Value



Network Interface Card GeneralStatisticsAll the network interface cards described in this chapter export a collection of kernelstatistics information. In the previous sections describing individual networkinterfaces, we described kernel statistics that are unique to each interface. Thissection describes the kernel statistics common to all interfaces. You can use either thekstat(1M) utility or the netstat(1M) utility to gather statistics about eachinterface.

In many cases, these statistics help establish whether packets are moving properlythrough the interfaces or whether the interface is in a state that will even allowpackets to be communicated properly.

TABLE 5-70 General Network Interface Statistics

kstat name Type Description

ipackets counter The number of packets received by the interface

ipackets64 counter A 64-bit version of ipackets so a larger count canbe kept

rbytes counter The number of bytes received by the interface

rbytes64 counter A 64-bit version of rbytes so a larger count can bekept

multircv counter The number of multicast packets received by theinterface

brdcstrcv counter The number of broadcast packets received by theinterface

unknowns counter The number of packets that are received by aninterface but cannot be classified to any Layer 3 orabove protocol available in the system

ierrors counter The number of receive packet errors that led to apacket being discarded

norcvbuf counter The number receive packets that could not bereceived because the NIC had no buffers available

opackets counter The number of packets transmitted by the interface

opackets64 counter A 64-bit version of opackets so a larger count canbe kept

obytes counter The number of bytes transmitted by the interface



Ethernet Media Independent InterfaceKernel StatisticsThe Ethernet Media Independent Interface (MII) Kernel Statistics help keep the stateof the link. They are very useful statistics for Ethernet interfaces, as they can be usedfor troubleshooting physical layer problems with a network connection. Thesestatistics cover MII and GMII. Note that in some cases with the Fast Ethernet orFiber-only devices some of the statistics might not apply.

obytes64 counter A 64-bit version of obytes so a larger count can bekept

multixmt counter The number of multicast packets transmitted by theinterface

brdcstxmt counter The number of broadcast packets transmitted by theinterface

oerrors counter The number of packets that encountered an error ontransmission, causing the packets to be dropped

noxmtbuf counter The number of transmit packets that were stalled fortransmission because the NIC had no buffersavailable

collisions counter The number of collisions encountered whiletransmitting packets

ifspeed state The current speed of the network connection inmegabits per second

mac_mtu state The current MTU allowed by the driver, includingthe Ethernet header and 4-byte CRC

TABLE 5-71 General Network Interface Statistics


xcvr_addr state Provides the MII address of the transceiver currently inuse.

xcvr_id state Provides the specific Vendor/Device ID of thetransceiver currently in use.

xcvr_inuse state Indicates the type of transceiver currently in use.

TABLE 5-70 General Network Interface Statistics (Continued)




cap_1000fdx state Indicates the device is 1 Gbit/sec full-duplex capable.

cap_1000hdx state Indicates the device is 1 Gbit/sec half-duplex capable.

cap_100fdx state Indicates the device is 100 Mbit/sec full-duplexcapable.

cap_100hdx state Indicates the device is 100 Mbit/sec half-duplexcapable.

cap_10fdx state Indicates the device is 10 Mbit/sec full-duplex capable.

cap_10hdx state Indicates the device is 10 Mbit/sec full-duplex capable.

cap_asmpause state Indicates the device is capable of asymmetric pauseEthernet flow control.

cap_pause state Indicates the device is capable of symmetric pauseEthernet flow control when set to 1 andcap_asmpause is 0.If cap_asmpause = 1 while cap_pause = 0, transmitpauses based on receive congestion.cap_pause = 1, receive pauses and slows downtransmit to avoid congestion.

cap_rem_fault state Indicates the device is capable of remote faultindication.

cap_autoneg state Indicates the device is capable of auto-negotiation.

adv_cap_1000fdx state Indicates the device is advertising 1 Gbit/sec Fullduplex capability.

adv_cap_1000hdx state Indicates the device is advertising 1 Gbit/sec Halfduplex capability.

adv_cap_100fdx state Indicates the device is advertising 100M bits/s Fullduplex capability.

adv_cap_100hdx state Indicates the device is advertising 100 Mbit/sec half-duplex capability.

adv_cap_10fdx state Indicates the device is advertising 10 Mbit/sec full-duplex capability.

adv_cap_10hdx state Indicates the device is advertising 10 Mbit/sec full-duplex capability.

adv_cap_asmpause state Indicates the device is advertising the capability ofasymmetric pause Ethernet flow control.





adv_cap_pause state Indicates the device is advertising the capability ofsymmetric pause Ethernet flow control whenadv_cap_pause = 1 and adv_cap_asmpause = 0.If adv_cap_asmpause = 1 while adv_cap_pause = 0,transmit pauses based on receive congestion.If adv_cap_pause = 1, receive pauses and slowsdown transmit to avoid congestion.

adv_cap_rem_fault state Indicates the device is experiencing a fault that it isgoing to forward to the link partner.

adv_cap_autoneg state Indicates the device is advertising the capability ofauto-negotiation.

lp_cap_1000fdx state Indicates the link partner device is 1 Gbit/sec full-duplex capable.

lp_cap_1000hdx state Indicates the link partner device is 1 Gbit/sec half-duplex capable.

lp_cap_100fdx state Indicates the link partner device is 100 Mbit/sec full-duplex capable.

lp_cap_100hdx state Indicates the link partner device is 100 Mbit/sec half-duplex capable.

lp_cap_10fdx state Indicates the link partner device is 10 Mbit/sec full-duplex capable.

lp_cap_10hdx state Indicates the link partner device is 10 Mbit/sec half-duplex capable.

lp_cap_asmpause state Indicates the device is advertising the capability ofasymmetric pause Ethernet flow control.

lp_cap_pause state Indicates the link partner device is capable ofsymmetric pause Ethernet flow control when set to 1and lp_cap_asmpause is 0.If lp_cap_asmpause = 1 while lp_cap_pause = 0,transmit pauses based on receive congestion.If lp_cap_pause = 1, receive pauses and slows downtransmit to avoid congestion.

lp_cap_rem_fault state Indicates the link partner is experiencing a fault withthe link.

lp_cap_autoneg state Indicates the link partner device is capable of auto-negotiation.





Maximizing the Performance of anEthernet NIC InterfaceThere are many ways to maximize the performance of your Ethernet NIC interface,and there are a few tools that are valuable in achieving that. The ndd parameters andkernel statistics provide a means to get the best out of your NIC. But there are someother tools for looking at the system behavior and establishing if more tuning can beachieved to better utilize the system as well as the NIC.

link_asmpause state Indicates the shared link asymmetric pause setting thevalue is based on local resolution column of Table 37-4IEEE 802.3 spec.link_asmpause = 0 Link is symmetric Pauselink_asmpause = 1 Link is asymmetric Pause

link_pause state Indicates the shared link pause setting. The value isbased on local resolution shown above.If link_asmpause = 0 while link_pause = 0,the link has no flow control.If link_pause = 1, link can flow control in bothdirections.If link_asmpause = 1 while link_pause = 0,local flow control setting can limit link partner.If link_pause = 1, link will flow control local Tx.

link_speed state The current speed of the network connection inmegabits per second.

link_duplex state Indicates the link duplex.link_duplex = 0, indicates link is down and duplexwill be unknown.link_duplex = 1, indicates link is up and in halfduplex mode.link_duplex = 2, indicates link is up and in fullduplex mode.

link_up state Indicates whether the link is up or down.link_up = 1, indicates link is up.link_up = 0, indicates link is down.





The starting point for this discussion is the physical layer because that layer is themost important with respect to creating the link between two systems. At thephysical layer, failures can prevent the link from coming up. Or worse, the linkcomes up and the duplex is mismatched, giving rise to less-visible problems. Thenthe discussion will move to the data link layer, where most problems areperformance related. During that discussion, the architecture features describedabove can be used to address many of these performance problems.

Ethernet Physical Layer TroubleshootingThe possibility of problems at the physical layer is huge. The problems range fromno cable being present to duplex mismatch. The key tool for looking at the physicallayer is the kstat command. See the kstat man page.

The first step in checking the physical layer is to check if the link is up.

If the link_up variable is set, then things are positive, and a physical connection ispresent. But also check that the speed matches your expectation. For example, if theinterface is 1000BASE-Tx interface and you expect it to run at 1000 Mbit/sec, thenthe link_speed parameter shown should indicate 1000. If this is not the case, thena check of the link partner capabilities might be required to establish if they are thelimiting factor. The following kstat command line will show output similar to thefollowing:

kstat ce:0 | grep link_link_asmpause 0link_duplex 2link_pause 0link_speed 1000link_up 1

kstat ce:0 | grep lp_caplp_cap_1000fdx 1lp_cap_1000hdx 1lp_cap_100T4 1lp_cap_100fdx 1lp_cap_100hdx 1lp_cap_10fdx 1lp_cap_10hdx 1lp_cap_asmpause 0lp_cap_autoneg 1lp_cap_pause 0



If the link partner appears to be capable of all the desired speed, then the problemmight be local. There are two possibilities: The NIC itself is not capable of thedesired speed. Or the configuration has no shared capabilities that can be agreedon— hence the link will not come up. You can check this using the following kstatcommand line.

If all the required capabilities are available for the desired speed and duplex, yetthere remains a problem with achieving the desired speed, the only remainingpossibility is an incorrect configuration. You can check this by looking at individualndd adv_cap_* parameters or you can use the kstat command:

Configuration issues are where most problems lie. All the issues of configuration canbe addressed using the kstat command above to establish the local and remoteconfiguration, and adjusting the adv_cap_* parameters using ndd to correct theproblem.

kstat ce:0 | grep cap_cap_1000fdx 1cap_1000hdx 1cap_100T4 1cap_100fdx 1cap_100hdx 1cap_10fdx 1cap_10hdx 1cap_asmpause 0cap_autoneg 1cap_pause 0.....

kstat ce:0 | grep adv_cap_adv_cap_1000fdx 1adv_cap_1000hdx 1adv_cap_100T4 1adv_cap_100fdx 1adv_cap_100hdx 1adv_cap_10fdx 1adv_cap_10hdx 1adv_cap_asmpause 0adv_cap_autoneg 1adv_cap_pause 0



The most common configuration problem is duplex mismatch, which is inducedwhen one side of a link is enabled for auto-negotiation and the other is not. This isknown as Forced mode and can only be guaranteed for 10/100 Mode operation. For1000BASE-T UTP Mode operation, the Forced mode (auto-negotiation disabled)capability is not guaranteed because not all vendors support it.

If Auto-negotiation is turned off, you must ensure that both ends of the connectionare also in Forced mode, and that the speed and duplex are matched perfectly.

If you fail to match Forced mode in gigabit operation, the impact will be that the linkwill not come up at all. Note that this result is quite different from the 10/100 Modecase. While in 10/100 Mode operation, if only one end of the connection is auto-negotiating (with full capabilities advertised) the link will come up with the correctspeed, but the duplex will always be set to half duplex (creating the potential for aduplex mismatch if the forced end is set to full duplex).

If both sides are set to Forced mode and you fail to match speeds, the link will nevercome up.

If both sides are set to forced mode and you fail to match duplex, the link will comeup, but you will have a duplex mismatch.

Duplex mismatch is a silent failure that manifests itself from an upper layer point ofview as really poor performance as many of the packets get lost because of collisionsand late collisions occurring on the half-duplex end of the connection due toviolations of Ethernet protocol induced by the full-duplex end.

The half-duplex end experiences collisions and late collisions while the full-duplexend experiences a whole manner of smashed packets, leading to MIB countersmeasuring, crc, runts, giants, alignment errors all being incremented.

If the node experiencing poor performance is the half duplex end of the connection,you can look at the kstat values for collisions and late_collisions.

If the node experiencing poor performance is the full duplex end of the connection,you can look at the packet corruption counters, for example, crc_err,alignment_err.

kstat ce:0 | grep collisionscollisions 22332late_collisions 15432

kstat ce:0 | grep crc_errcrc_err 22332

kstat ce:0 | grep alignment_erralignment_err 224532



Depending on the capability of the switch end or remote end of the connection, itmay be possible to do similar measurements there.

Forced mode while having the problem of creating a potential duplex mismatch alsohas the drawback of isolating the link partner capabilities from the local station. InForced mode, you cannot view the lp_cap* values and determine the capabilities ofthe remote link partner locally.

Where possible, use the default of Auto-negotiation with all capabilities advertisedand avoid tuning the physical link parameters.

Given the maturity of the Auto-negotiation protocol and its requirement in the802.3z specification for one gigabit UTP Physical implementations, ensure that Auto-negotiation to enabled.

Deviation from General Ethernet MII/GMIIConventionsWe must address some remaining deviation from the general Ethernet MII/GMIIkernel statistics.

In the case of the ge interface, all of the statistics for getting local capabilities andlink partner capabilities are read-only ndd properties, so they cannot be read usingthe kstat command, as described previously, although the debug mechanism is stillvalid.

To read the corresponding lp_cap_* using ge, use the following commands:

Or you could use the interactive mode, described previously. The mechanism usedfor enabling Ethernet Flow control on the ge interface is also different, using theparameters in the table below.

hostname# ndd -set /dev/ge instance 0hostname# ndd -get /dev/ge lp_1000fdx_cap

TABLE 34 Physical Layer Configuration Properties

Statistic Values Description

adv_pauseTX 0-1 Transmit Pause if the Rx buffer is full.

adv_pauseRX 0-1 When you receive a pause slow down Tx.



There’s also a deviation in ge for adjusting ndd parameters. For example, whenmodifying ndd parameters like adv_1000fdx_cap, the changes will not take effectuntil the adv_autoneg_cap parameter is toggled to change state (from 0-1 or from1-0). This is a deviation from the General Ethernet MII/GMII convention for the“take affect immediately rule” of ndd.

Ethernet Performance TroubleshootingEthernet performance troubleshooting is device specific because not all devices havethe same architecture capabilities. Therefore, the discussion of troubleshootingperformance issues will have to be tackled on a per-device basis.

The following Solaris™ tools aid in the analysis of performance issues:

■ kstat to view device-specific statistics■ mpstat to view system utilization information■ lockstat to show areas of contention

You can use the information from these tools to tune specific parameters. The tuningexamples that follow describe where this information is most useful.

You have two options for tuning: using the /etc/system file or the ndd utility.

Using the /etc/system file to modify the initial value of the driver variablesrequires a system reboot for the to take effect.

If you use the ndd utility for tuning, the changes take effect immediately. However,any modifications you make using the ndd utility will be lost when the system goesdown. If you want the ndd tuning properties to persist through a reboot, add theseproperties to the respective driver.conf file.

Parameters that have kernel statistics but have no capability to tune forimprovement are omitted from this discussion because no troubleshooting capabilityis provided in those cases.



ge Gigabit Ethernet

The ge interface provides some kstats that can be used to measure theperformance bottlenecks in the driver in the Tx or the Rx. The kstats allow you todecide what corrective tuning can be applied based on the tuning parameterspreviously described. The useful statistics are shown in TABLE 5-72.

When rx_overflow is incrementing, packet processing is not keeping up with thepacket arrival rate. If rx_overflow is incrementing and no_free_rx_desc is not,this indicates that the PCI bus or SBus bus is presenting an issue to the flow ofpackets through the device. This could be because the ge card is plugged into aslower I/O bus. You can confirm the bus speed by looking at the pci_bus_speedstatistic. An SBus bus speed of 40 MHz or a PCI bus speed of 33 MHz might not besufficient to sustain full bidirectional one-gigabit Ethernet traffic.

Another scenario that can lead to rx_overflow incrementing on its own is sharingthe I/O bus with another device that has similar bandwidth requirements to those ofthe ge card.

These scenarios are hardware limitations. There is no solution for SBus. For PCI bus,a first step in addressing them is to enable infinite burst capability on the PCI bus.You can achieve that by using the /etc/system tuning parameterge_dmaburst_mode.

Alternatively, you can reorganize the system to give the ge interface a 66-MHz PCIslot, or you can separate devices that contend for a shared bus segment by givingeach of them a bus segment.

TABLE 5-72 List of ge Specific Interface Statistics


rx_overflow counter Number of times the hardware is unable to receive apacket due to the Internal FIFO’s being full.

no_free_rx_desc counter Number of times the hardware is unable to post apacket because there are no more Rx Descriptorsavailable.

no_tmds counter Number of times transmit packets are posted on thedriver streams queue for processing some time later,the queue’s service routine.

nocanput counter Number of times a packet is simply dropped by thedriver because the module above the driver cannotaccept the packet.

pci_bus_speed value The PCI bus speed that is driving the card.



The probability that rx_overflow incrementing is the only problem is small.Typically, Sun systems have a fast PCI bus, and memory subsystem, so delays areseldom induced at that level. It is more likely is that the protocol stack softwaremight fall behind and lead to the Rx descriptor ring being exhausted of free elementswith which to receive more packets. If this happens, then the kstat parameterno_free_rx_desc will begin to increment, meaning the CPU cannot absorb theincoming packet in the case of a single CPU. If more than one CPU is available, it isstill possible to overwhelm a single CPU. But given that the Rx processing can besplit using the alternative Rx data delivery models provided by ge, it might bepossible to distribute the processing of incoming packets to more than one CPU. Youcan do this by first ensuring that ge_intr_mode is not set to 1. Also be sure to tunege_put_cfg to enable the load-balancing worker thread or streams service routine.

Another possible scenario is where the ge device is adequately handling the rate ofincoming packets, but the upper layer is unable to deal with the packets at that rate.In this case, the kstat nocanputs parameter will be incrementing. The tuning thatcan be applied to this condition is available in the upper layer protocols. If you'rerunning the Solaris 8 operating system or an earlier version, then upgrading to theSolaris 9 version will help your application experience fewer nocanputs. Theupgrade might reduce nocanput errors due to improved multithreading and IPscalability performance improvements in the Solaris 9 operating system.

While the Tx side is also subject to an overwhelmed condition, this is less likely thanany Rx-side condition. If the Tx side is overwhelmed, it will be visible when theno_tmds parameter begins to increment. If the Tx descriptor ring size can beincreased, the /etc/system tunable parameter ge_nos_tmd provides thatcapability.

ce Gigabit Ethernet

The ce interface provides a far more extensive list of kstats that can be used tomeasure the performance bottlenecks in the driver in the Tx or the Rx. The kstatsallow you to decide what corrective tuning can be applied based on the tuningparameters described previously. The useful statistics are shown in TABLE 5-73.

TABLE 5-73 List of ce Specific Interface Statistics


rx_ov_flow counter Number of times the hardware is unable to receive apacket due to the Internal FIFO’s being full.

rx_no_buf counter Number of times the hardware is unable to receive apacket due to Rx buffers being unavailable.

rx_no_comp_wb counter Number of times the hardware is unable to receive apacket due to no space in the completion ring to postReceived packet descriptor.



When rx_ov_flow is incrementing, it indicates that packet processing is notkeeping up with the packet arrival rate. If rx_ov_flow is incrementing whilerx_no_buf or rx_no_comp_wb is not, this indicates that the PCI bus is presentingan issue to the flow of packets through the device. This could be because ce card is

ipackets_cpuXX counter Number of packets being directed to load balancingthread XX.

mdt_pkts counter Number of packets sent using Multidata interface.

rx_hdr_pkts counter Number of packets arriving that are less than 252bytes in length.

rx_mtu_pkts counter Number of packets arriving that are greater than 252bytes in length.

rx_jumbo_pkts counter Number of packets arriving that are greater than1522 bytes in length.

rx_ov_flow counter Number of times a packet is simply dropped by thedriver because the module above the driver cannotaccept the packet.

rx_pkts_dropped counter Number of packets dropped due to Service FifoQueue being full.

tx_hdr_pkts counter Number of packets hitting the small packettransmission method. Packets are copied into a pre-mapped DMA buffer.

tx_ddi_pkts counter Number of packets hitting the mid-range DDI DMAtransmission method.

tx_dvma_pkts counter Number of packets hitting the top range DVMA fastpath DMA Transmission method.

tx_jumbo_pkts counter Number of packets being sent that are greater than1522 bytes in length.

tx_max_pend counter Measure of the maximum number of packets everqueued on a Tx ring.

tx_no_desc counter Number of times a packet transmit was attemptedand Tx Descriptor Elements were not available. Thepacket is postponed until later.

tx_queueX counter Number of packets transmitted on a particularqueue.

mac_mtu value The maximum packet size allowed past the MAC.

pci_bus_speed value The PCI bus speed that is driving the card.

TABLE 5-73 List of ce Specific Interface Statistics (Continued)




plugged into a slower PCI bus. This can be established by looking at thepci_bus_speed statistic. A bus speed of 33 MHz might not be sufficient to sustainfull bidirectional one gigabit Ethernet traffic.

Another scenario that can lead to rx_ov_flow incrementing on its own is sharingthe PCI bus with another device that has bandwidth requirements similar to those ofthe ce card.

These scenarios are hardware limitations. A first step in addressing them is to enablethe infinite burst capability on the PCI bus. Use the ndd tuning parameterinfinite-burst to achieve that.

Infinite burst will help give ce more bandwidth, but the Tx and Rx of the ce devicewill still be competing for that PCI bandwidth. Therefore, if the traffic profile showsa bias toward Rx traffic and this condition is leading to rx_ov_flow, you can adjustthe bias of PCI transactions in favor of the Rx DMA channel relative to the Tx DMAchannel, using ndd parameters rx-dma-weight and tx-dma-weight

Alternatively, you can reorganize the system by giving the ce interface a 66-MHzPCI slot, or you can separate devices that contend for a shared bus segment bygiving each of them a bus segment.

If this doesn’t contribute much to reducing the problem, then you should considerusing Random Early Detection (RED) to ensure that the impact of dropping packetsis minimized with respect to keeping connections alive that normally would beterminated due to regular overflow. The following parameters that allow enablingRED are configurable using ndd: red-dv4to6k, red-dv6to8k, red-dv8to10k,and red-dv10to12k.

The probability that rx_overflow incrementing is the only problem is small.Typically Sun systems have a fast PCI bus and memory subsystem, so delays areseldom induced at that level. It is more likely that the protocol stack software mightfall behind and lead to the Rx buffers or completion descriptor ring being exhaustedof free elements with which to receive more packets. If this happens, then thekstats parameters rx_no_buf and rx_no_comp_wb will begin to increment. Thiscan mean that there’s not enough CPU power to absorb the packets, but it can alsobe due to a bad balance of the buffer ring size versus the completion ring size, leading

to the rx_no_comp_wb incrementing without the rx_no_buf incrementing. The default

configuration is one buffer to four completion elements. This works great provided that the

packets arriving are larger than 256 bytes. If they are not and that traffic dominates, then 32

packets will be packed into a buffer leading to a greater probability that configuration

imbalance will occur. For that case, more completion elements need to be made available.

This can be addressed using the /etc/system tunables ce_ring_size to adjust the

number of available Rx buffers and ce_comp_ring_size to adjust the number of Rx packet

completion elements. To understand the traffic profile of the Rx so you can tune these

parameters, use kstat to look at the distribution of Rx packets across the rx_hdr_pkts and

rx_mtu_pkts.



If ce is being run on a single CPU system and rx_no_buf and rx_no_comp_wb are

incrementing, then you will have to resort again to RED or enable Ethernet flow control.

If more than one CPU is available, it is still possible to overwhelm a single CPU.Given that the Rx processing can be split using the alternative Rx data deliverymodels provided by ce, it might be possible to distribute the processing ofincoming packets to more than one CPU, described earlier as Rx load balancing. Thiswill happen by default if the system has four or more CPUs, and it will enable fourload-balancing worker threads. The threshold of CPUs in the system and the numberof load-balancing worker threads enabled can be managed using the /etc/systemtunables ce_cpu_threshold and ce_inst_taskqs.

The number of load balancing worker threads and how evenly the Rx load is beingdistributed to each worker thread can be viewed with the ipacket_cpuxx kstats.The highest number of xx tells you how many load balancing worker threads arerunning while the value of these parameters gives you the spread of the work acrossthe instantiated load balancing worker threads. This, in turn, gives an indication ifthe load balancing is yielding a benefit. For example, if all ipacket_cpuxx kstatshave an approximately even number of packets counted on each, then the loadbalancing is optimal. On the other hand, if only one is incrementing and the othersare not, then the benefit of Rx load balancing is nullified.

It is also possible to measure whether the system is experiencing a even spread ofCPU activity using mpstat. In the ideal case, if you experience good load balancingas shown in the kstats ipackets_cpuxx, it should also be visible in mpstat thatthe workload is evenly distributed to multiple CPUs.

If none of this benefit is visible, then disable the load balancing capabilitycompletely, using the /etc/system variable ce_taskq_disable.

The Rx load balancing provides packet queues, also known as service FIFOs,between the interrupt threads that fan out the workload and the service FIFOworker threads that drain the service FIFO and complete the workload. Theseservice FIFOs are of fixed size and are controlled by the /etc/system variablece_srv_fifo_depth. It is possible that the service FIFOs can also overflow anddrop packets as the rate of packet arrival exceeds the rate with which the serviceFIFO draining thread can complete the post processing. These dropped packets canbe measured using the rx_pkts_dropped kstat. If this is measured as occurring,you can increase the size of the service FIFO or you can increase the number ofservice FIFOs, allowing more Rx load balancing. In some cases, it may be possible toeliminate increments in rx_pkts_dropped, but the problem may move torx_nocanputs, which is generally only addressable by tuning that can be appliedby upper layer protocol. If you're running the Solaris 8 operating system or anearlier version, then upgrading to the Solaris 9 version will help your applicationexperience fewer nocanputs. The upgrade might reduce nocanput errors due toimproved multithreading and IP scalability performance improvements in theSolaris 9 operating system.



There is a difficulty is maximizing the Rx load balancing, and it is contingent on theTx ring processing. This is measurable using the lockstat command and will showcontention on the ce_start routine at the top as the most contended driverfunction. This contention cannot be eliminated, but it is possible to employ a new Txmethod known as Transmit serialization, which keeps contention to a minimumwhile forcing the Tx processes on a fixed set of CPUs. Keeping the Tx process on afixed CPU reduces the risk of CPUs spinning waiting for other CPUs to completetheir Tx activity, ensuring CPUs are always kept busy doing useful work. Thistransmission method can be enabled using the /etc/system variablece_start_cfg, setting it to 1. When you enable Transmit serialization, you will betrading off Transmit latency for avoiding mutex spins induced by contention.

The Tx side is also subject to overwhelmed condition, although this is less likelythan any Rx side condition. This becomes visible when tx_max_pending valuematches the size of the /etc/system variable ce_tx_ring_size. If this occurs,then you know that packets are being postponed because Tx descriptors are beingexhausted. Therefore the size of the ce_tx_ring_size should be increased.

The tx_hdr_pkts, tx_ddi_pkts, and tx_dvma_pkts are useful for establishingthe traffic profile of an application and matching that with the capabilites of asystem. For example, many small systems have very fast memory access timesmaking the cost of setting up DMA transactions more expensive than transmittingdirectly from a pre-mapped DMA buffer, in which case you can adjust the DMAthresholds programmable via /etc/system to push more packet into thepreprogrammed DMA versus the per packet programming. Once the tuning iscomplete, these statistics can be viewed again to see if the tuning took effect.

The tx_queueX kstats give a good indication if Tx load balancing matches the Rxside. If no load balancing is visible, meaning all the packets appear to be gettingcounted by only one tx_queue, then it may make sense to switch this feature off.The /etc/system variable that does that is ce_no_tx_lb.

The mac_mtu gives an indication of the maximum size of packet that will make itthrough the ce device. It is useful to know if jumbo frames is enabled at the DLPIlayer below TCP/IP. If jumbo frames is enabled, then the MTU indicated bymac_mtu will be 9216.

This is helpful, as it will show if there’s a mismatch between the DLPI layer MTUand the IP layer MTU, allowing troubleshooting to occur in a layered manner.

Once jumbo frames is successfully configured at the driver layer and the TCP/IPlayer, then you should ensure that jumbo frames packets are being communicatedusing the rx_jumbo_pkts and tx_jumbo_pkts to ensure Transmits and Receivesof jumbo frame packets respectively is happening correctly.



CHAPTER 6

Network Availability DesignStrategies

This chapter provides a survey of availability strategies from a networkingperspective. Keep in mind the required degree of availability during the networkdesign process. Availability has always been an important design goal for networkarchitectures. As enterprise customers increasingly deploy mission-critical Web-based services, they require a deeper understanding of designing optimal networkavailability solutions.

There are several approaches to implementing high-availability network solutions.This chapter provides a high-level survey of possible approaches to increasingnetwork availability and shows possible deployments using actual implementedexamples.

Network Architecture and AvailabilityOne of the first items to consider for network availability is the architecture itself.Network architectures fall into two basic categories: flat and multi-level.

■ A flat architecture is composed of a multi-layer switch that performs multipleswitching functions in one physical network device. This implies that a packetwill traverse fewer network switching devices when communicating from theclient to the server. This results in higher availability.

■ A multi-level architecture is composed of multiple small switches where eachswitch performs one or two switching functions. This implies that a packet willtraverse more network switching devices when communicating from the client tothe server. This results in a lower availability.

261


Serial components reduce availability and parallel components increase availability.A serial design requires that every component is functioning at the same time. If anyone component fails, the entire system fails. A parallel design offers multiple pathsin case one path fails. In a parallel design, if any one component fails, the entiresystem still survives by using the backup path.

Three network architecture aspects impact network availability:

■ Component failure – This aspect is the probability of the device failing. It ismeasured using statistics averaging the amount of time the device works dividedby the average time the device works plus the failed time. This value is called theMTBF. In calculating the MTBF, components that are connected seriallydramatically reduce the MTBF, while components that are in parallel increase theMTBF.

■ System failure – This aspect captures failures that are caused by external factors,such as a technician accidentally pulling out a cable. The number of componentsthat are potential candidates for failure is directly proportional to the complexityof the system. Design B in FIGURE 6-1 has more components that can go wrong,which contributes to the increased probability of failure.

■ Single points of failure – This aspect captures the number of devices that can failand still have the system functioning. Neither Design A nor Design B shown inFIGURE 6-1 has a single point of failure (SPOF), so they are equal in this regard.However, Design B is somewhat more resilient because if a network interface card(NIC) fails, that failure is isolated by the Layer 2 switch and does not impact therest of the architecture. This issue has a trade-off to consider, where availability issacrificed for increased resiliency and isolation of failures.

FIGURE 6-1 shows two network designs. In both designs, Layer 2 switches providephysical connectivity for one virtual local area network (VLAN) domain. Layer 2-7 switches are multilayer devices providing routing, load balancing, and other IPservices in addition to physical connectivity.

Design A shows a flat architecture, often seen with multilayer chassis-basedswitches using Extreme Networks Black Diamond, Foundry Networks BigIron, orCisco switches. The switch can be partitioned into VLANs, isolating traffic fromone segment to another, yet providing a much better overall solution. In thisapproach, the availability will be relatively high because there are two parallelpaths from the ingress to each server and only two serial components that apacket must traverse to reach the target server.

In Design B, the architecture provides the same functionality, but across manysmall switches. From an availability perspective, this solution will have arelatively lower mean time between failures (MTBF) because there are more serialcomponents that a packet must traverse to reach a target server. Otherdisadvantages of this approach include manageability, scalability, andperformance. However, one can argue that there might be increased securityusing this approach, which for some customers outweighs all other factors. InDesign B, multiple switches must be hacked to control the network, whereas inDesign A, only one switch needs to be hacked to bring down the entire network.



FIGURE 6-1 Network Topologies and Impact on Availability

1

2 3

54

6 7 8 9

10 11 16 17

12 13 18 19

14 15

Switch Layer 3

Switch Layer 2-7

Switch lLayer 2

A) Flat architecture, higher MTBF

B) Multilayered architecture, lower MTBF

Inte

grat

ion

laye

rD

istr

ibut

ion

laye

rS

ervi

ce m

odul

es

Web serviceDirectory service

Application serviceDatabase serviceIntegration service

Redundantmultilayer switches

Chapter 6 Network Availability Design Strategies 263


Layer 2 StrategiesThere are several Layer 2 availability design options. Layer 2 availability designs aredesirable because any fault detection and recovery is transparent to the IP layer.Further, the fault detection and recovery can be relatively fast if the correct approachis taken. In this section, we explain the operation and recovery times for threeapproaches:

■ Trunking and variants based on IEEE 802.3ad

■ SMLT and DMLT, a relatively new and promising approach available from NortelNetworks

■ Spanning Tree, a time-tested and proven Layer 2 availability strategy, originallydesigned for bridged networks by the brilliant Dr. Radia Perlman from DEC andnow with Sun Microsystems.

Trunking Approach to AvailabilityLink aggregation or trunking increases availability by distributing network trafficover multiple physical links. If one link breaks, the load on the broken link istransferred to the remaining links.

IEEE 802.3ad is an industry standard created to allow the trunking solutions ofvarious vendors to interoperate. Like most standards, there are many ways toimplement the specifications. Link aggregation can be thought of as a layer ofindirection between the MAC and PHY layers. Instead of having one fixed MACaddress that is bound to a physical port, a logical MAC address is exposed to the IPlayer and implements the Data Link Provider Interface (DLPI). This logical MACaddress can be bound to many physical ports. The remote side must have the samecapabilities and algorithm for distributing packets among the physical ports.FIGURE 6-2 shows a breakdown of the sublayers.



FIGURE 6-2 Trunking Software Architecture

Theory of Operation

The Link Aggregation Control Protocol (LACP) allows both ends of the trunk tocommunicate trunking or link aggregation information. The first command sent isthe Query command, where each link partner discovers the link aggregationcapabilities of the other. If both partners are willing and capable, a Start Groupcommand is sent. The Start Group command indicates that a link aggregationgroup is to be created followed by adding segments to this group that include linkidentifiers tied to the ports participating in the aggregation.

The LACP can also delete a link, which might be due to the detection of a failed link.Instead of balancing the load across the remaining ports, the algorithm simply placesthe failed link’s traffic onto one of the remaining links. The collector reassemblestraffic coming from the different links. The distributor takes an input stream andspreads out the traffic across the ports belonging to a trunk group or linkaggregation group.

Availability Issues

To understand suitability for network availability, Sun Trunking 1.2 software wasinstalled on several quad fast Ethernet cards. The client has four trunks connected tothe switch. The server also has four links connected to the switch. This setup allowsthe load to be distributed across the four links, as shown in FIGURE 6-3.

MACclient

LogicalMAC

PhysicalMAC

LACP

PMAC

IP

LMAC

Framecollector

PHY1 PHY2

Aggregatorparser/Mux

Aggregatorparser/Mux

Framedistributor



FIGURE 6-3 Trunking Failover Test Setup

The highlighted line (in bold italic) in the CODE EXAMPLE 6-1 output shows the trafficfrom the client qfe0 moved to the server qfe1 under load balancing.

CODE EXAMPLE 6-1 Output Showing Traffic from Client qfe0 to Server qfe1

Jan 10 14:22:05 2002

Name Ipkts Ierrs Opkts Oerrs Collis Crc %Ipkts %Opkts

qfe0 210 0 130 0 0 0 100.00 25.00qfe1 0 0 130 0 0 0 0.00 25.00qfe2 0 0 130 0 0 0 0.00 25.00qfe3 0 0 130 0 0 0 0.00 25.00

(Aggregate Throughput(Mb/sec): 5.73(New Peak) 31.51(Past Peak)18.18%(New/Past))

Jan 10 14:22:06 2002


qfe0 0 0 0 0 0 0 0.00 0.00qfe1 0 0 0 0 0 0 0.00 0.00qfe2 0 0 0 0 0 0 0.00 0.00qfe3 0 0 0 0 0 0 0.00 0.00


Jan 10 14:22:07 2002


qfe0 0 0 0 0 0 0 0.00 0.00qfe1 0 0 0 0 0 0 0.00 0.00

qfe0qfe1qfe2qfe3

Switch-trunkingcapable

qfe0qfe1qfe2qfe3

Trunked linkspoint topoint

Trunked linkspoint topoint

Client Server



Several test transmission control protocol (TTCP) streams were pumped from onehost to the other. When all links were up, the load was balanced evenly and eachport experienced a 25 percent load. When one link was cut, the traffic of the failedlink (qfe0) was transferred onto one of the remaining links (qfe1), which thenshowed a 51 percent load.

The failover took three seconds. However, if all links were heavily loaded, thealgorithm might force one link to be saturated with its original link load in additionto the failed link’s traffic. For example, if all links were running at 55 percentcapacity and one link failed, one link would be saturated at 55 percent + 55 percent= 110 percent traffic. Link aggregation is suitable for point-to-point links forincreased availability, where nodes are on the same segment. However, there is atrade-off of port cost on the switch side as well as the host side.

Load-Sharing Principles

The Trunking Layer will break up packets on a frame boundary. This means as longas the server and switch know that a trunk is spanning certain physical ports,neither side needs to know about which algorithm is being used to distribute theload across the trunked ports. What is important is to understand the trafficcharacteristics in order to optimally distribute the load as evenly as possible acrossthe trunked ports. The following diagrams describe how load sharing across trunksshould be configured based on the nature of the traffic, which is often asymmetric.

qfe2 0 0 0 0 0 0 0.00 0.00qfe3 0 0 0 0 0 0 0.00 0.00


Jan 10 14:22:08 2002


qfe0 0 0 0 0 0 0 0.00 0.00qfe1 1028 0 1105 0 0 0 100.00 51.52qfe2 0 0 520 0 0 0 0.00 24.24qfe3 0 0 520 0 0 0 0.00 24.24


CODE EXAMPLE 6-1 Output Showing Traffic from Client qfe0 to Server qfe1 (Continued)

Jan 10 14:22:05 2002



FIGURE 6-4 Correct Trunking Policy on Switch

FIGURE 6-4 shows that a correct trunking policy on a switch with ingress traffic thathas distributed Source IP address and Source MAC address can use a TrunkingPolicy based on round-robin, Source MAC/Destination MAC address, and SourceIP/Destination IP address. Such a policy will distribute load evenly acrossphysically trunked links.

FIGURE 6-5 Incorrect Trunking Policy on Switch

Client 1IP:10.0.0.1Mac:0:8:c:a:b:1



Network

Ingress Traffic is distributed across Trunked ports

Sun ServerIP20.0.0.2

Mac:8:0:20:1:a:1

Router-int10IP10.0.0.100

Mac:0:0:8:8:8:2


Mac:0:0:8:8:8:1




Network


Mac:8:0:20:1:a:1


Mac:0:0:8:8:8:2


Mac:0:0:8:8:8:1

Ingress Traffic is not distributed across Trunked ports



FIGURE 6-5 shows an incorrect trunking policy on a switch. In this case, the ingresstraffic, which has single target IP address and target MAC, should not use a trunkingpolicy based solely on the destination IP address or destination or source MAC.

FIGURE 6-6 Correct Trunking Policy on Server

FIGURE 6-6 shows a correct trunking policy on a server with egress traffic that hasdistributed target IP address, but the target MAC of the default router should onlyuse a trunking policy based on round-robin or destination IP address. A destinationMAC will not work because the destination MAC will only point to the defaultrouter :0:0:8:8:1, not the actual client MAC.




Network


Mac:8:0:20:1:a:1


Mac:0:0:8:8:8:2


Mac:0:0:8:8:8:1

Egress Return Traffic is distributed across Trunked ports



FIGURE 6-7 Incorrect Trunking Policy on a Server

FIGURE 6-7 shows an incorrect trunking policy on the server. In this example, theegress traffic has a distributed target IP address, but the target MAC of the defaultrouter should not use a trunking policy based on the destination MAC because thedestination MAC will only point to the default router :0:0:8:8:1, not the actual clientMAC. Trunking policy should not use either the source IP address or the sourceMAC. The trunking policy should use the target IP addresses because that willspread the load across the physical interfaces evenly.




Network


Mac:8:0:20:1:a:1


Mac:0:0:8:8:8:2


Mac:0:0:8:8:8:1

Egress Return Traffic is not distributed across Trunked ports



FIGURE 6-8 Incorrect Trunking Policy on a Server

FIGURE 6-8 shows an incorrect trunking policy on a server. Even though the egresstraffic is using round-robin, it is not distributing the load evenly because all thetraffic belongs to the same session. In this case, trunking is not effective indistributing load across physical interfaces.

Availability Strategies Using SMLT and DMLTIn the past, server network resiliency leveraged IPMP and VRRP. Our deploymentsrevealed that network switches with relatively low-powered CPUs had problemsprocessing large numbers of ping requests due to ICMP health checks whencombined with other control processing such as VRRP routing calculations. Networkswitches were not designed to process a steady stream of ping requests in a timelymanner. Ping requests were traditionally used occasionally to troubleshoot networkissues. Hence, processing of ping requests was a lower priority than processingrouting updates and other control plane network tasks. As the number of IPMPnodes increases, the network switch soon runs out of CPU processing resources anddrops ping requests. This results in IPMP nodes falsely detecting router failures,which often result in a ping-pong effect of failing over back and forth acrossinterfaces. One recent advance, introduced in NortelNetworks switches, is calledSplit MultiLink Trunking (SMLT) and Distributed Multilink Trunking (DMLT). Inthis section, we describe several key tested configurations using NortelNetworksPassport 8600 Core switches and the smaller Layer 2 switches NortelNetworks




Network


Mac:8:0:20:1:a:1


Mac:0:0:8:8:8:2


Mac:0:0:8:8:8:1

Egress Return Traffic is not distributed across Trunked ports



Business Policy Switch 2000. These configurations illustrate how network highavailability can be achieved without encountering the scalability issues that haveplagued IPMP and VRRP deployments.

SMLT is a Layer 2 trunking redundancy mechanism. It is similar to plain trunkingexcept that it spans two physical devices. FIGURE 6-9 shows a typical SMLTdeployment using two NortelNetworks Passport 8600 Switches and a Sun serverwith dual GigaSwift cards. The trunk spans both cards, but each card is connected toa separate switch. SMLT technology, in effect, exposes one logical trunk to the Sunserver, when actually there are two physically separate devices.

FIGURE 6-9 Layer 2 High-Availability Design Using SMLT

FIGURE 6-10 shows another integration point where workgroup servers connect to thecorporate network at an edge point. In this case, instead of integrating directly intothe enterprise core, the servers connect to a smaller Layer 2 switch, which runsDMLT, a scaled version of the SMLT, but similar in functionality. DMLT has fewerfeatures and a smaller binary image than SMLT. This means that DMLT can run onsmaller network devices. The switches are viewed as one logical trunking deviceeven though packets are load shared across the links, with the switches ensuring

SW1

Passport8600 Core Passport8600 Core

Passport 8600 Core Passport 8600 Core

IST34

SMLT

trunk - head ce

ce0 ce1

SMLT BLOCK 2 - 20.0.0.0/24

IST12

SW2

SMLT BLOCK 1- 10.0.0.0/24

SW3 SW4



packets arrive in order at the remote destination. FIGURE 6-10 illustrates a server-to-edge integration of a Layer 2 high-availability design using Sun Trunking 1.3 andNortelNetworks Business Policy 2000 Wiring Closet Edge Switches.

FIGURE 6-10 Layer 2 High-Availability Design Using DMLT

CODE EXAMPLE 6-2 shows a sample configuration of the Passport 8600.

CODE EXAMPLE 6-2 Sample Configuration of the Passport 8600

## MLT CONFIGURATION PASSPORT 8600#

SW1

Passport8600 Core Passport8600 Core

Passport 8600 Core Passport 8600 Core

IST34

SMLT

trunk - head ce

ce0 ce1

SMLT BLOCK 2 - 20.0.0.0/24

IST12

SW2

SMLT BLOCK 1- 10.0.0.0/24

SW3 SW4

DMLT

BusinessPolicy2000

BusinessPolicy2000



Availability Using Spanning Tree ProtocolThe spanning tree algorithm was developed by Radia Perlman, currently with SunMicrosystems. The Spanning Tree Protocol (STP) is used on Layer 2 networks toeliminate loops. For added availability, redundant Layer 2 links can be added.However, these redundant links introduce loops, which cause bridges to forwardframes indefinitely.

By introducing STP, bridges communicate with each other by sending bridgeprotocol data units (BPDUs), which contain information that a bridge uses todetermine which ports forward traffic and which ports don’t based on the spanningtree algorithm. A typical BPDU contains information including a unique bridgeidentifier, the port identifier, and cost to the root bridge, which is the top of thespanning tree.

From these BPDUs, each bridge can compute a spanning tree and decide which portsto direct all forwarding of traffic. If a link fails, this tree is recomputed, andredundant links are activated by turning on certain ports—hence creating increasedavailability. A network needs to be designed to ensure that every possible link thatcould fail has some redundant link.

In older networks, bridges are still used. However, with recent advances in networkswitch technology and smaller Layer 2 networks, bridges are not used as much.

Availability Issues

To better understand failure detection and recovery, a testbed was created, as shownin FIGURE 6-11.

mlt 1 createmlt 1 add ports 1/1,1/8mlt 1 name "IST Trunk"mlt 1 perform-tagging enablemlt 1 ist create ip 10.19.10.2 vlan-id 10mlt 1 ist enablemlt 2 createmlt 2 add ports 1/6mlt 2 name "SMLT-1"mlt 2 perform-tagging enablemlt 2 smlt create smlt-id 1

#

CODE EXAMPLE 6-2 Sample Configuration of the Passport 8600 (Continued)

#



FIGURE 6-11 Spanning Tree Network Setup

The switches sw1, sw2, sw3, and sw4 were configured in a Layer 2 network with anobvious loop, which was controlled by running the STP among these switches. Onthe client, we ran the traceroute server command, resulting in the followingoutput, which shows that the client sees only two Layer 3 networks: the 11.0.0.0 andthe 16.0.0.0 network.

client># traceroute servertraceroute: Warning: Multiple interfaces found; using 16.0.0.51 @hme0traceroute to server (11.0.0.51), 30 hops max, 40 byte packets 1 16.0.0.1 (16.0.0.1) 1.177 ms 0.524 ms 0.512 ms 2 16.0.0.1 (16.0.0.1) 0.534 ms !N 0.535 ms !N 0.529 ms !N

Server11.0.0.51

Client16.0.0.51

s48t

s48b

sw1

sw2 sw3

sw4

Spanning tree7

7

7

78

8

8

8

Blocked port 8on sw4



Similarly, the server sees only two Layer 3 networks. We ran the tracerouteclient command on the server and got the following output:

The following outputs show the STP configuration and port status of theparticipating switches, showing the port MAC address of the root switches.

server># traceroute clienttraceroute: Warning: Multiple interfaces found; using 11.0.0.51 @hme0traceroute to client (16.0.0.51), 30 hops max, 40 byte packets 1 11.0.0.1 (11.0.0.1) 0.756 ms 0.527 ms 0.514 ms 2 11.0.0.1 (11.0.0.1) 0.557 ms !N 0.546 ms !N 0.531 ms !N

* sw1:17 # sh s0 ports 7-8Stpd: s0 Port: 7 PortId: 4007 Stp: ENABLED Path Cost: 4Port State: FORWARDING Topology Change Ack: FALSEPort Priority: 16Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 0Designated Bridge: 80:00:00:01:30:92:3f:00 Designated PortId: 4007

Stpd: s0 Port: 8 PortId: 4008 Stp: ENABLED Path Cost: 4Port State: FORWARDING Topology Change Ack: FALSEPort Priority: 16Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 0Designated Bridge: 80:00:00:01:30:92:3f:00 Designated PortId: 4008

* sw2:12 # sh s0 ports 7-8Port Mode State Cost Flags Priority Port ID DesignatedBridge7 802.1D FORWARDING 4 e-R-- 16 1639180:00:00:01:30:92:3f:008 802.1D FORWARDING 4 e-D-- 16 1639280:00:00:01:30:92:3f:00

Total Ports: 8

Flags: e=Enable, d=Disable, T=Topology Change Ack R=Root Port, D=Designated Port, A=Alternative Port



The following output shows that STP has blocked Port 8 on sw4.

To get a better understanding of failure detection and fault recovery, we conducted atest where the client continually sent a ping to the server, and we pulled a cable onthe spanning tree path.

* sw3:5 # sh s0 ports 7-8Stpd: s0 Port: 7 PortId: 4007 Stp: ENABLED Path Cost: 4Port State: FORWARDING Topology Change Ack: FALSEPort Priority: 16Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 0Designated Bridge: 80:00:00:01:30:92:3f:00 Designated PortId: 4001

Stpd: s0 Port: 8 PortId: 4008 Stp: ENABLED Path Cost: 4Port State: FORWARDING Topology Change Ack: FALSEPort Priority: 16Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 4Designated Bridge: 80:00:00:e0:2b:98:96:00 Designated PortId: 4008

* sw4:10 # sh s0 ports 7-8Stpd: s0 Port: 7 PortId: 4007 Stp: ENABLED Path Cost: 4Port State: FORWARDING Topology Change Ack: FALSEPort Priority: 16Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 4Designated Bridge: 80:00:00:01:30:f4:16:a0 Designated PortId: 4008

Stpd: s0 Port: 8 PortId: 4008 Stp: ENABLED Path Cost: 4Port State: BLOCKING Topology Change Ack: FALSEPort Priority: 16Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 4Designated Bridge: 80:00:00:e0:2b:98:96:00 Designated PortId: 4008



The following output shows that it took approximately 58 seconds for failuredetection and recovery, which is not acceptable in most mission-criticalenvironments. (Each ping takes about one second. The following output shows thatfrom icmp_seq=16 to icmp_seq=74, the pings did not succeed.)

Layer 3 StrategiesThere are several Layer 3 availability design options. Layer 3 availability designs aredesirable because there could be a fault at the IP layer, but not at the lower layers. Byimplementing a Layer 3 availability strategy, we can infer the status of the networkat all layers below, but not at the layers above. The fault detection and recovery canbe relatively slower than Layer 2 strategies, depending on the strategy.

In this section we explain the operation and recovery times for three approaches:

■ VRRP and IPMP – proven to be very useful at the server-to-default routernetwork connectivity segment of the data center network

■ OSPF – a proven and effective link-state routing protocol, suitable for inter-switchconnectivity

■ RIP – a time-tested distance vector routing protocol, suitable for inter-switchconnectivity.

on client---------

4 bytes from server (11.0.0.51): icmp_seq=12. time=1. ms64 bytes from server (11.0.0.51): icmp_seq=13. time=1. ms64 bytes from server (11.0.0.51): icmp_seq=14. time=1. ms64 bytes from server (11.0.0.51): icmp_seq=15. time=1. ms64 bytes from server (11.0.0.51): icmp_seq=16. time=1. msICMP Net Unreachable from gateway 16.0.0.1 for icmp from client (16.0.0.51) to server (11.0.0.51)......ICMP Net Unreachable from gateway 16.0.0.1 for icmp from client (16.0.0.51) to server (11.0.0.51)ICMP Net Unreachable from gateway 16.0.0.1 for icmp from client (16.0.0.51) to server (11.0.0.51)for icmp from client (16.0.0.51) to server (11.0.0.51)64 bytes from server (11.0.0.51): icmp_seq=74. time=1. ms64 bytes from server (11.0.0.51): icmp_seq=75. time=1. ms64 bytes from server (11.0.0.51): icmp_seq=76.



We describe how these network design strategies work and actually testedconfigurations.

VRRP Router RedundancyThe Virtual Router Redundancy Protocol (VRRP) was designed to remove a singlepoint of failure where hosts connected to the rest of the enterprise network orInternet through one default router. The VRRP is based on an election algorithm,where there are two routers: one master that owns both a MAC and an IP addressand one that is a backup. Both routers reside on one LAN or VLAN segment. Thehosts all point to one IP address that points to the master router. The master andbackup constantly send multicast messages to each other. Depending on the vendor-specific implementation, the backup will assume the master role if the master is nolonger functioning or has lowered in priority based on some criteria. The newmaster also assumes the same MAC address so that the clients do not need to updatetheir Address Resolution Protocol (ARP) caches.

The VRRP, by itself, has left open many aspects so that switch manufacturers canimplement and add features to differentiate themselves. All vendors offer a varietyof features that alter the priority, which can be tied to server health checks, numberof active ports, and so on. Whichever router has the highest priority becomes themaster. These configurations need to be closely monitored to prevent oscillations.Often, a switch is configured to be too sensitive, causing it to constantly changepriority and hence fluctuate between master and backup.

IPMP—Host Network Interface RedundancyThe purpose of the server redundant network interface capability is to increaseoverall system availability. If one server NIC fails, the backup will take over withintwo seconds. This is IP Multipathing (IPMP) on the Solaris operating system.

IPMP is a feature bundled with the Solaris operating system that is crucial increating highly available network designs. IPMP has a daemon that constantly sendspings to the default router, which is intelligently pulled from the kernel routingtables. If that router is not reachable, another standby interface in the same IPMPgroup then assumes ownership of the floating IP address. The switch re-runs theARP for the new MAC address and can contact the server again.

A typical highly available configuration includes a Sun server with dual NIC cards,which increases the availability of these components by several orders of magnitude.For example, the GigabitEthernet card, part number 595-5414-01, by itself has anMTBF of 199156 hours, and assuming approximately two hours’ mean time to



recovery (MTTR), it has an availability of 0.999989958. With two cards, the MTBFbecomes nine 9’s at .9999999996 availability. This small incremental cost has a bigimpact on the overall availability computation.

FIGURE 6-12 shows the Sun server redundant NIC model using IPMP. The server hastwo NICs, ge0 and ge1, with a fixed IP addresses of a.b.c.d and e.f.g.h. Thevirtual IP address of w.x.y.z is the IP address of the service. Client requests usethis IP address as the destination. This IP address floats between the two interfacesge0 or ge1. Only one interface can be associated with the virtual IP address at anyone time. If the ge0 interface owns the virtual IP address, then data traffic willfollow the P1 path. If the ge0 interface fails, then the ge1 interface will take overand associate the virtual IP address and data traffic will follow the P2 path. Failurescan be detected within two seconds, depending on the configuration.

FIGURE 6-12 High-Availability Network Interface Cards on Sun Servers

Integrated VRRP and IPMPBy combining the availability technologies of routers and server NICs, we can createa cell that can be reused in any deployment where servers are connected to routers.This reusable cell is highly available and scalable. FIGURE 6-13 shows how this isimplemented. Lines 1 and 2 show the VRRP protocol used by the routers to monitoreach other. If one router detects that the other has failed, the surviving routerassumes the role of master and inherits the IP address and MAC address of themaster.

Lines 3 and 5 in FIGURE 6-13 show how a switch can verify that a particularconnection is up and running. This verification can be port-based, link-based, orbased on Layers 3, 4, and 7. The router can make synthetic requests to the server and

HCS HCS

VLAN

ge0:a.b.c.d

w.x.y.z

ge1:e.f.g.h

P1

P2

IPMP-dual NIC



verify that a particular service is up and running. If it detects that the service hasfailed, then the VRRP can be configured, on some switches, to take this intoconsideration to impact the election algorithm and tie this failure to the priority ofthe VRRP router. Simultaneously, the server also monitors links. Currently, IPMPconsists of a daemon, in.mpathd, that constantly sends pings to the default router.As long as the default router can receive a ping, the master interface (ge0) assumesownership of the IP address. If the in.mpathd daemon detects that the defaultrouter is not reachable, automatic failover will occur, which brings down the linkand floats the IP address of the server to the surviving interface (ge1).

In the lab, we can tune IPMP and Extreme Standby Routing Protocol (ESRP) toachieve failure detection and recovery within one second. Because the ESRP is aCPU-intensive task and the control packets are on the same network as theproduction network, the trade-off is that if the switches, networks, or serversbecome overloaded, false failures can occur because the device can take longer thanthe strict timeout to respond to the peer’s heartbeat.

FIGURE 6-13 Design Pattern—IPMP and VRRP Integrated Availability Solution

OSPF Network Redundancy—Rapid ConvergenceOpen Shortest Path First (OSPF) is an intra-domain, link-state routing protocol. Themain idea of OSPF is that each OSPF router can determine the state of the link to allneighbor routers and the costs associated with sending data over that link. Oneproperty of this routing protocol is that each OSPF router has a view of the entirenetwork, which allows it to find the best path to all participating routers.

All OSPF routers in the domain flood each other with link state packets (LSPs),which contain the unique ID of the sending router; a list of directly connectedneighbor routers and associated costs; a sequence number and a time to live,

1

43 6

5

2VRRP

VRRP

ge0

ge1

in.mpathd



authentication, hierarchy, and load balancing; and checksum information. From thisinformation, each node can reliably determine if this LSP is the most recent bycomparing seq numbers and computing the shortest path to every node and thencollecting all LSPs from all nodes and comparing costs using Dijstras’ shortest pathalgorithm. To prevent continuous flooding, the sender never receives the same LSPpacket that it sent out.

To better understand OSPF for suitability from an availability perspective, thefollowing lab network was set up, consisting of Extreme Network switches and Sunservers. FIGURE 6-14 describes the actual setup used to demonstrate availabilitycharacteristics of the interior routing protocol OSPF.

FIGURE 6-14 Design Pattern—OSPF Network

Server11.0.0.51

Client16.0.0.51

s48t

12.0.0.0

18.0.0.0

17.0.0.0

15.0.0.0

13.0.0.0

14.0.0.0

s48b

sw1

sw2 sw3

sw4

Default pathfrom clientto server

Backuppath



To confirm correct configuration, traceroute commands were issued from client toserver. In the following output, the highlighted lines show the path through sw2:

The following tables show the initial routing tables of the core routers. The first twohighlighted lines in CODE EXAMPLE 6-3 show the route to the client through sw2. Thesecond two highlighted lines show the sw2 path.

client># traceroute servertraceroute: Warning: Multiple interfaces found; using 16.0.0.51 @hme0traceroute to server (11.0.0.51), 30 hops max, 40 byte packets 1 16.0.0.1 (16.0.0.1) 1.168 ms 0.661 ms 0.523 ms 2 15.0.0.1 (15.0.0.1) 1.619 ms 1.104 ms 1.041 ms3 17.0.0.1 (17.0.0.1) 1.527 ms 1.197 ms 1.043 ms4 18.0.0.1 (18.0.0.1) 1.444 ms 1.208 ms 1.106 ms 5 12.0.0.1 (12.0.0.1) 1.237 ms 1.274 ms 1.083 ms 6 server (11.0.0.51) 0.390 ms 0.349 ms 0.340 ms

CODE EXAMPLE 6-3 Router sw1 Routing Table

OR Destination Gateway Mtr Flags Use M-Use VLAN Acct-1*s 10.100.0.0/24 12.0.0.1 1 UG---S-um 63 0 net12 0*oa 11.0.0.0/8 12.0.0.1 5 UG-----um 98 0 net12 0*d 12.0.0.0/8 12.0.0.2 1 U------u- 1057 0 net12 0*d 13.0.0.0/8 13.0.0.1 1 U------u- 40 0 net13 0*oa 14.0.0.0/8 13.0.0.2 8 UG-----um 4 0 net13 0*oa 15.0.0.0/8 18.0.0.2 12 UG-----um 0 0 net18 0*oa 15.0.0.0/8 13.0.0.2 12 UG-----um 0 0 net13 0*oa 16.0.0.0/8 18.0.0.2 13 UG-----um 0 0 net18 0*oa 16.0.0.0/8 13.0.0.2 13 UG-----um 0 0 net13 0*oa 17.0.0.0/8 18.0.0.2 8 UG-----um 0 0 net18 0*d 18.0.0.0/8 18.0.0.1 1 U------u- 495 0 net18 0*d 127.0.0.1/8 127.0.0.1 0 U-H----um 0 0 Default 0

Origin(OR): b - BlackHole, bg - BGP, be - EBGP, bi - IBGP, bo - BOOTP, ct - CBT d - Direct, df - DownIF, dv - DVMRP, h - Hardcoded, i - ICMP

mo - MOSPF, o - OSPF, oa - OSPFIntra, or - OSPFInter, oe - OSPFAsExt o1 - OSPFExt1, o2 - OSPFExt2, pd - PIM-DM, ps - PIM-SM, r - RIP ra - RtAdvrt, s - Static, sv - SLB_VIP, un - UnKnown.

Flags: U - Up, G - Gateway, H - Host Route, D - Dynamic, R - Modified, S - Static, B - BlackHole, u - Unicast, m - Multicast.

Total number of routes = 12.



Mask distribution: 11 routes at length 8 1 routes at length 24


sw2:8 # sh ipr

OR Destination Gateway Mtr Flags Use M-Use VLAN Acct-1*s 10.100.0.0/24 18.0.0.1 1 UG---S-um 27 0 net18 0*oa 11.0.0.0/8 18.0.0.1 9 UG-----um 98 0 net18 0*oa 12.0.0.0/8 18.0.0.1 8 UG-----um 0 0 net18 0*oa 13.0.0.0/8 18.0.0.1 8 UG-----um 0 0 net18 0*oa 14.0.0.0/8 17.0.0.2 8 UG-----um 0 0 net17 0*oa 15.0.0.0/8 17.0.0.2 8 UG-----um 9 0 net17 0*oa 16.0.0.0/8 17.0.0.2 9 UG-----um 0 0 net17 0*d 17.0.0.0/8 17.0.0.1 1 U------u- 10 0 net17 0*d 18.0.0.0/8 18.0.0.2 1 U------u- 403 0 net18 0*d 127.0.0.1/8 127.0.0.1 0 U-H----um 0 0 Default 0##


sw3:5 # sh ipr

OR Destination Gateway Mtr Flags Use M-Use VLAN Acct-1*s 10.100.0.0/24 13.0.0.1 1 UG---S-um 26 0 net13 0*oa 11.0.0.0/8 13.0.0.1 9 UG-----um 0 0 net13 0*oa 12.0.0.0/8 13.0.0.1 8 UG-----um 121 0 net13 0*d 13.0.0.0/8 13.0.0.2 1 U------u- 28 0 net13 0*d 14.0.0.0/8 14.0.0.1 1 U------u- 20 0 net14 0*oa 15.0.0.0/8 14.0.0.2 8 UG-----um 0 0 net14 0*oa 16.0.0.0/8 14.0.0.2 9 UG-----um 0 0 net14 0*oa 17.0.0.0/8 14.0.0.2 8 UG-----um 0 0 net14 0*oa 18.0.0.0/8 13.0.0.1 8 UG-----um 0 0 net13 0*d 127.0.0.1/8 127.0.0.1 0 U-H----um 0 0 Default 0

CODE EXAMPLE 6-3 Router sw1 Routing Table (Continued)



The first two highlighted lines in CODE EXAMPLE 6-6 show the route back to theserver through sw4. The second two highlighted lines show the sw2 path.

To check failover capabilities on the OSPF, the interface on the switch sw2 wasdamaged to create a failure and a constant ping command was run from the clientto the server.

The interface on the switch sw2 was removed, and the measurement of failover wasperformed as shown in the following output. The first highlighted line shows whenthe interface sw2 fails. The second highlighted line shows that the new switchinterface sw3 route is established in two seconds.

OSPF took approximately two seconds to detect and recover from the failed node.

CODE EXAMPLE 6-6 Switch sw4 Routing Table

sw4:8 # sh ipr

OR Destination Gateway Mtr Flags Use M-Use VLAN Acct-1*s 10.100.0.0/24 14.0.0.1 1 UG---S-um 29 0 net14 0*oa 11.0.0.0/8 17.0.0.1 13 UG-----um 0 0 net17 0*oa 11.0.0.0/8 14.0.0.1 13 UG-----um 0 0 net14 0*oa 12.0.0.0/8 17.0.0.1 12 UG-----um 0 0 net17 0*oa 12.0.0.0/8 14.0.0.1 12 UG-----um 0 0 net14 0*oa 13.0.0.0/8 14.0.0.1 8 UG-----um 0 0 net14 0*d 14.0.0.0/8 14.0.0.2 1 U------u- 12 0 net14 0*d 15.0.0.0/8 15.0.0.1 1 U------u- 204 0 net15 0*oa 16.0.0.0/8 15.0.0.2 5 UG-----um 0 0 net15 0*d 17.0.0.0/8 17.0.0.2 1 U------u- 11 0 net17 0*oa 18.0.0.0/8 17.0.0.1 8 UG-----um 0 0 net17 0*d 127.0.0.1/8 127.0.0.1 0 U-H----um 0 0 Default 0

client reading:

64 bytes from server (11.0.0.51): icmp_seq=11. time=2. ms64 bytes from server (11.0.0.51): icmp_seq=12. time=2. msICMP Net Unreachable from gateway 17.0.0.1for icmp from client (16.0.0.51) to server (11.0.0.51)ICMP Net Unreachable from gateway 17.0.0.1 for icmp from client (16.0.0.51) to server (11.0.0.51)64 bytes from server (11.0.0.51): icmp_seq=15. time=2. ms64 bytes from server (11.



The highlighted lines in the following output from the traceroute servercommand shows the new path from the client to the server through the switchinterface sw3.

The following code examples show the routing tables after the node failure. The firsthighlighted line in CODE EXAMPLE 6-7 shows the new route to the server through theswitch sw3. The second highlighted line shows that the switch sw2 link is down.

client># traceroute servertraceroute: Warning: Multiple interfaces found; using 16.0.0.51 @hme0traceroute to server (11.0.0.51), 30 hops max, 40 byte packets 1 16.0.0.1 (16.0.0.1) 0.699 ms 0.535 ms 0.581 ms 2 15.0.0.1 (15.0.0.1) 1.481 ms 0.990 ms 0.986 ms3 14.0.0.1 (14.0.0.1) 1.214 ms 1.021 ms 1.002 ms 4 13.0.0.1 (13.0.0.1) 1.322 ms 1.088 ms 1.100 ms 5 12.0.0.1 (12.0.0.1) 1.245 ms 1.131 ms 1.220 ms 6 server (11.0.0.51) 1.631 ms 1.200 ms 1.314 ms

CODE EXAMPLE 6-7 Switch sw1 Routing Table After Node Failure

sw1:27 # sh ipr

OR Destination Gateway Mtr Flags Use M-Use VLAN Acct-1*s 10.100.0.0/24 12.0.0.1 1 UG---S-um 63 0 net12 0*oa 11.0.0.0/8 12.0.0.1 5 UG-----um 168 0 net12 0*d 12.0.0.0/8 12.0.0.2 1 U------u- 1083 0 net12 0*d 13.0.0.0/8 13.0.0.1 1 U------u- 41 0 net13 0*oa 14.0.0.0/8 13.0.0.2 8 UG-----um 4 0 net13 0*oa 15.0.0.0/8 13.0.0.2 12 UG-----um 0 0 net13 0*oa 16.0.0.0/8 13.0.0.2 13 UG-----um 22 0 net13 0*oa 17.0.0.0/8 13.0.0.2 12 UG-----um 0 0 net13 0d 18.0.0.0/8 18.0.0.1 1 --------- 515 0 -------- 0*d 127.0.0.1/8 127.0.0.1 0 U-H----um 0 0 Default 0


sw1:4 # sh ipr

OR Destination Gateway Mtr Flags Use M-Use VLAN Acct-1*s 10.100.0.0/24 12.0.0.1 1 UG---S-um 63 0 net12 0*oa 11.0.0.0/8 12.0.0.1 5 UG-----um 168 0 net12 0*d 12.0.0.0/8 12.0.0.2 1 U------u- 1102 0 net12 0



The highlighted line in CODE EXAMPLE 6-10 shows the new route back to the clientthrough sw3.

OSPF is a good routing protocol with enterprise networks. It has fast failuredetection and recovery.

*d 13.0.0.0/8 13.0.0.1 1 U------u- 41 0 net13 0*oa 14.0.0.0/8 13.0.0.2 8 UG-----um 4 0 net13 0*oa 15.0.0.0/8 13.0.0.2 12 UG-----um 0 0 net13 0*oa 16.0.0.0/8 13.0.0.2 13 UG-----um 22 0 net13 0*oa 17.0.0.0/8 13.0.0.2 12 UG-----um 0 0 net13 0d 18.0.0.0/8 18.0.0.1 1 --------- 515 0 -------- 0*d 127.0.0.1/8 127.0.0.1 0 U-H----um 0 0 Default 0


sw3:6 # sh ipr

OR Destination Gateway Mtr Flags Use M-Use VLAN Acct-1*s 10.100.0.0/24 13.0.0.1 1 UG---S-um 26 0 net13 0*oa 11.0.0.0/8 13.0.0.1 9 UG-----um 24 0 net13 0*oa 12.0.0.0/8 13.0.0.1 8 UG-----um 134 0 net13 0*d 13.0.0.0/8 13.0.0.2 1 U------u- 29 0 net13 0*d 14.0.0.0/8 14.0.0.1 1 U------u- 20 0 net14 0*oa 15.0.0.0/8 14.0.0.2 8 UG-----um 0 0 net14 0*oa 16.0.0.0/8 14.0.0.2 9 UG-----um 25 0 net14 0*oa 17.0.0.0/8 14.0.0.2 8 UG-----um 0 0 net14 0*d 127.0.0.1/8 127.0.0.1 0 U-H----um 0 0 Default 0


sw4:9 # sh ipr

OR Destination Gateway Mtr Flags Use M-Use VLAN Acct-1*s 10.100.0.0/24 14.0.0.1 1 UG---S-um 29 0 net14 0*oa 11.0.0.0/8 14.0.0.1 13 UG-----um 21 0 net14 0*oa 12.0.0.0/8 14.0.0.1 12 UG-----um 0 0 net14 0*oa 13.0.0.0/8 14.0.0.1 8 UG-----um 0 0 net14 0*d 14.0.0.0/8 14.0.0.2 1 U------u- 12 0 net14 0*d 15.0.0.0/8 15.0.0.1 1 U------u- 216 0 net15 0*oa 16.0.0.0/8 15.0.0.2 5 UG-----um 70 0 net15 0*d 17.0.0.0/8 17.0.0.2 1 U------u- 12 0 net17 0*d 127.0.0.1/8 127.0.0.1 0 U-H----um 0 0 Default 0

CODE EXAMPLE 6-8 Switch sw2 Routing Table After Node Failure (Continued)

sw1:4 # sh ipr



RIP Network RedundancyThe Routing Information Protocol (RIP) is based on the Bellman-Ford distance vectoralgorithm. The idea behind the RIP is that each RIP router builds a one-dimensionalarray that contains a scalar notion of hops to reach all other hops. (In theory, OSPFwas able to use the notion of cost with greater accuracy, which could captureinformation such as link speed. However, in actual practice, this might not bepractical because of the increased burden of maintaining correct link costs in largechanging environments.) RIP routers flood each other with their view of the networkby first starting with directly connected neighbor routers and then modifying theirvector if peer updates show that there is a shorter path.

After a few updates, a complete routing table is constructed. When a router detects afailure, the distance is updated to infinity. Ideally, all routers would eventuallyreceive the proper update and adjust their tables accordingly. However, if thenetwork is designed with redundancy, there can be issues in properly updating thetables to reflect a failed link. There are problems such as count to infinity that havefixes such as “split horizon” and “poison reverse.” The RIP was a firstimplementation of the distance vector algorithm. The RIPv2, the most common,addresses scalability and other limitations of the RIP.

To better understand the failover capabilities of RIPv2, the test network shown inFIGURE 6-15 was set up.



FIGURE 6-15 RIP Network Setup

The following output shows the server-to-client path before node failure. Thehighlighted lines show the path through the switch sw3.

server># traceroute clienttraceroute: Warning: Multiple interfaces found; using 11.0.0.51@ hme0traceroute to client (16.0.0.51), 30 hops max, 40 byte packets 1 11.0.0.1 (11.0.0.1) 0.711 ms 0.524 ms 0.507 ms 2 12.0.0.2 (12.0.0.2) 1.448 ms 0.919 ms 0.875 ms 3 13.0.0.2 (13.0.0.2) 1.304 ms 0.977 ms 0.964 ms 4 14.0.0.2 (14.0.0.2) 1.963 ms 1.091 ms 1.151 ms 5 15.0.0.2 (15.0.0.2) 1.158 ms 1.059 ms 1.037 ms 6 client (16.0.0.51) 1.560 ms 1.170 ms 1.107 ms

Server11.0.0.51

Client16.0.0.51

s48t

12.0.0.0

18.0.0.0

17.0.0.0

15.0.0.0

13.0.0.0

14.0.0.0

s48b

sw1

sw2 sw3

sw4

Default pathfrom serverto client

If sw2 fails, backup pathbecomes active route



The following code examples show the initial routing tables. The highlighted line inCODE EXAMPLE 6-11 shows the path to the client through the switch sw3.

CODE EXAMPLE 6-11 Switch sw1 Initial Routing Table

OR Destination Gateway Mtr Flags Use M-Use VLAN Acct-1*s 10.100.0.0/24 12.0.0.1 1 UG---S-um 32 0 net12 0*r 11.0.0.0/8 12.0.0.1 2 UG-----um 15 0 net12 0*d 12.0.0.0/8 12.0.0.2 1 U------u- 184 0 net12 0*d 13.0.0.0/8 13.0.0.1 1 U------u- 52 0 net13 0*r 14.0.0.0/8 13.0.0.2 2 UG-----um 1 0 net13 0*r 15.0.0.0/8 18.0.0.2 3 UG-----um 0 0 net18 0*r 16.0.0.0/8 13.0.0.2 4 UG-----um 10 0 net13 0*r 17.0.0.0/8 18.0.0.2 2 UG-----um 0 0 net18 0*d 18.0.0.0/8 18.0.0.1 1 U------u- 12 0 net18 0*d 127.0.0.1/8 127.0.0.1 0 U-H----um 0 0 Default 0


sw2:3 # sh ipr

OR Destination Gateway Mtr Flags Use M-Use VLAN Acct-1*s 10.100.0.0/24 18.0.0.1 1 UG---S-um 81 0 net18 0*r 11.0.0.0/8 18.0.0.1 3 UG-----um 9 0 net18 0*r 12.0.0.0/8 18.0.0.1 2 UG-----um 44 0 net18 0*r 13.0.0.0/8 18.0.0.1 2 UG-----um 0 0 net18 0*r 14.0.0.0/8 17.0.0.2 2 UG-----um 0 0 net17 0*r 15.0.0.0/8 17.0.0.2 2 UG-----um 0 0 net17 0*r 16.0.0.0/8 17.0.0.2 3 UG-----um 3 0 net17 0*d 17.0.0.0/8 17.0.0.1 1 U------u- 17 0 net17 0*d 18.0.0.0/8 18.0.0.2 1 U------u- 478 0 net18 0*d 127.0.0.1/8 127.0.0.1 0 U-H----um 0 0 Default 0##


sw3:3 # sh ipr

OR Destination Gateway Mtr Flags Use M-Use VLAN Acct-1*s 10.100.0.0/24 13.0.0.1 1 UG---S-um 79 0 net13 0*r 11.0.0.0/8 13.0.0.1 3 UG-----um 3 0 net13 0



The highlighted line in CODE EXAMPLE 6-14 shows the path to the server through theswitch sw3.

The highlighted lines in the following output from running the tracerouteclient command show the new path from the server to the client through theswitch sw2 after the switch sw3 fails.

*r 12.0.0.0/8 13.0.0.1 2 UG-----um 44 0 net13 0*d 13.0.0.0/8 13.0.0.2 1 U------u- 85 0 net13 0*d 14.0.0.0/8 14.0.0.1 1 U------u- 33 0 net14 0*r 15.0.0.0/8 14.0.0.2 2 UG-----um 0 0 net14 0*r 16.0.0.0/8 14.0.0.2 3 UG-----um 10 0 net14 0*r 17.0.0.0/8 14.0.0.2 2 UG-----um 0 0 net14 0*r 18.0.0.0/8 13.0.0.1 2 UG-----um 0 0 net13 0*d 127.0.0.1/8 127.0.0.1 0 U-H----um 0 0 Default 0


sw4:7 # sh ipr

OR Destination Gateway Mtr Flags Use M-Use VLAN Acct-1*s 10.100.0.0/24 14.0.0.1 1 UG---S-um 29 0 net14 0*r 11.0.0.0/8 14.0.0.1 4 UG-----um 9 0 net14*r 12.0.0.0/8 14.0.0.1 3 UG-----um 0 0 net14 0*r 13.0.0.0/8 14.0.0.1 2 UG-----um 0 0 net14 0*d 14.0.0.0/8 14.0.0.2 1 U------u- 13 0 net14 0*d 15.0.0.0/8 15.0.0.1 1 U------u- 310 0 net15 0*r 16.0.0.0/8 15.0.0.2 2 UG-----um 16 0 net15 0*d 17.0.0.0/8 17.0.0.2 1 U------u- 3 0 net17 0*r 18.0.0.0/8 17.0.0.1 2 UG-----um 0 0 net17 0*d 127.0.0.1/8 127.0.0.1 0 U-H----um 0 0 Default 0

server># traceroute clienttraceroute: Warning: Multiple interfaces found; using 11.0.0.51 @hme0traceroute to client (16.0.0.51), 30 hops max, 40 byte packets 1 11.0.0.1 (11.0.0.1) 0.678 ms 0.479 ms 0.465 ms 2 12.0.0.2 (12.0.0.2) 1.331 ms 0.899 ms 0.833 ms3 18.0.0.2 (18.0.0.2) 1.183 ms 0.966 ms 0.953 ms 4 17.0.0.2 (17.0.0.2) 1.379 ms 1.082 ms 1.062 ms 5 15.0.0.2 (15.0.0.2) 1.101 ms 1.024 ms 0.993 ms 6 client (16.0.0.51) 1.209 ms 1.086 ms 1.074 ms

CODE EXAMPLE 6-13 Switch sw3 Initial Routing Table (Continued)

sw3:3 # sh ipr



The following output shows the result of the server ping commands.

The fault detection and recovery took in excess of 21 seconds. The RIPv2 is widelyavailable. However, the failure detection and recovery is not optimal.

Conclusions Drawn from EvaluatingFault Detection and Recovery TimesWe presented several approaches for increased availability for network designs byevaluating fault detection and recovery times and the adverse impact on computingand memory resources. In comparing these networking designs, we drew thefollowing conclusions:

■ Link aggregation is suitable for increasing the bandwidth capacity andavailability on point-to-point links only.

■ Layer 2 availability designs using Sun Trunking 1.3 and Split MultiLink Trunkingavailable on Nortel Networks Passport 8600 Switches were configured and tested.Distributed MultiLink Trunking was also configured and tested using Nortel’ssmaller Layer 2 Business Policy switches. Both switches were found to providerapid failure detection and failover recovery within two to five seconds. Furtherbenefit of this approach was that the failure and recovery events were transparentto the IP layer.

64 bytes from client (16.0.0.51): icmp_seq=18. time=2. ms64 bytes from client (16.0.0.51): icmp_seq=19. time=2. ms64 bytes from client (16.0.0.51): icmp_seq=20. time=2. msICMP Net Unreachable from gateway 12.0.0.2 for icmp from server (11.0.0.51) to client (16.0.0.51)ICMP Net Unreachable from gateway 12.0.0.2....for icmp from server (11.0.0.51) to client (16.0.0.51)ICMP Net Unreachable from gateway 12.0.0.2 for icmp from server (11.0.0.51) to client (16.0.0.51)ICMP Net Unreachable from gateway 12.0.0.2 for icmp from server (11.0.0.51) to client (16.0.0.51)64 bytes from client (16.0.0.51): icmp_seq=41. time=2. ms64 bytes from client (16.0.0.51): icmp_seq=42. time=2. ms64 bytes from client (16.0.0.51): icmp_seq=43. time=2. ms



■ Spanning Tree Protocol is not suitable because failure detection and recovery areslow. A recent improvement, IEEE 802.3w Rapid Spanning Tree, designed toimprove these limitations, might be worth considering in the future.

■ Layer 3 availability designs using VRRP and IPMP offer an alternative availabilitystrategy combination for server-to-network connection. This approach providesrapid failure detection and recovery and is economically feasible whenconsidering the increased MTBF calculations. Be sure to investigate the processingcapabilities of the control processor and consult with the vendor on the impact ofadditional load due to the ICMP ping commands caused by IPMP.




CHAPTER 7

Reference Design Implementations

This chapter describes network implementation concepts and details. It firstdescribes how the multi-tier services map to networks and VLANS. Then it describessome of the more important IP services to consider when crafting architectures formulti-tier data centers:

■ Server Load Balancing—how to achieve increased availability and performanceby redundancy of stateless applications

■ Layer 7 Switching—how to decouple internal applications from externalreferences

■ Network Address Translation—how to decouple internal IP addresses fromexternal references

■ Cookie Persistence—how to achieve stateful transactions over a stateless protocol

■ Secure Sockets Layer (SSL)—how to achieve secure transactions over a publicnetwork

■ IPMP—how to achieve network interface redundancy on servers that istransparent to applications

■ VRRP—how to achieve router redundancy.

The chapter then describes the logical network architecture and various physicalrealizations. Most important, it describes actual tested network referenceimplementations. It first describes the original secure multi-tier architecture and itslimitations. Then it describes a second architecture based on many small multi-layerand simple Layer 2 switches and their limitations. Finally, it describes in detail acollapsed network architecture based on large chassis-based switches. It is importantto note that these designs are vendor independent and could have been realized byCisco, Nortel, and other similar vendors or combinations thereof.

Network Equipment Providers usually implement standard Layer 2 and Layer 3functions using ASICs and there are few differences in their basic implementations.However, additional features such as load balancing can differentiate vendorssignificantly in how their products actually impact the network architecture. Weexplore two vendors and describe reference implementations that were configured

295


and tested. We then describe where it makes sense to use each design. We alsodiscuss how to create virtual firewalls between tiers to increase the level of securitywithout sacrificing wirespeed performance. In particular, we describe the testedconfiguration of Netscreen firewall and show how one box can be configured tocreate virtual firewalls, segregating and filtering inter-tier network traffic.

Logical Network ArchitectureThe logical network design is composed of segregated networks that areimplemented physically using virtual local area networks (VLANs) defined bynetwork switches. The internal network uses private IP address space (10.0.0.0)for security and portability advantages. FIGURE 7-1 shows a high-level overview ofthe logical network architecture.

The management network provides centralized data collection and management ofall devices. Each device has a separate interface to the management network toavoid contaminating the production network in terms of security and performance.The management network is also used for automating the installation of thesoftware using Solaris JumpStart™ technology.

Although several networks physically reside on a single active core switch, networktraffic is segregated and secured using static routes, ACLs, and VLANs. From apractical perspective, this is as secure as separate individual switches.



FIGURE 7-1 Logical Network Architecture Overview

Externalnetwork

192.168.10.0.

Web servicenetwork

10.10.0.0.

Namingservices network

10.20.0.0.

Managementnetwork

10.100.0.0.

Accessto allnetworks

Databaseservice network

10.100.0.0.

Backupnetwork

10.110.0.0.

SANnetwork

10.50.0.0.

Applicationservices network

10.30.0.0.

Clientnetwork

172.16.0.0.

Productionnetwork

Externalnetwork

Managementnetwork

Chapter 7 Reference Design Implementations 297


IP ServicesThe following subsections provide a description of some emerging IP services thatare often an important component in a complete network design for a Sun ONEdeployment. The IP services are divided into two categories:

■ Stateful Session Based – This class of IP services requires that the switchmaintain session state information so that a particular client’s session state ismaintained across all packets. This requirement has severe implications for highlyavailable solutions and limits scalability and performance.

■ Stateless Session Based – This class of IP services does not require that the switchmaintain any state information associated with a particular flow.

Many functions can be implemented either by network switches and appliances orby the Sun ONE software stack. This section describes how these new IP serviceswork and the benefit they provide. It then discusses availability strategies. Latersections describe similar functions that are included in the Sun ONE integratedstack.

Modern multilayer network switches perform many Layer 3 IP services in additionto vanilla routing. These services are implemented as functions that operate on apacket by modifying the packet headers and controlling the rate at which the packetis forwarded. IP services include functions such as QoS, server load balancing,application redirection, network address translation, and others. This section startsour discussion on an important service for data centers—server load balancing—andthen describes adjacent services that can be cascaded.

Stateless Server Load Balancing

The server load balancing (SLB) function maps incoming client requests destined toa virtual IP (VIP) address and port to a real server IP address and port. The targetserver is selected from a set of identically configured servers based on a predefinedalgorithm that considers the loads on the servers as criteria for choosing the bestserver at any instant in time. The purpose of SLB is to provide one layer ofindirection to decouple servers from the network service that clients interface with.Thus, the server load balancer can choose the best server to service a client request.Decoupling increases availability because if some servers fail, the service is stillavailable from the remaining functioning servers. Flexibility is increased becauseservers can be added or removed without impacting the service. Other redirectionfunctions can be cascaded to provide compound functionality.

SLB mapping functions differ from other mapping functions such as redirection,which makes mapping decisions based on criteria such as ensuring that a particularclient is redirected to the same server to take advantage of caches or cookiepersistence. FIGURE 7-2 shows an overview of the various mapping functions andhow the IP header is rewritten by various functions.



FIGURE 7-2 IP Services—Switch Functions Operate on Incoming Packets

FIGURE 7-2 shows that a typical client request is destined for an external VIP with IPaddress a.b.c.d and port 123. Various functions, as shown, can intercept thisrequest and rewrite it according to the provisioned configuration rules. The SLBalgorithm will eventually intercept the packet and rewrite the destination IP addressdestined to the real server, which was chosen by a particular algorithm. The packetis then returned as indicated by the source IP address.

Stateless Layer 7 Switching

Stateless Layer 7 switching, which is also called the application redirection function,intercepts a client’s HTTP request and redirects the request to anotherdestination—usually a group of cache servers. Application redirection rewrites theIP destination field. This is different from proxy switching, where the socketconnection is terminated and a new one is created to the server to fetch therequested Web page.

Application redirection serves the following purposes:

■ Reduces the load on one set of Web servers and redirects it to another set, whichis usually cache servers for specific content

■ Intercepts client requests and redirects to another destination for control of certaintypes of traffic based on filtered criteria

VIP1 = a.b.c.d:123 VIP2 = e.f.g.h:456

Real Dest. IP[1,2,3,...n]

srcIP:e.f.g.h srcPort: 456 dstIP:RealIP http://www.a.com/index.html

URLStringMatch

HTTPHeader Cookie Cache

SSLSession

ID

srcIP:a.b.c.d srcPort: 123 dstIP:VIP1 http://www.a.com/index.html

Packet A

Packet A1 Packet A2

SLB function finds the best server and rewrites the dstIP of the target real server—Real IP.

First Function rewrote srcIP and port so that the real server will reply to this switch, which is at srcIP e.f.g.h and port 456. Dest is set to the SLB function.

Client

srcIP:e.f.g.h srcPort: 456 dstIP:VIP2 http://www.a.com/index.html

SLB

CustomAlgorithm

RoundRobin

LeastConnections



FIGURE 7-3 illustrates the functional model of application redirection, which onlyrewrites the IP header.

FIGURE 7-3 Application Redirection Functional Model

Stateful Layer 7 Switching

Stateful Layer 7 switching, which is also called content switching, proxy switching,or URL switching, accepts a client’s incoming HTTP request, terminates the socketconnection, and creates another socket connection to the target Web server, which ischosen based on a user-defined rule. The difference between this and applicationredirection is the maintenance of state information. In application redirection, thepacket is rewritten and continues on its way. In content switching, state informationis required to keep track of client requests and server responses and to make surethey are tied together. The content switching function fetches the requested Webpage and returns it to the client.

FIGURE 7-4 shows an overview of the functional content switching model.

Application Redirection

client httprequest - DEST = A

-Filter has a defined destination IP-Client request meets filter criteria, request is intercepted, IP dest is rewritten to new desitination IP addr, DEST = B

servergroup 1DEST = A

servergroup 2DEST = B



FIGURE 7-4 Content Switching Functional Model

Content switching with full NAT serves the following purposes:

■ Isolates internal IP addresses from being exposed to the public Internet.

■ Allows reuse of a single IP address. For example, clients can send their Webrequests to www.a.com or www.b.com, where DNS maps both domains to asingle IP address. The proxy switch receives this request with the packetcontaining an HTTP header in the payload that contains the target domain, forexample a.com or b.com, and makes a decision to which group of servers toredirect this request.

■ Allows parallel fetching of different parts of Web pages from servers optimizedand tuned for that type of data. For example, a complex Web page might needGIFs, dynamic content, or cached content. With content switching, one set of Webservers can hold the GIFs and another can hold the dynamic content. The proxyswitch can make parallel fetches and retrieve the entire page at a faster rate thanwould be possible otherwise.

■ Ensures that requests with cookies or SSL session IDs are redirected to the sameserver to take advantage of persistence.

FIGURE 7-3 shows that the client’s socket connection is terminated by the proxyfunction. The proxy retrieves as much of the URL as needed to make a decisionbased on the retrieved URL. FIGURE 7-3 shows various URLs mapped to various

client httprequest

Layer 7 Switching Function

VIP

http://www.a.com/SMA/stata/index.html servergroup1

http://www.a.com/SMA/dnsa/index.html servergroup2

http://www.a.com/SMB/statb/index.html servergroup3

http://www.a.com/SMB/CACHEB/index.html servergroup4

http://www.a.com/SMB/DYNA/index.html servergroup1

servergroup 1stata

servergroup 2dnsa

servergroup 3statb

servergroup 4cacheb

servergroup 5dynab

- terminate socket connection- get URL- check against rules- forward to servergroup/slb function- or get valid cookie with server ID and forward it to the same server



server groups, which are VIP addresses. The next step is to forward the URL directlyor pass it off to the SLB function that is waiting for traffic destined to the servergroup.

The proxy is configured with a VIP, so the switch forwards all client requestsdestined to this VIP to the proxy function. The proxy function rewrites the IP header,particularly the source IP and port, so that the server sends back the requested datato the proxy, not the client directly.

Stateful Network Address Translation

Network Address Translation (NAT) is a critical component for security and propertraffic direction. There are two basic types of NAT: half and full. Half NAT rewritesthe destination IP address and MAC address to a redirected location such as Webcache, which returns the packet directly to the client because the source IP address isunchanged. Full NAT is where the socket connection is terminated by a proxy, so thesource IP and MAC are changed to that of the proxy server.

NAT serves the following purposes:

■ Security – Prevents exposing internal private IP addresses to the public.

■ IP Address Conservation – Requires only one valid exposed IP address to fetchInternet traffic from internal networks with non-valid IP addresses.

■ Redirection – Intercepts traffic destined to one set of servers and redirects it toanother by rewriting the destination IP and MAC addresses. The redirectedservers can send the request directly back to the clients with half NAT translatedtraffic because the original source IP address has not been rewritten.

NAT is configured with a set of filters, usually a 5-tuple Layer 3 rule. If the incomingtraffic matches a certain filter rule, the packet IP header is rewritten or anothersocket connection is initiated to the target server, which itself can be changed,depending on the rule.

Stateful Secure Sockets Layer Session ID Persistence

Secure Sockets Layer (SSL) can be implemented in software, hardware, or both. SSLcan be terminated at the target server, an intermediate server, an SSL networkappliance, or at an SSL-capable network switch. An SSL appliance, such as netscaleror array networks, tends to be implemented with a PC board and have a PCI-basedcard, which contains the SSL accelerator ASIC. Hence, the SSL acceleration isimplemented in libraries, which offload only the mathematical computations. Therest of the SSL processing is implemented in software, with selective functions beingdirected to the hardware accelerator. Clearly, one immediate limitation is the PCIbus. Other newer SSL devices have an SSL accelerator integrated in the datapath of



the network switch. These advanced products are just emerging from startups suchas Wincom Systems. This section discusses the switch and appliance interactions. Alater section covers the server SSL implementation.

FIGURE 7-5 shows that once a client makes initial contact to a particular server, whichmay have been selected based on SLB, the switch ensures that subsequent requestsare forwarded to the same SSL server based on the SSL ID that the switch has storedduring the initial SSL handshake. The switch keeps state information about theclient’s initial request based on HTTPS and port 443, which contain a hello message.This first request is then forwarded to the server selected by the SLB algorithm or byanother function. The server responds to the client’s hello message with an SSLsession ID. The switch then intercepts this SSL session and stores it in a table. Theswitch forwards all of the client’s subsequent requests to the same server as long aseach request contains the SSL session ID in the HTTP header. FIGURE 7-5 shows theremay be several different TCP socket connections that span the same SSL session.State is maintained by the SSL session ID in each HTTP request sent by the sameclient.

FIGURE 7-5 Network Switch with Persistence Based on SSL Session ID

An appliance can be added for increased performance in terms of SSL handshakesand bulk encryption throughput. FIGURE 7-7 illustrates how an SSL appliance wouldbe potentially deployed. Client requests first come in on a specific URL with theHTTPS protocol on port 443. The switch recognizes that these requests must bedirected to the appliance, which is configured to provide that SSL service. A typicalappliance such as Netscaler can also be configured, in addition to SSL acceleration,to provide content switching and load balancing. The appliance then reads or insertscookies and resubmits the HTTP request to an appropriate server, which canmaintain state based on the cookie that was in the HTTP header.

Client

HTTP Session 3HTTP Session 2HTTP Session 1

HTTP Session 4

SSI Server 1

Switch - Stores SSL Session ID and switches client to

same SSL Server

SSTunnel - SSL Session ID

SSI Server 2



FIGURE 7-6 Tested SSL Accelerator Configuration—RSA Handshake and Bulk Encryption

Stateful Cookie Persistence

The HTTP 1.0 protocol was originally designed to provide static pages in onetransaction. As more complex Web sites evolved, requiring that multiple HTTPrequests access the same server, performance was severely limited by the closing andopening of TCP socket connections. This was solved by HTTP 1.1, which allowedpersistent connection. Immediately after a socket connection, the client can pipelinemultiple requests. However, as more complex Web sites evolved to includeapplications such as the shopping cart, which required persistence across multipleHTTP 1.1 requests that were further complicated by proxies and load balancers thatinterfere with the traffic being redirected to the same Web server, anothermechanism was required to maintain state across multiple HTTP 1.1 requests. Thesolution was the introduction of two new headers in the HTTP request: Set-Cookieand Cookies as defined in RFC 2109. These headers carried the state informationbetween the client and server. Typically, most load-balancing switches have enoughintelligence to ensure that a particular client’s session with a particular server ismaintained based on the cookie inserted by the server and maintained by the client.

Client Multilayer switch

Sun ONE web servers

SSL accelerator appliancekey exchange and bulk encryption

Session persistencebased on SSL

session ID Session persistencebased on cookie

Loadbalancer

Internet

SSL http

http

https



FIGURE 7-7 Network Availability Strategies

Design Considerations: AvailabilityFIGURE 7-7 shows a cross section of the tier types and functions that are performed ateach tier. Also shown are the availability strategies for the Network and Web tier.External tier availability strategies are outside the scope of this book. We will limitour discussion to the services tiers, which include Web, Application Services,Naming, and so on.

Designing network architectures for optimal availability requires maximizing twoorthogonal components:

■ Intra Availability – Refers to maximizing the function that estimates failureprobability of the components themselves. The components that cause the failureare only considered by the following equation:

FAvailability = MTBF ÷ (MTBF + MTTR)

■ Inter Availability – Refers to minimizing the impact of failures caused by factorsexternal to the system such as single points of failure (SPOFs), power outages, ora technician accidently pulling out a cable.

Client Tier

SW1

SW2

External NetworkConnectivity

Network Tier - Layer 3 Redundancy

Session-Based Services require Session Sharing. Other stateless services failover with no problem.

Services Tier -Sun Serverswith IPMP -dual NICs



It is not sufficient to simply maximize the FAvailability function. The SPOF andenvironmental factors also must be considered. The networks designed in thischapter describe a highly available architecture that conforms to these designprinciples and is described in further detail later.

FIGURE 7-8 Logical Network Architecture—Design Details

Externalnetwork

192.168.10.0.

Web servicenetwork

10.10.0.0.

Namingservices network

10.20.0.0.

Managementnetwork

10.100.0.0.

Accessto allnetworks

Databaseservice network

10.100.0.0.

Backupnetwork

10.110.0.0.

SANnetwork

10.50.0.0.

Applicationservices network

10.30.0.0.

Clientnetwork

172.16.0.0.

Productionnetwork

Externalnetwork

Managementnetwork



FIGURE 7-8 is repeated here to simplify a detailed discussion. The diagram shows anoverview of the logical network architecture, showing how the tiers map to thedifferent networks, which are also mapped to segregated VLANs. This segregationallows inter-tier traffic to be controlled by filters on the switch or a firewall, which isthe only bridge point between VLANs. The following describes each subnetwork:

■ External network – The external facing network that directly connects to theInternet. All IP addresses must be registered and should be secured with afirewall.

The following networks are assigned non-routable IP addresses based on RFC1918, which can also be based on the following:

10.0.0.0 – 10.255.255.255 (10/8 prefix)

172.16.0.0 – 172.31.255.255 (172.16/12 prefix)

192.168.0.0 – 192.168.255.255 (192.168/16 prefix)

■ Web services network – A dedicated network that contains Web servers. Typicalconfigurations include a load-balancing switch, which can be configured to allowthe Web server to return the client’s HTTP request directly or to require the loadbalancing device to return the request on behalf of the provider Web server.

■ Naming services network – A dedicated network that consists of servers thatprovide LDAP, DNS, NIS, and other naming services. The services are for internaluse only and should be highly secure. Internal infrastructure support servicesmust be sure that requests originate and are destined to internal servers. Mostrequests tend to be read intensive, hence their potential for caching strategies forincreased performance.

■ Management network – A dedicated service network that provides managementand configuration of all servers, including jumpstart of new systems.

■ Backup network – A dedicated service network that provides backup and restoreoperations pivotal to minimizing disturbances to other production servicenetworks during backup and other network bandwidth-intensive operations.

■ Device network – This is a dedicated network that attaches IP storage devices andother devices.

■ Application services network – A dedicated network that typically consists oflarge multi-CPU servers that host multiple instances of the Sun ONE Applicationserver software image. These requests tend to be low network bandwidthintensive but may span multiple protocols, including HTTP, CORBA, proprietaryTCP, and UDP. The network traffic can also be significant when Sun ONEApplication server clustering is enabled. Every update to a stateful session beantriggers a multicast update to all servers on this dedicated network so thatparticipating cluster nodes update the appropriate stateful session bean. Networkutilization increases in direct proportion to the intensity of session bean updates.



■ Database network – A dedicated network that typically consists of one or twomulti-CPU database servers. The network traffic typically consists of JavaDataBase Connectivity™ (JDBC) traffic between the application server or the Webserver.

Collapsed Layer 2/Layer 3 Network DesignEach service is deployed in a dedicated Class C network where the first three octetsrepresent the network number. The design represents an innovative approach whereseparate Layer 2 devices are not required because the functionality is collapsed intothe core switch. Decreasing the management and configuration of separate deviceswhile maintaining the same functionality is a major step toward cutting costs andincreasing reliability.

FIGURE 7-9 shows how a traditional configuration requires two Layer 2 switches. Aspecific VLAN spans the six segments that give each interface access to the VLANon failover.

FIGURE 7-9 Traditional Availability Network Design Using Separate Layer 2 Switches

Edge switch

Masterswitch

(Layer 3)

Standbyswitch(Layer 3)

Layer 2switch

Layer 2switch

Client network

Networkinterface 0

Networkinterface 1

Sun server



The design shown in FIGURE 7-10 results in the same network functionality, buteliminates the need for two Layer 2 devices. This is accomplished using a taggedVLAN interconnect between the two core switches. By collapsing the Layer 2functionality, there is a reduction in the number of network devices, providing fewerunits that might fail, lower cost, and reduced manageability issues.

FIGURE 7-10 Availability Network Design Using Large Chassis-Based Switches

Multi-Tier Data Center Logical Design

The logical network design for the multi-tier data center (FIGURE 7-11) incorporatesserver redundant network interfaces and integrated VRRP and IPMP. See“Integrated VRRP and IPMP” on page 280 for more information.

Client network

Edgeswitch

Standby core switch 2Master core switch 1

Networkinterface 0

Networkinterface 1

Sun server



FIGURE 7-11 Logical Network Architecture with Virtual Routers, VLANs, and Networks

172.16.0.1

Clients

192.168.0.1

Master

Servers

Slave

192.168.0.2

10.10.0.110.50.0.1

10.20.0.110.40.0.1

10.30.0.1

192.168.0.2

10.10.0.1 10.50.0.1

10.20.0.1 10.40.0.1

10.30.0.1



TABLE 7-1 summarizes the eight separate networks and associated VLANs.

The edge network connects to the internal network in a redundant manner. One ofthe core switches has ownership of the 192.16.0.2 IP address, which means thatswitch is the master and the other is in slave mode. When the switch is in slavemode, it does not respond to any traffic, including ARPs. The master also assumesownership of the MAC that floats along with the virtual IP address of 192.16.0.2.

Note – If you have multiple NICs, make sure each NIC uses its unique MACaddress.

Each switch is configured with the identical networks and associated VLANS, asshown in TABLE 7-1. An interconnect between the switches extends each VLAN but istagged to allow multiple VLAN traffic to share a physical link (this requires anetwork interface, such as the Sun ge, that supports tagged VLANS). The Sunservers connect to both switches in the appropriate slot, where only one of the twointerfaces will be active.

Although most switches support Routing Information Protocol (RIP and RIPv2),Open Shortest Path First (OSPF), and Border Gateway Protocol v4 (BGP4), staticroutes provide a more secure environment. A redundancy protocol based on virtualrouter redundancy protocol (VRRP, RFC 2338) runs between the virtual routers. TheMAC address of the virtual routers floats among the active virtual routers so that theARP caches of the servers do not need any updates when a failover occurs.

TABLE 7-1 Network and VLAN Design

Name Network Default Router VLAN Purpose

client 172.16.0.0 172.16.0.1 client Client load generation

edge 192.16.0.0 192.16.0.1 edge Connects client network to the data center

web 10.10.0.0 10.10.0.1 web Web services

ds 10.20.0.0 10.20.0.1 ds Directory services

db 10.30.0.0 10.30.0.1 db Database services

app 10.40.0.0 10.40.0.1 app Application services

dns 10.50.0.0 10.50.0.1 dns DNS services

mgt 10.100.0.0 10.100.0.1 mgt Management and administration



How Data Flows Through the Service ModulesWhen a client makes a request, it can be handled in one of two ways, depending onthe type of request. A Web server might return information to the client directly or itmight forward the request to an application server for further processing.

In the case where the client’s request is for static content such as images, the requestis handled directly by the Web server module. These requests are handled quicklyand do not present a heavy load to the client or server.

In the case where the client requests dynamically generated content that requiresJava Server Pages (JSP) or servlet processing, the request is passed to the applicationservice module for processing. This is often the bottleneck for large-scaleenvironments.

The application server runs the core of the application that handles the businesslogic to service the client request, either directly or indirectly. Over the course ofhandling the business logic, the application server can use many supportingresources, including directory servers, databases, and perhaps even other Webapplication services.

FIGURE 7-12 illustrates how the data flows through the various system interfacesduring a typical application services request. TABLE 7-2 provides a description of eachnumbered interaction.



FIGURE 7-12 Logical Network

1 12

2 11

3 54

89

10

76

Clients

Switchingservices

Web services

Directoryservices

Applicationservices

Databaseservices



TABLE 7-2 Sequence of Events for FIGURE 7-12

Item Interface1 Interface2 Protocol Description

1 Client Switch HTTP/HTTPS

Client initiates Web request. Client communication canbe HTTP or HTTPS (HTTP with secure socket layer).HTTPS can be terminated at the switch or at the Webserver.

2 Switch Web server HTTP/HTTPS

Switch redirects client request to appropriate Web server.

3 Web Server Applicationserver

Applicationserver Webconnectorover TCP

The Web server redirects the request to the applicationserver for processing. Communication passes through aWeb server plug-in over a proprietary TCP-basedprotocol.

4 Applicationserver

Directoryserver

LDAP The Java™ 2 Enterprise Edition (J2EE) application hostedby the application server identifies the requested processas requiring specific authorization. It sends a request tothe directory server to verify that the user has validauthorization.

5 Directoryserver

Applicationserver

LDAP The directory server successfully verifies theauthorization through the user’s LDAP role. Thevalidated response is returned to the application server.Application server then processes business logicrepresented in J2EE application.

6 Applicationserver

Databaseserver

JDBC The business logic requests data from a database as inputfor processing. The requests may come from servlets,Java™ Data Objects, or Enterprise Java Beans (EJBs) thatin turn use Java DataBase Connectivity (JDBC) to accessthe database.

7 Databaseserver

Applicationserver

JDBC The JDBC request can contain any valid SQL statement.The database processes the request natively and returnsthe appropriate result through JDBC to the applicationserver.

8 Applicationserver

Web server Applicationserver Webconnectorover TCP

The J2EE application completes the business logicprocessing, packages the data for display (usuallythrough a JSP that renders HTML) and returns theresponse to the Web server.

9 Web server Switch HTTP/HTTPS

Switch receives reply from Web server.

10 Switch Client HTTP/HTTPS

Switch rewrites IP header and returns request to client.



Physical Network ImplementationsThe next step involves constructing a real network based on the logical networkarchitecture. You can use several approaches to realize the network that functionallysatisfies the logical architectural requirements.

The multi-tier data center is vendor independent, and you can use the networkequipment that best suits your environment. We briefly describe the original multi-tier data center implementation (secure multi-tier architectures), then we describethe multiswitch approach, and finally we describe the collapsed approach.

Secure Multi-TierFIGURE 7-13 shows the overall structure of a classic multi-tier design.

FIGURE 7-13 Secure Multi-Tier

The advantages of this approach are simplicity and security. Clearly the only way toaccess the Data tier is through the application servers. There are no other possiblenetwork paths to access the Data tier. The drawbacks are limited flexibility andmanageability. If an application running on the Web server needs to connect to an

Web Web

App App

DB DB



LDAP server or a database through a JDBC connection, a fundamental change to thearchitecture will be needed. As the number of tiers increases, so does the number ofswitches, which becomes a management issue.

Multi-Level Architecture Using Many SmallSwitchesFIGURE 7-14 shows the overall structure of a multi-level architecture that is composedof many small port density switches.

FIGURE 7-14 Multi-Tier Data Center Architecture Using Many Small Switches

MultilayerSwitch

MultilayerSwitch

MultilayerSwitch

Layer 2

Layer 2 Layer 2

Layer 2 Layer 2

MultilayerSwitch

MultilayerSwitch

MultilayerSwitch

Layer 2

Layer 2 Layer 2

Layer 2 Layer 2



This approach has few advantages and many disadvantages. One advantage is thatthe entry cost is low. One can start from a very small deployment, procuring smalleight-port multilayer switches and Layer 2 switches and increasing the tiers andservers to the point where the ingress links become a bottleneck or the port densityof the small multilayer switches becomes an issue. Actual tested configurationsleveraged Alteon 180 switches as the multilayer switches and Extreme NetworksSummit 48i for the Layer 2 switches, which had gigabit uplinks and 10/100 ports forconnections to the server. This architecture has the following disadvantages:

■ Lower Availability – Because of the number of links and devices, more things cango wrong. In particular, the serial connections drastically reduce the MBTF. Thelinks are often prone to accidents and should be kept to a minimum. Due to thearchitecture, the link failure detection time and recovery is much slower becauseof the number of layers.

■ Waste – In any network architecture, stateless functionality should be deployedtowards the center of the network and complex processing should be deployed atthe outermost edge. Having two layers of multilayer switches is a tremendouswaste in terms of packet processing and equipment cost. When a packetundergoes Layer 7 processing, especially by software, it is extremely slow. Thecost of a multilayer switch is much more than that of a plain Layer 2 orLayer 3 device.

■ Manageability – As the number of switches increases, so does the manageabilityworkload.

Flat Architecture Using Collapsed Large ChassisSwitchesThe flat network architecture using collapsed large chassis switches was found to bethe best design for large-scale multi-tier deployments in availability, performance,and manageability.

In the lab, we built two different network configurations. One configuration usedExtreme Networks equipment (FIGURE 7-15), and the other used Foundry Networksequipment (FIGURE 7-16).

The Extreme Networks switch that we used has built-in load balancing, so there wasno need for an external load-balancing device.

The Foundry Networks products required use of a separate load-balancing switch.



FIGURE 7-15 Network Configuration with Extreme Networks Equipment

Client 1 Client 2

L2-L3 edge switch

WebserviceTier

Sun Fire6800

Sun Fire680010.30.0.101

Sun Fire6800

Sun Fire6800

10.30.0.100

Clientaccess

Extremeswitches


DatabaseserviceTier

DirectoryseviceTier

Extreme switch192.168.10.2



T3

T3

192.168.10.1

10.10.0.110.20.0.110.40.0.110.30.0.1

10.50.0.1

10.10.0.100 10.10.0.101 10.10.0.102 10.10.0.103

10.20.0.100 10.20.0.101 10.20.0.102 10.20.0.103

10.40.0.100 10.40.0.101

Extreme switch192.168.10.3Core Core



FIGURE 7-16 Sun ONE Network Configuration with Foundry Networks Equipment

Client 1 Client 2


WebserviceTier

Sun Fire6800

Sun Fire680010.30.0.101

Sun Fire6800

Sun Fire6800

10.30.0.100

ClientaccessTier


DatabaseserviceTier





T3

T3

192.168.10.1

10.10.0.110.20.0.110.40.0.110.30.0.1

10.50.0.1

10.10.0.100 10.10.0.101 10.10.0.102 10.10.0.103

10.20.0.100 10.20.0.101 10.20.0.102 10.20.0.103

10.40.0.100 10.40.0.101



Foundryswitches



Physical Network—ConnectivityThe physical wiring of the architecture is shown in FIGURE 7-17 and described inTABLE 7-3

TABLE 7-3 Physical Network Connections and Addressing

Switch Description PortPHYSpeed Base Address Netmask

edge Client network to externalnetwork router

1,2,3,4 ge 172.16.0.1 255.255.255.0

edge External network - mls1 5,6 ge 192.168.10.1 255.255.255.0

mls1 External network 1 ge 192.168.10.2 255.255.255.0

mls1 Web/app service router 3,4,5,6 ge 10.10.0.1 255.255.255.0

mls1 Directory service router 7,8 ge 10.20.0.1 255.255.255.0

mls1 Database services router 9,10 ge 10.30.0.1 255.255.255.0

mls2 External network 1 ge 192.168.10.2 255.255.255.0

mls2 Web/app service router 3,4,5,6 ge 10.10.0.1 255.255.255.0

mls2 Directory services router 7,8 ge 10.20.0.1 255.255.255.0

mls2 Database services router 9,10 ge 10.30.0.1 255.255.255.0



FIGURE 7-17 Physical Network Connections and Addressing

Client1Client2

Edge

mls1 mls2

ge0:172.16.0.101/24ge0:172.16.0.102/24

172.16.0.1/24

192.168.0.1/24

192.168.0.2/24 192.168.0.2/24

192.168.0.101/24 192.168.0.102/24

1 2 3 4

5 67 hme0:10.100.16.101hme0:10.100.16.102

10.100.16.1

10.100.168.2

hme010.100.10.101

10.100.168.2

hme0:10.100.10.105

hme0:10.100.20.101 hme0:10.100.20.103

192.168.0.2/24 192.168.0.2/24

hme0:10.100.30.101 hme0:10.100.30.103

ge0:10.10.0.101/24

ge1:10.10.0.102/24

ge0:10.10.0.103/24

ge1: 10.10.0.104/24

ge0:10.10.0.105/24

ge1:10.10.0.106/26

ge0:10.10.0.107/24

ge1:10.10.0.108/26

app1

app2

app1

app2

10.10.0.1/24 10.10.0.1/24

ge0:10.40.0.101/24 ge1:10.40.0.102/24

ge0:10.40.0.103/24 ge1:10.40.0.104/24

ge0:10.40.0.105/24 ge1:10.40.0.106/24

ge0:10.10.0.107/24 ge1:10.40.0.108/24

web1 web1 web1 web1

ds1 ds2

10.20.0.1/24 10.20.0.1/24

ge0:10.20.0.101/24

ge1:10.20.0.102/24 ge1:10.20.0.104/24

ge0:10.20.0.103/24

10.30.0.1/24 10.30.0.1/24

ge1:10.30.0.104/24

db1 db2

ge0:10.30.0.101/24

ge1:10.30.0.102/24

ge0:10.30.0.103/24



Switch ConfigurationA high-level overview of the switch configuration is shown in FIGURE 7-18.

FIGURE 7-18 Collapsed Design Without Layer 2 Switches

Extreme Networks Summit 7i

Extreme Networks - Black Diamond 6808

Slot 1

client172.16.0.1

edge192.168.0.2

web10.10.0.1

edge192.168.0.2

Slot 2ds

10.20.0.1

Slot 3db

10.30.0.1

Slot 4app

10.40.0.1

Slot 5dns

10.50.0.1

Slot 6 esrp interconnect - web, ds, db, app, dns

Slot 7

Slot 8mgt

10.100.0.1

Extreme Networks - Black Diamond 6808

Slot 1web

10.10.0.1edge

192.168.0.2

Slot 2ds

10.20.0.1

Slot 3db

10.30.0.1

Slot 4app

10.40.0.1

Slot 5dns

10.50.0.1

Slot 6esrp interconnect - web, ds, db, app, dns

Slot 7

Slot 8mgt

10.100.0.1



Configuring the Extreme Networks SwitchesFor the multi-tier data center, two Extreme Networks BlackDiamond switches wereused for the core switches and one Summit7i switch was used for the edge switch.

Note – Network equipment from Foundry Networks can be used instead. See“Configuring the Foundry Networks Switches” on page 324.

▼ To Configure the Extreme Networks Switches1. Configure the core switches.

The following example shows an excerpt of the switch configuration file.

## MSM64 Configuration generated Thu Dec 6 20:19:20 2001# Software Version 6.1.9 (Build 11) By Release_Master on 08/30/0111:34:27

configure slot 1 module g8xconfigure slot 2 module g8xconfigure slot 3 module g8xconfigure slot 4 module g8xconfigure slot 5 module g8xconfigure slot 6 module g8xconfigure slot 7 module f48tconfigure slot 8 module f48t.....................................................configure dot1q ethertype 8100configure dot1p type dot1p_priority 0 qosprofile QP1configure dot1p type dot1p_priority 1 qosprofile QP2configure dot1p type dot1p_priority 2 qosprofile QP3configure dot1p type dot1p_priority 3 qosprofile QP4.....................................................enable sys-health-checkconfigure sys-health-check alarm-level logenable system-watchdogconfig qosprofile QP1 minbw 0% maxbw 100% priority Low minbuf 0%maxbuf 0 Kconfig qosprofile QP2 minbw 0% maxbw 100% priority LowHi minbuf 0%maxbuf 0 K



2. Configure the edge switch.

The following example shows an excerpt of the switch configuration file.

Configuring the Foundry Networks SwitchesThis section describes the network architecture implementation using FoundryNetworks equipment instead of Extreme Networks equipment. The overall setup isshown in FIGURE 7-19.

## Summit7i Configuration generated Mon Dec 10 14:39:46 2001# Software Version 6.1.9 (Build 11) By Release_Master on 08/30/0111:34:27configure dot1q ethertype 8100configure dot1p type dot1p_priority 0 qosprofile QP1....................................................enable system-watchdogconfig qosprofile QP1 minbw 0% maxbw 100% priority Low minbuf 0%maxbuf 0 K....................................................delete protocol ipdelete protocol ipxdelete protocol netbiosdelete protocol decnetdelete protocol appletalk....................................................# Config information for VLAN Default.config vlan “Default” tag 1 # VLAN-ID=0x1 Global Tag 1config vlan “Default” protocol “ANY”config vlan “Default” qosprofile “QP1”enable bootp vlan “Default”....................................................



FIGURE 7-19 Foundry Networks Implementation

Client Client Client Client

Servers Servers

Servers Servers

Servers Servers

Servers Servers

Servers Servers

Webservicemodule

Directoryservicemodule

Applicationservicemodule

Databaseservicemodule

Netscreenfirewall

NS5200Netscreen

firewall

NS5200

Server loadbalancer

SLB1BigIron

layer 2/3 switch

MLS0BigIron

layer 2/3 switch

MLS1Server load

balancer

SLB0

Extreme NetworksSummit 7i

S7i



Master Core Switch Configuration

CODE EXAMPLE 7-1 shows an example of the configuration file for the master coreswitch (called MLS0 in the lab). We used the Foundry Networks BigIron switch.

CODE EXAMPLE 7-1 MLS0 Configuration File

module 1 bi-jc-8-port-gig-m4-management-modulemodule 3 bi-jc-48e-port-100-module!global-protocol-vlan!vlan 1 name DEFAULT-VLAN by portvlan 10 name refarch by port untagged ethe 1/1 ethe 3/1 to 3/16 router-interface ve 10vlan 99 name mgmt by port untagged ethe 3/47 to 3/48 router-interface ve 99!hostname MLS0ip default-network 129.146.138.0/16ip route 192.168.0.0 255.255.255.0 172.0.0.1ip route 129.148.181.0 255.255.255.0 129.146.138.1ip route 0.0.0.0 0.0.0.0 129.146.138.1!router vrrp-extendedinterface ve 10 ip address 20.20.0.102 255.255.255.0 ip address 172.0.0.70 255.255.255.0 ip vrrp-extended vrid 1 backup priority 100 track-priority 20 advertise backup ip-address 172.0.0.10 dead-interval 1 track-port e 3/1 enable ip vrrp-extended vrid 2 backup priority 100 track-priority 20 advertise backup ip-address 20.20.0.100 dead-interval 1 track-port e 3/13 enable!interface ve 99 ip address 129.146.138.10 255.255.255.0end



Standby Core Switch Configuration

CODE EXAMPLE 7-2 shows a partial listing of the configuration file for the standbycore switch (called MLS1 in the lab). Again we used the Foundry Networks BigIronswitch.

CODE EXAMPLE 7-2 MLS1 Configuration File

ver 07.5.05cT53!module 1 bi-jc-8-port-gig-m4-management-modulemodule 3 bi-jc-48e-port-100-module!global-protocol-vlan!vlan 1 name DEFAULT-VLAN by port!vlan 99 name swan by port untagged ethe 1/6 to 1/8 router-interface ve 99!vlan 10 name refarch by port untagged ethe 3/1 to 3/16 router-interface ve 10!!hostname MLS1ip default-network 129.146.138.0/1ip route 192.168.0.0 255.255.255.0 172.0.0.1ip route 0.0.0.0 0.0.0.0 129.146.138.1!router vrrp-extendedinterface ve 10 ip address 20.20.0.102 255.255.255.0 ip address 172.0.0.71 255.255.255.0 ip vrrp-extended vrid 1 backup priority 100 track-priority 20 advertise backup ip-address 172.0.0.10 dead-interval 1 track-port e 3/1 enable ip vrrp-extended vrid 2 backup priority 100 track-priority 20 advertise backup ip-address 20.20.0.100 dead-interval 1 track-port e 3/13enable

interface ve 99



Server Load Balancer

The following code box shows a partial listing of the configuration file used for theserver load balancer (called SLB0 in the lab). We used the Foundry Networks ServerXL.

ip address 129.146.138.11 255.255.255.0!!!!!sflow sample 512sflow source ethernet 3/1sflow enable!!end

CODE EXAMPLE 7-3 SLB0 Configuration File

ver 07.3.05T12global-protocol-vlan!!server source-ip 20.20.0.50 255.255.255.0 172.0.0.10!!!!server real web1 10.20.0.1 port http port http url "HEAD /"!server real web2 10.20.0.2 port http port http url "HEAD /"!!server virtual WebVip1 192.168.0.100 port http port http dsr bind http web1 http web2 http!

!vlan 1 name DEFAULT-VLAN by port

CODE EXAMPLE 7-2 MLS1 Configuration File (Continued)



Server Load Balancer

The following code box shows a partial listing of the configuration file used for thestandby server load balancer (called SLB1 in the lab). Again we used the FoundryNetworks Server XL.

no spanning-tree!hostname SLB0ip address 192.168.0.111 255.255.255.0ip default-gateway 192.168.0.10web-management allow-no-passwordbanner motd ^CReference Architecture -- Enterprise Engineering^CServer Load Balancer-- SLB0 129.146.138.12/24^C!!end

CODE EXAMPLE 7-4 SLB1 Configuration File

ver 07.3.05T12global-protocol-vlan!!server source-ip 20.20.0.51 255.255.255.0 172.0.0.10!!!!server real s1 20.20.0.1 port http port http url "HEAD /"!server real s2 20.20.0.2 port http port http url "HEAD /"!!server virtual vip1 172.0.0.11 port http port http dsr bind http s1 http s2 http!

!vlan 1 name DEFAULT-VLAN by port

CODE EXAMPLE 7-3 SLB0 Configuration File (Continued)

ver 07.3.05T12



Network SecurityFor the Sun ONE network configuration, firewalls were configured between eachservice module to provide network security. FIGURE 7-20 shows the relationshipbetween the firewalls and the service modules.

!hostname SLB1ip address 172.0.0.112 255.255.255.0ip default-gateway 172.0.0.10web-management allow-no-passwordbanner motd ^CReference Architecture - Enterprise Engineering^CServer Load Balancer - SLB1 - 129.146.138.13/24^C!

CODE EXAMPLE 7-4 SLB1 Configuration File (Continued)

ver 07.3.05T12



FIGURE 7-20 Firewalls between Service Modules

In the lab, one physical firewall device was used to create multiple virtual firewalls.Network traffic was directed to pass through the firewalls between the servicemodules, as shown in FIGURE 7-21.

The core switch is only configured for Layer 2 with separate port-based VLANs. Theconnection between the Netscreen and the core switch uses tagged VLANS. Trustzones are created on the Netscreen device, and they map directly to the taggedVLANs. The Netscreen firewall device performs the Layer 3 routing. Thisconfiguration directs all traffic through the firewall, resulting in firewall protectionbetween each service module.

Firewall

Firewall

Firewall

Web serviceTier

Application serviceTier

Database serviceTier

Edgeswitch

Intranet/Internet

Client Client



FIGURE 7-21 Virtual Firewall Architecture Using Netscreen and Foundry Networks Products

Edgeswitch

Intranet/Internet

Client Client

Coreswitch

CoreswitchVLAN*

Database VLAN

Netscreendevice Netscreen

deviceVLAN*

Application VLAN

Web VLAN

Database VLAN

Application VLAN

Web VLAN

*Web, application, and database traffic multiplexed on one VLAN

Web serviceTier

Application serviceTier

Database serviceTier



Netscreen Firewall

CODE EXAMPLE 7-5 shows a partial example of a configuration file used to configurethe Netscreen device.

CODE EXAMPLE 7-5 Configuration File Used for Netscreen Device

set auth timeout 10

set clock "timezone" 0

set admin format dos

set admin name "netscreen"

set admin password nKVUM2rwMUzPcrkG5sWIHdCtqkAibn

set admin sys-ip 0.0.0.0

set admin auth timeout 0

set admin auth type Local

set zone id 1000 "DMZ1"

set zone id 1001 "web"

set zone id 1002 "appsrvr"

set zone "Untrust" block

set zone "DMZ" vrouter untrust-vr

set zone "MGT" block

set zone "DMZ1" vrouter trust-vr

set zone "web" vrouter trust-vr

set zone "appsrvr" vrouter trust-vr

set ip tftp retry 10

set ip tftp timeout 2

set interface ethernet1 zone DMZ1

set interface ethernet2 zone web

set interface ethernet3 zone appsrvr

set interface ethernet1 ip 192.168.0.253/24

set interface ethernet1 route





unset interface vlan1 bypass-others-ipsec

unset interface vlan1 bypass-non-ip

set interface ethernet1 manage ping

unset interface ethernet1 manage scs

unset interface ethernet1 manage telnet

unset interface ethernet1 manage snmp

unset interface ethernet1 manage global

unset interface ethernet1 manage global-pro

unset interface ethernet1 manage ssl

set interface ethernet1 manage web



unset interface ethernet1 ident-reset

set interface vlan1 manage ping

set interface vlan1 manage scs

set interface vlan1 manage telnet

set interface vlan1 manage snmp

set interface vlan1 manage global

set interface vlan1 manage global-pro

set interface vlan1 manage ssl

set interface vlan1 manage web

set interface v1-trust manage ping

set interface v1-trust manage scs

set interface v1-trust manage telnet

set interface v1-trust manage snmp

set interface v1-trust manage global

set interface v1-trust manage global-pro

set interface v1-trust manage ssl

set interface v1-trust manage web

unset interface v1-trust ident-reset

unset interface v1-untrust manage ping

unset interface v1-untrust manage scs

unset interface v1-untrust manage telnet

unset interface v1-untrust manage snmp

unset interface v1-untrust manage global

unset interface v1-untrust manage global-pro

unset interface v1-untrust manage ssl

unset interface v1-untrust manage web

unset interface v1-untrust ident-reset

set interface v1-dmz manage ping

unset interface v1-dmz manage scs

unset interface v1-dmz manage telnet

unset interface v1-dmz manage snmp

unset interface v1-dmz manage global

unset interface v1-dmz manage global-pro

unset interface v1-dmz manage ssl

unset interface v1-dmz manage web

unset interface v1-dmz ident-reset








CODE EXAMPLE 7-5 Configuration File Used for Netscreen Device (Continued)



unset interface ethernet2 manage web









unset interface ethernet3 manage web


set interface v1-untrust screen tear-drop

set interface v1-untrust screen syn-flood

set interface v1-untrust screen ping-death

set interface v1-untrust screen ip-filter-src

set interface v1-untrust screen land

set flow mac-flooding

set flow check-session

set address DMZ1 "dmznet" 192.168.0.0 255.255.255.0

set address web "webnet" 10.10.0.0 255.255.255.0

set address appsrvr "appnet" 20.20.0.0 255.255.255.0

set snmp name "ns208"

set traffic-shaping ip_precedence 7 6 5 4 3 2 1 0

set ike policy-checking

set ike respond-bad-spi 1

set ike id-mode subnet

set l2tp default auth local

set l2tp default ppp-auth any

set l2tp default radius-port 1645

set policy id 0 from DMZ1 to web "dmznet" "webnet" "ANY" Permit

set policy id 1 from web to DMZ1 "webnet" "dmznet" "ANY" Permit

set policy id 2 from DMZ1 to appsrvr "dmznet" "appnet" "ANY" Permit

set policy id 3 from appsrvr to DMZ1 "appnet" "dmznet" "ANY" Permit

set ha interface ethernet8

set ha track threshold 255

set pki authority default scep mode "auto"

set pki x509 default cert-path partial

_____________________

CODE EXAMPLE 7-5 Configuration File Used for Netscreen Device (Continued)




APPENDIX A

Lyapunov Analysis

This appendix provides an outline of the mathematical proof that shows why theleast connections server load balancing (SLB) algorithm is inherently stable. Thismeans that over a long period of time, the system will ensure that the load is evenlybalanced. This analysis can be used to model and verify the stability of any networkdesign, which may be of tremdous value if you are an advanced network architect.

Building on what was discussed in Chapter 3, we will extend the model of the singlequeue to that of the entire system and then show that the entire system is stable. Theentire system consists of an aggregate ingress load of l, N server processes of varyingservice rates µ1, µ2, . . .µn, hence we get the following equation:

EQN 1: S = λ + µ1 + µ2 +. . . µn

We will use this equation later. It states that the value S is the sum of the aggregateload and the sum of all the service rates. This means in one time slot:

■ average sum of all incoming loads■ average sum of all server processing capacity

Since the incoming packets are modeled as Poisson arrivals, which is in continuoustime, we will map the time domain to an index N, which increases whenever thestate of the system changes. The state is defined as the queue occupancy. If a packetarrives, it will increase the size of one of the queues in the system. If a packet isserviced, then the size of one queue of the system will decrease.

Let Qs(t) = min(Q1(t), Q2(t). . . QN(t)). This Qs is the least occupied queue among allN queues.

Let Qb(t) = set {Q1(t), Q2(t). . . QN(t)} - {Qs(t)}, which is all the queues except for theleast occupied.

Let Qa(t) = Qb(t) + Qs(t), which is all the queues.

337


We know that the next state of all queues in the set Qb(t) can change only due to aWeb service, which is a reduction by one request. There can be no increase in thisqueue size because the SLB will not forward any new requests. Therefore, thesequeues cannot grow in the next time slot, so we get:

Qb(t+1) = Qb(t) - 1 with probability of µib/S

We can also figure out the next possible state of Qs(t), which can change due to aWeb service, resulting in a reduction of queue size by 1 or an increase in queue size,due to the SLB forwarding a request to this queue. Hence we get the next state asfollows:

Qs(t+1) = Qs(t) -1 with probability of µis/S

or Qs(t) +1 with a probability of λis/S

We can assign the Lyapunov Function to the sum of all the occupancies of all Nqueues. We will use t, representing a particular time slot:

L(t) = Q1(t) + Q2(t).... QN(t)

= ΣQib(t) + Qis(t)

L(t+1) = ΣQib(t+1) + Qis(t)

= Σ[µib/S(Qib(t) -1)] + µis/S[Qs(t) -1] + λis/S[Qs(t) +1]

Now if we look at one particular queue, Qi(t), keeping time discrete, the state ofQi(t) only changes due to events of arrivals and/or departures. We can see how thisqueue increases and decreases in size or queue occupancy.

For stability, we need to show:

EL = E[L(t+1) - L(t)) | L(t)] <= -e || Q || + k

This says that the expected value of the single step drift—that is, the LyapunovFunction at time t+1–Lyapunov Function at time t, given the Lyapunov Function attime t—must be a negative constant times the queue size plus some constant k. Thevalue of EL becomes negative when the queue size times -e is larger than k. This istypical in almost all systems in that before the system reaches a steady state there isan initial unstable period, but after some time a steady state is reached. This is wherewe need to look at the system to determine the behavior of the system in steadystate.

EL = E[L(t+1) - L(t)|L(t)]

= E [Σ[µib/S(Qib(t) -1)] + µis/S*[Qs(t) -1] + λis/S*[Qs(t) +1] - µib/S*ΣQib(t) - (λis/S +µis/S)*Qis(t)| L(t)]

= E [ λis/S*[Qs(t) +1] - λis/S*Qis(t) + µis/S*[Qs(t) -1] - µis/S*Qis(t) + Σ[µib/S(Qib(t) -1)] -µib/S*ΣQib(t)]|



= E [ λis/S - µis/S - Σµib/S]

= λis/S - µis/S - Σµib/S

This is always negative as long as:

λis < µis/S - Σµib/S

This means that since:

■ All incoming traffic is being redirected by the SLB algorithm to the least occupiedqueue,

λ = λis

■ All Web server capacity at time slot t is:

µis + Σµib

From this we conclude that as long as the incoming traffic is admissible or

λ < µ

The system is stable!

This proves that the SQF algorithm is guaranteed to drain all queues in such a wayas to make sure the system is stable.

If we had a round-robin SLB algorithm instead, we would not get this mathematicalresult. In particular, there is no way we can enforce the following:

λi < µi, resulting in Qi(t) overflowing, even though the overall average incomingtraffic is less than the overall average server capacity that is, λ < µ.

The SLB blindly forwards incoming traffic to servers, without considering theoccupancy of Qi(t). The round-robin scheme can easily have some idle servers andstill continue to forward traffic to an overloaded server, resulting in instability. In theSQF algorithm, we know that only the shortest queue is forwarded traffic and thatthe other queues can only drain. As long as Qis does not overflow, the entire systemis stable. We know that λ < µ, hence we know that Qis(t) is stable.

Appendix A Lyapunov Analysis 339



Glossary

This glossary defines terms and acronyms used in this book.

AABR Available Bit Rate

ACK Acknowledgement flag, TCP header

ACL Access Control List

ANAR Auto-Negotiation Advertisement Register

ANER Auto-Negotiation Expansion Register

API Application Programming Interface

Application Server A host computer that provides access to a software application. In the contextof this Reference Architecture, it is used to mean a J2EE Application Server,which essentially serves as an enterprise platform for Java applications. SeeJ2EE.

ARI/FCI Address Recognition Indicator/Frame Copier Indicator

ARP Address Resolution Protocol

ASIC Application Specific Integrated Circuit

ATM Asynchronous Transfer Mode

341


BBER Bit Error Rate

BGP Border Gateway Protocol

BMCR Basic Mode Control Register

BMP Bean Managed Persistence

BMSR Basic Mode Status Register

BPDU Bridge Protocol Data Unit

CCBQ Class-Based Queuing

CBR Constant Bit Rate

CBS Committed Burst Size

CGI Common Gateway Interface

CIR Committed Information Rate

CLEC Competitive Local Exchange Carrier

CMP Container Managed Persistence

Congestion Window A congestion window added by slow start to the sender’s TCP: the congestionwindow, called cwnd. When a new connection is established with a host onanother network, the congestion window is initialized to one segment (that is,the segment size announced by the other end).

DDAC Dual-Attached Concentrator

DAPL Direct Access Programming Library

DAS Dual-Attached Station



DLPI Data Link Provided Interface

DMA Direct Memory Access

DMLT Distributed Multilink Trunking

DNS Domain Name Service

DoS Denial of Service

DRAM Dynamic Random Access Memory

DSR Direct Server Return

DTR Dedicated Token Ring

EEBS Excess Burst Size

EJB Enterprise JavaBean

ESRP Extreme Standby Routing Protocol

Edge data centerswitch The integration point to the customer’s existing backbone network. This is the

switch that connects the data center to the customer’s backbone network.

FFailover A characteristic of a highly available component or service that describes the

ability to switch to another equivalent component or service so that the overallavailability is still maintained. See High Availability.

FDDI Fiber Distributed Data Interface

FIB Forwarding Information Base

FIFO First In, First Out

FPGA Field Programmable Gate Array

Glossary 343


GGCR Gigabit Control Register

GESR Gigabit Extended Status Register

GFR Guaranteed Frame Rate

GMII Gigabit Media Independent Interface

GSR Gigabit Status Register

HHigh Availability

(HA) General term used to describe the ability of a component or service to berunning and therefore available.

HOL Head of Line Blocking

HTTP (Hypertext Transfer Protocol) The Internet protocol based on TCP/IP thatfetches hypertext objects from remote hosts.

HTTPS HTTP over SSL

Iias iPlanet Application Server

ILEC Incumbent Local Exchange Carrier

Integratable In the context of an integrated stack, it represents a mixture of third-partysoftware products that support open standards such as Java and JavaTechnologies for SOAP, UDDI, XML, and WSDL. These products can becombined to deliver a customer solution and should work together given theirsupport of these open standards.

Integrated In the context of an integrated stack, it represents Sun’s software products thatimplement the Sun ONE architecture to deliver a fully optimized, tested, andsupported system to maximize value to customers.

IOMMU input/output memory management unit



IP Internet Protocol

IPG Inter-Packet Gap

IPMP Internet Protocol Multipathing

isapi Microsoft’s internet server application programming interface

ISP Internet Service Provider

IXC Inter Exchange Carrier

JJ2EE (Java™ 2 Platform Enterprise Edition) Set of standards that leverages J2SE

technology and simplifies Java development by offering standardized, modularcomponents by providing a complete set of services to those components, andby handling many details of application behavior automatically, withoutcomplex programming. This is the standard on which the Sun ONEApplication Server is based. See http://java.sun.com/j2ee/.

J2SE (Java 2 Platform Standard Edition) Represents the set of technologies thatprovides the run time environment and Software Development Kit for Javadevelopment. See http://java.sun.com/j2se/.

Java An object-oriented programming language developed by Sun Microsystems.The Write Once, Run Anywhere programming language.

Java 2 SDK The software development kit that developers need to build applications forthe Java 2 Platform, Standard Edition, v. 1.2. See also JDK.

JavaBeans A portable, platform-independent reusable component model. Seehttp://java.sun.com/.

Java RMI (Java Remote Method Invocation) (n.) A distributed object model for Javaprogram to Java program, in which the methods of remote objects written inthe Java programming language can be invoked from other Java virtualmachines, possibly on different hosts.

JAXM (Java API for XML Messaging) Enables applications to send and receivedocument oriented XML messages using a pure Java API. JAXM implementsSimple Object Access Protocol (SOAP) 1.1 with Attachments messaging so thatdevelopers can focus on building, sending, receiving, and decomposingmessages for their applications instead of programming low-level XMLcommunications routines. Seehttp://java.sun.com/xml/jaxm/index.html.

Glossary 345


JAXR (Java API for XML Registries) Provides a uniform and standard Java API for accessingdifferent kinds of XML Registries. Seehttp://java.sun.com/xml/jaxr/index.html.

JAXRPC (Java API for XML-based RPC) Enables Java technology developers to build Webapplications and Web services incorporating XML-based RPC functionality accordingto the SOAP 1.1 specification. Seehttp://java.sun.com/xml/jaxrpc/index.html.

JDBC Java DataBase Connectivity

JDK (Java Development Kit) The software that includes the APIs and tools thatdevelopers need to build applications for those versions of the Java platformthat preceded the Java 2 Platform. See also Java 2 SDK.

JNDI Java Naming and Directory Interface

JRE (Java runtime environment) A subset of the Java Development Kit (JDK) forusers and developers who want to redistribute the runtime environment. TheJava runtime environment consists of the Java virtual machine (JVM), the Javacore classes, and supporting files.

JSP (JavaServer Pages™) Technology that allows Web developers and designers torapidly develop and easily maintain information-rich, dynamic Web pages thatleverage existing business systems. Seehttp//java.sun.com/products/jsp/.

LLAA Locally Administered Address

LACP Link aggregation Control Protocol

LAN Local Area Network

LDAP (Lightweight Directory Access Protocol) The Internet standard for directorylookups.

LLC Logical Link Control

LPNAR Link Partner Auto-negotiation Advertisement Register

LSP Link State Packet



MMAC Media Access Control

MAU Media Access Unit

MDT Multi-Data Transmission

MII Media Independent Interface

M/M/1 queue

MSS Maximum Segment Size

MTBF Mean Time Between Failures

MTTR Mean Time Till Recovery

MTU Maximum Transmission Unit

Multi-tierarchitecture For a given custom application, multiples of any of these tiers may be

used—thus n-tier. There is no implied relationship between tiers and machines,but collapsing all the tiers onto a single machine would not be network centric.

1. Client Tier

2. Web Tier

3. Application Tier

4. Database Tier

NNAP Network Access Point

NAT Network Address Translation

NFS Network File System

NIC Network Interface Card (or Controller)

NSAPI Netscape Application Programming Interface

NSP Network Service Provider

Glossary 347


OOperating System A collection of programs that monitor the use of the system and supervise the

other programs executed by it.

OSPF Open Shortest Path First

PPHY Physical layer

ping (1) (n.) (Packet Internet Groper) A small program (ICMP ECHO) that acomputer sends to a host and times on its return path.

(2) (v.) To test the reach of destinations by sending them an ICMP ECHO: “Pinghost X to see if it is up!”

PIR Peak Information Rate

PPA Primary Point of Attachment

Presentation Service Term used to describe a service that “presents” the data that is returned to theend user. In this context, the presentation service was delivered by a tier ofWeb servers that served up JSP/servlet traffic for viewing by the client Webbrowsers.

Protocol Manager Enables communication between Sun ONE Application Server and a client.Manages and provides services for all active, loaded listeners. Supports HTTP,HTTPS (HTTP over SSL), and IIOP.

Q

QoS (Quality of Service) Measures the ability of network and computingsystems to provide different levels of services to selected applications andassociated network flows.



R

RBOC (n.) Regional Bell Operating Company

RDMA Remote Direct Memory Access

RED Random Early Detection

Remote system (n.) A system other than the one on which you are working.

RLDRAM reduced latency DRAM

RIP (n.) Routing Information Protocol, An IGP with Berkeley UNIX®

RJ-45 connector (n.) A modular cable connector standard used with consumertelecommunications equipment, such as systems equipped for ISDNconnectivity.

RMI (n.) Remote Method Invocation (See Java RMI.)

root (n.) In a hierarchy of items, the one item from which all other items aredescended. The root item has nothing above it in the hierarchy. See also class,hierarchy, package, root directory, root file system, and root user name.

root directory (n.) The base directory from which all other directories stem, directly orindirectly.

root disk (n.) On Sun™ server systems, the disk drive where the operating systemresides. The root disk is located in the SCSI tray behind the front panel.

root file system (n.) A file system residing on the root device (a device predefined by thesystem at initialization) that anchors the overall file system.

root user name (n.) The SunOS™ user name that grants special privileges to the person whologs in with that ID. The user who can supply the correct password for the rootuser name is given superuser privileges for the particular machine.

root window (1) (n.) In the X protocol, a window with no parent window. Each screen has aroot window that covers it.

(2) (adj.) Characteristic of an input method that uses a pre-editing window thatis a child of the root window.

Router A system that assigns a path for network (or Internet) traffic to follow based onIP Address.

RR Round-robin method of load balancing.

RTT Round Trip Time

Glossary 349


SSAC single-attached concentrator

SACK selective acknowledgement

SAS single-attached station

SBus (n.) A 32-bit self-identifying bus used mainly on SPARCTM workstations, theSBus provides information to the system so that it can identify the devicedriver that needs to be used. An SBus device might need to use hardwareconfiguration files to augment the information provided by the SBus card. Seealso PCI bus.

SBus bridge (n.) A device providing additional SBus slots by connecting two SBuses.Generally, a bus bridge is functionally transparent to devices on the SBus.However, there are cases (for example, bus sizing) in which bus bridges canchange the exact way a series of bus cycles is performed. Also called an SBuscoupler.

SBus controller (n.) The hardware responsible for performing arbitration, addressingtranslation and decoding, driving slave selects and address strobe, andgenerating timeouts.

SBus device (n.) A logical device attached to the SBus. This device might be on themotherboard or on an SBus expansion card.

SBus expansion card (n.) A physical printed circuit assembly that conforms to the single- or double-width mechanical specifications and that contains one or more SBus devices.

SBus expansion slot (n.) An SBus slot into which you can install an SBus expansion card.

SBus ID (n.) A special series of bytes at address 0 of each SBus slave that identifies theSBus device.

SDP Sockets Direct Protocol

Security—SSL (Secure Sockets Layer) A protocol developed for transmitting privatedocuments via the Internet. SSL works by using a public key to encrypt datathat’s transferred over the SSL connection.

Services on Demand Ability to provide information, data, and applications to anyone, anytime,anywhere on any device. Includes Web services technology, but also includestechnology you are using today and could use in the future.

SFM Switch Fabric Module

SLA Service Level Agreement

SLB server load balancing



Sliding window A TCP flow control protocol that allows the sender to transmit multiplepackets before it stops and waits for an acknowledgment.

SMLT Split Multilink Trunking

SNA Systems Network Architecture

SOHO Small Office/Home Office

Solaris OperatingSystem The Sun Microsystems open standards-based UNIX operating system. The

Solaris Operating System, the foundation for Sun™ ONE software architecture,delivers the security, manageability, and performance.

SPOF single point of failure

SQF Smallest Queue First

SRAM static random access memory

STP Spanning Tree Protocol

Stream (n.) A kernel aggregate created by connecting STREAMS components, resultingfrom an application of the STREAMS mechanism. The primary components arethe Stream head, the driver, and zero or more pushable modules between theStream head and driver.

Stream end (n.) A Stream component that is farthest from the user process and contains adriver.

Stream head (n.) A Stream component closest to the user process. It provides the interfacebetween the Stream and the user process.

Streaming server Handles data streams from the Sun ONE Application Server to the Web serverand to the Web browser. A streaming service improves performance byallowing users to begin viewing results of requests sooner rather than waitinguntil the complete operation has been processed.

STREAMS (n.) A kernel mechanism that supports development of network services anddata communications drivers. STREAMS defines interface standards forcharacter input/output within the kernel and between the kernel and userlevel. The STREAMS mechanism includes integral functions, utility routines,kernel facilities, and a set of structures.

STREAMS-basedpipe (n.) A mechanism for bidirectional data transfer implemented using STREAMS

and sharing properties of STREAMS-based devices.

Sun ONE (Sun Open Net Environment) The Sun Microsystems software strategy thatcomprises the vision, architecture, platform, and expertise for developing anddeploying Services on Demand today. See http://www.sun.com/sunone.

Glossary 351


Switch Any device or mechanism that moves data from one network to anotherwithout any routing tables.

SYN synchronization

TTCAM Telecommunications Access Method

TDM time division multiplexing

TTCP Test Transmission Control Protocol

TTRT Target Token rotation

UUDDI Universal Description, Discovery, and Integration. The UDDI Project is an

industry initiative that is working to enable businesses to quickly, easily, anddynamically find and transact with one another via Web services. UDDIenables a business to(i) describe its business and its services,(ii) discover other businesses that offer desired services, and(iii) integrate with these other businesses.See http://www.uddi.org. An alternative to UDDI is JAXR, created by SunMicrosystems. See JAXR.

WWeb connectors Web connectors and listeners manage the passing of requests from the Web

server to the Sun ONE Application Server. Listeners distribute and handlerequests from the Web connectors. New listeners can be added with the HTTPhandler.

Web server The easy-to-use, extensible, easy-to-administer, secure, platform-independentsolution to speed up and simplify the deployment and management of yourInternet and intranet Web sites. It provides immediate productivity for full-featured, Java™ technology-based server applications.



Web service A fine-grained, component-style service

– Advertised and described in a service registry

– Based on standardized protocols – JAXR, UDDI, JAXRPC, JAXM, SOAP,WSDL, and so on

– Accessible programmatically by applications or other Web services

WSDL (Web Services Description Language) An XML format for describing networkservices as a set of endpoints operating on messages containing eitherdocument-oriented or procedure-oriented information. Seehttp://www.w3.org/TR/.

Glossary 353



Index

Aaccess control lists 296active switches 311application redirection function 299architecture 3architecture, network security 330asynchronous transfer mode (ATM) 98Auto-negotiation 153Auto-negotiation Advertisement Register 155Auto-negotiation Expansion register 157BBasic Mode Control Register 153Basic Mode Status Register 154BlackDiamond switches 323border gateway protocol v4 (BGP4) 311bridges 66business logic 312Cchecksumming 146cipher suite 110class C network 308client requests 312configuring the

Extreme Networks switches 323Foundry Networks switches 324

congestion control 107congestion window 51, 54consistent mode 142Constant Bit Rate (CBR) 98content switching 300Control Plane 67

CPU load balancing 148Ddata flow through service modules 312Data Plane 67descriptor ring 141design 3disable source routing 127Dual-attached concentrator 134Dual-attached station 132EEnterprise 98Enterprise Java Beans (EJBs) 314Extreme Networks equipment 317Extreme Networks switches, configuring 323FFDDI concentrators 134FDDI interfaces 136FDDI station 132Fiber Distributed Data Interface network 131firewall architecture 332firewalls between service modules (figure) 331flat architecture 261flow control keywords 213Forced mode 153Foundry Networks equipment 317Foundry Networks switches, configuring 324Full NAT 91, 302functional tiers 17GGigabit Media Independent Interface 157global synchronization 107

355


HHalf NAT 91, 302Handshake Layer 110Iinterface specifications 314interrupt blanking 148IP address space (private) 296IP forwarding module 107IP header 314JJ2EE application 314Java data access objects 314Java DataBase Connectivity (JDBC) 314Java Server Pages (JSP) 312Jumbo frames 152jumbo frames 217JumpStart, Solaris 296LLayer 3 routing 331Link-partner Auto-negotiation Advertisement 155load balancing, built-in 317local area networks, virtual (VLANs) 296logical network architecture 296logical network architecture overview (figure) 297logical network design 296MMAC overflow 146management network 296mapping process 21media access unit 128multi-data transmission 143multi-level architecture 261Nnetmask values 320Netscreen Firewall configuration file 333network

configuration (Extreme Networks equipment)318

configuration (Foundry Networks equipment)319

physical 315security architecture 330

Network Address Translation 302Network Address Translation (NAT) 91network architecture with virtual routers 310

network design 3traditional 308using chassis-based switches 309

Network Service Provider (NSP) 98Oopen shortest path first (OSPF) 311PParallel Detection Fault 157partial checksumming 147pause frames 161physical network 315policing 107Precedence Priority Model 102private IP address space 296proxy switching 300QQoS Profile 103Quality of Service (QoS) 92queuing 107Rrandom early detection register 216Random Early Discard 151receive interrupt blanking values 215receive window 54received packet delivery method 150Reservation Model 102ring of trees 135ring speed 129round-robin 74router 66routers 320routing information protocol (RIP) 311Ssecure socket layer 314sequence of events (data flow) 314server load balancing 298Service Level Agreements (SLAs) 98Services on Demand architecture 18shaping 107Single-attached concentrator 134Single-attached station 132sliding windows 54Startup Phase 51stateful 25Stateful Layer 7 switching 300



Stateful Session Based 298stateless and idempotent 25Stateless Session Based 298static routes 296, 311Steady State Phase 51streaming mode 142Streams Service Queue model 151switch 66switch configuration 322switch configuration file (Extreme switch) 323switch configuration file (Foundry switch) 326symmetric flow control 162Ttagged VLAN 309Tail Drop 107token ring interfaces 125token ring network 123transmission latency 142Transmit Pause capability 162Trunking Policies 232trust zones 331UURL switching 300VVariable Bit Rate-Real Time (VBR-rt) 98virtual firewalls 331virtual local area networks (VLANs) 296virtual routers 310VLAN, tagged 309WWeb-based applications 17weighted round-robin 74

Index 357


BP Networking Concepts and Technology

Documents

et sans

recherche et

et solaris sont

copyright et licenci

sous licence et sont

copyright et distribu

sparc trademarks

des parties