Top Banner
Software Testing Software Testing Doesn’t Scale Doesn’t Scale James Hamilton James Hamilton [email protected] [email protected] Microsoft SQL Server Microsoft SQL Server
14
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server.

Software Testing Software Testing Doesn’t ScaleDoesn’t Scale

James HamiltonJames [email protected]@microsoft.com

Microsoft SQL ServerMicrosoft SQL Server

Page 2: Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server.

22

OverviewOverview The Problem:The Problem:

S/W size & complexity inevitableS/W size & complexity inevitable Short cycles reduce S/W reliabilityShort cycles reduce S/W reliability S/W testing is the real issueS/W testing is the real issue Testing doesn’t scaleTesting doesn’t scale

trading complexity for qualitytrading complexity for quality

Cluster-based solutionCluster-based solution The Inktomi lessonThe Inktomi lesson Shared-nothing cluster architectureShared-nothing cluster architecture Redundant data & metadataRedundant data & metadata Fault isolation domainsFault isolation domains

Page 3: Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server.

33

S/W Size & Complexity InevitableS/W Size & Complexity Inevitable

Successful S/W products grow largeSuccessful S/W products grow large # features used by a given user small# features used by a given user small

But union of per-user features sets is hugeBut union of per-user features sets is huge

Reality of commodity, high volume S/WReality of commodity, high volume S/W Large feature setsLarge feature sets Same trend as consumer electronicsSame trend as consumer electronics

Example mid-tier & server-side S/W stack:Example mid-tier & server-side S/W stack: SAP: ~47 mlocSAP: ~47 mloc DB: ~2 mlocDB: ~2 mloc NT: ~50 mlocNT: ~50 mloc

Testing all feature interactions impossibleTesting all feature interactions impossible

Page 4: Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server.

44

Short Cycles Reduce S/W ReliabilityShort Cycles Reduce S/W Reliability

Reliable TP systems typically evolve slowly Reliable TP systems typically evolve slowly & conservatively& conservatively

Modern ERP systems can go through 6+ Modern ERP systems can go through 6+ minor revisions/yearminor revisions/year

Many e-commerce sites change even fasterMany e-commerce sites change even faster Fast revisions a competitive advantageFast revisions a competitive advantage

Current testing and release methodology:Current testing and release methodology: As much testing as dev timeAs much testing as dev time Significant additional beta-cycle timeSignificant additional beta-cycle time

Unacceptable choice: Unacceptable choice: reliable but slow evolving or fast changing yet reliable but slow evolving or fast changing yet

unstable and brittleunstable and brittle

Page 5: Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server.

55

Testing the Real IssueTesting the Real Issue 15 yrs ago test teams tiny fraction of dev group15 yrs ago test teams tiny fraction of dev group

Now tests teams of similar size as dev & growing rapidlyNow tests teams of similar size as dev & growing rapidly Current test methodology improving incrementally:Current test methodology improving incrementally:

Random grammar driven test case generationRandom grammar driven test case generation Fault injectionFault injection Code path coverage toolsCode path coverage tools

Testing remains effective at feature testingTesting remains effective at feature testing Ineffective at finding inter-feature interactionsIneffective at finding inter-feature interactions

Only a tiny fraction of Heisenbugs found in testing (Only a tiny fraction of Heisenbugs found in testing (www.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Avialiawww.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Avialiability_talk.pptbility_talk.ppt))

Beta testing because test known to be inadequateBeta testing because test known to be inadequate Test team growth scales exponentially with system Test team growth scales exponentially with system

complexitycomplexity Test and beta cycles already intolerably longTest and beta cycles already intolerably long

Page 6: Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server.

66

The Inktomi LessonThe Inktomi Lesson Inktomi web search engine (SIGMOD’98)Inktomi web search engine (SIGMOD’98) Quickly evolving software:Quickly evolving software:

Memory leaks, race conditions, etc. considered normalMemory leaks, race conditions, etc. considered normal Don’t attempt to test & beta until quality highDon’t attempt to test & beta until quality high

System availability of paramount importanceSystem availability of paramount importance Individual node availability unimportantIndividual node availability unimportant

Shared nothing clusterShared nothing cluster Exploit ability to fail individual nodes:Exploit ability to fail individual nodes:

Automatic reboots avoid memory leaksAutomatic reboots avoid memory leaks Automatic restart of failed nodesAutomatic restart of failed nodes Fail fast: fail & restart when redundant checks failFail fast: fail & restart when redundant checks fail Replace failed hardware weekly (mostly disks)Replace failed hardware weekly (mostly disks)

Dark machine roomDark machine room No panic midnight calls to admins No panic midnight calls to admins

Mask failures rather than futile attempt to avoidMask failures rather than futile attempt to avoid

Page 7: Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server.

77

Apply to High Value TP Data?Apply to High Value TP Data?

Inktomi model:Inktomi model: Scales to 100’s of nodesScales to 100’s of nodes S/W evolves quicklyS/W evolves quickly Low testing costs and no-beta requirementLow testing costs and no-beta requirement

Exploits ability to lose individual node without Exploits ability to lose individual node without impacting system availabilityimpacting system availability

Ability to temporarily lose some data W/O Ability to temporarily lose some data W/O significantly impacting query qualitysignificantly impacting query quality

Can’t loose data availability in most TP systemsCan’t loose data availability in most TP systems Redundant data allows node loss w/o data availability lostRedundant data allows node loss w/o data availability lost

Inktomi model with redundant data & metadata a Inktomi model with redundant data & metadata a solution to exploding test problemsolution to exploding test problem

Page 8: Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server.

88

Client

Connection Model/ArchitectureConnection Model/Architecture

ServerNode

Server Cloud

All data & metadata multiply All data & metadata multiply redundantredundant

Shared nothingShared nothing Single system imageSingle system image Symmetric server nodesSymmetric server nodes

Any client connects to any serverAny client connects to any server

All nodes SAN-connectedAll nodes SAN-connected

Page 9: Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server.

99

Client

Compilation & Execution ModelCompilation & Execution Model

Server Cloud

Server ThreadLex analyzeParseNormalizeOptimizeCode generate

Query execute

Query execution on many Query execution on many subthreads synchronized subthreads synchronized by root threadby root thread

Page 10: Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server.

1010

Client

Node Loss/RejoinNode Loss/Rejoin

Server Cloud

Execution in progressExecution in progress

Rejoin. Rejoin. Node local recoveryNode local recovery Rejoin clusterRejoin cluster Recover global data at rejoining nodeRecover global data at rejoining node Rejoin clusterRejoin cluster

Lose nodeLose node RecompileRecompile Re-executeRe-execute

Page 11: Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server.

1111

Client

Redundant Data Update ModelRedundant Data Update Model

Server Cloud

Updates are standard parallel Updates are standard parallel plansplans

Optimizer knows all Optimizer knows all redundant data pathsredundant data paths

Generated plan updates allGenerated plan updates all No significant new technologyNo significant new technology Like materialized view & index Like materialized view & index

updates todayupdates today

Page 12: Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server.

1212

Fault Isolation DomainsFault Isolation Domains Trade single-node perf for redundant data checks:Trade single-node perf for redundant data checks:

Fairly common…but complex error recovery is even more Fairly common…but complex error recovery is even more likely to be wrong than original forward processing codelikely to be wrong than original forward processing code

Many of the best redundant checks are compiled out of Many of the best redundant checks are compiled out of “retail versions” when shipped (when needed most)“retail versions” when shipped (when needed most)

Fail fast rather than attempting to repair:Fail fast rather than attempting to repair: Bring down node for mem-based data structure faultsBring down node for mem-based data structure faults Never patch inconsistent data…other copies keep Never patch inconsistent data…other copies keep

system availablesystem available

If anything goes wrong “fire” the node and If anything goes wrong “fire” the node and continue:continue: Attempt node restartAttempt node restart Auto-reinstall O/S, DB and recreate DB partitionAuto-reinstall O/S, DB and recreate DB partition Mark node “dead” for later replacementMark node “dead” for later replacement

Page 13: Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server.

1313

SummarySummary 100 MLOC of server-side code and growing:100 MLOC of server-side code and growing:

Can’t fight it & can’t test it … Can’t fight it & can’t test it … quality will continue to decline if we don’t do something quality will continue to decline if we don’t do something

differentdifferent

Can’t afford 2 to 3 year dev cycleCan’t afford 2 to 3 year dev cycle 60’s large system mentality still prevails:60’s large system mentality still prevails:

Optimizing precious machine resources is false economyOptimizing precious machine resources is false economy

Continuing focus on single-system perf dead Continuing focus on single-system perf dead wrong:wrong: Scalability & system perf rather than individual node Scalability & system perf rather than individual node

performanceperformance

Why are we still incrementally attacking an Why are we still incrementally attacking an exponential problem?exponential problem?

Any reasonable alternatives to clusters?Any reasonable alternatives to clusters?

Page 14: Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server.

Software Testing Software Testing Doesn’t ScaleDoesn’t Scale

James HamiltonJames [email protected]@microsoft.com

Microsoft SQL ServerMicrosoft SQL Server