JUG Poznan - 2017.01.31

Post on 13-Apr-2017

49 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

Transcript

Avoiding software failsFew metrics to improve application reliability

slawomir.michalik@omnilogy.pl

Poznań, 2017/01/31

2 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE #Dynatrace

What to do with the fastest car …

3 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE #Dynatrace

… if it fails to reach the finish line

In 2005, only 2% of performance incidents had been predicted

Source: Gartner

What % of problems were predicted in 2015?

A. 75%B. 46%C. 11%D. 3%E. None of the above

What % of problems were predicted in 2015?

A. 75%B. 46%C. 11%D. 3%E. None of the above

Why do software projects fail so often?http://spectrum.ieee.org/computing/software/why-software-fails

Unrealistic or unarticulated project goals

Inaccurate estimates of needed resourcesBadly defined system requirements

Poor reporting of the project's statusUnmanaged risks

Poor communication among customers, developers, and users

Commercial pressures Stakeholder politics

Poor project management

Sloppy development practices

Inability to handle the project's complexity

Use of immature technology

Performance issues increase costs63% of IT organizations spend 20%+ of the time working on performance issues

Inability to Innovate40% of Developers’ time is wasted in triage, stealing a focus from activities that innovates

The good thing is

80:20

Lets start on the frontend 80/20 rule from Steve

But then we’d focus on the backend

5 Use cases

&

metrics that really pay off…

#1

Pushing without a Plan

Web Site: this shoudn’t happenSome Ad Company during American Super-Bowl

Total size ~ 20MB

434 Resources in on that page

Web Site: this could be easily eliminatedObama Care

16 individual jQuery

-related files that should be merged

Most JavaScript files contains Dev

documentation, which makes up to 80% of the file size

Web Site: this shoudn’t happenFifa.com doring Woldcup

Faviconthe Largest element

Some heavy CSS & JS +150kb

• Developers not using the browser built-in diagnostics tools• Testers not doing a sanity checks with the same tools

• Some tools for you • Built-in Inspectors via Ctrl-Shift-I in Chrome and Firefox

• YSlow, PageSpeed

• Dynatrace Ajax Edition

• Level-Up: Automate Testing & Diagnostics Check

Lessons Learnt – NO Excuses for …

# Resources

# of Domains

Usage of CDNs

Page Load & Size

#2

Not every Architect makes good decisions

• Symptoms• HTML takes 60-120s to render

• High GC Time

• Developer Assumptions• Bad GC Tuning

• Probably bad DB performance as rendering was simple

• Resulted in: months of finger-pointing between Dev & DBA

Project: Online Room Reservation System

Developers-built monitoring

void roomreservationReport(int officeId){long startTime = System.currentTimeMillis();Object data = loadDataForOffice(officeId);long dataLoadTime = System.currentTimeMillis() - startTime;

generateReport(data, officeId);}

Result:Avg. Data Load Time: 41s!

DB Tool says:Avg. SQL Query: <1ms!

#1: Loading too much data

24889! Calls to the DB API

High CPU & High Memory Usage to keep all data in Memory

#2: On individual connections 12444! individual connections

Individual SQL really fast <1ms

Classical N+1 Query Problem

#3: Putting all data in temp Hashtable

Lots of time spent in Hashtable.get

Called from their Entity Objects

• …You know what code is doing• Challenge the developers• Don’t use Hashtabels as workaround, use O/R mappers

• Explore Tools that “might seem” out of your league!• Built-In Database Analysis Tools• “Logging” options of Frameworks such as Hibernate, …• JMX, Perf Counters, … of your Application Servers• APM (Performance Tracing) Tools: Dynatrace Personal Ed.,…

Lessons Learned – Don’t Assume …

# SQL Executions

# of Same SQLs

Conn. Acquisition Time

Root Cause: Deployment Considerations

Log Service provides a Synchronized File across all JVMs

1M Log exceptions over 30 min

Production Deployment leads to Log SYNC Issues

Log message TimeIn Sync

Two calls comming fromCustomr coded methods

Time Spent in Sync & Logging

# of Log Messages

# of Exceptions

#3

Deployment Gone Bad

Test Environment

Production Environment8x slower

3x more SQL

Test Environment Production Environment

That’s Normal: Having I/O for Web

Request as main contributor

Hibernate, Classloading, XML – The

Key Hotspots

I/O for Web Requests doesn’t even show up!

These calls all originate form thousands of calls to

find item by code

Top Contributor Class.getInterfaces

Called from Hibernates FieldInterceptionHelper

Top Methods related to XML Processing

Classloading is triggered through CustomMonkey and the Xalan Parser

• Plan enough time for proper testing

• Anticipate changed user behavior during peak load

• Only test what really ends up in Production

Lessons Learned

Time Spent in API

# Calls to API

#4

Incorrect Sizing of Pools and Queues

Online Banking: Slow Balance Check

101s! To Check Balance!

600! SQL Executions87% spent in IIS

#1 Time really spent in IIS?

Tip: Elapsed Time tells us WHEN a

Method was executed!

Tip: Thread# gives us insight on Thread Queues / Switches

Finding: Thread 32 in IIS waited 87s to pass

control to Thread 30 in ASP.NET

#2 What about these SQL Executions?

Finding: EVERY SQL statement is executed on ITS OWN

Connection!

Tip: Look at “GetConnection”

#2 SQL Executions! continued …

#1: Same SQL is executed 67! times

#2: NO PREPARATION because everything executed on new

Connection

Lessons Learned!

ASP.NET Worker Thread Pool Sizing!

DB Connection PoolsMore Efficient SQL

Idle vs. Busy Threads

# SQLs / Request

# GetConnection

%CPU Starvation

#5

Do know what you Test

23s for One click

22s$3-5M worth

Data grid

New Generation CRM: Angular.js / Coherence

New Generation CRM: Angular.js / Coherence

7sfor filter execution

Filter Value

Talk to Architects, andTrace argument’s values 4 performance sensitive methods

# of unique invocations

Response Time

# Images

# Redirects

# and Size of Resources

# SQL Executions

# of SAME SQLs

# Items per Page

# AJAX per Page

Remember: New Metrics When Testing Apps

Time Spent in API

# Calls into API

# Functional Errors

# 3rd Party calls

# of Domains

Total Size

Resource (W3C) Timings: PLT, DOM Processing/Ready, Page Interactive

Online Performance Clinics

Every week @

bit.ly/onlineperfclinic

bit.ly/dttrial

Putting it into a Test Automation

12 0 120ms3 1 68ms

Build 20 testPurchase OKtestSearch OK

Build 17 testPurchase OKtestSearch OK

Build 18 testPurchase FAILEDtestSearch OK

Build 19 testPurchase OKtestSearch OK

Build # Test Case Status # SQL # Excep CPU12 0 120ms3 1 68ms

12 5 60ms3 1 68ms

75 0 230ms3 1 68ms

Test Framework Results Architectural Data

We identified a regression

Problem solved

Exceptions probably reason for failed tests

Problem fixed but now we have an architecturalregression

Problem fixed but now we have an architectural regression

Now we have the functional and architectural confidence

Let’s look behind the scenes

top related