Top Banner
24

Redefine Triage by Learning the Golden Nuggets of APM

May 11, 2015

Download

Technology

CA Technologies

Successful use of APM doesn’t happen by accident or wishful thinking. You need to learn specific tasks and capabilities and evolve in the course of becoming competent with the technology, as well as becoming savvy with the philosophy and lifestyle of performance management. We have been validating analytic techniques for APM data and have found that using KPIs directly from your managed environment has a distinct advantage versus a generic set of metrics. This ensures that your analytics are farming meaningful data, and not getting distracted with excessive volumes of spurious metrics. It is a technique that you can apply today, as you begin planning for your upgrade tomorrow.

In a webcast on May 29th 2013, CA Technologies Mike Sydor, Senior Engineering Services Architect, and author of “APM Best Practices” used this content to discuss how we identify and harness KPIs to make sense of your APM "big data", and how these techniques will help to prepare for your upgrade to the new features and functionality with upcoming APM release and its tight integration with Advanced Behavior Analytics (ABA).

Listen to the webcast replay http://goo.gl/PZwTeu
Learn more at http://www.ca.com/apm
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Redefine Triage by Learning the Golden Nuggets of APM
Page 2: Redefine Triage by Learning the Golden Nuggets of APM

2 © 2014 CA. ALL RIGHTS RESERVED.

Agenda

Why so many metrics with APM?– “Big Data”?

What we are learning with CA-ABA (analytics)

How to find KPIs

What’s new for CA-APM 9.6 Release

Page 3: Redefine Triage by Learning the Golden Nuggets of APM

3 © 2014 CA. ALL RIGHTS RESERVED.

Typical APM Cluster

Dozens to hundreds of applications– 2800 JVMs/CLRs

Up to 5M metrics, every 15 seconds

Large applications span multiple data centers– 2-8 APM clusters, typical

– 30-70 EM Collectors for a nationwide portal application

12M to 28M metrics, every 15 seconds

… certainly sounds like big data!!!

Page 4: Redefine Triage by Learning the Golden Nuggets of APM

4 © 2014 CA. ALL RIGHTS RESERVED.

What is Big Data???APM information is “big”… but it is not “big data” without enrichment

5M Metrics

that you don’t fully

understand

OR

5M Metrics

that you don’t

fully understand

Trouble

Management

Version

Control

Time of ____

Constraints

Air Traffic

Advisories

Weather

Forecast

AP News

Updates

Marketing

Campaigns

E N R I C H M E N T

Correlation

Trends

Insights

Anomalies

Page 5: Redefine Triage by Learning the Golden Nuggets of APM

5 © 2014 CA. ALL RIGHTS RESERVED.

Challenges for Big Data

Data Variety – different sources gives different perspectives. Does your data have a significant perspective?

Validation – is the data source meaningful/predictive?

Consistency – are the values trustworthy?

Data Structure and Nomenclature – Mapping, Transformation

Temporal Impedance Mismatch– APM: real-time with 15 second reporting interval

– Trouble Management: +15-30 minutes later

– Stock Ticker: +15-30 minutes later

– Air Traffic Advisories: +30-60 minutes later

– Version Control: days to weeks in advance

– Marketing Campaign Assessment: 2-4 weeks later

Page 6: Redefine Triage by Learning the Golden Nuggets of APM

6 © 2014 CA. ALL RIGHTS RESERVED.

KPI Management Maturity

SGCM: Stalls, GC Settings, Concurrency, Memory Management Trends

APC : Availability, Performance, Capacity

EKB: Errors, Key Resource Performance, Business Transaction Survey

VA

LUE

KPI MATURITY

(Platform) (Application) (Transaction)

Page 7: Redefine Triage by Learning the Golden Nuggets of APM

What We are Learning with CA-ABA

Page 8: Redefine Triage by Learning the Golden Nuggets of APM

ABA Logical Architecture

APM Cluster

5M Metrics100k

Metrics(via RegEx)

Anomaly Engine

Anomalies Alerts

Why only 100k Metrics???Why not 5M???

Page 9: Redefine Triage by Learning the Golden Nuggets of APM

RegEx == Regular Expression

analytics.metricfeed.process.3 =

Custom Metric Host (Virtual) \\|Custom Metric Process (Virtual)\\|Custom Business Application Agent (Virtual)

analytics.metricfeed.metric.3 =

By Business Service\\|[^|]+\\|[^|]+\\|[^|]+:.+

Page 10: Redefine Triage by Learning the Golden Nuggets of APM

RegEx is hard… but easy to validate

Page 11: Redefine Triage by Learning the Golden Nuggets of APM

Metricfeed.3

0

20

40

60

80

100

120

140

160

180

200

Series1

metricfeed.3

Broader collection of metrics but only 87/500 == 17.4% are generally known as useful

Page 12: Redefine Triage by Learning the Golden Nuggets of APM

Suspects Identified via Baseline Technique

SiteMinder Backends JSP Frontends JMX Custom

0

2

4

6

8

10

12

14

16

18

Series1

Suspects via Baseline TechniquesAverage RT only

100% Useful metrics, ready for validation: 47/43625 == 0.1%

Page 13: Redefine Triage by Learning the Golden Nuggets of APM

Metric Count TypeView

Page 14: Redefine Triage by Learning the Golden Nuggets of APM

What is an Application?

Front-ends– Browser? Webservice? Messaging?

Back-ends– Databases Webservices Messaging Mainframes Trading_Partners

Muck-in-the-Middle– Software quality, stability and scalability

- We want to identify KPIs for each of these elements– - helps us build a useful dashboard for Operations

– - helps expose with the resources are really doing

– - helps us define acceptance criteria, to act proactively

– - helps us to triage really effectively

Page 15: Redefine Triage by Learning the Golden Nuggets of APM

How to Find KPIs

Page 16: Redefine Triage by Learning the Golden Nuggets of APM

Capacity KPIs – “Tree Rings”

Page 17: Redefine Triage by Learning the Golden Nuggets of APM

Performance KPIs

High Volume

+

Significant Response Time

Page 18: Redefine Triage by Learning the Golden Nuggets of APM

Create a Simple Alert and Threshold (ConnectionStatus)

Page 19: Redefine Triage by Learning the Golden Nuggets of APM

Create a Simple Alert, Find Restart and threshold (MetricCount)

“UP” – but not actually doing anything!!!

Page 20: Redefine Triage by Learning the Golden Nuggets of APM

Understanding Your Environment

Identify the KPIs– Availability

Agent ConnectionStatus

Number Live Metrics (Metric Count)

– Performance High Volume components with significant response time

– NOT “Top 10 Response Time”

– Capacity Highest Volume Components

Don’t Wait for Production!!!– Make it part of your pre-production review

– Manage the application lifecycle by trending KPIs

Page 21: Redefine Triage by Learning the Golden Nuggets of APM

Good Better (additional) Best (additional)

Stalls Availability – Connected Status

Errors

GC Settings Availability - Metric Count

Key Resource Performance

Concurrency Suspect Performance Business Transaction Survey

Memory Management (graph)

Suspect Capacity

PlatformCoarse information..but not really APM

Application, Transactions, ResourcesThe APM Advantage

KPI Evolution

Page 22: Redefine Triage by Learning the Golden Nuggets of APM

What’s New in CA APM 9.6Simplified, automated, and built on CA APM strengths.

Seamless Mainframe Awareness

Faster, Easier APM

• Intelligent Deep Transaction Trace is now dynamic, automated, and requires less developer involvement for deep dives into apps supporting the transactions

• Simplified Triage with easier drill down with Application Triage Map including Socket Grouping

• Improved response times with software based Transaction Impact Monitor (end-user experience)

• Expanding APMs scope with Java 7 EM & Agents

• Increased insight by adding DB2 details to transaction traces

• Greater awareness with CA SYSVIEW MQ alerts & complete status in APM

• Driving further cross enterprise depth with CTG traces to fully expand backend calls

• Other mainframe based enhancements

Page 23: Redefine Triage by Learning the Golden Nuggets of APM

Preparing to Upgrade

HealthCheck the existing cluster prior to any upgrade

Good: – - Do a clean install of the APM Cluster, alongside of the existing cluster version.

- Manually duplicate management modules, domains.xml, etc.

- Bring down the old version, then bring up the new

Better:– - Install the new version in a separate environment, reduced size

– - migrate a few applications to the new environment for validation

– - upgrade the primary environment after validation achieved

Best:– - Install a new GOLD environment in production, separate from original cluster

– - migrate agents, as schedules permit, until original cluster may be decommissioned

– - this provides an opportunity to introduce pre-production review and generally correct any bad deployment habits

Page 24: Redefine Triage by Learning the Golden Nuggets of APM

Resources

APM Community Site ( https://communities.ca.com/web/ca-wily-global-user-community

– - Cookbook: APM HealthCheck

– - Understanding Which Metrics Matter (KPI discussion)

– - Cookbook: Application Audit

- more details on the baseline techniques and process

APM best practices – Realizing Application Performance Management

– available on Amazon.com and Apress.com

- Baselines, Test Plans, App Audits, Triage, Firefighting

- Organizational Models, Service Catalogs

APM Web Page : Ca.com/apm