Top Banner
Copyright 2009 1 VoIP Troubleshooting, Monitoring, and Metrics Terry Slattery Principal Consultant CCIE #1026
83

Troubleshooting VoIP in Converged Networks

Mar 22, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Troubleshooting VoIP in Converged Networks

Copyright 20091

VoIP Troubleshooting, Monitoring, and Metrics

Terry SlatteryPrincipal Consultant

CCIE #1026

Page 2: Troubleshooting VoIP in Converged Networks

Copyright 20092

Part 1: Troubleshooting

• Provide examples of common problems

• Identify sources of problems and their symptoms

• Remediation

• Techniques you can use in your network

Page 3: Troubleshooting VoIP in Converged Networks

Copyright 20093

The Network is the Foundation for VoIP

• VoIP depends upon the network1.Network hardware and links2.Network protocols

(routing & switching) 3.Transport protocols (TCP/UDP) 4.VoIP protocols and operation

• Other features– QoS– Redundancy

• Use VoIP operationalmodel to aid troubleshooting andmonitoring

Network Hardware & Links(Routers & Switches)

Routing & Switching Protocols (OSPF, STP)

Communication Protocols(TCP/UDP/IP)

Applications(VoIP)

Connectivity and Registration

Call Setup

Call OperationMisc

Operationand

Services

Page 4: Troubleshooting VoIP in Converged Networks

Copyright 20094

How VoIP Works

• Connectivity and Registration– Power requested by continuous Fast Link Pulse

(FLP) – DHCP request & response (UDP) – Get config from TFTP server (UDP) – Register with call controller (TCP)

DHCPRequest

TFTPDHCP

10.9.14.410.9.28.1

CentralSite

RemoteBranch

GetConfig

Register w/ Call Server

Power

Page 5: Troubleshooting VoIP in Converged Networks

Copyright 20095

How VoIP Works (cont)

Call setup and call operation*1. Off-hook, Dialtone, Phone 12. Collect digits and call setup, Phone 13. Ringback tone, Phone 14. Call setup, Phone 25. Ring Phone, 26. Off-hook, Phone 27. Connect RTP stream

Phone1

Phone2

1

6

23

4

5

7

* Basic steps; a lot more happensthan in this high-level description

Page 6: Troubleshooting VoIP in Converged Networks

Copyright 20096

Troubleshooting Methodology

• Gather data about the problem• Subdivide the problem space

– Associate failure symptoms with typical causes• No display; no IP address• One-way audio or poor audio

– Topology: All users at one site?– Timeframe: Always at the same time?

Coincident with other events?– Endpoints involved:

Same PSTN gateway?

BadGood

PSTN GW

Page 7: Troubleshooting VoIP in Converged Networks

Copyright 20097

Troubleshooting Methodology (cont)

• Isolate the origin of the problem– Collect problem data (description, scope, etc) – Narrow the list of causes– Collect additional data to validate or disprove causes– Create and test hypothesis– Repeat until problem is found and corrected

• Narrow causes based on how VoIP works– Connectivity & Registration– Call Setup– Call Operation– Other Services

Connectivity and Registration

Call Setup

Call OperationMisc

Operationand

Services

Page 8: Troubleshooting VoIP in Converged Networks

Copyright 20098

Troubleshooting Diagnostic Aids

Page 9: Troubleshooting VoIP in Converged Networks

Copyright 20099

Connectivity - Power

• PoE problems– Know power supply limits; upgrade when necessary– 802.3af may not negotiate with old PoE modules– Cisco 6500 power status

Console> show environment powerPS1 Capacity: 1153.32 Watts (27.46 Amps @ 42V) PS2 Capacity: nonePS Configuration : PS1 and PS2 in Redundant Configuration.Total Power Available: 1153.32 Watts (27.46 Amps @ 42V) Total Power Available for Line Card Usage: 1153.32 Watts

(27.46 Amps @ 42V) Total Power Drawn From the System: 289.80 Watts (6.90 Amps @

42V) Remaining Power in the System: 863.52 Watts (20.56 Amps @42V)

Page 10: Troubleshooting VoIP in Converged Networks

Copyright 200910

Connectivity – Power (cont)

• Monitor power supplies, utilization, and know limits– Automated tools with thresholds and alerts– SNMP Traps on power supply failure and all

environmental data like fan failure• Redundant power supply dies – some

interfaces or modules don't get power• UPS failures and bad batteries

– Monitor UPS operation and battery health

Page 11: Troubleshooting VoIP in Converged Networks

Copyright 200911

Connectivity – VLAN

• Voice VLAN mis-configured– Phone comes up in the wrong VLAN– Static configuration on phone (eBay purchase) – Switch misconfigured

• No Voice VLAN– Phone connected to data port– Switch misconfigured (include voice vlan)

interface FastEthernet0/9switchport access vlan 100switchport mode accessswitchport voice vlan 411

Page 12: Troubleshooting VoIP in Converged Networks

Copyright 200912

Connectivity – DHCP

• IP address assignment, default gateway, addl boot info - Cisco: option 150, Avaya: option 176

• Local vs Central DHCP server– Short lease vs Long lease– Administrative overhead– Tracking address utilization

RemoteBranch

CentralSite

DHCPRequest

DHCPDHCP

Page 13: Troubleshooting VoIP in Converged Networks

Copyright 200913

Connectivity – DHCP Location Tradeoffs

• Central– Multi-day address lease – longer than typical

downtime– Reduces network equipment configuration– Good if many small branches exist– Handling long connectivity downtime due to disaster

• Local– Short address lease– Manage DHCP config at each site– More appropriate at larger remote sites.– Good if downtime is more extensive– Very remote offices with poor connection reliability

Page 14: Troubleshooting VoIP in Converged Networks

Copyright 200914

Local DHCP

• Administered on each router– A good NCCM product can help here

• Problem: no DHCP server in voice VLAN• ip dhcp pool VOICE• network 10.9.28.0 255.255.255.0• option 150 ip 10.9.14.4 10.9.20.2• default-router 10.9.28.1• dns-server 10.9.12.4 10.9.20.11• domain-name your-domain.com

RemoteBranch

CentralSite

DHCPRequest

TFTPDHCP

10.9.14.4

10.9.28.1

Page 15: Troubleshooting VoIP in Converged Networks

Copyright 200915

Central DHCP

• Central server– Redundant servers are common– Commercial products: Bluecat and Infoblox

• Problem: phones don't get a DHCP address– Helper address forwards DHCP requests– ip helper-address 10.9.14.4

RemoteBranch

CentralSiteDHCP

Request

DHCPip helper-address

10.9.14.4

Page 16: Troubleshooting VoIP in Converged Networks

Copyright 200916

Connectivity - DHCP

• Design tip: Separate address space for VoIP!!!

• Avoid ACL entries for each voice vlan– Access lists for classification– Access lists for separation of voice & data– Design for summarization (simpler ACLs)

ip access-list extended Voice-to-Voicepermit udp 10.9.0.0 0.0.255.255 range 16384 32767

10.9.0.0 0.0.255.255 range 16384 32767

• Good security practice: firewall between voice and data subnets

Page 17: Troubleshooting VoIP in Converged Networks

Copyright 200917

DHCP Summary

• Problems primarily from configuration mistakes• No DHCP server in voice VLAN• Central DHCP server reachability

– Missing or incorrect helper address“ip helper-address <addr>”

RemoteBranch

CentralSite

DHCPRequest

DHCPDHCP

Page 18: Troubleshooting VoIP in Converged Networks

Copyright 200918

Connectivity - TFTP

• Download the phone config and OS • Connectivity between phone and TFTP server

– Co-located with central DHCP server is good– TFTP uses UDP – Firewall or ACL configuration

• TFTP timeout on long delay and lossy paths

RemoteBranch

CentralSite

TFTP

10.9.14.4

10.9.28.1

Page 19: Troubleshooting VoIP in Converged Networks

Copyright 200919

Connectivity – TFTP

• TFTP server failure– DHCP option 150 for Cisco; 176 for Avaya– Redundant server specification is good

• Bad TFTP file– Doesn't exist – often wrong phone MAC address– Bad format or contains typos

• Long system boot times, due to power outage– Example: 20 minutes to get all phones working– Network infrastructure boot time– DHCP/TFTP/Call servers booting, then overloaded– Download congestion! – Use load balancing

Page 20: Troubleshooting VoIP in Converged Networks

Copyright 200920

TFTP Summary

• Network connectivity– Ping from TFTP server to phone– Ping from TFTP server to phone's default gateway– Note: TFTP timeout due to long delay or lossy path

• Configuration errors– Bad TFTP file (typo of phone MAC address)– Bad DHCP option 150/176 address– Incorrect Firewall or ACL configuration

• TFTP server failure or overload– Check availability and os load image

• Failure domain: one phone, all phones, or phones in one subnet?

Page 21: Troubleshooting VoIP in Converged Networks

Copyright 200921

Registration

• Can’t connect to the Call Server– Routing problem between phone and call server– Incorrect firewall, or ACL configuration

• Test with ping and traceroute from call server • Which phones are affected?• New site?

No route to phones

No route to call controller

Firewall, ACL, or routing problem

Page 22: Troubleshooting VoIP in Converged Networks

Copyright 200922

Registration

• Can’t connect to the Call Server– Phone not configured in Call Server– MAC address wrong in Call Server– Default TFTP config file has wrong Call Controller

address

Phone MAC address wrongor not configured

Wrong call controller address

Page 23: Troubleshooting VoIP in Converged Networks

Copyright 200923

Registration

• Can’t connect to the Call Server– Call server capacity (e.g., after power outages)– Call server is down

• Use redundant call servers on different subnets

Overloaded call server

Redundant server but the subnet is unreachable

Page 24: Troubleshooting VoIP in Converged Networks

Copyright 200924

Registration Summary

• How many and where are affected phones?• Routing – check with network tools like ping• Call server configuration of phone MAC addr• Server capacity and redundancy

Overloaded call server

Redundant server but the subnet isunreachable

No route to phones

No route to call controller

Firewall, ACL, or routing problem

Phone MAC address wrongor not configured

Wrong call controller address

Page 25: Troubleshooting VoIP in Converged Networks

Copyright 200925

Call Setup

• Incorrect destination call routing– Dial plan problems

• Overlapping dial spaces

• Incorrect dial search spaces

– Troubleshoot with DNA (Dialed Number Analyzer)

7-digit dialing:939XXXX (Internal)

939XXXX (Local) 9.939XXXX (Local) 9.393@ (Local or LD)

4-digit dialing:736-8[0-4]XX355-8[5-9]XXThen add:736-85XX

Page 26: Troubleshooting VoIP in Converged Networks

Copyright 200926

Call Setup

• Phones get calls for other locations– Numbers and hunt groups tied to phone, not line– Phone moved but call server not updated

• Spend time on a good dial plan!– 10-digit, multi-tenant plan– Map dial spaces onto this plan– Can still do 4-digit (or N-digit) dialing– Allows for growth, merger, acquisition– Much, much less expensive to maintain– Note: include planning to avoid toll fraud

Page 27: Troubleshooting VoIP in Converged Networks

Copyright 200927

Call Setup

• TCP is used between call server and endpoints– Routing problem between call controller & endpoints– Typically won't get dial tone or registration– Ping, traceroute, ACL checks, etc (sound familiar?)– Endpoints include PSTN gateways and DSPs*

No route to endpoints

No route to call controller

Firewall, ACL, or routing problem

*Digital Signal Processor

Page 28: Troubleshooting VoIP in Converged Networks

Copyright 200928

Call Setup

• DSP required to match codecs or for conf calls• Troubleshooting

– CUCM log: “no resources”– Monitor DSP pool utilization

• Cat 6500: show port voice active• Command syntax and limits depend on hardware

• Solution: buy more hardware

CentralSite

RemoteBranch

v

DSP Transcoder

G.711 G.729

Page 29: Troubleshooting VoIP in Converged Networks

Copyright 200929

Call Operation - No-Way Audio

• Audio RTP data sent in UDP datagrams• Endpoints don't have connectivity

– Routing problem– Firewall or ACL blocking a path– Cisco Skinny payload carries IP addr (NAT must

know to change the embedded address)• Use ping & traceroute

to check reachability

Phone1

Phone2

1

6

23

4

5

7XROTN*

* Rest Of The Network

Page 30: Troubleshooting VoIP in Converged Networks

Copyright 200930

Call Operation - One-Way Audio

• Check basic connectivity– Firewall or ACL blocking one path– Routing problem

• Two-way, then one-way– Change in routing or configuration– DSP crash (when transcoding or conference call) – Link congestion and no QoS or bad QoS

• Troubleshooting– What changed? (routing & configuration) – Who was affected?– Log analysis

Page 31: Troubleshooting VoIP in Converged Networks

Copyright 200931

Call Operation - One-Way Audio

• Example two-way, then one-way– 128Kbps serial link congestion and no QoS

• Routing change + big updates– Link flooded with routing update– Routing protocol uses up to 50% of link bandwidth– Incorrect bandwidth setting (default T1 on Cisco)– Routing prioritized above voice (default behavior)– Link congestion starved voice (took weeks to find)

ROTN128Kbps link;bandwidth 1544

Big RoutingUpdate

Page 32: Troubleshooting VoIP in Converged Networks

Copyright 200932

Call Operation - Delay, Jitter, Packet Loss

• Causes:– Inconsistent or no QoS– Duplex mismatch or bad link– Routing problems (loss) or multipath (jitter) – Oversubscribed links (congestion & loss)

• Know when it's happening– Be able to detect the cause of each problem– Monitoring depends on vendor

• RTCP stream (Avaya, Nortel) • Call stats on call server (Cisco) • ITU specs: 150ms delay, 30ms jitter, 1% loss

G.729 Good

60ms Jitter

10% packet loss

Page 33: Troubleshooting VoIP in Converged Networks

Copyright 200933

Call Operation - Delay

• ITU Spec: 150ms one-way delay• Reduces interaction of a call

– Wait for voice to travel to the other end of the call– Worst case is like a push-to-talk radio (Nextel?)– Roughly 10ms per 1000 miles (~30ms across the US)

• Causes:– Sub-optimum route path selection

• New York to Atlanta via San Francisco– Long delay path, e.g., satellite circuit

(250ms one-way)

Page 34: Troubleshooting VoIP in Converged Networks

Copyright 200934

Call Operation - Jitter

• Phones buffer packets to handle minor jitter– Packets with large jitter arrive too late and are

dropped– Route flapping– Multipath load balancing

New York, NYBeaverton, OR

Via Atlanta,Dallas, San Jose

Via Chicago& Denver

100M

45M 100M

100M

Page 35: Troubleshooting VoIP in Converged Networks

Copyright 200935

128-Bytes128-Bytes

60-Bytes

60-Bytes

Call Operation - Jitter

• ITU Spec: 30ms jitter• Big packets delay voice on low speed links• Use link Fragmentation and Interleaving

– Choose fragment size for delays of about 15 ms

Packet Size (bytes)Link Speed 64 128 256 512 1024 1500

64Kbps 8 ms 16 ms 32 ms 64 ms 128 ms 187 ms128Kbps 4 ms 8 ms 16 ms 32 ms 64 ms 93 ms256Kbps 2 ms 4 ms 8 ms 16 ms 32 ms 46 ms512Kbps 1 ms 2 ms 4 ms 8 ms 16 ms 23 ms768Kbps 0.6 ms 1.2 ms 2.5 ms 5.1 ms 10.2 ms 15 ms

1500-BytesVoice Data

64Kbps

Voice DataData...After

Before

Page 36: Troubleshooting VoIP in Converged Networks

Copyright 200936

Call Operation - Jitter

• Inconsistent or no QoS implemented– Series of big packets delay voice– Only occurs when a link is oversubscribed– Priority queue moves voice to the front of the queue– Caution: Priority queue can starve lower priority

queues; use policing to limit its effect– Configuration details vary among products

40 pkts @ 1500-Bytes6ms of Data

100Mbps

...After

G.711Before 218-Bytes

Voice218-BytesVoice

19ms

218-BytesVoice

218-BytesVoice

33 pkts5ms of Data

5ms jitter

7 pkts1ms of Data

19ms

Page 37: Troubleshooting VoIP in Converged Networks

Copyright 200937

Call Operation – Packet Loss

• ITU Spec: 1% packet loss (codecs handle 5%)• Incorrect or no QoS configuration

– Oversubscribed priority queue with policing• Designed for 4 concurrent calls, 20ms rate

– G.729 on Frame Relay: 28.14 kbps *– G.711 on Ethernet: 91.56 kbps *

• Facility expands and 8 concurrent calls occur• Policing on priority queue drops excess traffic

– Monitor QoS queue drops• VoIP traffic not properly classified

– Dropped when congestion occurs* google: “cisco codec bandwidth” for calculators

Page 38: Troubleshooting VoIP in Converged Networks

Copyright 200938

Call Operation – Packet Loss

• Example: Intermittent calling– Oversubscribed 100Mbps link (backup to 1G link)– No QoS on signaling or bearer data– Keepalives to call controller were regularly lost– Calls only placed when registered to call controller

• Symptoms– Intermittent calling possible– Loss of calling and poor call quality during network

busy times• Solution

– QoS on the 100Mbps link– Better solution: QoS end-to-end

Page 39: Troubleshooting VoIP in Converged Networks

Copyright 200939

Call Operation – Packet Loss

• Duplex mismatch (very common)– Fixed configuration on one end of link– The fixed configuration end doesn't negotiate– Look for errors: FCS, Runts, Late Collisions– Use Auto-negotiate for phones

interface FastEthernet 0/1duplex auto

• Bad cabling– Bad crimp– Cat 3 cable– Pinched cable

• Use 'duplex auto'

int fa0/1duplex full

1. Try negotiation

duplex auto

2. No response3. Must set duplex half

Page 40: Troubleshooting VoIP in Converged Networks

Copyright 200940

Call Operation – Echo

• Listener vs talker echo (talker is more frequent)• Due to signal crossover• Audio feedback

– Remote speaker output feeds back into the microphone

– Sources: speakerphones, earpieces, cell phones– Increase echo cancellation timer

• DSP transcoding bugs• Delay inherent in VoIP accentuates echo

Page 41: Troubleshooting VoIP in Converged Networks

Copyright 200941

Call Operation – Echo

• 4-wire to 2-wire hybrid - electrical coupling– Decrease amplitude of echo

• Decrease output gain• Increase input attenuation• Make small changes (20% steps) and check

quality– Increase echo canceling timer

PSTNRX

TXHybrid Hybrid

EchoAnalogPhone

Page 42: Troubleshooting VoIP in Converged Networks

Copyright 200942

Other Services – Music on Hold

• No Music on Hold– Music on Hold not configured– Configuration error – filename incorrect or changed– MoH source died– All phones affected

...No MoH

MoHSource

No MoHNo MoH

Page 43: Troubleshooting VoIP in Converged Networks

Copyright 200943

Other Services – Music on Hold

• Sporadic Music on Hold (most common)– Unicast overload on call server (syslog message)

• Varies by phone and load (time of day)– Multicast configuration problem

• Determine scope of the outage• Network routing problem

...

No multicast

OK No MoH

MoHMulticastSource

...

Too manyclients

Unicast MoH

Page 44: Troubleshooting VoIP in Converged Networks

Copyright 200944

Survivable Remote Site Telephony (SRST)

• Symptom: Phones can’t register with SRST Router

• SRST not configured on phone & router• More phones or directory numbers than SRST

router supports• Short DHCP lease (increase to 8 days)

PSTN

WANX

Page 45: Troubleshooting VoIP in Converged Networks

Copyright 200945

What about Video?

• What type:– Streaming or video conferencing? – Unicast or Multicast?– Primarily dynamic or static images (surveillance)– Audio is a voip data stream

• Like voice: delay, jitter, packet loss• Bursty due to “I” frames that contain full image• “P” and “B” frames are

updates to the “I” frame

Page 46: Troubleshooting VoIP in Converged Networks

Copyright 200946

What about Video?

• Unlike voice– Dropouts are less important– Bursty, so don't mix with voice in priority queue

• QoS recommendations– Queue below voice due to burstiness– Interactive - DSCP 34 (AF41); Streaming - DSCP 32 (CS4)– Allocate video bandwidth + 20%

Page 47: Troubleshooting VoIP in Converged Networks

Copyright 200947

Summary: Troubleshooting

• Configuration mistakes arethe major cause of problems

• Collect data; subdivide theproblem

• Test hypothesis and repeat

• Use the Network and Operational Models to subdivide the problem and aid troubleshooting

Network Hardware & Links(Routers & Switches)

Routing & Switching Protocols (OSPF, STP)

Communication Protocols(TCP/UDP/IP)

Applications(VoIP)

Connectivity and Registration

Call Setup

Call OperationMisc

Operationand

Services

Page 48: Troubleshooting VoIP in Converged Networks

Copyright 200948

Part 2: Monitoring and Metrics

• Establish monitoring requirements

• What to monitor and why

• Available tools

• Identify useful metrics

Page 49: Troubleshooting VoIP in Converged Networks

Copyright 200949

Manual processes don't scale

• More than 20-50 devices is too big for manual monitoring

• Check system interdependencies– Root bridge depends on the switches in the STP

domain– Duplex mismatch depends on

connected device– Routing protocol consistency– VoIP call quality– QoS configurations

PSTN GW

Page 50: Troubleshooting VoIP in Converged Networks

Copyright 200950

What you should be doing

• Manual processes are low value– Fire-fighting network problems– Collecting raw data– Basic analysis of raw data

• Contributing value to the business– Planning network upgrades to support future

business initiatives– Implementing technologies that support today's

business– Troubleshooting and solving significant problems– Creating scalable processes and procedures

Page 51: Troubleshooting VoIP in Converged Networks

Copyright 200951

Monitoring Requirements

• Real-time– Events; Performance; Error detection

• Trending– Historical utilization and operational data

• Baseline– Categorizing normal behavior; Inventory

• Configuration management– Saving configs and checking against policies

• Latent problem detection– Combining data to find potential problems

• Diagnostic– Special tools for troubleshooting

Page 52: Troubleshooting VoIP in Converged Networks

Copyright 200952

Metrics

• Measurable– Link, CPU, memory utilization– QoS queue drops– Interface errors

• Actionable– Must be usable for identifying and fixing problems

• Update frequency– Nyquist sampling theorem: sample at 2X the freq of

the data– Dependent on the use

• Trending and historical• Real-time & diagnostic

Page 53: Troubleshooting VoIP in Converged Networks

Copyright 200953

Realtime – Events

• Syslog & SNMP traps– Sent asynchronously by network gear– High volume (particularly firewalls) – UDP-based (unreliable delivery) – Informational through critical severity

• Log everything– Keep for historical reference

• Filters for different recipients– Network operations team– Unified communications team– Security team

• Sync device clocks with NTP– Correlate timestamps from multiple devices

Page 54: Troubleshooting VoIP in Converged Networks

Copyright 200954

Realtime – Event Processing

• Handling the volume– Filter out unimportant events– Tune filters over time

• Daily summary reportSummary of GNS Cisco syslog Messages on Wed Jan 17 23:59:00 EST 2007Cisco Messages:

437 DUAL-5-NBRCHANGE353 LINEPROTO-5-UPDOWN114 CRYPTO-6-IKMP_MODE_FAILURE

...Messages sorted by frequency and source device:

346 test1.com DUAL-5-NBRCHANGE114 test2.com CRYPTO-6-IKMP_MODE_FAILURE84 test3.com LINEPROTO-5-UPDOWN Tunnel11967 test4.com DUAL-5-NBRCHANGE

Page 55: Troubleshooting VoIP in Converged Networks

Copyright 200955

Realtime – Cisco Events

• Cisco search: “System Error Messages for Cisco Unified Communications Manager”

• CCM_CALLMANAGER-CALLMANAGER-3-CallManagerFailure: includes reason for failure

• CCM_CALLMANAGER-CALLMANAGER-3-SDLLinkOOS: Cluster communications link failure

• CCM_CALLMANAGER-CALLMANAGER-3-DeviceTransientConnection: incomplete device registration; large numbers of these indicate a problem

• CCM_CALLMANAGER-CALLMANAGER-4-MediaResourceListExhausted: media resource type not found (e.g. DSP, MusicOnHold, etc)

• CCM_CALLMANAGER-CALLMANAGER-3-TspError: phone registration problem

• CCM_CALLMANAGER-CALLMANAGER-6-Database: various internal database errors

Page 56: Troubleshooting VoIP in Converged Networks

Copyright 200956

Realtime – Example Events

• Critical network events– LINK-3-UPDOWN: backbone and important links– CDP-4-DUPLEX-MISMATCH: high utilization links– LINK-4-ERROR: excessive link errors– SYS-5-RESTART: device restarted– DUAL-3-SIA: EIGRP routing protocol problem– SYS-{1345}-SYS-LCPERR{1345}: Cat 6500 internal

error• Tune your own filters based on activity in your

network• Events are specific to each vendor

Page 57: Troubleshooting VoIP in Converged Networks

Copyright 200957

Realtime - Event Tools

• Splunk– Search log files for key patterns– Easy to use and tune for events on your network

• LogLogic– Often used by big companies for compliance– Handles many types of log files, including server

logs• Solarwinds (Kiwi Cat-tools)

– Just acquired – may take a while to be integrated• Syslog-ng

– Open source– Flexible, can forward events to other systems– Good filtering capability

Page 58: Troubleshooting VoIP in Converged Networks

Copyright 200958

Realtime – Performance

• Thresholds – trigger alert when exceeded– Interface errors > 1E-6 (~1E-10 bit error rate for

1000B average packet size)– QoS priority queue drops (cisco-class-based-qos-

mib)– 5-minute CPU over 80%– 15-minute WAN interface over 70% *– 15-minute LAN interface over 80% *– Router or switch memory utilization over 75%

• Correlate utilization peaks with VoIP problems

* Utilization depends on interface speed setting

Page 59: Troubleshooting VoIP in Converged Networks

Copyright 200959

Realtime – VoIP Performance

• Delay, Jitter, Loss stat collection– Cisco: Call Detail Record & Call Maintenance Record

collection– Avaya: RTCP stream directed to collector

• ITU specs: – Delay: 150ms one-way– Jitter: 30ms– Loss: 1%

• Determine your thresholds– Military often uses much higher values– 1% packet loss is terrible for data– NY to SF is 30ms one-way

Page 60: Troubleshooting VoIP in Converged Networks

Copyright 200960

Realtime – Triggers

• Call completion failure codes– search cisco.com “Call Termination Cause Codes”

• Environmental failures other than events– High power supply utilization– Fan failure (should be an event, but uses UDP)– Temperature– UPS battery reserve, AC supply status, etc– Change in STP root bridge– Redundant router (HSRP/VRRP) change

Page 61: Troubleshooting VoIP in Converged Networks

Copyright 200961

Event and Realtime VoIP Tools

• Cisco– CDR Analysis and Reporting tool (to report on Cisco

call completion codes)– Microsoft Log Parser on Microsoft-based CUCM– Realtime Monitor in CUCM v6 and later

• Avaya– VoIP Monitor (Vmon) reports many stats– Control Network Analyzer is like a sniffer for

diagnosis– Fault Performance Manager for fault and

performance viewing– Trunk Group Analyzer to see trunk utilization

Page 62: Troubleshooting VoIP in Converged Networks

Copyright 200962

Trending

• Know “what's normal”

• Historical trend– Predict when and where to change capacity– Use for input to the design process

• “Top N” in percentage utilization change (month, quarter, 6-months, year)

• Use 95th Percentile per day, week, or month

Page 63: Troubleshooting VoIP in Converged Networks

Copyright 200963

Realtime – 95th Percentile Performance

• Used by many carriers for billing• Averaging all samples distorts results• Example: Peak 1.32Mbps; Average 0.27Mbps

Page 64: Troubleshooting VoIP in Converged Networks

Copyright 200964

Realtime – 95th Percentile Performance

• Sort all samples; Discard top 5% of samples• Value of remaining sample is the 95th percentile• 0.75Mbps for this link• 'Busy-hour'

Page 65: Troubleshooting VoIP in Converged Networks

Copyright 200965

Trending

• Correlate with configurations to find latent problems

• Trends in call quality (CDR/CMR trending)

• UPS battery life and planning replacements

• CPU & Memory utilization trends, particularly in software-based routers

• QoS queue drops

Page 66: Troubleshooting VoIP in Converged Networks

Copyright 200966

Trending Example

• Memory leak – router crash every twelve days

Drill-down to 10.17.8.102 stats, monthly view

Page 67: Troubleshooting VoIP in Converged Networks

Copyright 200967

Trending – VoIP Resource Utilization

• DSP pool utilization (CISCO-DSP-MGMT-MIB)– cdspCardResourceUtilization

• Indicates the percentage of current DSP resource utilization of the card

– cdspCardLastHiWaterUtilization• Indicates the last high water mark of DSP

resource utilization– Calculate total utilization across all cards

• Trunk channel utilization & CUCM monitoring– CISCO-CCM-MIB-V1SMI: ccmGatewayTrunkTable– Calculate utilization from total and in-use counts

• Metric– 70% for growing organization; 90% for no growth

Page 68: Troubleshooting VoIP in Converged Networks

Copyright 200968

Baseline

• Inventory– What do you have and what are its capabilities– Tracking end-of-life equipment and software

• Utilization– Know historical utilization levels– “How long has it been like this?”

• Call Quality– Validate initial installation– Historical basis for tracking future changes

Page 69: Troubleshooting VoIP in Converged Networks

Copyright 200969

Trending and Baseline Tools

• Netcordia NetMRI– Excellent network discovery and baseline– Analysis of collected data

• SolarWinds– Real-time network utilization graphs– Well-respected product– Excellent user community and online forum

• CA/Concord/Spectrum– High-end product– Takes a while to get installed and customized

Page 70: Troubleshooting VoIP in Converged Networks

Copyright 200970

Network Visibility – Flow Data

• Netflow, sFlow, IP-FIX– All similar functionality – flow statistics– Which systems are talking to which systems

• Protocol (TCP/UDP/ICMP) • Port numbers• Data volume• Number of conversations

• Find top talkers, top pairs, top protocols

• Identify active holes between voice & data networks

Page 71: Troubleshooting VoIP in Converged Networks

Copyright 200971

Network Visibility – Flow Tools

• Fluke Networks: Netflow Tracker– “All the flows, all the time” (or something like that)

• Plixer: Scrutinizer– Inexpensive, reasonable starter

• SolarWinds– Reasonable price– Check their forums for customer feedback

• cFlowd– Open source

Page 72: Troubleshooting VoIP in Converged Networks

Copyright 200972

Configuration Management

• Greatest impact on network stability and faults– Majority of network problems are due to

configuration mistakes– More than 40%; amount depends on the analyst– Impossible to get to five-nines without it

• What to track– Who made the change– What changed– When was it changed– Use a AAA server (Radius

or TACACS+)• Critical in VoIP networks

Page 73: Troubleshooting VoIP in Converged Networks

Copyright 200973

Configuration Management

• Basic requirements– Configuration archive– Check Running vs Saved configurations– Log configuration changes– Tools to view changes

Page 74: Troubleshooting VoIP in Converged Networks

Copyright 200974

Configuration Management

• Example: The Site That Lost Its VoIP– Major VoIP deployment– No automated tools in place– All routers and switches updated at the site– Two weeks later: power outage at the site– VoIP is down– Analysis: Configurations were not saved to NVRAM

Page 75: Troubleshooting VoIP in Converged Networks

Copyright 200975

Configuration Policy

• Policy definition process1.Policy defined2.Template created3.Per-device modifications made to template4.Install final configuration in the device

• Policy is infrequently reviewed afterwards– Configs divert from policy as changes accumulate– Manual methods are tedious and error-prone

POLICYHostnameInternal DNS Internal NTPRouter loop back

TEMPLATEhostname routerip name-server 10.1.1.12ntp server 10.1.1.12interface lo0ip address 10.2.X.Y

DEVICE CONFIGhostname b3-core-1ip name-server 10.1.1.12ntp server 10.1.1.12interface lo0ip address 10.2.1.1

Page 76: Troubleshooting VoIP in Converged Networks

Copyright 200976

Validating Configuration Policy

• Not just regulatory – check best practices• Mechanism

– Compare templateswith device configs

– Identify differences– Create an alert

• Value– Validate existing

policies– Identify devices that

don't match a new policy

Page 77: Troubleshooting VoIP in Converged Networks

Copyright 200977

Fixing Configuration Policy Exceptions

• Remediation– Some policy exceptions can be automatically fixed

• Duplex mismatch• Bridge priority• Router ARP timer > switch CAM timer

– Service impacting changes need manual application

• Without automated policy validation, configs become inconsistent

• QoS policies– Trusting QoS in the right places?– Correct QoS marking policies in place?

Page 78: Troubleshooting VoIP in Converged Networks

Copyright 200978

Latent Problems

• Problems that exist but don't have an impact (yet)

• Often a configuration error

• Latent problems need a triggering action– Router redundancy– STP root bridge selection– Config not saved

Page 79: Troubleshooting VoIP in Converged Networks

Copyright 200979

Latent Problems – No Redundancy

• HSRP & VRRP– No redundant

router– First failure

was not noticed

Page 80: Troubleshooting VoIP in Converged Networks

Copyright 200980

Latent Problems – Wrong Root Bridge

• Root Bridge– Must determine switches in spanning tree domain– Check bridge priority on all switches in the domain

Page 81: Troubleshooting VoIP in Converged Networks

Copyright 200981

Configuration Management Tools

• Netcordia NetMRI– Config repository, analysis, differences, etc– Easy to install and get running

• SolarWinds– Config repository– Relatively easy to install and run

• CA Product Suite– Good for very large organizations

• HPOV– Good for very large organizations

Page 82: Troubleshooting VoIP in Converged Networks

Copyright 200982

Diagnostic Monitoring

• Fast polling– Look for periodic trends– Interface: look for other protocol updates– CPU & Memory: peak usage

• Link utilization to determine proper sizing and burst for BW limiting

• MRTG

Page 83: Troubleshooting VoIP in Converged Networks

Copyright 200983

Summary: Monitoring and Metrics

• The network is the foundation for VoIP• VoIP is a complex system –

many interdependencies• Monitor key parameters with automated tools• Use the Network and Operational Models to

subdivide the problem and aid troubleshooting

Network Hardware & Links(Routers & Switches)

Routing & Switching Protocols (OSPF, STP)

Communication Protocols(TCP/UDP/IP)

Applications(VoIP)

Connectivity and Registration

Call Setup

Call OperationMisc

Operationand

Services