Copyright 2009 1 VoIP Troubleshooting, Monitoring, and Metrics Terry Slattery Principal Consultant CCIE #1026
Copyright 20091
VoIP Troubleshooting, Monitoring, and Metrics
Terry SlatteryPrincipal Consultant
CCIE #1026
Copyright 20092
Part 1: Troubleshooting
• Provide examples of common problems
• Identify sources of problems and their symptoms
• Remediation
• Techniques you can use in your network
Copyright 20093
The Network is the Foundation for VoIP
• VoIP depends upon the network1.Network hardware and links2.Network protocols
(routing & switching) 3.Transport protocols (TCP/UDP) 4.VoIP protocols and operation
• Other features– QoS– Redundancy
• Use VoIP operationalmodel to aid troubleshooting andmonitoring
Network Hardware & Links(Routers & Switches)
Routing & Switching Protocols (OSPF, STP)
Communication Protocols(TCP/UDP/IP)
Applications(VoIP)
Connectivity and Registration
Call Setup
Call OperationMisc
Operationand
Services
Copyright 20094
How VoIP Works
• Connectivity and Registration– Power requested by continuous Fast Link Pulse
(FLP) – DHCP request & response (UDP) – Get config from TFTP server (UDP) – Register with call controller (TCP)
DHCPRequest
TFTPDHCP
10.9.14.410.9.28.1
CentralSite
RemoteBranch
GetConfig
Register w/ Call Server
Power
Copyright 20095
How VoIP Works (cont)
Call setup and call operation*1. Off-hook, Dialtone, Phone 12. Collect digits and call setup, Phone 13. Ringback tone, Phone 14. Call setup, Phone 25. Ring Phone, 26. Off-hook, Phone 27. Connect RTP stream
Phone1
Phone2
1
6
23
4
5
7
* Basic steps; a lot more happensthan in this high-level description
Copyright 20096
Troubleshooting Methodology
• Gather data about the problem• Subdivide the problem space
– Associate failure symptoms with typical causes• No display; no IP address• One-way audio or poor audio
– Topology: All users at one site?– Timeframe: Always at the same time?
Coincident with other events?– Endpoints involved:
Same PSTN gateway?
BadGood
PSTN GW
Copyright 20097
Troubleshooting Methodology (cont)
• Isolate the origin of the problem– Collect problem data (description, scope, etc) – Narrow the list of causes– Collect additional data to validate or disprove causes– Create and test hypothesis– Repeat until problem is found and corrected
• Narrow causes based on how VoIP works– Connectivity & Registration– Call Setup– Call Operation– Other Services
Connectivity and Registration
Call Setup
Call OperationMisc
Operationand
Services
Copyright 20099
Connectivity - Power
• PoE problems– Know power supply limits; upgrade when necessary– 802.3af may not negotiate with old PoE modules– Cisco 6500 power status
Console> show environment powerPS1 Capacity: 1153.32 Watts (27.46 Amps @ 42V) PS2 Capacity: nonePS Configuration : PS1 and PS2 in Redundant Configuration.Total Power Available: 1153.32 Watts (27.46 Amps @ 42V) Total Power Available for Line Card Usage: 1153.32 Watts
(27.46 Amps @ 42V) Total Power Drawn From the System: 289.80 Watts (6.90 Amps @
42V) Remaining Power in the System: 863.52 Watts (20.56 Amps @42V)
Copyright 200910
Connectivity – Power (cont)
• Monitor power supplies, utilization, and know limits– Automated tools with thresholds and alerts– SNMP Traps on power supply failure and all
environmental data like fan failure• Redundant power supply dies – some
interfaces or modules don't get power• UPS failures and bad batteries
– Monitor UPS operation and battery health
Copyright 200911
Connectivity – VLAN
• Voice VLAN mis-configured– Phone comes up in the wrong VLAN– Static configuration on phone (eBay purchase) – Switch misconfigured
• No Voice VLAN– Phone connected to data port– Switch misconfigured (include voice vlan)
interface FastEthernet0/9switchport access vlan 100switchport mode accessswitchport voice vlan 411
Copyright 200912
Connectivity – DHCP
• IP address assignment, default gateway, addl boot info - Cisco: option 150, Avaya: option 176
• Local vs Central DHCP server– Short lease vs Long lease– Administrative overhead– Tracking address utilization
RemoteBranch
CentralSite
DHCPRequest
DHCPDHCP
Copyright 200913
Connectivity – DHCP Location Tradeoffs
• Central– Multi-day address lease – longer than typical
downtime– Reduces network equipment configuration– Good if many small branches exist– Handling long connectivity downtime due to disaster
• Local– Short address lease– Manage DHCP config at each site– More appropriate at larger remote sites.– Good if downtime is more extensive– Very remote offices with poor connection reliability
Copyright 200914
Local DHCP
• Administered on each router– A good NCCM product can help here
• Problem: no DHCP server in voice VLAN• ip dhcp pool VOICE• network 10.9.28.0 255.255.255.0• option 150 ip 10.9.14.4 10.9.20.2• default-router 10.9.28.1• dns-server 10.9.12.4 10.9.20.11• domain-name your-domain.com
RemoteBranch
CentralSite
DHCPRequest
TFTPDHCP
10.9.14.4
10.9.28.1
Copyright 200915
Central DHCP
• Central server– Redundant servers are common– Commercial products: Bluecat and Infoblox
• Problem: phones don't get a DHCP address– Helper address forwards DHCP requests– ip helper-address 10.9.14.4
RemoteBranch
CentralSiteDHCP
Request
DHCPip helper-address
10.9.14.4
Copyright 200916
Connectivity - DHCP
• Design tip: Separate address space for VoIP!!!
• Avoid ACL entries for each voice vlan– Access lists for classification– Access lists for separation of voice & data– Design for summarization (simpler ACLs)
ip access-list extended Voice-to-Voicepermit udp 10.9.0.0 0.0.255.255 range 16384 32767
10.9.0.0 0.0.255.255 range 16384 32767
• Good security practice: firewall between voice and data subnets
Copyright 200917
DHCP Summary
• Problems primarily from configuration mistakes• No DHCP server in voice VLAN• Central DHCP server reachability
– Missing or incorrect helper address“ip helper-address <addr>”
RemoteBranch
CentralSite
DHCPRequest
DHCPDHCP
Copyright 200918
Connectivity - TFTP
• Download the phone config and OS • Connectivity between phone and TFTP server
– Co-located with central DHCP server is good– TFTP uses UDP – Firewall or ACL configuration
• TFTP timeout on long delay and lossy paths
RemoteBranch
CentralSite
TFTP
10.9.14.4
10.9.28.1
Copyright 200919
Connectivity – TFTP
• TFTP server failure– DHCP option 150 for Cisco; 176 for Avaya– Redundant server specification is good
• Bad TFTP file– Doesn't exist – often wrong phone MAC address– Bad format or contains typos
• Long system boot times, due to power outage– Example: 20 minutes to get all phones working– Network infrastructure boot time– DHCP/TFTP/Call servers booting, then overloaded– Download congestion! – Use load balancing
Copyright 200920
TFTP Summary
• Network connectivity– Ping from TFTP server to phone– Ping from TFTP server to phone's default gateway– Note: TFTP timeout due to long delay or lossy path
• Configuration errors– Bad TFTP file (typo of phone MAC address)– Bad DHCP option 150/176 address– Incorrect Firewall or ACL configuration
• TFTP server failure or overload– Check availability and os load image
• Failure domain: one phone, all phones, or phones in one subnet?
Copyright 200921
Registration
• Can’t connect to the Call Server– Routing problem between phone and call server– Incorrect firewall, or ACL configuration
• Test with ping and traceroute from call server • Which phones are affected?• New site?
No route to phones
No route to call controller
Firewall, ACL, or routing problem
Copyright 200922
Registration
• Can’t connect to the Call Server– Phone not configured in Call Server– MAC address wrong in Call Server– Default TFTP config file has wrong Call Controller
address
Phone MAC address wrongor not configured
Wrong call controller address
Copyright 200923
Registration
• Can’t connect to the Call Server– Call server capacity (e.g., after power outages)– Call server is down
• Use redundant call servers on different subnets
Overloaded call server
Redundant server but the subnet is unreachable
Copyright 200924
Registration Summary
• How many and where are affected phones?• Routing – check with network tools like ping• Call server configuration of phone MAC addr• Server capacity and redundancy
Overloaded call server
Redundant server but the subnet isunreachable
No route to phones
No route to call controller
Firewall, ACL, or routing problem
Phone MAC address wrongor not configured
Wrong call controller address
Copyright 200925
Call Setup
• Incorrect destination call routing– Dial plan problems
• Overlapping dial spaces
• Incorrect dial search spaces
– Troubleshoot with DNA (Dialed Number Analyzer)
7-digit dialing:939XXXX (Internal)
939XXXX (Local) 9.939XXXX (Local) 9.393@ (Local or LD)
4-digit dialing:736-8[0-4]XX355-8[5-9]XXThen add:736-85XX
Copyright 200926
Call Setup
• Phones get calls for other locations– Numbers and hunt groups tied to phone, not line– Phone moved but call server not updated
• Spend time on a good dial plan!– 10-digit, multi-tenant plan– Map dial spaces onto this plan– Can still do 4-digit (or N-digit) dialing– Allows for growth, merger, acquisition– Much, much less expensive to maintain– Note: include planning to avoid toll fraud
Copyright 200927
Call Setup
• TCP is used between call server and endpoints– Routing problem between call controller & endpoints– Typically won't get dial tone or registration– Ping, traceroute, ACL checks, etc (sound familiar?)– Endpoints include PSTN gateways and DSPs*
No route to endpoints
No route to call controller
Firewall, ACL, or routing problem
*Digital Signal Processor
Copyright 200928
Call Setup
• DSP required to match codecs or for conf calls• Troubleshooting
– CUCM log: “no resources”– Monitor DSP pool utilization
• Cat 6500: show port voice active• Command syntax and limits depend on hardware
• Solution: buy more hardware
CentralSite
RemoteBranch
v
DSP Transcoder
G.711 G.729
Copyright 200929
Call Operation - No-Way Audio
• Audio RTP data sent in UDP datagrams• Endpoints don't have connectivity
– Routing problem– Firewall or ACL blocking a path– Cisco Skinny payload carries IP addr (NAT must
know to change the embedded address)• Use ping & traceroute
to check reachability
Phone1
Phone2
1
6
23
4
5
7XROTN*
* Rest Of The Network
Copyright 200930
Call Operation - One-Way Audio
• Check basic connectivity– Firewall or ACL blocking one path– Routing problem
• Two-way, then one-way– Change in routing or configuration– DSP crash (when transcoding or conference call) – Link congestion and no QoS or bad QoS
• Troubleshooting– What changed? (routing & configuration) – Who was affected?– Log analysis
Copyright 200931
Call Operation - One-Way Audio
• Example two-way, then one-way– 128Kbps serial link congestion and no QoS
• Routing change + big updates– Link flooded with routing update– Routing protocol uses up to 50% of link bandwidth– Incorrect bandwidth setting (default T1 on Cisco)– Routing prioritized above voice (default behavior)– Link congestion starved voice (took weeks to find)
ROTN128Kbps link;bandwidth 1544
Big RoutingUpdate
Copyright 200932
Call Operation - Delay, Jitter, Packet Loss
• Causes:– Inconsistent or no QoS– Duplex mismatch or bad link– Routing problems (loss) or multipath (jitter) – Oversubscribed links (congestion & loss)
• Know when it's happening– Be able to detect the cause of each problem– Monitoring depends on vendor
• RTCP stream (Avaya, Nortel) • Call stats on call server (Cisco) • ITU specs: 150ms delay, 30ms jitter, 1% loss
G.729 Good
60ms Jitter
10% packet loss
Copyright 200933
Call Operation - Delay
• ITU Spec: 150ms one-way delay• Reduces interaction of a call
– Wait for voice to travel to the other end of the call– Worst case is like a push-to-talk radio (Nextel?)– Roughly 10ms per 1000 miles (~30ms across the US)
• Causes:– Sub-optimum route path selection
• New York to Atlanta via San Francisco– Long delay path, e.g., satellite circuit
(250ms one-way)
Copyright 200934
Call Operation - Jitter
• Phones buffer packets to handle minor jitter– Packets with large jitter arrive too late and are
dropped– Route flapping– Multipath load balancing
New York, NYBeaverton, OR
Via Atlanta,Dallas, San Jose
Via Chicago& Denver
100M
45M 100M
100M
Copyright 200935
128-Bytes128-Bytes
60-Bytes
60-Bytes
Call Operation - Jitter
• ITU Spec: 30ms jitter• Big packets delay voice on low speed links• Use link Fragmentation and Interleaving
– Choose fragment size for delays of about 15 ms
Packet Size (bytes)Link Speed 64 128 256 512 1024 1500
64Kbps 8 ms 16 ms 32 ms 64 ms 128 ms 187 ms128Kbps 4 ms 8 ms 16 ms 32 ms 64 ms 93 ms256Kbps 2 ms 4 ms 8 ms 16 ms 32 ms 46 ms512Kbps 1 ms 2 ms 4 ms 8 ms 16 ms 23 ms768Kbps 0.6 ms 1.2 ms 2.5 ms 5.1 ms 10.2 ms 15 ms
1500-BytesVoice Data
64Kbps
Voice DataData...After
Before
Copyright 200936
Call Operation - Jitter
• Inconsistent or no QoS implemented– Series of big packets delay voice– Only occurs when a link is oversubscribed– Priority queue moves voice to the front of the queue– Caution: Priority queue can starve lower priority
queues; use policing to limit its effect– Configuration details vary among products
40 pkts @ 1500-Bytes6ms of Data
100Mbps
...After
G.711Before 218-Bytes
Voice218-BytesVoice
19ms
218-BytesVoice
218-BytesVoice
33 pkts5ms of Data
5ms jitter
7 pkts1ms of Data
19ms
Copyright 200937
Call Operation – Packet Loss
• ITU Spec: 1% packet loss (codecs handle 5%)• Incorrect or no QoS configuration
– Oversubscribed priority queue with policing• Designed for 4 concurrent calls, 20ms rate
– G.729 on Frame Relay: 28.14 kbps *– G.711 on Ethernet: 91.56 kbps *
• Facility expands and 8 concurrent calls occur• Policing on priority queue drops excess traffic
– Monitor QoS queue drops• VoIP traffic not properly classified
– Dropped when congestion occurs* google: “cisco codec bandwidth” for calculators
Copyright 200938
Call Operation – Packet Loss
• Example: Intermittent calling– Oversubscribed 100Mbps link (backup to 1G link)– No QoS on signaling or bearer data– Keepalives to call controller were regularly lost– Calls only placed when registered to call controller
• Symptoms– Intermittent calling possible– Loss of calling and poor call quality during network
busy times• Solution
– QoS on the 100Mbps link– Better solution: QoS end-to-end
Copyright 200939
Call Operation – Packet Loss
• Duplex mismatch (very common)– Fixed configuration on one end of link– The fixed configuration end doesn't negotiate– Look for errors: FCS, Runts, Late Collisions– Use Auto-negotiate for phones
interface FastEthernet 0/1duplex auto
• Bad cabling– Bad crimp– Cat 3 cable– Pinched cable
• Use 'duplex auto'
int fa0/1duplex full
1. Try negotiation
duplex auto
2. No response3. Must set duplex half
Copyright 200940
Call Operation – Echo
• Listener vs talker echo (talker is more frequent)• Due to signal crossover• Audio feedback
– Remote speaker output feeds back into the microphone
– Sources: speakerphones, earpieces, cell phones– Increase echo cancellation timer
• DSP transcoding bugs• Delay inherent in VoIP accentuates echo
Copyright 200941
Call Operation – Echo
• 4-wire to 2-wire hybrid - electrical coupling– Decrease amplitude of echo
• Decrease output gain• Increase input attenuation• Make small changes (20% steps) and check
quality– Increase echo canceling timer
PSTNRX
TXHybrid Hybrid
EchoAnalogPhone
Copyright 200942
Other Services – Music on Hold
• No Music on Hold– Music on Hold not configured– Configuration error – filename incorrect or changed– MoH source died– All phones affected
...No MoH
MoHSource
No MoHNo MoH
Copyright 200943
Other Services – Music on Hold
• Sporadic Music on Hold (most common)– Unicast overload on call server (syslog message)
• Varies by phone and load (time of day)– Multicast configuration problem
• Determine scope of the outage• Network routing problem
...
No multicast
OK No MoH
MoHMulticastSource
...
Too manyclients
Unicast MoH
Copyright 200944
Survivable Remote Site Telephony (SRST)
• Symptom: Phones can’t register with SRST Router
• SRST not configured on phone & router• More phones or directory numbers than SRST
router supports• Short DHCP lease (increase to 8 days)
PSTN
WANX
Copyright 200945
What about Video?
• What type:– Streaming or video conferencing? – Unicast or Multicast?– Primarily dynamic or static images (surveillance)– Audio is a voip data stream
• Like voice: delay, jitter, packet loss• Bursty due to “I” frames that contain full image• “P” and “B” frames are
updates to the “I” frame
Copyright 200946
What about Video?
• Unlike voice– Dropouts are less important– Bursty, so don't mix with voice in priority queue
• QoS recommendations– Queue below voice due to burstiness– Interactive - DSCP 34 (AF41); Streaming - DSCP 32 (CS4)– Allocate video bandwidth + 20%
Copyright 200947
Summary: Troubleshooting
• Configuration mistakes arethe major cause of problems
• Collect data; subdivide theproblem
• Test hypothesis and repeat
• Use the Network and Operational Models to subdivide the problem and aid troubleshooting
Network Hardware & Links(Routers & Switches)
Routing & Switching Protocols (OSPF, STP)
Communication Protocols(TCP/UDP/IP)
Applications(VoIP)
Connectivity and Registration
Call Setup
Call OperationMisc
Operationand
Services
Copyright 200948
Part 2: Monitoring and Metrics
• Establish monitoring requirements
• What to monitor and why
• Available tools
• Identify useful metrics
Copyright 200949
Manual processes don't scale
• More than 20-50 devices is too big for manual monitoring
• Check system interdependencies– Root bridge depends on the switches in the STP
domain– Duplex mismatch depends on
connected device– Routing protocol consistency– VoIP call quality– QoS configurations
PSTN GW
Copyright 200950
What you should be doing
• Manual processes are low value– Fire-fighting network problems– Collecting raw data– Basic analysis of raw data
• Contributing value to the business– Planning network upgrades to support future
business initiatives– Implementing technologies that support today's
business– Troubleshooting and solving significant problems– Creating scalable processes and procedures
Copyright 200951
Monitoring Requirements
• Real-time– Events; Performance; Error detection
• Trending– Historical utilization and operational data
• Baseline– Categorizing normal behavior; Inventory
• Configuration management– Saving configs and checking against policies
• Latent problem detection– Combining data to find potential problems
• Diagnostic– Special tools for troubleshooting
Copyright 200952
Metrics
• Measurable– Link, CPU, memory utilization– QoS queue drops– Interface errors
• Actionable– Must be usable for identifying and fixing problems
• Update frequency– Nyquist sampling theorem: sample at 2X the freq of
the data– Dependent on the use
• Trending and historical• Real-time & diagnostic
Copyright 200953
Realtime – Events
• Syslog & SNMP traps– Sent asynchronously by network gear– High volume (particularly firewalls) – UDP-based (unreliable delivery) – Informational through critical severity
• Log everything– Keep for historical reference
• Filters for different recipients– Network operations team– Unified communications team– Security team
• Sync device clocks with NTP– Correlate timestamps from multiple devices
Copyright 200954
Realtime – Event Processing
• Handling the volume– Filter out unimportant events– Tune filters over time
• Daily summary reportSummary of GNS Cisco syslog Messages on Wed Jan 17 23:59:00 EST 2007Cisco Messages:
437 DUAL-5-NBRCHANGE353 LINEPROTO-5-UPDOWN114 CRYPTO-6-IKMP_MODE_FAILURE
...Messages sorted by frequency and source device:
346 test1.com DUAL-5-NBRCHANGE114 test2.com CRYPTO-6-IKMP_MODE_FAILURE84 test3.com LINEPROTO-5-UPDOWN Tunnel11967 test4.com DUAL-5-NBRCHANGE
Copyright 200955
Realtime – Cisco Events
• Cisco search: “System Error Messages for Cisco Unified Communications Manager”
• CCM_CALLMANAGER-CALLMANAGER-3-CallManagerFailure: includes reason for failure
• CCM_CALLMANAGER-CALLMANAGER-3-SDLLinkOOS: Cluster communications link failure
• CCM_CALLMANAGER-CALLMANAGER-3-DeviceTransientConnection: incomplete device registration; large numbers of these indicate a problem
• CCM_CALLMANAGER-CALLMANAGER-4-MediaResourceListExhausted: media resource type not found (e.g. DSP, MusicOnHold, etc)
• CCM_CALLMANAGER-CALLMANAGER-3-TspError: phone registration problem
• CCM_CALLMANAGER-CALLMANAGER-6-Database: various internal database errors
Copyright 200956
Realtime – Example Events
• Critical network events– LINK-3-UPDOWN: backbone and important links– CDP-4-DUPLEX-MISMATCH: high utilization links– LINK-4-ERROR: excessive link errors– SYS-5-RESTART: device restarted– DUAL-3-SIA: EIGRP routing protocol problem– SYS-{1345}-SYS-LCPERR{1345}: Cat 6500 internal
error• Tune your own filters based on activity in your
network• Events are specific to each vendor
Copyright 200957
Realtime - Event Tools
• Splunk– Search log files for key patterns– Easy to use and tune for events on your network
• LogLogic– Often used by big companies for compliance– Handles many types of log files, including server
logs• Solarwinds (Kiwi Cat-tools)
– Just acquired – may take a while to be integrated• Syslog-ng
– Open source– Flexible, can forward events to other systems– Good filtering capability
Copyright 200958
Realtime – Performance
• Thresholds – trigger alert when exceeded– Interface errors > 1E-6 (~1E-10 bit error rate for
1000B average packet size)– QoS priority queue drops (cisco-class-based-qos-
mib)– 5-minute CPU over 80%– 15-minute WAN interface over 70% *– 15-minute LAN interface over 80% *– Router or switch memory utilization over 75%
• Correlate utilization peaks with VoIP problems
* Utilization depends on interface speed setting
Copyright 200959
Realtime – VoIP Performance
• Delay, Jitter, Loss stat collection– Cisco: Call Detail Record & Call Maintenance Record
collection– Avaya: RTCP stream directed to collector
• ITU specs: – Delay: 150ms one-way– Jitter: 30ms– Loss: 1%
• Determine your thresholds– Military often uses much higher values– 1% packet loss is terrible for data– NY to SF is 30ms one-way
Copyright 200960
Realtime – Triggers
• Call completion failure codes– search cisco.com “Call Termination Cause Codes”
• Environmental failures other than events– High power supply utilization– Fan failure (should be an event, but uses UDP)– Temperature– UPS battery reserve, AC supply status, etc– Change in STP root bridge– Redundant router (HSRP/VRRP) change
Copyright 200961
Event and Realtime VoIP Tools
• Cisco– CDR Analysis and Reporting tool (to report on Cisco
call completion codes)– Microsoft Log Parser on Microsoft-based CUCM– Realtime Monitor in CUCM v6 and later
• Avaya– VoIP Monitor (Vmon) reports many stats– Control Network Analyzer is like a sniffer for
diagnosis– Fault Performance Manager for fault and
performance viewing– Trunk Group Analyzer to see trunk utilization
Copyright 200962
Trending
• Know “what's normal”
• Historical trend– Predict when and where to change capacity– Use for input to the design process
• “Top N” in percentage utilization change (month, quarter, 6-months, year)
• Use 95th Percentile per day, week, or month
Copyright 200963
Realtime – 95th Percentile Performance
• Used by many carriers for billing• Averaging all samples distorts results• Example: Peak 1.32Mbps; Average 0.27Mbps
Copyright 200964
Realtime – 95th Percentile Performance
• Sort all samples; Discard top 5% of samples• Value of remaining sample is the 95th percentile• 0.75Mbps for this link• 'Busy-hour'
Copyright 200965
Trending
• Correlate with configurations to find latent problems
• Trends in call quality (CDR/CMR trending)
• UPS battery life and planning replacements
• CPU & Memory utilization trends, particularly in software-based routers
• QoS queue drops
Copyright 200966
Trending Example
• Memory leak – router crash every twelve days
Drill-down to 10.17.8.102 stats, monthly view
Copyright 200967
Trending – VoIP Resource Utilization
• DSP pool utilization (CISCO-DSP-MGMT-MIB)– cdspCardResourceUtilization
• Indicates the percentage of current DSP resource utilization of the card
– cdspCardLastHiWaterUtilization• Indicates the last high water mark of DSP
resource utilization– Calculate total utilization across all cards
• Trunk channel utilization & CUCM monitoring– CISCO-CCM-MIB-V1SMI: ccmGatewayTrunkTable– Calculate utilization from total and in-use counts
• Metric– 70% for growing organization; 90% for no growth
Copyright 200968
Baseline
• Inventory– What do you have and what are its capabilities– Tracking end-of-life equipment and software
• Utilization– Know historical utilization levels– “How long has it been like this?”
• Call Quality– Validate initial installation– Historical basis for tracking future changes
Copyright 200969
Trending and Baseline Tools
• Netcordia NetMRI– Excellent network discovery and baseline– Analysis of collected data
• SolarWinds– Real-time network utilization graphs– Well-respected product– Excellent user community and online forum
• CA/Concord/Spectrum– High-end product– Takes a while to get installed and customized
Copyright 200970
Network Visibility – Flow Data
• Netflow, sFlow, IP-FIX– All similar functionality – flow statistics– Which systems are talking to which systems
• Protocol (TCP/UDP/ICMP) • Port numbers• Data volume• Number of conversations
• Find top talkers, top pairs, top protocols
• Identify active holes between voice & data networks
Copyright 200971
Network Visibility – Flow Tools
• Fluke Networks: Netflow Tracker– “All the flows, all the time” (or something like that)
• Plixer: Scrutinizer– Inexpensive, reasonable starter
• SolarWinds– Reasonable price– Check their forums for customer feedback
• cFlowd– Open source
Copyright 200972
Configuration Management
• Greatest impact on network stability and faults– Majority of network problems are due to
configuration mistakes– More than 40%; amount depends on the analyst– Impossible to get to five-nines without it
• What to track– Who made the change– What changed– When was it changed– Use a AAA server (Radius
or TACACS+)• Critical in VoIP networks
Copyright 200973
Configuration Management
• Basic requirements– Configuration archive– Check Running vs Saved configurations– Log configuration changes– Tools to view changes
Copyright 200974
Configuration Management
• Example: The Site That Lost Its VoIP– Major VoIP deployment– No automated tools in place– All routers and switches updated at the site– Two weeks later: power outage at the site– VoIP is down– Analysis: Configurations were not saved to NVRAM
Copyright 200975
Configuration Policy
• Policy definition process1.Policy defined2.Template created3.Per-device modifications made to template4.Install final configuration in the device
• Policy is infrequently reviewed afterwards– Configs divert from policy as changes accumulate– Manual methods are tedious and error-prone
POLICYHostnameInternal DNS Internal NTPRouter loop back
TEMPLATEhostname routerip name-server 10.1.1.12ntp server 10.1.1.12interface lo0ip address 10.2.X.Y
DEVICE CONFIGhostname b3-core-1ip name-server 10.1.1.12ntp server 10.1.1.12interface lo0ip address 10.2.1.1
Copyright 200976
Validating Configuration Policy
• Not just regulatory – check best practices• Mechanism
– Compare templateswith device configs
– Identify differences– Create an alert
• Value– Validate existing
policies– Identify devices that
don't match a new policy
Copyright 200977
Fixing Configuration Policy Exceptions
• Remediation– Some policy exceptions can be automatically fixed
• Duplex mismatch• Bridge priority• Router ARP timer > switch CAM timer
– Service impacting changes need manual application
• Without automated policy validation, configs become inconsistent
• QoS policies– Trusting QoS in the right places?– Correct QoS marking policies in place?
Copyright 200978
Latent Problems
• Problems that exist but don't have an impact (yet)
• Often a configuration error
• Latent problems need a triggering action– Router redundancy– STP root bridge selection– Config not saved
Copyright 200979
Latent Problems – No Redundancy
• HSRP & VRRP– No redundant
router– First failure
was not noticed
Copyright 200980
Latent Problems – Wrong Root Bridge
• Root Bridge– Must determine switches in spanning tree domain– Check bridge priority on all switches in the domain
Copyright 200981
Configuration Management Tools
• Netcordia NetMRI– Config repository, analysis, differences, etc– Easy to install and get running
• SolarWinds– Config repository– Relatively easy to install and run
• CA Product Suite– Good for very large organizations
• HPOV– Good for very large organizations
Copyright 200982
Diagnostic Monitoring
• Fast polling– Look for periodic trends– Interface: look for other protocol updates– CPU & Memory: peak usage
• Link utilization to determine proper sizing and burst for BW limiting
• MRTG
Copyright 200983
Summary: Monitoring and Metrics
• The network is the foundation for VoIP• VoIP is a complex system –
many interdependencies• Monitor key parameters with automated tools• Use the Network and Operational Models to
subdivide the problem and aid troubleshooting
Network Hardware & Links(Routers & Switches)
Routing & Switching Protocols (OSPF, STP)
Communication Protocols(TCP/UDP/IP)
Applications(VoIP)
Connectivity and Registration
Call Setup
Call OperationMisc
Operationand
Services