Top Banner
Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006 IEEE/IFIP Network Operations and Management Symposium
17

Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

Mar 29, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

Unlocking Systems and Data: The Key to Network Management Innovation

Charles Kalmanek

Internet & Network Systems Research V.P.AT&T Labs-Research

2006 IEEE/IFIP Network Operations and Management Symposium

Page 2: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 2

Vision for IP Network Management

Approach• Manage the entire network, not network elements

• Instrument the network, rely on direct correlation of real data

• Model interactions to predict the effects of actions in advance

• Automate as much as possible, audit results

Topology, Configuration, Workflow

Offered Traffic,Routing, Fault

Network

Network-wide modelauditing, “what-if,” etc.

measure control

Goal: A robust, global, multi-service IP/MPLS network

Provisioning, Changes to the Network

Design goals, policies

Page 3: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 3

Why It’s Hard Scale & Diversity Challenges

• Large, distributed networks (100,000’s of NE’s)• Complex, diverse building blocks • Ongoing maintenance, spanning multiple time zones• Fragile IP network control planes• Complex software systems on top

Constant change• Architectural change, new features & services, new protocols…• Customers join, leave, change/upgrade service• Network “events” – failures, migrations, upgrades, etc.

Measurement and data challenges• Inadequate implementation of the basics• Data often locked up in NM systems “smokestacks”• Diverse data sources, with highly variable data quality • Limited direct measurements of causality • Inadequate ability to trace events across the network

Page 4: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 4

Tier-1 Service Provider Network

PoP: Point-of-PresenceP: Backbone (core) RouterPE: Provider Edge RouterCE: Customer Edge Router

AccessNetwork

Intercity

Metro

CPE CE

EPEEPE

CPCP

Customer facing PE interfaces

CPCP

CPCP

CPCP

CPCP CPCP

CPCP CPCP

PoPEPEEPE

EPEEPE

EPEEPE

OC-48 or OC-192 DWDM

Rough stats:100s of offices100s of Ps, 1000s of PEs, 10000s of CEs100,000s of transport facilities

DWDM systems

LEC

PoP

PoPCPCP

CPCPEPEEPE

CustomerNetwork

(Enterprise customer networks rival ISP’s in size

& complexity!)

Page 5: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 5

Unlocking Network Data

Measurement data is essential to running the network• Marketing and customer acquisition• Network and customer care• Network engineering and capacity management• Research to improve / evolve the network

If you don’t have the data, you can’t design, manage, secure, or improve the network

If you can’t evolve systems, you can’t evolve the network

Example 1: Fault/performance management

Example 2: Router Provisioning

Page 6: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 6

Network Troubleshooting

Goals• Automate the entire life cycle of event detection and repair

for every performance impacting event– Detect, Localize, Diagnose, Fix, Verify

• Drive short and long term network, operations & systems improvements

– Use forensics to reveal chronic events

Systems and Tools• Active and passive performance monitoring

– Each data source has its unique value and limitations• Maintenance and troubleshooting require correlation across

multiple data sets– Associations of customers to access circuits, router

interfaces, network policies, network elements, monitoring systems, …

Page 7: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 7

Example: Cross-Layer Troubleshooting

IP composite link: multiple SONET links combined together• Example: 5 OC192s

• IP routing does not take bandwidth into account.– On component failure: how to decide between

mechanisms to take traffic off the link, as function of remaining capacity?

LA NY

LA NY

Logical IP link3 units oftraffic

3 units oftraffic congestion

1 unit of capacity

Page 8: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 8

Example: Cross-Layer Troubleshooting (cont.)

Detect: • Packet loss from active measurements for a set of PE pairs

Localize/Diagnose:• Temporal correlation: PE-PE measurement alerts occurring at the

same time as flapping on several composite link members• Spatial correlation: paths where packet loss occurs contain flapping

composite link components (PE-PE measurements mapped to paths via route monitoring)

Diagnose: • Congestion due to composite link component flapping

Fix: • Short term: “cost out” the link• Permanent: repair failing components

Verify: • Packet loss alerts disappear

Page 9: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 9

Example: Chronic Control Plane Outage

Detect• Active performance monitoring shows high loss at a PE

Localize/Diagnose• Correlation of performance alerts, fault data, routing updates,

configuration, and workflow logs reveals recurring pattern– OSPF sessions flap during customer provisioning on some PE

platforms• Diagnosis: BGP starves OSPF processing on this class of PEs

Fix• Short-term: process changes to control provisioning on this class of

PE• Long-term: better OSPF and BGP process scheduler for PE

Verify• High loss disappears at the PE

PE

PE

PEPE

Page 10: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 10

Data Distribution Problem

• Many, diverse data feeds required• Labor-intensive and error-prone to create and maintain each feed• Ad-hoc development to convert, copy, encrypt, & ingest the data• Several groups with business critical functions need network data• Stringent delivery requirements (security, timeliness, reliability)

Network data• Network inventory• Route monitors, BGP tables• SNMP link utilization & faults• Syslog info (status, health, events)• Active path monitoring• Netflow• Other: workflow, VoIP, transport

Customer data • Access: location, circuit ID, IP

addresses, CE platform, LEC interface, layer 2 info (Frame Relay, Ethernet, DSL, Private Line,…), router info (hardware, software version)

• Trouble tickets

• Performance and SLA reports

• Service orders

Page 11: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 11

Data Correlation Framework

Flexible data/systems architecture• Pluggable data-source specific collectors

• Data distribution bus

• Common real time and archival data store

• Variety of network management applications on top

Evolving domain knowledge• It’s an iterative process: exploratory data mining (EDM)

– Apply statistical tools, visualization, “hunches,” …– Export results to “case manager” for analysis

Diagnosis engines• Near real-time drill down, forensics

• Temporal and spatial event clustering

• Scalable statistical mechanisms to uncover correlations

Page 12: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 12

Data/Systems Architecture

Network

Internal PortalCustomer Portal

OA&M

Topology

I/F

Netflow

Collector

L3 Control

Plane

Collector

Active

Probe

Collector

Syslog

Collector

CDR

Collector

Real-time

Network Mgt

Applications

End-to-end

Reporting

Application

Planning

Application

Surveillance

Application

Data Distribution Bus (DDB)

Data Store Component (DSC)

SNMP

Collector

GUIGUIGUIGUIGUIGUIGUIGUI

Data Distribution Bus• Publish/subscribe

system handling all incoming data feeds

• Supports multiple transport options, normalizes data to “standard” formats

• Reliably delivers data to consumers

Data Store Component• Efficient long-term

storage of operational data

• Automatic generation of schema, loading scripts, access scripts, data aging allowing non-DBAs to manage warehouse

Network data is available to multiple applications allowing auditing, correlation, reporting, EDM, …

Page 13: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 13

Router Provisioning

Goal: translate service intent to network reality• Get hardware & circuits to the right place at the right time• Access & update network inventory databases• Configure routers to establish and verify the service

Challenges• Huge diversity at network element layer (dependencies on

hardware & software versions, physical configuration, vendor, etc.)

• Low level configuration languages, no abstraction layer, multiple ways of achieving the same thing

• Config generator must consider hardware limitations, service definition, customer order info, additional customer info, etc.

• Commercial tools offer limited customizability, only solve pieces of the problem

• Initial provisioning is only part of the life cycle problem (network-wide changes, firmware mgt, auditing, CE-PE coordination, change requests, …)

Page 14: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 14

Detect/Fix Discords• Non-compliance to

architectural intent– e.g., errors in route-maps

for VPNs crossing routing domains

• Config time-bombs – e.g., gaps in the ACL

perimeter defense

Additional Benefits• Assessment, Bootstrapping

automation, Decision Support

Technology• Parsers, Algorithms, Rules and

Queries encoding domain expertise : e.g., ACL analysis

Auditing

DiscordsLow level

standard form (tables)

Customer/networkdatabase

polled

queries

Router configuration

Provisioning

fix

Configuration File Analysis

Page 15: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 15

Automated CPE Router Provisioning

• Technical Questionnaire• E.g., Web form• (Service Level)

• Device/service specific templates, with embedded variables and callouts to computations and databases• E.g., callouts for ports, IP addresses, ACL clauses, …

• Detailed Device Configuration commands – bundled as a “configlet”• (Network Element Level)

• Logic: allocations of ports, IP addresses, VRFs, …

Page 16: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 16

Template-driven Config Generation

Executing templates in a given context (stored in a database) produces configs, similar to code generation

– Evolves easily to integrate new features, router models, access types, resiliency options

– Eliminates errors, reduces holds

– Ensures conformance to engineering guidelines

router bgp <BGP_1.CE_ASN>no synchronizationbgp log-neighbor-changesnetwork

<WAN_IF_1.NETIP:computeIpMask_Netip(<WAN_IF_1.IF_IP>,255.255.255.252)> mask 255.255.255.252

network <WAN_IF_2.NETIP:computeIpMask_Netip(<WAN_IF_2.IF_IP>,255.255.255.252)> mask 255.255.255.252

network <ROUTER.LOOPBACKIP> mask 255.255.255.255

Example: BGP configurationContext Substitution

Functional Substitution

Page 17: Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

CRK 04/10/23 17

Conclusions

Unlocking data and fault/performance management systems enables innovation

• Exploratory data mining and data correlation are essential to forensics and network maintenance automation

• Approach: Flexible data distribution and data storage architecture

Unlocking provisioning systems enables innovation • Bottom-up analysis is a useful tool for discord-detection, etc.• Template driven approach allows network engineering to add new

network features without new systems development

Challenges are legion…• How to overcome proprietary data models, systems thwarting

forensics?• How to find efficiently find needles in (massive) data haystacks?• How to raise the level of provisioning abstraction? • How to reduce the systems drag on network feature and

architecture change?