NOC Services Delivery: Discovery, Monitoring and ...€¦ · NOC Services Delivery: Discovery, Monitoring and Escalation Process NOC (Network Operations Center) discovers of all the

NOC Services Delivery: Discovery, Monitoring and Escalation Process

NOC (Network Operations Center) discovers of all the components of your infrastructure starting from the underlying

network infrastructure, storage, blades right up to virtual machines hosted on the hypervisor. All the components are

logically and hierarchically grouped making it easy to navigate through your infrastructure. This helps your IT team to

easily understand inter-component relation, root-cause and impact analysis when a problem occurs.

The discovered devices can then be marked as ‘manage devices’ as agreed by the local IT Team. The managed devices

are then grouped based on the business units they belong or by the locations or customized per the local IT Team

further.

Mid-term change of devices, adding a new device or decommissioning the device can be done by following the change

order process.

Monitoring of the devices could be done from the deployed monitoring solution. The monitoring solution provides the

flexibility to monitor availability, performance, services, ports, event logs, SNMP OIDs, traps, sys logs, synthetic

transaction, DNS, website monitoring, and application specific custom monitoring using APIs for closet to cloud

infrastructures.

Haladon will review the local IT Team’s infrastructure setup and propose the monitoring templates/parameters for each

of the infrastructure components. Monitoring frequency and thresholds are agreed upon by the onboarding team and

local IT Team before that are deployed.

Mid-term review and revision of the monitoring templates/parameters will be taken up in conjunction with the local IT

Team based on the observations and inputs from the field.

There are two types of event categories:

• Discrete state changes sent asynchronously, where an agent installed on the target managed element.

• Threshold breaches indicating that an element is no longer operating within “normal” parameters are sent

asynchronously by the agent and is installed on the target managed element. Normal this is based on a predefined,

default, out-of-the-box threshold; a KAR-defined, customized setting; or a dynamic, measured baseline.

Every event captured for treshold breach by the SG or Vistara Agent will be pushed to the Alert Browser in the Cloud

Portal. The Alert Browser will correlate and tranform the alert parameters in to human presentable format which helps

1st

level engineers to quickly pin point the severity of the issue, validate the alert with remote command executions,

create an incident ticket with priority and assign to the domain experts for resolution. All seamlessly from the same

alert.

Any threshold or parameter changes are seamlessly pushed to the Alert Browser from the monitoring solution.

1. Auto Correlation:

NOC processes events using filtering techniques for correlation. When multiple, repetitive events are received

for the same problem on the same element, it stores the event once and increases a counter indicating the

number of time it has been received, rather than flooding user's screen with redundant events.

2. Notification and Escalation Management

ITIL Incident Management System are integrated with the notification and escalation management engine.

It provides automatic notification and escalation at predefined intervals. These notifications are sent for creation,

ownership assignment, progress, status change and closure of the incident, and also acts as an alerting mechanism to

notify in case of SLA breach. The notifications can be sent to hierarchical escalation of an incident to an individual or

group based on pre-defined parameters thresholds or manual override conditions.

It can capture the escalation matrix for the location or site. The Escalation management can be setup differently for

business hours and after business hours as well as weekends. The escalation contacts can be defined in hierarchy with

options to call, or email only and both. The escalation contact details hierarchy and the hours of operations can be

modified dynamically as needed by The local IT Team and the NOC team will follow the escalation as defined in the

escalation matrix. Below is a sample escalation matrix:

The Local IT Team’s will have the ability to change the matrix or contact details from the proposed solution.

Notifications are escalated to client in following cases:

• On receipt of alerts/tickets

• Subsequent updates on the alert/ticket status

• Status change

• Require customer intervention

• Closure of the ticket

3. Ticketing Management

The service offers 24x7 Monitoring, Alerting & Incident Management services from its secure NOCs. The services include

monitoring alerts from the infrastructure, validation of the alerts and ticketing, Incident Reporting and ITIL compliant

SOP-based remediation.

Each incident is recorded so that it can be tracked, monitored, and updated throughout the life cycle. This information

can then be utilized for problem management, reporting, process optimization and planning purposes.

Classification and initial support: The classification process categorizes an incident to determine the priority of an

incident, focusing on the potential business impact. Priority is mapped to the incident after validating the event occurred

in the network and the potential business impact. An Incident ticket is then assigned to the respective domain experts

for quick resolution

Investigation and diagnosis: This process deals with the investigation of an incident and the gathering of diagnostic

data. The goal of the process is to identify how the incident can be resolved as quickly and efficiently as possible. The

technical resources will be handling the incidents until resolution, escalating to domain experts when required to meet

pre-defined SLAs.

Major incident procedure: The major incident procedure (P0) exists in NOC Service Delivery Framework to handle those

critical incidents that require a response above and beyond that provided by the normal incident process. Although

these incidents still follow the normal incident lifecycle, the triggering of the major incident procedure provides the

increased coordination, escalation, communication, and resources that these high-priority events require.

Vendor Tech Support:

This process deals with contacting Vendors technical support team if needed to resolve the issue. Although first

objective is to find a workaround and make sure the business is as usual, domain experts will work in the background

with the vendors to resolve the issue. The Local IT Team should however have valid contracts for NOC to coordinate.

Resolution and recovery: This process covers the steps that are required to resolve an incident, often by utilizing the

change management process for implementation of remedial actions. Once the issue has been resolved, recovery

options, such as restarting services after an application failure, may be required.

Closure: This process ensures that the customer is satisfied with the resolution and handling of an incident prior to

closing the ticket.

4. Handling of service requests

Different types of service requests require handling in different ways. The NOC team will be able to process certain

requests, while other requests need to be passed to other processes, such as change management and new equipment

onboarding.

5. Monitors In Detail

Monitors are setup for each device/application in the managed infrastructure using standard Windows WMI or SNMP

data collection. The ITOP Cloud Portal also enables the NOC to remotely and securely access the monitored devices in

order to perform SOPs (Standard Operating Procedures) or advanced troubleshooting. Below are examples of the some

of the key monitoring parameters the NOC can monitor for.

Key Monitoring Parameters for Network Devices:

SWITCHES, ROUTERS & FIREWALL

Device Availability: Up/Down

Device Health: (CPU and Memory utilization)

Interface Status: Up/Down

Interface Performance: – Utilization, In/Out Traffic Rate

Interface Errors: Error and Discard Rate, CRC and Collision

Errors

Buffer Usage – Small, Medium, Large and Huger Buffer

Utilization and Failures

VPN – IKE and IPsec Tunnel Availability

Hardware Monitoring: Disk, Memory Modules, Chassis

Temperature, Fan, Power, and Voltage Status

WIRELESS NETWORKS

Access Point Availability,

Access Points Client statistics

Network Health – Load. Interference, Noise and

Coverage Status

REAL-TIME NETWORK PERFORMANCE MONITORING

(SYNTHETIC TRAFFIC)

HTTP - URL Response Time

Network Health – Load. Interference, Noise and

Coverage Status

Key Monitoring Parameters for Windows and Linux Servers:

WINDOWS OPERATING SYSTEM


Device Health: (CPU, Memory and Disk Utilization)

Windows Services: Up/Down (Default: All services with start-

up type “Automatic”)

Windows Event Logs: Critical Application, System Logs

Server Hardware Monitoring: Disk, Memory Modules, and

Chassis Temperature

LINUX OPERATING SYSTEM


Device Health: (CPU, Memory and Disk Utilization)

Linux Interfaces: Up/Down

Logs: Critical Logs

Server Hardware Monitoring: Disk, Memory Modules,

Chassis Temperature

Key Monitoring Parameters for SQL Server Database:

WINDOWS OPERATING SYSTEM

Server Availability: Up/Down

Server Health: (CPU, Memory and Disk Utilization)

Windows Event Logs: Critical Application, System Logs

Server Hardware Monitoring: Disk, Memory Modules,

Chassis Temperature

DEVICE HEALTH

Device/Network/Cluster Availability

Device Health (CPU and Memory and Disk Utilization)

SERVER THROUGHPUT METRICS

Number of Logical Connections, Logins/sec, Logouts /sec,

SERVER MEMORY METRICS

Connection Memory Size, Granted Workspace Memory,

Lock Memory Size/ Blocks Allocated, Owner Blocks

Allocated, Maximum Workspace Memory, Memory Grants

Outstanding, Optimizer Memory Alert, SQL Cache Memory

Size, Total Server Memory Size

SQL SERVER LOCK METRICS

Lock Wait Time (ms), Lock Requests/sec, Lock

Timeouts/sec, Deadlocks/sec

SQL SERVER RESOURCE UTILIZATION METRICS

Data File Size , Replication Transaction Rate , Average and

Active Transactions. Transactions /sec, Queued Jobs , Failed

Jobs and Job Success Rate, Open Connections Count

SQL SERVER CACHE METRICS

Cache Hit Ratio (MSSQL), Cache Objects, Cache Pages, and

Cache Objects in use.

SERVER DISK METRICS

Average Disk Reads/Writes/Transfers in Bytes, Disk Queue

Length, Disk Read/Write Queue, Data Space of DB,

SQL SERVER LOG METRICS

Log File(s) Size, Log Flush Wait Time, Log Flush Waits/sec, Log

Flushes/sec, Log Growth and Shrink Rate

Total Latch Wait Time , Number of Replication Pending

Transactions, User Connections,

WINDOWS SERVICES MONITORING

SQL Server , Agent Service , Integrations Services,

Reporting Services Analysis Services, Full Text Services

SQL SERVER PHYSICAL I/O PERFORMANCE

Advanced Windows Extensions AWE lookup Maps/sec,

AWE Stolen Maps/sec, Buffer Cache Hit Ratio, Checkpoint

Pages/sec, Lazy writes/sec, Page Lookups, Reads and

Writes/sec

SQL SERVER HIGH AVAILABILITY MONITORING

Monitors to Track Replication Latency , Mirror

Synchronization , Lag in Log Shipping , Cluster Availability

and Failover/Failback

6. Interactive Graphing

With Interactive graphing feature, the NOC offers graphs at device level to show our monitoring capabilities on Servers

and Network Devices.

The graphs are time-lined, which ease the analysis of various performance indicators as shown in examples below:

Fig: CPU Monitor Graph

Fig: Disk Utilization Monitor Graph

Fig: Memory Utilization Monitor Graph

Fig: PING Monitor Graph

Fig: Network Device Interface Monitor Graph

Fig: Network Device Interface Health Monitor Graph

7. Security

To address these problems effectively within their organizations, enterprises need to ask the following key questions:

• How can we record and audit the actions of IT staff (enterprise staff or co-sourced service provider staff) when they

work on our critical IT assets? How can we increase accountability?

• How can we identify violations of access?

• How can we selectively grant access to our IT assets to 3rd party service providers to maintain them during off-

hours, while ensuring accountability?

• How can we prove to our auditors that we were access compliant over the last multiple years?

• How can we “bottle” the workflows of our senior and expert IT staff when they troubleshoot and solve IT

problems? How can we make this knowledge and skills available to others in our IT team for training?

8. Solution

Haladon provides highly optimized solutions for the operational management of IT assets at remote sites. "as-if-you-

were-there" interface to servers, clients, and network devices are provided remotely - using just a browser from

anywhere in the world - independent of whether there is a firewall between user and managed devices or not.

The NOC brings together a multitude of capabilities to deliver secure, high performance, and cost-effective discovery,

monitoring, management and remediation operations for remote sites.

All data transfer and management control communications between the NOC browser and the services gateway are

secured via 256-bit AES SSL encryption. Similarly, all data transfer and management control communications between

ITOP cloud portal and IT assets are also secured via IPSec or SSL. This level of encryption ensures that it is impossible for

any hacker or intruder to decrypt any of the communications, even if they gain unlawful access and eavesdrop.

Security: Role Based Access For Authorization, Data, & Control Privacy

ITOP cloud portal supports a robust mechanism for role based access to enable authorization of users by their roles.

Each role can be defined to include a set of permissions or access levels (or not) to perform certain key actions on the

target device, or to provide access to certain key objects in the system such as incidents, monitors, and session

recordings. It is possible to do a security partitioning of roles by teams, and make the roles private to that entity,

irrespective of whether expertise is local or remote to the site. In situations where multiple 3rd party service providers

augment the IT administration team, it is possible to include these administrators in the role.

Fig: Role-based Access control with ITOP Cloud Portal Solution

Audit & Compliance: Session Recordings, Audit Trails, & Reports

All access and sessions (e.g. RDP, VNC, Telnet, SSH, serial consoles) performed on the target server or device from the

NOC are recorded and logged via a "flight recorder". Each access requires the user to authenticate with a ticket number,

as well as to enter comments on the reason for access. Every recorded session also maintains a record of the remote

console activity including meta-data related to the access (start time, end time, user name and type of remote console,

ticket number and comments).

Audit & Compliance: RDP & VNC Remote Console Session Recordings

If the NOC administrator uses RDP or VNC remote consoles to access the target device, then that two-way data transfer

is brokered by the ITOP cloud portal which stores all in-transit data in native format. The entire “session” between the

NOC administrator and the target device is recorded and stored as a “session recording”. The ITOP cloud portal supports

ability to replay these recording sessions and to show a screen by screen “as if you were there” playback of the entire

session at multiple speeds (1X to 10X). The native recording data stream is stored in an optimized format, but can be

converted and exported into an SWF format thus making it playable on other external players at different resolutions

and different speeds.

Fig: Session Record Player

Audit & Compliance: SSH, Telnet, & Serial Console Recordings

Recordings from SSH, telnet and serial console sessions are captured and stored in a text format. The recordings are

then indexed offline to enable fast text search capabilities via the portal. Recordings can be searched for content as well

as for comments entered before initiation or after completion of the remote console session.

Text based recordings can also be played back, downloaded as a text file, or searched for specific sub-strings. The text

based console player has powerful features to replay the recording in the original remote console timeline or can be

played back using an option that skips idle time or at 1X-4X speeds.

Surveillance: Access Controlled Session

The “access controlled session” capability allows an organization to “schedule” tickets for third party maintenance

services on critical devices. These tickets provide an un-chaperoned, off-hours, time-window-enforced, fully-auditable,

access control to specific target devices for third party organizations (e.g. co-sourced provider).

The ITOP cloud portal supports workflows that enable an enterprise to leverage and schedule 3rd party consultants and

maintenance services during off-hours. For example, if remote technical skilled expertise is needed on a weekend to

perform a scheduled upgrade of a VoIP device, business critical system, or a core router, then this activity can be

scheduled ahead of time with an “access controlled session” ticket on the ITOP cloud portal. The remote 3rd party

administrator (a guest user) will be provided limited access to only the specific target devices and consoles (via RBAC

policy) during a pre-specified maintenance window. The guest user can login into the ITOP cloud portal during the

scheduled window, launch consoles on the specified target devices, and complete his tasks.

The guest user cannot login into the portal outside the specified maintenance window and will be blocked from doing

so. All access during the time window by the guest user is recorded as a session recording and is available for audit.

The benefits of this capability are obvious and far-reaching. An enterprise IT department employee does not need to

“baby-sit” or watch over the guest user while he/she performs maintenance tasks during the weekend.

9. Certifications

NOC is ISO 27001 Certified

Operational focus, documentation and training on standard policy and procedures enabled the NOC to become ISO

27001 certified in 2008. The NOC continues to improve its operational procedures and training with on-staff Six Sigma

black belts and will maintain its certification with third-party ISO 27001 audits on an annual basis.

The NOC is committed to meeting the highest standards for information security, regulatory compliance and business

continuity. The ISO 27001 certification is a strong standard of baseline operations that every IT department should

require of external vendors. By requiring ISO 27001, IT departments know they have chosen an external vendor who is

focused on providing secure, compliant, and reliable IT services 24 x 7 x 365.

What is ISO 27001?

ISO 27001 is an Information Security Management System (ISMS) standard published in October 2005 by the

International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). The

complete, correct nomenclature is: ISO/IEC 27001:2005 - Information technology - Security techniques - Information

security management systems - Requirements (ISMS). The objective of the standard is to provide a model for

establishing, implementing, operating, monitoring, reviewing, maintaining, and improving an Information Security

Management System.

ISO 27001 standards requires that an organization is continually:

• Examining systematically the organization’s information security risks, taking account of the threats, vulnerabilities

and impacts;

• Designing and implementing a coherent and comprehensive suite of information security controls and/or other

forms of risk treatment (such as risk avoidance or risk transfer) to address those risks that it deems unacceptable;

and

• Adopting an overarching management process to ensure that the information security controls continue to meet the

organization’s information security needs on an on-going basis.

To provide a better understanding of the operational coverage of the ISO 27001 standard, below is a breakdown of the

10 basic control areas for organizations:

• Security policy - This provides management direction and support for information security organization of assets

and resources to help you manage information security within the organization.

• Asset classification and control - To help you identify your assets and appropriately protect them.

• Personnel security - To reduce the risks of human error, theft, fraud or misuse of facilities.

• Physical and environmental security - To prevent unauthorized access, damage and interference to business

premises and information.

• Communications and operations management - To ensure the correct and secure operation of information

processing facilities.

• Access control - To strictly control access to information.

• Systems development and maintenance - To ensure that security is at the forefront in the building and

maintenance of information systems.

• Business continuity management - To minimize interruptions to business activities and to protect critical business

processes from the effects of major failures or disasters.

• Compliance – Recognition of and adherence to criminal and civil law, statutory, regulatory or contractual

obligations, and any security requirement minimizes legal risks and costs.

10. Conclusion – NOC, The Right Solution For Session Recordings, Audit, Compliance, & Surveillance

Management of remote client sites by an IT administrator from a central NOC is fraught with challenges in both

capabilities and security for remote access. The ITOP cloud portal feature set on session recordings, audit, compliance,

and surveillance offers a comprehensive set of capabilities to address these challenges in a secure manner for an IT

organization.

Additionally since IT infrastructure management systems are by definition "top of the food chain" for systems in an

organization, they cannot afford to have weak security capabilities, as the results can be catastrophic to an organization

if security is compromised.

The ITOP cloud portal solution for remote management is built with strong security from the ground up using well-

understood industry standards and best practices in security. The exhaustive security measures and robust management

feature set assures that deploying its solution will not only save the organization time and money, but also enhance the

security of its already existing infrastructure.

NOC Services Delivery: Discovery, Monitoring and ...€¦ · NOC Services Delivery: Discovery, Monitoring and Escalation Process NOC (Network Operations Center) discovers of all the

Documents

NOC Services Delivery: Discovery, Monitoring and ...€¦ · NOC Services Delivery: Discovery, Monitoring and Escalation Process NOC (Network Operations Center) discovers of all the