NOC Services Delivery: Discovery, Monitoring and Escalation Process
NOC (Network Operations Center) discovers of all the components of your infrastructure starting from the underlying
network infrastructure, storage, blades right up to virtual machines hosted on the hypervisor. All the components are
logically and hierarchically grouped making it easy to navigate through your infrastructure. This helps your IT team to
easily understand inter-component relation, root-cause and impact analysis when a problem occurs.
The discovered devices can then be marked as ‘manage devices’ as agreed by the local IT Team. The managed devices
are then grouped based on the business units they belong or by the locations or customized per the local IT Team
further.
Mid-term change of devices, adding a new device or decommissioning the device can be done by following the change
order process.
Monitoring of the devices could be done from the deployed monitoring solution. The monitoring solution provides the
flexibility to monitor availability, performance, services, ports, event logs, SNMP OIDs, traps, sys logs, synthetic
transaction, DNS, website monitoring, and application specific custom monitoring using APIs for closet to cloud
infrastructures.
Haladon will review the local IT Team’s infrastructure setup and propose the monitoring templates/parameters for each
of the infrastructure components. Monitoring frequency and thresholds are agreed upon by the onboarding team and
local IT Team before that are deployed.
Mid-term review and revision of the monitoring templates/parameters will be taken up in conjunction with the local IT
Team based on the observations and inputs from the field.
There are two types of event categories:
• Discrete state changes sent asynchronously, where an agent installed on the target managed element.
• Threshold breaches indicating that an element is no longer operating within “normal” parameters are sent
asynchronously by the agent and is installed on the target managed element. Normal this is based on a predefined,
default, out-of-the-box threshold; a KAR-defined, customized setting; or a dynamic, measured baseline.
Every event captured for treshold breach by the SG or Vistara Agent will be pushed to the Alert Browser in the Cloud
Portal. The Alert Browser will correlate and tranform the alert parameters in to human presentable format which helps
1st
level engineers to quickly pin point the severity of the issue, validate the alert with remote command executions,
create an incident ticket with priority and assign to the domain experts for resolution. All seamlessly from the same
alert.
Any threshold or parameter changes are seamlessly pushed to the Alert Browser from the monitoring solution.
1. Auto Correlation:
NOC processes events using filtering techniques for correlation. When multiple, repetitive events are received
for the same problem on the same element, it stores the event once and increases a counter indicating the
number of time it has been received, rather than flooding user's screen with redundant events.
2. Notification and Escalation Management
ITIL Incident Management System are integrated with the notification and escalation management engine.
It provides automatic notification and escalation at predefined intervals. These notifications are sent for creation,
ownership assignment, progress, status change and closure of the incident, and also acts as an alerting mechanism to
notify in case of SLA breach. The notifications can be sent to hierarchical escalation of an incident to an individual or
group based on pre-defined parameters thresholds or manual override conditions.
It can capture the escalation matrix for the location or site. The Escalation management can be setup differently for
business hours and after business hours as well as weekends. The escalation contacts can be defined in hierarchy with
options to call, or email only and both. The escalation contact details hierarchy and the hours of operations can be
modified dynamically as needed by The local IT Team and the NOC team will follow the escalation as defined in the
escalation matrix. Below is a sample escalation matrix:
The Local IT Team’s will have the ability to change the matrix or contact details from the proposed solution.
Notifications are escalated to client in following cases:
• On receipt of alerts/tickets
• Subsequent updates on the alert/ticket status
• Status change
• Require customer intervention
• Closure of the ticket
3. Ticketing Management
The service offers 24x7 Monitoring, Alerting & Incident Management services from its secure NOCs. The services include
monitoring alerts from the infrastructure, validation of the alerts and ticketing, Incident Reporting and ITIL compliant
SOP-based remediation.
Each incident is recorded so that it can be tracked, monitored, and updated throughout the life cycle. This information
can then be utilized for problem management, reporting, process optimization and planning purposes.
Classification and initial support: The classification process categorizes an incident to determine the priority of an
incident, focusing on the potential business impact. Priority is mapped to the incident after validating the event occurred
in the network and the potential business impact. An Incident ticket is then assigned to the respective domain experts
for quick resolution
Investigation and diagnosis: This process deals with the investigation of an incident and the gathering of diagnostic
data. The goal of the process is to identify how the incident can be resolved as quickly and efficiently as possible. The
technical resources will be handling the incidents until resolution, escalating to domain experts when required to meet
pre-defined SLAs.
Major incident procedure: The major incident procedure (P0) exists in NOC Service Delivery Framework to handle those
critical incidents that require a response above and beyond that provided by the normal incident process. Although
these incidents still follow the normal incident lifecycle, the triggering of the major incident procedure provides the
increased coordination, escalation, communication, and resources that these high-priority events require.
Vendor Tech Support:
This process deals with contacting Vendors technical support team if needed to resolve the issue. Although first
objective is to find a workaround and make sure the business is as usual, domain experts will work in the background
with the vendors to resolve the issue. The Local IT Team should however have valid contracts for NOC to coordinate.
Resolution and recovery: This process covers the steps that are required to resolve an incident, often by utilizing the
change management process for implementation of remedial actions. Once the issue has been resolved, recovery
options, such as restarting services after an application failure, may be required.
Closure: This process ensures that the customer is satisfied with the resolution and handling of an incident prior to
closing the ticket.
4. Handling of service requests
Different types of service requests require handling in different ways. The NOC team will be able to process certain
requests, while other requests need to be passed to other processes, such as change management and new equipment
onboarding.
5. Monitors In Detail
Monitors are setup for each device/application in the managed infrastructure using standard Windows WMI or SNMP
data collection. The ITOP Cloud Portal also enables the NOC to remotely and securely access the monitored devices in
order to perform SOPs (Standard Operating Procedures) or advanced troubleshooting. Below are examples of the some
of the key monitoring parameters the NOC can monitor for.
Key Monitoring Parameters for Network Devices:
SWITCHES, ROUTERS & FIREWALL
Device Availability: Up/Down
Device Health: (CPU and Memory utilization)
Interface Status: Up/Down
Interface Performance: – Utilization, In/Out Traffic Rate
Interface Errors: Error and Discard Rate, CRC and Collision
Errors
Buffer Usage – Small, Medium, Large and Huger Buffer
Utilization and Failures
VPN – IKE and IPsec Tunnel Availability
Hardware Monitoring: Disk, Memory Modules, Chassis
Temperature, Fan, Power, and Voltage Status
WIRELESS NETWORKS
Access Point Availability,
Access Points Client statistics
Network Health – Load. Interference, Noise and
Coverage Status
REAL-TIME NETWORK PERFORMANCE MONITORING
(SYNTHETIC TRAFFIC)
HTTP - URL Response Time
Network Health – Load. Interference, Noise and
Coverage Status
Key Monitoring Parameters for Windows and Linux Servers:
WINDOWS OPERATING SYSTEM
Device Availability: Up/Down
Device Health: (CPU, Memory and Disk Utilization)
Windows Services: Up/Down (Default: All services with start-
up type “Automatic”)
Windows Event Logs: Critical Application, System Logs
Server Hardware Monitoring: Disk, Memory Modules, and
Chassis Temperature
LINUX OPERATING SYSTEM
Device Availability: Up/Down
Device Health: (CPU, Memory and Disk Utilization)
Linux Interfaces: Up/Down
Logs: Critical Logs
Server Hardware Monitoring: Disk, Memory Modules,
Chassis Temperature
Key Monitoring Parameters for SQL Server Database:
WINDOWS OPERATING SYSTEM
Server Availability: Up/Down
Server Health: (CPU, Memory and Disk Utilization)
Windows Event Logs: Critical Application, System Logs
Server Hardware Monitoring: Disk, Memory Modules,
Chassis Temperature
DEVICE HEALTH
Device/Network/Cluster Availability
Device Health (CPU and Memory and Disk Utilization)
SERVER THROUGHPUT METRICS
Number of Logical Connections, Logins/sec, Logouts /sec,
SERVER MEMORY METRICS
Connection Memory Size, Granted Workspace Memory,
Lock Memory Size/ Blocks Allocated, Owner Blocks
Allocated, Maximum Workspace Memory, Memory Grants
Outstanding, Optimizer Memory Alert, SQL Cache Memory
Size, Total Server Memory Size
SQL SERVER LOCK METRICS
Lock Wait Time (ms), Lock Requests/sec, Lock
Timeouts/sec, Deadlocks/sec
SQL SERVER RESOURCE UTILIZATION METRICS
Data File Size , Replication Transaction Rate , Average and
Active Transactions. Transactions /sec, Queued Jobs , Failed
Jobs and Job Success Rate, Open Connections Count
SQL SERVER CACHE METRICS
Cache Hit Ratio (MSSQL), Cache Objects, Cache Pages, and
Cache Objects in use.
SERVER DISK METRICS
Average Disk Reads/Writes/Transfers in Bytes, Disk Queue
Length, Disk Read/Write Queue, Data Space of DB,
SQL SERVER LOG METRICS
Log File(s) Size, Log Flush Wait Time, Log Flush Waits/sec, Log
Flushes/sec, Log Growth and Shrink Rate
Total Latch Wait Time , Number of Replication Pending
Transactions, User Connections,
WINDOWS SERVICES MONITORING
SQL Server , Agent Service , Integrations Services,
Reporting Services Analysis Services, Full Text Services
SQL SERVER PHYSICAL I/O PERFORMANCE
Advanced Windows Extensions AWE lookup Maps/sec,
AWE Stolen Maps/sec, Buffer Cache Hit Ratio, Checkpoint
Pages/sec, Lazy writes/sec, Page Lookups, Reads and
Writes/sec
SQL SERVER HIGH AVAILABILITY MONITORING
Monitors to Track Replication Latency , Mirror
Synchronization , Lag in Log Shipping , Cluster Availability
and Failover/Failback
6. Interactive Graphing
With Interactive graphing feature, the NOC offers graphs at device level to show our monitoring capabilities on Servers
and Network Devices.
The graphs are time-lined, which ease the analysis of various performance indicators as shown in examples below:
Fig: CPU Monitor Graph
Fig: Disk Utilization Monitor Graph
Fig: Memory Utilization Monitor Graph
Fig: PING Monitor Graph
Fig: Network Device Interface Monitor Graph
Fig: Network Device Interface Health Monitor Graph
7. Security
To address these problems effectively within their organizations, enterprises need to ask the following key questions:
• How can we record and audit the actions of IT staff (enterprise staff or co-sourced service provider staff) when they
work on our critical IT assets? How can we increase accountability?
• How can we identify violations of access?
• How can we selectively grant access to our IT assets to 3rd party service providers to maintain them during off-
hours, while ensuring accountability?
• How can we prove to our auditors that we were access compliant over the last multiple years?
• How can we “bottle” the workflows of our senior and expert IT staff when they troubleshoot and solve IT
problems? How can we make this knowledge and skills available to others in our IT team for training?
8. Solution
Haladon provides highly optimized solutions for the operational management of IT assets at remote sites. "as-if-you-
were-there" interface to servers, clients, and network devices are provided remotely - using just a browser from
anywhere in the world - independent of whether there is a firewall between user and managed devices or not.
The NOC brings together a multitude of capabilities to deliver secure, high performance, and cost-effective discovery,
monitoring, management and remediation operations for remote sites.
All data transfer and management control communications between the NOC browser and the services gateway are
secured via 256-bit AES SSL encryption. Similarly, all data transfer and management control communications between
ITOP cloud portal and IT assets are also secured via IPSec or SSL. This level of encryption ensures that it is impossible for
any hacker or intruder to decrypt any of the communications, even if they gain unlawful access and eavesdrop.
Security: Role Based Access For Authorization, Data, & Control Privacy
ITOP cloud portal supports a robust mechanism for role based access to enable authorization of users by their roles.
Each role can be defined to include a set of permissions or access levels (or not) to perform certain key actions on the
target device, or to provide access to certain key objects in the system such as incidents, monitors, and session
recordings. It is possible to do a security partitioning of roles by teams, and make the roles private to that entity,
irrespective of whether expertise is local or remote to the site. In situations where multiple 3rd party service providers
augment the IT administration team, it is possible to include these administrators in the role.
Fig: Role-based Access control with ITOP Cloud Portal Solution
Audit & Compliance: Session Recordings, Audit Trails, & Reports
All access and sessions (e.g. RDP, VNC, Telnet, SSH, serial consoles) performed on the target server or device from the
NOC are recorded and logged via a "flight recorder". Each access requires the user to authenticate with a ticket number,
as well as to enter comments on the reason for access. Every recorded session also maintains a record of the remote
console activity including meta-data related to the access (start time, end time, user name and type of remote console,
ticket number and comments).
Audit & Compliance: RDP & VNC Remote Console Session Recordings
If the NOC administrator uses RDP or VNC remote consoles to access the target device, then that two-way data transfer
is brokered by the ITOP cloud portal which stores all in-transit data in native format. The entire “session” between the
NOC administrator and the target device is recorded and stored as a “session recording”. The ITOP cloud portal supports
ability to replay these recording sessions and to show a screen by screen “as if you were there” playback of the entire
session at multiple speeds (1X to 10X). The native recording data stream is stored in an optimized format, but can be
converted and exported into an SWF format thus making it playable on other external players at different resolutions
and different speeds.
Fig: Session Record Player
Audit & Compliance: SSH, Telnet, & Serial Console Recordings
Recordings from SSH, telnet and serial console sessions are captured and stored in a text format. The recordings are
then indexed offline to enable fast text search capabilities via the portal. Recordings can be searched for content as well
as for comments entered before initiation or after completion of the remote console session.
Text based recordings can also be played back, downloaded as a text file, or searched for specific sub-strings. The text
based console player has powerful features to replay the recording in the original remote console timeline or can be
played back using an option that skips idle time or at 1X-4X speeds.
Surveillance: Access Controlled Session
The “access controlled session” capability allows an organization to “schedule” tickets for third party maintenance
services on critical devices. These tickets provide an un-chaperoned, off-hours, time-window-enforced, fully-auditable,
access control to specific target devices for third party organizations (e.g. co-sourced provider).
The ITOP cloud portal supports workflows that enable an enterprise to leverage and schedule 3rd party consultants and
maintenance services during off-hours. For example, if remote technical skilled expertise is needed on a weekend to
perform a scheduled upgrade of a VoIP device, business critical system, or a core router, then this activity can be
scheduled ahead of time with an “access controlled session” ticket on the ITOP cloud portal. The remote 3rd party
administrator (a guest user) will be provided limited access to only the specific target devices and consoles (via RBAC
policy) during a pre-specified maintenance window. The guest user can login into the ITOP cloud portal during the
scheduled window, launch consoles on the specified target devices, and complete his tasks.
The guest user cannot login into the portal outside the specified maintenance window and will be blocked from doing
so. All access during the time window by the guest user is recorded as a session recording and is available for audit.
The benefits of this capability are obvious and far-reaching. An enterprise IT department employee does not need to
“baby-sit” or watch over the guest user while he/she performs maintenance tasks during the weekend.
9. Certifications
NOC is ISO 27001 Certified
Operational focus, documentation and training on standard policy and procedures enabled the NOC to become ISO
27001 certified in 2008. The NOC continues to improve its operational procedures and training with on-staff Six Sigma
black belts and will maintain its certification with third-party ISO 27001 audits on an annual basis.
The NOC is committed to meeting the highest standards for information security, regulatory compliance and business
continuity. The ISO 27001 certification is a strong standard of baseline operations that every IT department should
require of external vendors. By requiring ISO 27001, IT departments know they have chosen an external vendor who is
focused on providing secure, compliant, and reliable IT services 24 x 7 x 365.
What is ISO 27001?
ISO 27001 is an Information Security Management System (ISMS) standard published in October 2005 by the
International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). The
complete, correct nomenclature is: ISO/IEC 27001:2005 - Information technology - Security techniques - Information
security management systems - Requirements (ISMS). The objective of the standard is to provide a model for
establishing, implementing, operating, monitoring, reviewing, maintaining, and improving an Information Security
Management System.
ISO 27001 standards requires that an organization is continually:
• Examining systematically the organization’s information security risks, taking account of the threats, vulnerabilities
and impacts;
• Designing and implementing a coherent and comprehensive suite of information security controls and/or other
forms of risk treatment (such as risk avoidance or risk transfer) to address those risks that it deems unacceptable;
and
• Adopting an overarching management process to ensure that the information security controls continue to meet the
organization’s information security needs on an on-going basis.
To provide a better understanding of the operational coverage of the ISO 27001 standard, below is a breakdown of the
10 basic control areas for organizations:
• Security policy - This provides management direction and support for information security organization of assets
and resources to help you manage information security within the organization.
• Asset classification and control - To help you identify your assets and appropriately protect them.
• Personnel security - To reduce the risks of human error, theft, fraud or misuse of facilities.
• Physical and environmental security - To prevent unauthorized access, damage and interference to business
premises and information.
• Communications and operations management - To ensure the correct and secure operation of information
processing facilities.
• Access control - To strictly control access to information.
• Systems development and maintenance - To ensure that security is at the forefront in the building and
maintenance of information systems.
• Business continuity management - To minimize interruptions to business activities and to protect critical business
processes from the effects of major failures or disasters.
• Compliance – Recognition of and adherence to criminal and civil law, statutory, regulatory or contractual
obligations, and any security requirement minimizes legal risks and costs.
10. Conclusion – NOC, The Right Solution For Session Recordings, Audit, Compliance, & Surveillance
Management of remote client sites by an IT administrator from a central NOC is fraught with challenges in both
capabilities and security for remote access. The ITOP cloud portal feature set on session recordings, audit, compliance,
and surveillance offers a comprehensive set of capabilities to address these challenges in a secure manner for an IT
organization.
Additionally since IT infrastructure management systems are by definition "top of the food chain" for systems in an
organization, they cannot afford to have weak security capabilities, as the results can be catastrophic to an organization
if security is compromised.
The ITOP cloud portal solution for remote management is built with strong security from the ground up using well-
understood industry standards and best practices in security. The exhaustive security measures and robust management
feature set assures that deploying its solution will not only save the organization time and money, but also enhance the
security of its already existing infrastructure.