Cloud Service Assurance Process

We will discuss the service assurance aspect of the services that are provisioned through the cloud service fulfillment process. We will discuss the fundamental service assurance processes and corresponding best practices that are typically employed to successfully manage the services that are delivered to customers.

Service assurance is a combination of fault and performance management. Cloud service assurance requires fault and performance management of cloud infrastructure that is comprised of network, compute and storage in addition to the applications that run on services platform itself. Reporting an incident to the appropriate management system for technical remediation and further, to informational dashboards in near real time for end user notification is a fundamental requirement. It is also essential the cloud service fulfillment processes be closely coupled with cloud assurance processes, since cloud customers may only be using the infrastructure for a defined period of time. For example, customers may use a cloud service as a test/dev environment only for the period of time when they are developing/testing a service. Previously, customers stayed on the network for very long periods of time, hence the need to spin up and down services with the appropriate assurance monitoring and management did not happen as part of the service lifecycle.

Cloud service providers must resolve service-related problems quickly to minimize outages and revenue loss. Cloud assurance solutions give the customer and the operational team maximum visibility into service performance and cost-effective management of SLAs coupled with service impact analysis.

Cloud End-to-End Service Assurance Flow

Figure 1 below shows the typical end-to-end service assurance steps:

Figure 1: Cloud Assurance Flow – data collection to display

Unlike the cloud service fulfillment cycle that starts with the customer and ends with a provisioning task into the service platform resources, service assurance events occur in the resources affecting services with variable levels of degradation. Eventually the customer is made aware of the service degradation via notification unless it is fixed pro-actively before it is perceived by the customer.

The steps illustrated in Figure 1 are explained below:

The incident/fault and performance events are sent by the infrastructure devices comprised of network devices, compute/server platforms, storage devices, and applications.
The domain managers collect these events from infrastructure devices through the CLI, Simple Network Management Protocol (SNMP) polling, or traps sent by the infrastructure devices. Typically separate domain managers are used for the network, compute, and storage devices.
The domain managers receive the messages from the infrastructure devices and de-duplicate/filter the events.
The device events received by the domain managers have device information only and do not contain any service information. The domain manager will enrich the events by looking into the service catalog. In other words, the events are mapped to the services to determine which services are affected because of the events in the infrastructure. It is also possible to map to customers and determine the customers who are impacted by the events.
The service impact is assessed and the information is shown on a dashboard for the operations personnel. This will help the operations people to prioritize the remediation efforts.
The service impact information is forwarded to other locations, including mobile devices.
The service impact information is sent to the service desk (SD).
The service impact information is sent to the service-level manager (SLM).
The SD proactively manages customers, informs them of service impacts, and keeps them up to date on the remediation efforts. Also, the service impact is checked against the SLA to determine any SLA violations and business impacts.
The events, SLA violations, trouble tickets, and so on are displayed on a portal for various consumers, such as customers, operations people, suppliers, and business managers.

Best Practices for Cloud Service Assurance using ITILv3 Principles

ITILv3 provides the IT life cycle processes: service strategy, service design, service transition, service operate and Continuous Service Improvement (CSI). Applying these processes is a good way to establish service assurance processes for data center virtualization and cloud management. Figure 2 shows cloud service assurance flow based on ITILv3:

Figure 2: Cloud service provisioning flow based on ITIL V3 principles

Mapping ITILv3 phases to ‘Infrastructure as a Service’ requirements

Figure 2 also shows some of the items that need to be considered in each of the five phases of the cloud service life cycle for cloud assurance. More details are provided in the following sections along the lines of ITIL V3 phases.

1. Service Strategy: In the cloud strategy phase for the cloud assurance consider the following topics:

Architecture Assessment
Business Requirements
Demand Management

2. Service Design:

The following items should be considered, taking input from the service strategy phase:

Service Catalog Management
Service Level Management
Availability Management
Capacity Management
Incident Management
Problem Management
Supplier Management
Information Security Management
Service Continuity Management

3. Service Transition:

Consider the following items in this phase:

Change Management
Configuration and setting up all the Assurance Systems
Service asset management in the CMS (CMDB)
Migration from current state to target state (people, processes, products, and partners)
Staging and Validation of all systems and processes for assurance

4. Service Operate: Cloud Operate phase is where the service provider takes possession of the management of cloud operations from the equipment vendors, system integrators and partners, and monitors and audits the service using the monitoring systems (FCAPS) to ensure the SLAs are met. Consider the following items in the cloud Operate phase:

Service Desk (function)
Incident Management
Problem Management
Event Management
Other IT day-to-day activities

5. Cloud CSI Phase: Continuous Service Improvement (also referred as Optimization) for cloud assurance involves improving on the operations by adding best practices to the processes, tools and configurations.

Auditing the configurations against best practices and changing the configurations as appropriate
Fine tuning the tools and processes based on best practices
Adding new products and services, and performing assessments to ensure the new services can be incorporated into the current environment. If not, determine the changes required and go through the cloud life cycle, starting from cloud strategy.