We will discuss the service assurance aspect of the services that are
provisioned through the cloud service fulfillment process. We will discuss the
fundamental service assurance processes and corresponding best practices that
are typically employed to successfully manage the services that are delivered to
customers.
Service assurance is a combination of fault and performance management. Cloud
service assurance requires fault and performance management of cloud
infrastructure that is comprised of network, compute and storage in addition to
the applications that run on services platform itself. Reporting an incident to
the appropriate management system for technical remediation and further, to
informational dashboards in near real time for end user notification is a
fundamental requirement. It is also essential the cloud service fulfillment
processes be closely coupled with cloud assurance processes, since cloud
customers may only be using the infrastructure for a defined period of time. For
example, customers may use a cloud service as a test/dev environment only for
the period of time when they are developing/testing a service. Previously,
customers stayed on the network for very long periods of time, hence the need to
spin up and down services with the appropriate assurance monitoring and
management did not happen as part of the service lifecycle.
Cloud service providers must resolve service-related problems quickly to
minimize outages and revenue loss. Cloud assurance solutions give the customer
and the operational team maximum visibility into service performance and
cost-effective management of SLAs coupled with service impact analysis.
Cloud End-to-End Service Assurance Flow
Figure 1 below shows the typical end-to-end service assurance steps:
Figure 1: Cloud Assurance Flow –
data collection to display
Unlike the cloud service fulfillment cycle that starts with the customer and
ends with a provisioning task into the service platform resources, service
assurance events occur in the resources affecting services with variable levels
of degradation. Eventually the customer is made aware of the service
degradation via notification unless it is fixed pro-actively before it is
perceived by the customer.
The steps illustrated in Figure 1 are explained below:
- The incident/fault and performance events are sent by the infrastructure devices comprised of network devices, compute/server platforms, storage devices, and applications.
- The domain managers collect these events from infrastructure devices through the CLI, Simple Network Management Protocol (SNMP) polling, or traps sent by the infrastructure devices. Typically separate domain managers are used for the network, compute, and storage devices.
- The domain managers receive the messages from the infrastructure devices and de-duplicate/filter the events.
- The device events received by the domain managers have device information only and do not contain any service information. The domain manager will enrich the events by looking into the service catalog. In other words, the events are mapped to the services to determine which services are affected because of the events in the infrastructure. It is also possible to map to customers and determine the customers who are impacted by the events.
- The service impact is assessed and the information is shown on a dashboard for the operations personnel. This will help the operations people to prioritize the remediation efforts.
- The service impact information is forwarded to other locations, including mobile devices.
- The service impact information is sent to the service desk (SD).
- The service impact information is sent to the service-level manager (SLM).
- The SD proactively manages customers, informs them of service impacts, and keeps them up to date on the remediation efforts. Also, the service impact is checked against the SLA to determine any SLA violations and business impacts.
- The events, SLA violations, trouble tickets, and so on are displayed on a portal for various consumers, such as customers, operations people, suppliers, and business managers.
Best Practices for Cloud Service Assurance using ITILv3 Principles
ITILv3 provides the IT life cycle processes: service strategy, service
design, service transition, service operate and Continuous Service Improvement
(CSI). Applying these processes is a good way to establish service assurance
processes for data center virtualization and cloud management. Figure 2 shows
cloud service assurance flow based on ITILv3:
Figure 2: Cloud service
provisioning flow based on ITIL V3 principles
Mapping ITILv3 phases to ‘Infrastructure as a Service’ requirements
Figure 2 also shows some of the items that need to be considered in each of
the five phases of the cloud service life cycle for cloud assurance. More
details are provided in the following sections along the lines of ITIL V3
phases.
1. Service Strategy: In the cloud strategy phase for the
cloud assurance consider the following topics:
- Architecture Assessment
- Business Requirements
- Demand Management
2. Service
Design:
The following items should be considered, taking input from the service
strategy phase:
- Service Catalog Management
- Service Level Management
- Availability Management
- Capacity Management
- Incident Management
- Problem Management
- Supplier Management
- Information Security Management
- Service Continuity Management
3. Service
Transition:
Consider the following items in this phase:
- Change Management
- Configuration and setting up all the Assurance Systems
- Service asset management in the CMS (CMDB)
- Migration from current state to target state (people, processes, products, and partners)
- Staging and Validation of all systems and processes for assurance
4. Service Operate: Cloud Operate phase is where the
service provider takes possession of the management of cloud operations from the
equipment vendors, system integrators and partners, and monitors and audits the
service using the monitoring systems (FCAPS) to ensure the SLAs are met.
Consider the following items in the cloud Operate phase:
- Service Desk (function)
- Incident Management
- Problem Management
- Event Management
- Other IT day-to-day activities
5. Cloud CSI Phase: Continuous Service Improvement (also
referred as Optimization) for cloud assurance involves improving on the
operations by adding best practices to the processes, tools and
configurations.
- Auditing the configurations against best practices and changing the configurations as appropriate
- Fine tuning the tools and processes based on best practices
- Adding new products and services, and performing assessments to ensure the new services can be incorporated into the current environment. If not, determine the changes required and go through the cloud life cycle, starting from cloud strategy.
No comments:
Post a Comment