This is the third part of our 3-part blog series discussing Network Operation Centers (NOC’s) best practices. The first post was dedicated to NOC tools. The second provided some useful tips regarding NOC knowledge and skills. In this last part, we’ll address processes.
What are the operational, structured processes that you should implement for effective and repeatable results? Here are our top ones.
For example, see the table below defining the escalation procedures for DB related problems.
A critical problem that was not solved within 30 minutes is escalated up the management ladder, until a response and/or ownership is taken. At every step of the process, it is recommended to involve all personnel up to the current level. So when an SMS is sent to the project manager, it is also sent the DB on call and Group Leader.
Understanding the prioritization of incidents in terms of their business impact should be part of the NOC training. The entire team should be familiar with the NOC “Top 10” projects, and have an understanding of what signifies a critical incident. It could be the temperature rising in the data center, a major network cable breaking or a service going down.
Obviously, common sense is very useful. Clearly the shift leader should be able to determine that an incident that jeopardizes the entire data center has a higher priority than a request to verify why an individual server is down.
Incident handling process should cover issues such as:
What are the operational, structured processes that you should implement for effective and repeatable results? Here are our top ones.
Escalation
A table of escalation will ensure that all team members are clear on the proper protocol and channels for escalating issues. This table should also include all areas and skills covered by the NOC and the people who are trained to cover those areas.For example, see the table below defining the escalation procedures for DB related problems.
Time Frame | Escalate To | Method | |
0+15mins | DB on call | SMS | |
0+30mins | DB on call | Phone | |
0+60mins | DB Group Leader | Phone | |
0+90mins | UNIX & DB Project Manager | SMS | |
0+120mins | UNIX & DB Director | SMS |
A critical problem that was not solved within 30 minutes is escalated up the management ladder, until a response and/or ownership is taken. At every step of the process, it is recommended to involve all personnel up to the current level. So when an SMS is sent to the project manager, it is also sent the DB on call and Group Leader.
Prioritization
The process of prioritizing incidents is different in each NOC, and therefore should be clearly defined. Incidents should never be handled on a first come, first served basis. Instead, the shift manager should prioritize incidents and cases based on the importance and impact on the business. Issues that have a greater impact on the business should obviously be handled first.Understanding the prioritization of incidents in terms of their business impact should be part of the NOC training. The entire team should be familiar with the NOC “Top 10” projects, and have an understanding of what signifies a critical incident. It could be the temperature rising in the data center, a major network cable breaking or a service going down.
Obviously, common sense is very useful. Clearly the shift leader should be able to determine that an incident that jeopardizes the entire data center has a higher priority than a request to verify why an individual server is down.
Incident handling
The process of handling incidents applies both to NOC operators and shift managers. Both roles should be familiar with the specific process of handling incidents with the greatest impact on users.Incident handling process should cover issues such as:
- Full technical solution, if available.
- Escalation of issue to appropriate personnel.
- Notification of other users who may be directly or indirectly affected by issues.
- ‘Quick solution’ procedures or temporary workarounds for more complex problems that may take longer to completely resolve.
- Incident reporting. An incident report, completed once the incident is resolved, helps improves the service when the next incident occur, or may also prevent the recurrence of the same incident.
No comments:
Post a Comment