Wednesday, November 16, 2011

How OSPF SPF Adaptive Timers are implemented in IOS and JUNOS


It became a fact that both of Cisco Systems and Juniper Networks have proved their strong market penetration and most of the operators and providers deploying the various platforms of both of them. Based on this, it became an essential for the networking engineers specially those who are working on operator’s environment to know how each vendor’s platforms are architectured, and how their OS are structured as well as how to configure it. However this will not be adequate for the design engineers who has to assure their multi-vendor network are perfectly merged and converged without any interoperability issues, so, they have to dig more and understand how each of the leading vendors are implementing the technologies and matching the RFCs.

Today we will start explaining how each of Cisco Systems and Juniper networks are implementing the OSPF SPF adaptive timers or what called SPF throttling (Cisco) or SPF hold-down (Juniper)
Before we dig into that, let’s talk a little bit about what OSPF SPF Adaptive Timers are designed to do for us, and then we’ll take a look at how each vendor is implementing the concept.

If we can recall from our OSPF background, OSPF SPF algorithm has design to run upon arrivals of LSAs. So, if each LSA triggers a full or incremental SPF run, and if they are arriving fast, SPF can begin eating up the majority of your CPU.

The challenge in large-scale networks is to quickly react to network changes while at the same time not allowing SPF calculations to dominate the route processors. This is the goal of SPF delay, also called SPF hold-down or SPF throttling.

Rather than kick off an SPF calculation every time a new LSA/LSP arrives, SPF delay forces the router to wait a bit between SPF runs. If a large number of LSA/LSPs are being flooded, a delay between SPF runs means that more LSA/LSPs are added to the link state database during the hold-down period. Efficiency is then increased because when the hold-down period expires and SPF is run, more network changes are included in a single calculation.

But this efficiency you are getting from SPF delay, it has its costs which it increase your network convergence time. So, the challenge is to set the delay interval long enough when abnormal things happen while keeping it short when the network is stable so you got a quick convergence. This leads to the concept of adaptive SPF timers.

Both Cisco and Juniper are offering adaptive SPF timers, but with different approaches. In the coming sections, we are going to explain the mechanism used by each vendor.

Adaptive SPF Timers in JUNOS


Juniper Networks uses a linear fast/slow algorithm for adaptive SPF timers. So, it introduced the SPF delay timer which is the minimum delay in the time between the detection of a topology change and when the SPF algorithm actually runs. This period is 200ms by default. The period is configurable with the spf-delay command to between 50 and 8000ms.

Secondly, they introduce a second parameter which is rapid-runs. If three (the default) SPF runs are triggered in quick succession, indicating instability in the network, the router will enter the “slow mode” and a third parameter called the hold-down timer will start. Any subsequent SPF calculation is not run until the hold-down timer expires. The routers remain in this “slow mode” until the hold-down period have passed since the last SPF run—indicating that the network has converged—and then switches back to “fast mode”, and the system reverts to the configured values for the delay and rapid-runs statements.

The default values for SPF calculations in JUNOS can be seen below:

Default SPF timers values in JUNOS
r2@r2> show ospf overview | match SPF
Full SPF runs: 280SPF delay: 0.200000sec, SPF holddown: 5 sec, SPF rapid runs: 3

Changing SPF Timers in JUNOS

The configuration stanza for JunOS shows how these settings may be changed.

1spf-options {
2 delay milliseconds;
3 holddown milliseconds;
4 rapid-runs number;
5}

These default values can be changed with the following command:

[edit protocols ospf]
r1@r1> set spf-options delay milliseconds holddown milliseconds rapid-runs number

Now we are going to play with the timers and run the debugs, and examine the behavior. We will set the delay to 1 sec and the hold-down timer to 20 sec while keeping the rapid-runs as default.

spf-options {
delay 1000;
holddown 20000;
}


The log entry below shows, on lines 2,6 and 10, that the SPF run occurs every 1 second after the LSA Update. Once the SPF run has completed 3 iterations it moves into a slower mode of operation.

12:01:50.465905 OSPF full SPF refresh scheduled for topology default
12:01:50.466445 OSPF SPF scheduled for topology default in 1s
12:01:51.467761 Starting full SPF refresh for topology default
12:02:04.540150 OSPF rcvd LSUpdate 91.198.180.250 -> 224.0.0.5 (fxp1.100 IFL 95 area 0.0.0.0)
12:02:04.541073 OSPF full SPF refresh scheduled for topology default
12:02:04.541581 OSPF SPF scheduled for topology default in 1s
12:02:05.543546 Starting full SPF refresh for topology default
12:02:12.886187 OSPF rcvd LSUpdate 91.198.180.250 -> 224.0.0.5 (fxp1.100 IFL 95 area 0.0.0.0)
12:02:12.892787 OSPF full SPF refresh scheduled for topology default
12:02:12.893285 OSPF SPF scheduled for topology default in 1s
12:02:13.894226 Starting full SPF refresh for topology default

The next log entry shows that SPF started after 20 sec from the SPF run (at t=12:02:13). The default number of SPF calculations that can occur in succession is 3. The range that you can configure is from 1 through 5. Each SPF algorithm is run after the configured SPF delay. When the maximum number of SPF calculations occurs, the hold-down timer begins. We previously configured this to be 20 seconds. Any subsequent SPF calculation is not run until the hold-down timer expires. This is why the received LSA update on line 4 does not immediately trigger an SPF run.

12:02:20.739927 OSPF rcvd LSUpdate 91.198.180.250 -> 224.0.0.5 (fxp1.100 IFL 95 area 0.0.0.0)
12:02:20.747717 OSPF full SPF refresh scheduled for topology default
12:02:20.756118 OSPF SPF scheduled for topology default in 13.140569s
12:02:26.990677 OSPF rcvd LSUpdate 91.198.180.250 -> 224.0.0.5 (fxp1.100 IFL 95 area 0.0.0.0)
12:02:33.896073 Starting full SPF refresh for topology default

Next, the log shows the router once again enters the fast mode…

12:02:59.734614 OSPF rcvd LSUpdate 91.198.180.250 -> 224.0.0.5 (fxp1.100 IFL 95 area 0.0.0.0)
12:02:59.753923 OSPF full SPF refresh scheduled for topology default
12:02:59.754409 OSPF SPF scheduled for topology default in 1s
12:03:00.755847 Starting full SPF refresh for topology default
12:03:07.494415 OSPF rcvd LSUpdate 91.198.180.250 -> 224.0.0.5 (fxp1.100 IFL 95 area 0.0.0.0)
12:03:07.501625 OSPF full SPF refresh scheduled for topology default
12:03:07.502166 OSPF SPF scheduled for topology default in 1s
12:03:08.503663 Starting full SPF refresh for topology default
12:03:57.215931 OSPF rcvd LSUpdate 91.198.180.250 -> 224.0.0.5 (fxp1.100 IFL 95 area 0.0.0.0)
12:03:57.223481 OSPF full SPF refresh scheduled for topology default
12:03:57.223998 OSPF SPF scheduled for topology default 1s
12:03:58.225848 Starting full SPF refresh for topology default

We can also observe from the previous log that although 3 more SPF runs have taken place, the router does not move into slow mode again. This is because there has been 50sec between the first and the last SPF run in the set of 3. If the 3 SPF runs happen within 3 x “delay value“, or in our case 3 seconds, the router will start to throttle the number of SPF runs, and start the holddown timer countdown. If the SPF runs are outwith 3 x the configured delay value, the rapid-run counter is reset to 0 and no back-off algorithms are run.

Now, shown in the next log snippet, the router will enter the slow mode and the holddown timer will start, because three SPF runs have occurred in succession.

12:04:03.364745 OSPF rcvd LSUpdate 91.198.180.250 -> 224.0.0.5 (fxp1.100 IFL 95 area 0.0.0.0)
12:04:03.378123 OSPF full SPF refresh scheduled for topology default
12:04:03.378655 OSPF SPF scheduled for topology default in 1s
12:04:04.379888 Starting full SPF refresh for topology default
12:04:15.329694 OSPF rcvd LSUpdate 91.198.180.250 -> 224.0.0.5 (fxp1.100 IFL 95 area 0.0.0.0)
12:04:15.349992 OSPF full SPF refresh scheduled for topology default
12:04:15.350510 OSPF SPF scheduled for topology default in 1s
12:04:16.352016 Starting full SPF refresh for topology default

And finally, the following log shows that SPF again started after 20 sec from the last SPF run (at t=12:04:16)


The figure below is charting the above debug which can help you in more understanding the JunOS behaviour with the SPF timers

Adaptive SPF Timers in IOS

Cisco Systems introduced an exponential backoff algorithm for the adaptive SPF timers by using three different configurable timers.
This exponential functionality limits the number of SPF computations during times of network instability by doubling the delay associated with the SPF run, up to a maximum hold delay, for the period of instability. When the period of instability ends, the delay is reset to the original value. Three timers are associated SPF exponential backoff: Start Time, Initial-Hold Time, and Max-Hold Time.
IOS internally has an internal timer called the waiting-interval which the SPF computation will be delayed till it expires. When a topology change is received for the first time, the waiting-interval will be set to the start timer which is similar to the spf-delay in JUNOS, and the SPF computation is delayed for the value set by start timer. When the SPF computation completes, a waiting-interval starts with the value of the initial-hold timer and the router will enter the “slow mode”. If there is a topology change during waiting-interval, the SPF computation will run at the expiration of the initial-hold timer. At the completion of the SPF computation the waiting-interval is set to the twice the value of initial-hold timer and then run again. So for example, if the start timer is 100ms and the initial-hold timer is 1000ms, the router delays the first SPF run by 100ms, the second by 1000ms, the third by 2000ms, the fourth by 4000ms, and so on.
The waiting-interval grows exponentially as 2^t*initial-hold until it reaches the max_hold-time value. After this, any topology change during the current waiting-interval would result in the next SPF computation will run at the expiration of the max hold time and next waiting-interval being equal to the constant max-hold timer. This ensures that exponential growth is limited. If the SPF has not run for twice the time specified by the max-hold timer, the router switches back to “fast” mode in which the start delay timer is used and the waiting-interval is reset back to the initial value.
The default values for SPF calculations in IOS can be seen below:

Default SPF timers values in IOS
R2#sh ip ospf | i SPF�
Initial SPF schedule delay 5000 msecs
Minimum hold time between two consecutive SPFs 10000 msecs
Maximum wait time between two consecutive SPFs 10000 msecs

Changing SPF Timers in IOS

These default values can be changed with the following command:
R1(config)# router ospf 100
R1(config-router)# timers throttle spf spf-start spf-hold spf-max-wait

As we did above with JunOS, will play with the SPF throttle timers and run the debugs, and examine the behavior. We will set the Start delay timer to 1 sec and the initial-hold timer to 5 sec and the max-hold timer to 50 sec.
The log entry below shows, on lines 2, that the SPF run at t= 21:30 which is 1 second after the LSA Update, and the next wait_interval set to the initial-hold time which is 5 sec as shown in line 5.

12:21:29: OSPF: Detect change in LSA type 1, LSID 2.2.2.2, from 2.2.2.2 area 0
12:21:30: OSPF: Begin SPF at 54881.208ms, process time 13316ms
12:21:30: spf_time 15:14:43.672, wait_interval 1000ms
12:21:30: OSPF: End SPF at 54889.600ms, Total elapsed time 84ms
12:21:30: Schedule time 15:14:44.672, Next wait_interval 5000ms

The next log entry shows that the waiting_interval is getting doubled after each SPF run. Starting with a waiting_interval equal to 5 sec which is the initial-hold timer as shown on line 3, the next waiting_interval on line 8 is set to 10 sec then to 20 sec on line 14 and 40 sec on line 22.
While the router is in the slow mode no SPF will run until the wait_interval elapses no matter how many topology changes have been detected. This is why the received LSA update on lines 13 and 14 and also on lines 20,21,22 and 23 does not immediately trigger an SPF run.

12:21:32: OSPF: Detect change in LSA type 1, LSID 2.2.2.2, from 2.2.2.2 area 0
12:21:35: OSPF: Begin SPF at 54889.600ms, process time 13424ms
12:21:35: spf_time 15:14:44.672, wait_interval 5000ms
12:21:35: OSPF: End SPF at 54889.672ms, Total elapsed time 72ms
12:21:35: Schedule time 15:14:49.672, Next wait_interval 10000ms
12:21:41: OSPF: Detect change in LSA type 1, LSID 2.2.2.2, from 2.2.2.2 area 0
12:21:45: OSPF: Begin SPF at 54899.672ms, process time 13516ms
12:21:45: spf_time 15:14:49.672, wait_interval 10000ms
12:21:45: OSPF: End SPF at 54899.720ms, Total elapsed time 48ms
12:21:45: Schedule time 15:14:59.720, Next wait_interval 20000ms
12:21:58: OSPF: Detect change in LSA type 1, LSID 2.2.2.2, from 2.2.2.2 area 0
12:22:03: OSPF: Detect change in LSA type 1, LSID 2.2.2.2, from 2.2.2.2 area 0
12:22:05: OSPF: Begin SPF at 54919.720ms, process time 13580ms
12:21:05: spf_time 15:14:59.720, wait_interval 20000ms
12:22:05: OSPF: End SPF at 54919.776ms, Total elapsed time 56ms
12:22:05: Schedule time 15:15:19.776, Next wait_interval 40000ms
12:22:22: OSPF: Detect change in LSA type 1, LSID 2.2.2.2, from 2.2.2.2 area 0
12:22:27: OSPF: Detect change in LSA type 1, LSID 2.2.2.2, from 2.2.2.2 area 0
12:22:32: OSPF: Detect change in LSA type 1, LSID 2.2.2.2, from 2.2.2.2 area 0
12:22:39: OSPF: Detect change in LSA type 1, LSID 2.2.2.2, from 2.2.2.2 area 0
12:22:45: OSPF: Begin SPF at 54959.776ms, process time 13684ms
12:22:45: spf_time 15:15:19.776, wait_interval 40000ms
12:22:46: OSPF: End SPF at 54959.884ms, Total elapsed time 108ms
12:22:46: Schedule time 15:15:59.884, Next wait_interval 50000ms

The next log entry shows that the waiting_interval is reached the max-hold time (50 sec) and upcoming waiting_interval being equal to the constant max-hold timer as on lines 3 and 8 .

12:23:17: OSPF: Detect change in LSA type 1, LSID 2.2.2.2, from 2.2.2.2 area 0
12:23:36: OSPF: Begin SPF at 55009.884ms, process time 13808ms
12:23:36: spf_time 15:15:59.884, wait_interval 50000ms
12:23:36: OSPF: End SPF at 55009.928ms, Total elapsed time 44ms
12:23:36: Schedule time 15:16:49.928, Next wait_interval 50000ms
12:24:26: OSPF: Detect change in LSA type 1, LSID 2.2.2.2, from 2.2.2.2 area 0
12:24:26: OSPF: Begin SPF at 55059.928ms, process time 13872ms
12:24:26: spf_time 15:16:49.928, wait_interval 50000ms
12:24:26: OSPF: End SPF at 55059.968ms, Total elapsed time 40ms
12:24:26: Schedule time 15:17:39.968, Next wait_interval 50000ms

We can also observe from the previous log that although the LSA on line 6 arrived 60 sec after last SPF run has taken place which is more than the waiting_interval , the router does not move into fast mode again. This is because that the condition is that to divert back to the fast mode the SPF should not run for twice the time specified by the max-hold timer.
Now, shown in the next log snippet, the router will enter the slow mode and the holddown timer will start, because the SPF has not run for 100 sec which is twice the time specified by the maximum delay period.


12:26:22: OSPF: Detect change in LSA type 1, LSID 2.2.2.2, from 2.2.2.2 area 0
12:26:23: OSPF: Begin SPF at 55177.420ms, process time 13932ms
12:26:23: spf_time 15:19:36.420, wait_interval 1000ms
12:26:23: OSPF: End SPF at 55177.488ms, Total elapsed time 68ms
12:26:23: Schedule time 15:19:37.488, Next wait_interval 5000ms
12:26:29: OSPF: Detect change in LSA type 1, LSID 2.2.2.2, from 2.2.2.2 area 0
12:26:28: OSPF: Begin SPF at 55177.488ms, process time 14016ms
12:26:28: spf_time 15:19:37.488, wait_interval 5000ms
12:26:28: OSPF: End SPF at 55184.660ms, Total elapsed time 108ms
12:26:28: Schedule time 15:19:42.660, Next wait_interval 10000ms

For more clarity, I have reflected the debugs on the following figure, so you can use both the debugs and the figure to examine the behavior


No comments:

Post a Comment