Generic Kernel Design
Kernels normally do not provide any immediately perceived or revenue-generating functionality. Instead, they perform housekeeping activities such as memory allocation and hardware management and other system-level tasks. Kernel threads are likely the most often run tasks in the entire system. Consequently, they have to be robust and run with minimal impact on other processes.
In the past, kernel architecture largely defined the operating structure of the entire system with respect to memory management and process scheduling. Hence, kernels were considered important differentiators among competing designs.
Historically, the disputes between the proponents and opponents of lightweight versus complex kernel architectures came to a practical end when most operating systems became functionally decoupled from their respective kernels. Once software distributions became available with alternate kernel configurations, researchers and commercial developers were free to experiment with different designs.
For example, the original Carnegie-Mellon Mach microkernel was originally intended to be a drop-in replacement for the kernel in BSD UNIX and was later used in various operating systems, including mkLinux and GNU FSF projects. Similarly, some software projects that started life as purely microkernel-based systems later adopted portions of monolithic designs.
Over time, the radical approach of having a small kernel and moving system functions into the user-space processes did not prevail. A key reason for this was the overhead associated with extra context switches between frequently executed system tasks running in separate memory spaces. Furthermore, the benefits associated with restartability of essentially all system processes proved to be of limited value, especially in embedded systems. With the system code being very well tested and limited to scheduling, memory management and a handful of device drivers, the potential errors in kernel subsystems are more likely to be related to hardware failures than to software bugs. This means, for example, that simply restarting a faulty disk driver is unlikely to help the routing engine stay up and running, as the problem with storage is likely related to a hardware failure (for example, uncorrectable fault in a mass storage device or system memory bank).
Another interesting point is that although both monolithic and lightweight kernels were widely studied by almost all operating system vendors, few have settled on purist implementations.
For example, Apple’s Mac OS X was originally based on microkernel architecture, but now runs system processes, drivers and the operating environment in BSD-like subsystems. Microsoft NT and derivative operating systems also went through multiple changes, moving critical performance components such as graphical and I/O subsystems in and out of the system kernel to find the right balance of stability, performance and predictability. These changes make NT a hybrid operating system. On the other hand, freeware development communities such as FSF, FreeBSD and NetBSD have mostly adopted monolithic designs (for example, Linux kernel) and have gradually introduced modularity into selected kernel sections (for example, device drivers).
So what difference does kernel architecture make to routing and control?
Monolithic Versus Microkernel Network Operating System Designs
In the network world, both monolithic and microkernel designs can be used with success.
However, the ever-growing requirements for a system kernel quickly turn any classic implementation into a compromise. Most notably, the capability to support a real-time forwarding plane along with stateful and stateless forwarding models and extensive state replication requires a mix of features not available from any existing monolithic or microkernel OS implementation.
This lack can be overcome in two ways.
First, a network OS can be constrained to a limited class of products by design. For instance, if the OS is not intended for mid- to low-level routing platforms, some requirements can be lifted. The same can be done for flow-based forwarding devices, such as security appliances. This artificial restriction allows the network operating systems to stay closer to their general-purpose siblings—at the cost of fracturing the product lineup. Different network element classes will now have to maintain their own operating systems, along with unique code bases and protocol stacks, which may negatively affect code maturity and customer experience.
Second, the network OS can evolve into a specialized design that combines the architecture and advantages of multiple classic implementations.
This custom kernel architecture is a more ambitious development goal because the network OS gets further away from the donor OS, but the end result can offer the benefits of feature consistency, code maturity, and operating experience.
This is the design path that Juniper selected for Junos OS.
Junos OS Kernel
According to the formal criteria, the Junos OS kernel is fully customizable (Figure 3). At the very top is a portion of code that can be considered a microkernel. It is responsible for real-time packet operations and memory management, as well as interrupts and CPU resources. One level below it is a more conventional kernel that contains a scheduler, memory manager and device drivers in a package that looks more like a monolithic design. Finally, there are user-level (POSIX) processes that actually serve the kernel and implement functions normally residing inside the kernels of classic monolithic router operating systems. Some of these processes can be compound or run on external CPUs (or packet forwarding engines). In Junos OS, examples include periodic hello management, kernel state replication, and protected system domains (PSDs).
The entire structure is strictly hierarchical, with no underlying layers dependent on the operations of the top layers. This high degree of virtualization allows the Junos OS kernel to be both fast and flexible.
However, even the most advanced kernel structure is not a revenue-generating asset of the network element. Uptime is the only measurable metric of system stability and quality. This is why the fundamental difference between the Junos OS kernel and competing designs lies in the focus on reliability.
Figure 3: Generic Junos OS 9.0 architectural structure
Coupled with Juniper’s industry-leading nonstop active routing and system upgrade implementation, kernel state replication acts as the cornerstone for continuous operation. In fact, the Junos OS redundancy scheme is designed to protect data plane stability and routing protocol adjacencies at the same time. With in-service software upgrade, networks powered by Junos OS are becoming immune to the downtime related to the introduction of new features or bug fixes, enabling them to approach true continuous operation. Continuous operation demands that the integrity of the control and forwarding planes remains intact in the event of failover or system upgrades, including minor and major release changes. Devices running Junos OS will not miss or delay any routing updates when either a failure or a planned upgrade event occurs.
This goal of continuous operation under all circumstances and during maintenance tasks is ambitious, and it reflects Juniper’s innovation and network expertise, which is unique among network vendors.
Process Scheduling in Junos OS
Innovation in Junos OS does not stop at the kernel level; rather, it extends to all aspects of system operation. As mentioned before, there are two tiers of schedulers in Junos OS, the topmost becoming active in systems with a software data plane to ensure the real-time handling of incoming packets. It operates in real time and ensures that quality of service (QoS) requirements are met in the forwarding path.
The second-tier (non-real-time) scheduler resides in the base Junos OS kernel and is similar to its FreeBSD counterpart. It is responsible for scheduling system and user processes in a system to enable preemptive multitasking.
In addition, a third-tier scheduler exists within some multithreaded user-level processes, where threads operate in a cooperative, multitasking model. When a compound process gets the CPU share, it may treat it like a virtual CPU, with threads taking and leaving the processor according to their execution flow and the sequence of atomic operations. This approach allows closely coupled threads to run in a cooperatively multitasking environment and avoid being entangled in extensive IPC and resource- locking activities (Figure 4).
Figure 4: Multilevel CPU scheduling in Junos OS
Another interesting aspect of multi-tiered scheduling is resource separation. Unlike first-generation designs, Junos OS systems with a software forwarding plane cannot freeze when overloaded with data packets, as the first-level scheduler will continue granting CPU cycles to the control plane.
Junos OS Routing Protocol Process
The routing protocol process daemon (RPD) is the most complex process in a Junos OS system. It not only contains much of the actual code for routing protocols, but also has its own scheduler and memory manager. The scheduler within RPD implements a cooperative multitasking model, in which each thread is responsible for releasing the CPU after an atomic operation has been completed. This design allows several closely related threads to coexist without the overhead of IPC and to scale without risk of unwanted interactions and mutual locks.
The threads within RPD are highly modular and may also run externally as standalone POSIX processes—this is, for example, how many periodic protocol operations are performed. In the early days of RPD, each protocol was responsible for its own adjacency management and control. Now, most keepalive processing resides outside RPD, in the Bidirectional Forwarding Detection protocol (BFD) daemon and periodic packet management process daemon (PPMD), which are, in turn, distributed between the routing engine and the line cards. The unique capability of RPD to combine preemptive and cooperative multitasking powers the most scalable routing stack in the market.
Compound processes similar to RPD are known to be very effective but sometimes are criticized for the lack of protection between components. It has been said that a failing thread will cause the entire protocol stack to restart. Although this is a valid point, it is easy to compare the impact of this error against the performance of the alternative structure, where every routing protocol runs in a dedicated memory space.
Assume that the router serves business VPN customers, and the ultimate revenue-generating product is continuous reachability between remote sites. At the very top is a BGP process responsible for creating forwarding table entries. Those entries are ultimately programmed into a packet path ASIC for the actual header lookup and forwarding. If the BGP process hits a bug and restarts, forwarding table entries may become stale and would have to be flushed, thus disrupting customer traffic. But BGP relies on lower protocols in the stack for traffic engineering and topology information, and it will not be able to create the forwarding table without OSPF or RSVP. If any of these processes are restarted, BGP will also be affected (Figure 5). This case supports the benefits of running BGP, OSPF and RSVP in shared memory space, where the protocols can access common data without IPC overhead.
Figure 5: Hierarchical protocol stack operation
In a reverse example, several routing protocols legitimately operate at the same level and do not depend on each other. One case would be unicast family BGP and Protocol Independent Multicast (PIM). Although both depend on reachability information about connected and IGP known networks, failures in one protocol can be safely ignored in the other. For instance, unicast forwarding to remote BGP known networks can continue even if multicast forwarding is disrupted by PIM failure. In this case, the multicast and unicast portions of the routing code are better off stored in different protected domains so they do not affect each other.
Looking deeper into the realm of exceptions, we find that they occur due to software and hardware failures alike. A faulty memory bank may yield the same effect as software that references a corrupt pointer—in both cases, the process will most likely be restarted by a system.
In general, the challenge in ensuring continuous operation is fourfold:
• First, the existing forwarding entries should not be affected. Restart of a process should not affect the traffic flowing through the router.
• Second, the existing forwarding entries should not become stale. Routers should not misdirect traffic in the event of a topology change (or lack thereof).
• Third, protocol operation should have low overhead and be well contained. Excessive CPU utilization and deadlocks are not allowed as they negatively affect node stability.
• Fourth, the routing protocol peers should not be affected. The network should remain stable.
Once again, we see that few software challenges can be met by structuring in one specific way.
Routing threads may operate using a cooperative, preemptive or hybrid task model, but failure recovery still calls for state restoration using external checkpoint facilities. If vital routing information were duplicated elsewhere and could be recovered promptly, the failure would be transparent to user traffic and protocol peers alike. Transparency through prompt recovery is the principal concept underlying any NSR design and the main idea behind the contemporary Juniper Networks RPD implementation.
Instead of focusing on one technology or structure, Juniper Networks engineers evolve the Junos OS protocol stack according to a “survival of the fittest” principle, toward the goal of true nonstop operation, reliability and usability. State replication, checkpointing and IPC are all used to reduce the impact of software and hardware failures. The Junos OS control plane is designed to maintain speed, uptime and full state under the most unfavorable network situations.
Adapting to ever-changing real-world conditions and practical applications, the Junos OS routing architecture will continue to evolve to become even more advanced, with threads added or removed as dictated by the needs of bestin- class software design. Juniper Networks software is constantly adapted to the operating environment, and as you read this paragraph, new ideas and concepts are being integrated into Junos OS. Stay tuned.
Scalability
Junos OS can scale up and down to platforms of different sizes. This capability is paramount to the concept of “network OS” that can power a diverse range of network elements. The next section highlights the challenges and opportunities seen in this aspect of networking.
Scaling Down
Scaling down is the capacity of a network operating system to run on low-end hardware, thus creating a consistent user experience and ensuring the same level of equipment resilience and reliability across the entire network, from high-end to low-end routing platforms.
Achieving this goal involves multiple challenges for a system designer. Not only does the code have to be efficient on different hardware architectures, but low-end systems bring their own unique requirements, such as resource constraints, cost, and unique security and operations models. In addition, many low-end routers, firewalls and switches require at least some CPU assistance for packet forwarding or services, thus creating the need for a software forwarding path.
Taking an arbitrary second-generation router OS and executing it in a low-end system can be a challenging task, evidenced by the fact that no vendor except Juniper actually ships low-end and high-end systems running the same OS based on second-generation design principles or better.
But bringing a carrier-sized OS all the way down to the enterprise is also rewarding.
It brings immediate advantages to customers in the form of uniform management, compatibility and OPEX savings across the entire network. It also improves the original OS design. During the “fasting” exercise needed to fit the OS into low-end devices, the code is extensively reviewed, and code structure is optimized. Noncritical portions of code are removed or redesigned.
What’s more, the unique requirements of variable markets (for example, security, Ethernet and enterprise) help stress-test the software in a wide range of situations, thus hardening the overall design. Scaling limits are pushed across many boundaries when the software adopts new roles and applications.
Finally, low-end systems typically ship in much larger quantities than high-end systems. The increased number of systems in the field proportionally amplifies the chances of finding nonobvious bugs and decreases the average impact of a given bug on the installed base worldwide.1 All these factors together translate into a better product, for both carriers and enterprises.
It can be rightfully said that the requirement for scaling down has been a major source of inspiration for Junos OS developers since introduction of the Juniper Networks J Series Services Routers. The quest for efficient infrastructure has helped with such innovative projects as Junos OS SDK, and ultimately paved the way to the concept of one OS powering the entire network—the task that has never been achieved in history of networking before.
Scaling Up
Empowerment of a multichassis, multiple-terabit router is associated with words such as upscale and high end, all of which apply to Junos OS. However, it is mostly the control plane capacity that challenges the limits of software in modern routers with all-silicon forwarding planes. For example, a 1.6-terabit router with 80 x 10 Gigabit Ethernet core-facing interfaces may place less stress on its control plane than a 320-megabit router with 8,000 slow-speed links and a large number of IGP and BGP adjacencies behind them.
Scaling is dependent on many factors. One of the most important is proper level of modularity. As discussed in the previous sections, poor containment of intermodule interactions can cause exponential growth in supplementary operations and bring a system to gridlock.
Another factor is design goal and associated architectural decisions and degree of complexity. For instance, if a router was never intended to support 5,000 MPLS LSP circuits, this number may still be configurable, but will not operate reliably and predictably. The infrastructure changes required to fix this issue can be quite significant.
Realistic, multidimensional scaling is an equivalent of the Dhrystone2 benchmark. This scaling is how a routing system proves itself to be commercially attractive to customers. Whenever discussing scaling, it is always good to ask vendors to stand behind their announced scaling limits. For example, the capability to configure 100,000 logical interfaces on a router does not necessarily mean that such a configuration is viable, as issues may arise on different fronts—slow responses to user commands, software timeouts and protocol adjacency loss. Vendor commitment means that the advertised limits are routinely checked and tested and new feature development occurs according to relevant expectations.
Scaling is where Junos OS delivers.
Some of the biggest networks in the world are built around Junos OS scaling capacities, supporting thousands of BGP and IGP peers on the same device. Likewise, core routers powered by Junos OS can support tens of thousands of transit MPLS label-switched paths (LSPs). With its industry-leading slot density on the Juniper Networks T Series Core Routers, Junos OS has proven to be one of the most scalable network operating systems in existence.
Architecture and Infrastructure
This section addresses architecture and infrastructure concerns related to parallelism, flexibility and portability, and open architecture.
Parallelism
Advances in multicore CPU development and the capability to run several routing processors in a system constitute the basis for increased efficiency in a router control plane. However, finding the right balance of price and performance can also be very difficult.
Unlike the data mining and computational tasks of supercomputers, processing of network updates is not a static job. A block of topology changes cannot be prescheduled and then sliced across multiple CPUs. In routers and switches, network state changes asynchronously (as events happen), thus rendering time-based load sharing irrelevant.
Sometimes vendors try to solve this dilemma by statically sharing the load in functional, rather than temporal, domains. In other words, they claim that if the router OS can use separate routing processors for different tasks (for example, OSPF or BGP), it can also distribute the bulk of data processing across multiple CPUs.
To understand whether this is a valid assumption, let’s consider a typical CPU utilization capture (Figure 6). What is interesting here is that the different processes are not computationally active at the same time—OSPF and BGP do not compete for CPU cycles. Unless the router runs multiple same-level protocols simultaneously, the well-designed network protocol stack stays fairly orthogonal. Different protocols serve different needs and seldom converge at the same time.
Figure 6: Typical CPU times capture (from NEC 8800 product documentation)
For instance, an IGP topology change may trigger a Dijkstra algorithm computation; until it is complete, BGP nexthop updates do not make much sense. At the same time, all protected MPLS LSPs should fall on precomputed alternate paths and not cause major RSVP activities.
Thus, the gain from placing different processes of a single control plane onto physically separate CPUs may be limited, while the loss from the overhead functions such as synchronization and distributed memory unification may be significant.
Does this mean that the concept of parallelism is not applicable to the routing processors? Not at all.
Good coding practice and modern compilers can make excellent use of multicore and SMP hardware, while clustered routing engines are indispensable when building multichassis (single control and data plane spanning multiple chassis) or segmented (multiple control and data planes within a single physical chassis) network devices. Furthermore, high-end designs may allow for independent scaling of control and forwarding planes, as implemented in the highly acclaimed Juniper Networks JCS1200 Control System.
With immediate access to state-of-the art processor technology, Juniper Networks engineers heavily employ parallelism in the Junos OS control plane design, targeting both elegance and functionality.
A functional solution is the one that speeds up the control plane without unwanted side effects such as limitations in forwarding capacity. When deployed in a JCS1200, Junos OS can power multiple control plane instances (system domains) at the same time without consuming revenue-generating slots in the router chassis. Moreover, the Junos OS architecture can run multiple routing systems (including third-party code) from a single rack of routing engines, allowing an arbitrary mix-and-match of control plane and data plane resources within a point of presence (POP). These unique capabilities translate into immediate CAPEX savings, because a massively parallel control plane can be built independent of the forwarding plane and will never confront a limited common resource (such as the number of physical routers or a number of slots in each chassis).
Elegance means the design should also bring other technical advantages: for instance, bypassing space and power requirements associated with the embedded chassis and thus enabling use of faster silicon and speeding up the control plane. Higher CPU speed and memory limits can substantially improve the convergence and scaling characteristics of the entire routing domain.
The goal of Juniper design philosophy is tangible benefits to our customers—without cutting corners.
Flexibility and Portability
A sign of a good OS design is the capability to adapt the common software platform to various needs. In the network world, this equates to the adoption of new hardware and markets under the same operating system.
The capability to extend the common operating system over several products brings the following important benefits to customers:
• Reduced OPEX from consistent UI experience and common management interface
• Same code for all protocols; no unique defects and interoperability issues
• Common schedule for software releases; a unified feature set in the control plane
• Accelerated technology introduction; once developed, the feature ships on many platforms
Technology companies are in constant search of innovation both internally and externally. New network products can be developed in-house or within partnerships or acquired. Ideally, a modern network OS should be able to absorb domestic (internal) hardware platforms as well as foreign (acquired) products, with the latter being gradually folded into the mainstream software line (Figure 7).
Figure 7: Product consolidation under a common operating system
The capability to absorb in-house and foreign innovations in this way is a function of both software engineering discipline and a flexible, well-designed OS that can be adapted to a wide range of applications.
On the contrary, the continuous emergence of internally developed platforms from the same vendor featuring different software trains and versions can signify the lack of a flexible and stable software foundation.
For example, when the same company develops a core router with one OS, an Ethernet switch with another, and a data center switch with a third, this likely means that in-house R&D groups considered and rejected readily available OS designs as impractical or unfit. Although partial integration may still exist through a unified command-line interface (CLI) and shared code and features, the main message is that the existing software designs were not flexible enough to be easily adapted to new markets and possibilities. As a result, customers end up with a fractured software lineup, having to learn and maintain loosely related or completely unrelated software trains and develop expertise in all of them—an operationally suboptimal approach.
In contrast to this trend, Juniper has never used a multitrain approach with Junos OS and has never initiated multiple operating system projects. Since its inception in 1996, Junos OS has been successfully ported to a number of Intel, MIPS, and PowerPC architectures and currently powers a broad spectrum of routing products ranging from the world’s fastest Juniper Networks T1600 Core Router to low-end routing devices, Ethernet switches, and security appliances. Juniper’s clear goal is to keep all products (both internally developed and acquired together with industry-leading companies and talent) under the same Junos OS umbrella.
Degrees of Modularity
Software modularity, as previously described, has focused on the case where tasks are split into multiple loosely coupled modules. This type of modularity is called “horizontal,” as it aims at limiting dependency and mutual impact between processes operating at the same peer level. Another interesting degree of modularity is known as “vertical modularity,” where modular layers are defined between parts of the operating system in the vertical direction.
Without vertical modularity, a network OS remains built for a specific hardware and services layout. When porting to a new target, much of this infrastructure has to be rewritten. For example, both software- and hardware-based routers can provide a stateful firewall service, but they require dramatically different implementations. Without a proper vertical modularity in place, these service implementations will not have much in common, which will ultimately translate into an inconsistent user experience.
Vertical modularity solves this problem, because most OS functions become abstracted from lower-level architecture and hardware capabilities. Interaction between upper and lower OS levels happens via well-known subroutine calls. Although vertical modularity itself is almost invisible to the end user, it eliminates much of the inconsistency between various OS implementations. This can be readily appreciated by network operations center (NOC) personnel who no longer deal with platform-specific singularities and code defects. Vertical modularity is an ongoing project, and the Junos OS team has always been very innovative in this area.
No comments:
Post a Comment