Saturday, July 30, 2011

Network Operating System Evolution - Part III

Open Architecture

An interesting implication of vertical modularity is the capability to structure code well enough to document appropriate software interfaces and allow external pluggable code. While a high degree of modularity within a system allows easy porting to different and diverse hardware architectures, a well-defined and documented application programming interface (API) can be made available to third parties for development of their own applications.

In Junos OS, the high degree of modularity and documentation eventually took the form of the Partner Solution Development Platform (PSDP), which opened the API and tool chain specific to Juniper to customers and integrators worldwide. PSDP allows these customers and integrators to co-design the operating system, fitting it precisely to their needs, especially in support of advanced and confidential applications. The degree of development may vary from minor changes to software appearance to full-scale custom packet processing tailored to specific needs.

The Juniper Networks Software Developer’s Kit (SDK) highlights the achievements of Junos OS in network code design and engineering and reflects the innovation that is integral to Juniper’s corporate culture. This high level of synergy between original equipment manufacturer (OEM) vendors and operators promises to enable creation of new services and competitive business differentiators, thus removing the barriers to network transformation. Just as the open-source FreeBSD was the donor OS for Junos OS, with the Juniper Networks SDK, Junos OS is now a platform open to all independent developers.
Product Maintenance
Another important characteristic of products is maintainability. It covers the process of dealing with software defects and new features, abilities to improve existing code, and the introduction of new services and capabilities. It also makes a big difference in the number and quality of NOC personnel that is required to run a network. Maintainability is where a large portion of OPEX resides.

Self-Healing
 Routers are complex devices that depend on thousands of electronic components and millions of code lines to operate. This is why some portion of the router installed base will almost inevitably experience software or hardware defects over the product life span.

So far, we have been describing the recovery process, in which state replication and process restarts are the basis of continuous operation. In most cases, Junos OS will recover so efficiently that customers never notice the problem, unless they closely monitor the system logs. A failing process may restart instantly with all the correct state information, and the router operation will not be affected.

But even the best recovery process does not provide healing; software or hardware component remains defective and may cause repeated failures if it experiences the same condition again. The root cause for the failure needs to be tracked and eliminated, either through a software fix or a hardware replacement.

Traditionally, this bug investigation begins with a technical assistance center (TAC) ticket opened by a customer and requires intensive interaction between the customer and vendor engineers. Once identified, the problem is usually resolved through a work-around, software upgrade or hardware replacement, all of which must be performed manually.
 Since the early days of Junos OS, Juniper Networks routers were designed to include the built-in instrumentation needed to diagnose and remedy problems quickly. Reflecting Juniper’s origins as a carrier-class routing company, every Junos OS system in existence comes with an extensive array of software and hardware gear dedicated to device monitoring and analysis. Juniper has been a pioneer in the industry with innovations such as persistent logging, automatic core file creation and development tools (such as GDB) embedded in Junos OS, all facilitating fast defect tracing and decision making). In the traditional support model, customers and Juniper Networks TAC (JTAC) engineers jointly use those tools to zero in on a possible issue and resolve it via configuration change or software fix.
In many cases, this is enough to resolve a case in real time, as soon as the defect traces are made available to Juniper Networks Customer Support.

However, Juniper would never have become a market leader without a passion for innovation. We see routing systems with embedded intelligence and self-healing capabilities as the tools for ensuring survivability and improving the user experience. Going far beyond the automated hardware self-checking normally available from many vendors, Junos OS can not only collect data and analyze its own health, but can also report this state back to the customer and to the JTAC with the patent-pending Advanced Insight Service (AIS) technology. As a result, the router that experiences problems can get immediate vendor attention around the clock and without involving NOC personnel. A support case can be automatically created and resolved before operators are aware of the issue. If a code change is needed, it will go into the next maintenance or major Junos OS release and will be available through a seamless upgrade on the router. This cycle is the basis of self-healing Junos OS operation and paves the way to dramatic OPEX savings for existing networks.
 The main difference between AIS and similar call-home systems is the degree of embedded intelligence.
 AIS-enabled Junos OS both monitors itself for apparent failures such as a process crash or laser malfunction and proactively waits for early signs of problems such as degrading storage performance or increasing number of unresolved packets in the forwarding path. Triggered and periodic health checks are continuously improved based on actual field cases encountered and resolved by the JTAC, thus integrating the collective human expertise into the running of Junos OS systems. Further, AIS is fully programmable with new health metrics and triggers that customers can add. Better yet, in its basic form, AIS comes with every Junos OS system—for free.

Troubleshooting
 An often forgotten but very important aspect of functional separation is the capability to troubleshoot and analyze a production system. As the amount of code that constitutes a network operating system is often measured in hundreds of megabytes, software errors are bound to occur in even the most robust and well-regressed designs. Some errors may be discovered only after a huge number of protocol transactions have accumulated on a system with many years of continuous operation. Defects of this nature can rarely be predicted or uncovered even with extensive system testing.
 After the error is triggered and the damage is contained by means of automatic software recovery, the next step is to collect the information necessary to find the problem and to fix it in the production code base. The speed and effectiveness of this process can be critical to the success of the entire network installation because most unresolved code defects are absolutely not acceptable in production networks.

This is where proper functional separation comes into major play. When a software defect is seen, it is likely to become visible via an error message or a faulty process restart (if a process can no longer continue). Because uptime is paramount to the business of networking, routers are designed to restart the failing subsystem as quickly as possible, typically in a matter of milliseconds.

When this happens, the original error state is lost, and software engineers will not be able to poke around a live system for possible root causes of the glitch. Unless the defect is trivial and easily understood, code designers may take some time to recreate and understand the issue. Offsite reproduction can be challenging, because replicating the exact network conditions and sequence of events can be difficult, and sometimes impossible. In this case, the post-mortem memory image (core dump) of the failing process is indispensable because it contains the state of data structures and variables, which can be examined for integrity. It is not uncommon for Junos OS engineers to resolve a defect just by analyzing the process core dump.

The catch here is that in tightly coupled processes, the failure of one process may actually be triggered by an error in another process. For example, RSVP may accept a “poisoned” traffic engineering database from a link-state IGP process and subsequently fail. If the processes run in different memory spaces, RSVP will dump the core, and IGP will continue running with a faulty state. This situation not only hampers troubleshooting, but also potentially brings more damage to the system because the state remains inconsistent.

The issue of proper functional separation also has implications for software engineering managers. It is a common practice to organize development groups according to code structure, and interprocess software defects can become difficult to troubleshoot because of organizational boundaries. Improper code structure can easily translate into a TAC nightmare, where a defect is regularly seen in the customer network, but cannot be reliably reproduced in the lab or even assigned to the right software engineering group.

In Junos OS, the balance between the amount of restartable code and the core dump is tuned to improve troubleshooting and ensure quick problem resolution. Junos OS is intended to be a robust operating system and to deliver the maximum amount of information to engineering should an error occur. This design helps ensure that most software defects resulting in code restarts are resolved within a short time period, often as soon as the core file is delivered.

Quality and Reliability
System integrity is vital, and numerous engineering processes are devoted to ensuring it. The following section touches on the practice of quality software products design.

System Integrity
 If you were curious enough to read this paper up to this point, you should know that a great deal of work goes into the design of a modern operating system. Constant feature development and infrastructural changes mean that each new release has a significant amount of new code.
Now you might ask if the active development process can negatively affect system stability.
With any legacy software design process, the answer would be definite: Yes.

The phenomenon known to programmers as “feature bloating” is generally responsible for degrading code structure and clarity over time. As new code and bug fixes are introduced, the original design goals are lost, testing becomes too expensive, and the release process produces more and more “toxic builds” or otherwise unusable software with major problems.

This issue was recognized very early in the Junos OS development planning stage.

Back in 1996, automated system tests were not widely used, and most router vendors crafted their release methodology based on the number of changes they expected to make in the code. Typically, every new software release would come in mainstream and technology versions, with the former being a primary target for bug fixes, and the latter receiving new features. Defects were caught mainly in production networks after attempts to deploy new software, which resulted in a high number of bug fixes and occasional release deferrals.

To satisfy the needs of customers looking for a stable operational environment, “general deployment” status was used to mark safe-harbor software trains. It was typically awarded to mainstream code branches after they had run for a long enough time in early adopters’ networks.

As a general rule, customers had to choose between features and stability. Technology and early deployment releases were notoriously problematic and full of errors, and the network upgrade process was a trial-and-error operation in search for the code train with a “right” combination of features and bugs.

This approach allowed router vendors to avoid building extensive test organizations, but generally led to low overall product quality. General deployment software trains lingered for years with almost no new features, while technology builds could barely be deployed in production because of reliability problems. Multiple attempts to find the balance between the two made the situation even worse due to introduction of even more software trains with different stability and feature levels.

This practice was identified as improper in the fledgling Junos OS design process. Instead, a state-of-the-art test process and pioneering release methodology were born.

Each Junos OS build is gated by a full regression run that is fully automated and executes for several days on hundreds of test systems simulating thousands of test cases. These test cases check for feature functionality, scaling limits, previously known defects and resilience to negative input (such as faulty routing protocol neighbors). If a failure occurs in a critical test, the final product will not be shipped until the problem is fixed. This process allows Junos OS releases to occur on a predictable, periodic basis. In fact, many customers trust Junos OS to the point that they run the very first build of each version in production. Still, every Junos OS version is entitled to the so-called regression run (if requested by customers). A regressed release is a fully tested original build with all latest bug fixes applied.

The Junos OS shipping process is based on several guiding principles:

• Every Junos OS release is gated by a systems test, and no releases with service-affecting issues are cleared for shipment.
 • Regressed (maintenance) releases, by rule, deliver no new features. For example, no features were introduced between Junos OS 8.5R1 and 8.5R2.
 • As a general rule, feature development happens only at the head of the Junos OS train. Experimental (engineering) branches may exist, but they are not intended for production.
 • No feature backports are allowed (that is, features developed for rev 9.2 are not retrofitted into rev 8.5)
 • No special or customer-specific builds are allowed. This restriction means Junos OS never receives
 modifications that are not applicable to the main code base or cannot pass the system test. Every change and feature request is carefully evaluated according to its value and impact worldwide; the collective expertise of all Juniper Networks customers benefits every Junos OS product.

This release process ensures the exceptional product quality customers have come to expect from Juniper over the years. Although initially met with reluctance by some customers accustomed to the randomly spaced, untested and special builds produced by other vendors, our release policy ensures that no production system receives unproven software. Customers have come to appreciate the stability in OS releases that Juniper’s approach provides.

With its controlled release paradigm, Juniper has set new standards for the entire networking industry, The same approach was used later by many other design organizations.

However, the Junos OS design and build structure remains largely unmatched.
 Unlike competitors’ build processes, our build process occurs simultaneously for all Juniper Networks platforms and uses the same software repository for all products. Each code module has exactly one implementation, in both shared (common) and private (platform-specific) cases. Platform-specific and shared features are merged during the build in a well-controlled and modular fashion, thus providing a continuous array of functionality, quality and experience across all Junos OS routing, switching and security products.

Release Process Summary
 Even the best intentions for any software development are inadequate unless they can prove themselves through meaningful and repeatable results. At Juniper, we firmly believe in a strong link between software quality and release discipline, which is why we have developed criteria for meeting—or failing—our own targets.

Here is a set of metrics for judging the quality of release discipline:

• Documented design process: The Juniper Networks software design process has met the stringent TL9000 certifications requirements.

 • Release schedule: Junos OS releases have been predictable and have generally occurred every three months. An inconsistent, unpredictable or repeatedly slipping release process generally indicates problems in a software organization.

 • Code branching: This is a trend where a single source tree branches out to support either multiple platforms or alternative builds on the same platform with unique software features and release schedules. Branching degrades system integrity and quality because the same functionality (for example, routing) is being independently maintained and developed in different software trains. Branching is often related to poor modularity and can also be linked to poor code quality. In an attempt to satisfy a product schedule and customer requirements, software engineers use branching to avoid features (and related defects) that are not critical to their main target or customer. As a result, the field ends up with several implementations of the same functionality on similar or even identical hardware platforms. Although Junos OS powers many platforms with vastly different capabilities, it is always built from one source tree with core and platform-specific sections. The interface between the two parts is highly modular and well documented, with no overlap in functionality. There is no branching in Junos OS code.

 • Code patching: To speed defect resolution, some vendors provide code patching or point bug-fix capability, so that selected defects can be patched on a running operating system. Although technically very easy to do, code patching significantly degrades production software with uncontrolled and unregressed code infusions. Production systems with code patches become unique in their software state, which makes them expensive to control and maintain. After some early experiments with code patching, Junos OS ceased this process in favor of a more comprehensive and coherent in-service software upgrade (ISSU) and nonstop routing implementation.

 • Customer-specific builds: The use of custom builds is typically the result of failures in a software design methodology and constitutes a form of code branching. If a feature or specific bug fix is of interest to a particular customer, it should be ported to the main development tree instead of accommodated through a separate build. Code branching almost inevitably has major implications for a product such as insufficient test coverage, feature inconsistency and delays. Junos OS is not delivered in customer-specific build forms.

 • Features in minor (regressed) releases: Under Juniper’s release methodology, which has been adopted by many other companies, minor software releases are regressed builds that almost exclusively contain bug fixes. Sometimes the bug fix may also enable functionality that existed but was not made public in the original feature release. However, this should not be a common case. If a vendor consistently delivers new functionality along with bug fixes, this negatively affects the entire release process and methodology because subsequent regressed releases may have new caveats based on the new feature code they have received.
Final Product Quality and Stability

Good code quality in a network operating system means that the OS runs and delivers functionality without problems and caveats—that is, it provides revenue-generating functionality right out of the box with no supervision. Customers often measure software quality by the number of defects they experience in production per month or per year. In the most severe cases, they also record the downtime associated with software defects.

Generally, all software problems experienced by a router can be divided into three major categories:

• Regression defects are those introduced by the new code; a regression defect indicates that something is broken that was working before.
 • Existing software defects are those previously present in the code that were either unnoticed or (up to a certain point) harmless until they significantly affected router operation.
 • New feature fallouts are caveats in new code.
 Juniper’s software release methodology was created to greatly reduce the number of software defects of all types, providing the foundation for the high quality of Junos OS. Regression defects are mostly caught very early in their lifetime at the forefront of the code development.

Existing software defects represent a more challenging case. JTAC personnel, SE community or customers can report them.

Some defects are, in fact, uncovered years after the original design. The verity that they were not found by the system test or by customers typically means that they are not severe or that they occur in rare circumstances, thus mitigating their possible impact. For instance, the wrong integer type (signed versus unsigned) may affect a 32-bit counter only when it crosses the 2G boundary. Years of uptime may be needed to reveal this defect, and most customers will never see it.

In any case, once a new defect class is found, it is scripted and added to the systest library of test cases. This guarantees that the same defect will not leak to the field again, as it will be filtered out early in the build process. This systest library, along with the Junos OS code itself, is among the “crown jewels” of Juniper intellectual property in the area of networking.

As a result, although any significant feature may take several years of development, Juniper has an excellent track record for making sure things work right at the very first release, a record that is unmatched in the networking industry.
Conclusion

Designing a modern operating system is a difficult task that challenges developers with complex problems and choices. Any specific feature implementation is rarely perfect and often strikes a subtle balance among a broad range of reliability, performance and scaling metrics.

This balance is something that Junos OS developers work hard to deliver every day.
 The best way to appreciate Junos OS features and quality is to start using Junos OS in production, alongside any other product in a similar deployment scenario. At Juniper Networks, we go beyond what others consider the norm to ensure that our software leads the industry in performance, resilience and reliability.

What Makes Junos OS Different

No comments:

Post a Comment