Hochverfügbarkeit

Planning for High Availability

08.05.2003 von Bob Zimmerman
Hochverfügbarkeit, einst Exotenthema für spezielle Systeme, ist allein durch die Dauerpräsenz im Netz zu einer allgemeinen Anforderung geworden. Die verschiedenen Aspekte einer verlässlichen Lösung verlangen sorgfältige Planung, schreibt Bob Zimmerman, Analyst der Giga Group.

Like many great innovations throughout history, the Web caught most by surprise. Its developers let the genie out of the bottle almost accidentally, and its escape has brought untold opportunities to the world of information processing. However, like most genies, the Web also brought great hazards. In exchange for its potential, the Web has triggered a new wave of turmoil and forced evolution in the basic infrastructure that drives the modern enterprise into a new set of assumptions about high availability (HA). The transformation of the Web into an integral part of the enterprise infrastructure has resulted in changes in the characteristics of enterprise IT:

All applications are now 24x7 - With the near-universal worldwide access provided by the Web, almost any Web-based application, especially a public-facing one, is by definition 24x7, even those with low duty cycles. This has challenged the fundamental assumption that there is a maintenance window for business applications, and increased the demand for high-availability solutions, formerly the domain of a select few large applications.

All applications are now mission-critical - When the world can access your application, applications failures are exposed to a much wider community.

All previous assumptions about capacity planning are now obsolete - another consequence of ubiquitous access is unpredictability of loads, challenging established techniques for application capacity planning.

Security risks are magnified - With the entire world at the front door, better locks become mandatory.

Dwight Eisenhower once said, "In preparing for battle, I have always found that plans are useless, but planning is indispensable." Whether or not an application rides on HA infrastructure, architects can improve application availability by applying HA design strategies. HA application design is, of course, a complex topic in itself; think of the core concepts below as a starter kit for designing HA applications, even on non-HA infrastructure. Decide when to use these strategies by (1) gathering concrete data on the business impact of application downtime, (2) analyzing the causes and likelihood of planned and unplanned downtime, (3) assessing which HA design strategies can affect which causes and (4) weighing the extra cost of HA design strategies against the benefits of reduced downtime.

The increasing complexity of systems management and escalating demands on enterprise availability have intensified the demand for high-availability support solutions. However, while HA services are essential for 24x7 mission-critical applications, there are significant cost issues to be evaluated. When extending service levels above 99.9 percent planned availability, the incremental cost increases exponentially, while the amount of downtime saved declines. Due to the high support costs and stringent configuration requirements, "five nines" application-level availability can yield a negative return on investment percent.

HA System Configurations

HA infrastructure ensures that an application has constant availability of network, processors, disks, memory, etc., such that a failure of one of these components is transparent to the application. Risk analysis identifies important functions and assets critical to HA, then establishes the probability of a breakdown in them. Once the risk is established, objectives and strategies to eliminate avoidable risks and minimize the impact of unavoidable risks can be set. For most hardware, middleware and OS, this means duplication and physical separation of IT systems, reducing single points-of-failure and clustering and coupling applications between multiple systems.

Clustered server architectures provide the benefits of both high availability and performance scalability. Cluster packaging comes in many forms: (1) multiple stand-alone servers (with very high-speed cluster interconnects), (2) multiple servers in a box (this would include new high-density servers as a category), (3) multiple partitions within an SMP or (4) any combination of the above. A single-system view is an important component of a cluster high-availability environment. As nodes are added to a cluster, the requirement to manage distributed cluster resources as if managing a single server becomes a critical differentiator in the selection of a high-availability system.

Access to data and intelligent failover, including dynamic reconnect, are critical to application-level high availability. Key requirements for storage solutions include: Improved IT service, including security, local performance options and remote data replication 24x7 data availability Cluster server support for both individual servers and generic cluster access Connect any server to any storage system through storage networks Rapid recovery and/or restart of applications.

There are other critical components in an HA system. For example:

Several server adapter card techniques can help a network manager increase network availability: load balancing; hot plug-ability; dual homing of server cards; and NOS optimization.

Uninterruptible power supply (UPS) systems planning should include investment in a global, shared solution with reliable switch gear and full bypass capability, rather than deploying many low-capacity (ostensibly inexpensive) UPS systems for individual racks or devices in a fragmented approach.

Vendors have introduced a variety of new high-availability feature for enterprises that are considering building large-scale virtual private networks (VPNs). Competition in the maturing VPN gateway market will yield a stream of incremental high-availability features from all the major vendors. Internet VPNs for mission-critical applications, large branch networks and large remote access user populations can now be designed to take advantage of these new resiliency features, reducing the risks and costs associated with network congestion and downtime.

DBMS vendors have also been actively enhancing their products to fit in an HA world. For example: Clustered databases with increased numbers of nodes, improving scalability and availability. Advanced manageability and scalability: DBMS vendors continue to emphasize enhanced self- tuning capabilities for databases and ease of operational complexity. Scalability may be the decision factor for many enterprises that support large HA and mission-critical databases. Integrated monitoring solutions: DBMS tools vendors offer highly integrated monitoring solutions supporting heterogeneous databases across platforms, providing a holistic view of the entire environment, monitoring applications end to end. Standby database and data replication technology: Enterprises continue to address the growing need for business continuity by deploying redundant databases at remote locations. DBMS vendors will offer improved scalability and high-performance standby database and data replication technology to support business continuity.

Key Strategies in High-Availability Application Design

HA infrastructure is not the only way to increase application availability. Application servers typically have HA features (J2EE servers, in particular), but a technically demanding application requires supplemental design strategies. HA application design is concerned with maintaining application operation in the midst of application failures, infrastructure failures and real-time maintenance. The strategies below can be used individually or in combination:

Redundancy: Each element of an HA application must have a backup that can take over if the primary fails. Load balancing features share the load during normal operation and by shifting the load when a node fails. Alternatively, one or more hot standbys might take over if a primary fails, and the design must account for transactions that were in-flight when the failure occurred.

Recoverable state design: An application's handling of in-flight transactions is largely determined by its approach to state management. "Stateless execution" is often put forth as an HA design principle, but while it is true that an individual element is "more HA" if stateless, the application as a whole typically cannot be viewed as stateless -- users make a series of requests and later requests build on earlier ones. Thus, it is necessary to store state between exchanges, replicate the state (so that it is not subject to a single point of failure) and then re-establish state after recovery.

Failure detection: To initiate recovery of state, and for any failure scenario not handled transparently to the application, there must be "detect and retry" logic within the application. The server side of the application may be able to do this transparently (preferred), but the client side may have to do it. The application may have to "fail gracefully" by saving transaction information, notifying a user or administrator and performing cleanup upon application restart.

Watchers and heartbeats: An HA application must be watched in real time to ensure it is still running. Two key design strategies are process watchers, which monitor execution of application processes on the watcher's machine, and heartbeats, where a network-based element responds to periodic "Are you still there?" messages.

Operations management integration: Monitoring and management tools may adequately manage watcher and heartbeat functions, but operations integration can go much deeper. Applications may incorporate management APIs to raise alerts (e.g., SNMP traps), enable full monitoring and management (e.g., SNMP MIBs) and write errors to logs that are monitored by a management tool.

Automatic restart: When a watcher or management tool detects a failure, restart must perform necessary application cleanup, reinitiate application processes, reconnect them as appropriate and reregister them with application naming services.

Version migration: The highest levels of availability require eliminating planned downtime, which may involve upgrading application versions while the application is running. The two basic approaches for this are (1) parallel operation of multiple versions and (2) a "flash cut" to a hot standby (in-flight transactions complete on the old version; all new transactions go to the new version). Supplemental approaches include auto-update clients and version awareness within application interfaces (or within the infrastructure, as in .NET's version management). The biggest issue arises when a new version changes data structures -- without a downtime window in which to perform the conversion, the application must be written to handle data conversion on the fly.

Connection management: The application must be designed to handle connection failures (e.g., network, DBMS) by recognizing connection timeouts and re-establishing connections to alternate providers, most likely found via an application naming service.

Multi-threaded resource requests: For resource requests that have the possibility of a timeout, an HA application may spawn separate threads for making such requests. This allows the application to more effectively manage response to its users when it experiences a timeout due to a resource failure.

Transaction-aware design: Transaction management features (e.g., of application servers and DBMSes) will ensure transaction integrity, but only if the failure occurs within the context of transaction control boundaries. Some transactions can be submitted multiple times with no loss of integrity (e.g., an address update) while some cannot (e.g., an account withdrawal). Upon a request failure, the application should validate whether the transaction was properly applied and, if not, restart it (or perhaps notify an end user).

Indirection: The principle of indirection underlies many design principles, i.e., an application element should never know the physical address of another -- instead, elements should find each other by name. This allows elements to be moved and reconnected in a failure scenario without changing the application.

HA design adds significant cost to an application delivery effort (and HA infrastructure will add additional costs discussed below). In addition, testing an HA application is more expensive because it is often difficult to re-create various failure scenarios. There are also impacts on performance and operations management, so the appropriate level of HA design for any given application is highly dependent on business considerations and the identifiable business risks and impacts of downtime for the application.

HA Cost Considerations

The duplicated hardware, software license fees, facilities, etc., are easily priced, but the ongoing support costs can grow dramatically. Maintaining the IT infrastructure is an ongoing process, made even more critical in HA mode, and companies must routinely re-evaluate their requirements:

1. Identify IT needs, goals and measurement metrics.
2. Review the company's current architecture and support.
3. Evaluate gaps between the actual performance and the goals.
4. Construct a plan for reducing the gaps.
5. Assess the costs required to attain goals and adjust goals if needed.
6. Determine whether to conduct an ROI or cost/benefit analysis.
7. If deemed necessary, perform an ROI study and define metrics before implementing new projects or purchasing support.

Since support costs are dependent on the services provided and the infrastructure configuration, absolute price points are irrelevant. HA support should be priced at a multiple of the cost of a "standard" offering. The "Cost of Availability" graphic provides an estimated cost multiple for each level of availability and applies the cost multiple to calculate a "prevention cost per hour of downtime." The cost multiple provided is an average derived from cost estimates supplied by major services providers; the actual multiple is subject to change and varies by provider. The cost multiple would be multiplied by the cost of a standard HA of 99 percent to determine the total cost of the higher-availability solution. For example, if you historically spent $100,000 per year to achieve 99 percent availability (88 hours total downtime), plan to spend (290.7 x (0.011 x $100,000)) or almost $320,000 for "five nines" availability (downtime ...)

Summary

The high cost of downtime can be devastating to an enterprise. In the information-driven economy, downtime for any reason is unacceptable. In fact, availability and performance go hand in hand. Regardless of why -- application or database failure, system upgrade, operational error or just poor performance -- if a Web site or an application is slow in delivering requested information, it might as well be offline. The consequences -- lost data, lost customers, lost revenues -- can be devastating to an enterprise. Under these conditions, you have to maintain continuous uptime and predictable performance levels. And should an outage or disaster occur, quick recovery with minimal data loss is imperative. HA design and implementation can help avoid this.