The new world of clustering -- ADTmag

The new world of clustering

By Frank Teti
December 1, 2001

Clustering isn't just for databases anymore; builders of scalable, reliable and high-performance applications must implement a strong clustering strategy.

Java and J2EE are now being used seriously not only for large-scale Internet applications, but enterprise-wide internal applications. For mission-critical apps that need to be scalable, reliable and perform well, a true clustering strategy and infrastructure is a stiff requirement.

Trying to understand middleware clustering, load balancing and failover for high-availability e-commerce sites requires absolute attention to technical detail. Yet reliability and scalability often seem to concern project managers only at the end of a project, rather than early on when business functionality is being considered.

I remember a fixed-income government securities system that was a distributed, CORBA architecture. It included asynchronous callbacks for buy/sell market making, custom-connection pooling and reengineering a legacy C-based trading API. This app was not engineered for "in-flight transaction failover" to a back-up site. At the time, building in true clustering was beyond the scope of the original statement of work. Custom clustering, as would have been the case here, would have been an expensive proposition, and is probably not included in many distributed transactional systems.

In a middleware environment, "clustering" implies the close cooperation of two or more replicated servers to ensure fast and continuous service to users. Scalability is achieved by load balancing across replicas; high availability is delivered by session state failover across replicas. These features should be part of the middleware infrastructure and should not have to be programmed into the application. It is important to point out that middleware clustering is separate and distinct from DBMS clustering. The reason for clustering is to reduce system bottlenecks and eliminate points of failure in the distributed application.

The amount of fault-tolerance that should be engineered into an application is usually identified during a project's business requirements analysis. Among the items that should be considered are:

Imputing the cost of failed transactions;
Defining service-level agreements or expectations;
Estimating lost productivity for users that need to re-key data because their session failed; and
Damage to brand/site reputation integrity due to unreliable service.

A business analyst's answer to these questions gives architects insight into how complex the clustering mechanism needs to be.

Clustering and a clustering strategy are the only ways to guarantee 100% fault tolerance in a highly distributed environment. A true clustering strategy can take an application from being 99% reliable to 99.9% reliable. To achieve this service level, however, the clustering functionality has to be as reliable as the operating system. By definition this rules out custom solutions, which are inherently flawed because they are based on one-off, ad hoc application programming fixes to the problem. In fact, the intricacies of clustering—namely replication, request routing, load balancing and failover—should be "transparent" to programmers. In other words, programmers should not be concerned with these implementation details; they should be designed for and configured by system architects and/or administrators.

Today, advanced J2EE architectures have clustering built into their infrastructure. From a platform perspective, I have already gone on record (see "J2EE bridges the legacy gap," Application Development Trends, June 2001, p. 53) as stating that J2EE is a preferred environment for true e-commerce. For example, J2EE is seen as more "enterprise-grade," while .NET is seen as being stronger on the front end with better usability with the desktop. We are talking about enterprise-wide, high-availability, fault-tolerant computing here, and that does not mean persisting session state to cookies.

Architecting a middleware cluster
As previously stated, the main objective of a clustered architecture is to design a system with no bottlenecks or points of failure. Eliminating bottlenecks or system "hot spots," which can cause system failure, requires the distributed application to be designed and built with a considerable amount of location independence with respect to its services.

Another important design objective for the architecture is the ability to easily manage the cluster. Ideally, administrators need to a have a single, logical image of the system: You cannot manage an environment you cannot monitor holistically. I recently worked on an e-commerce project where the client cited system management as the main reason for outsourcing the app. The hosting company apparently had an impressive portfolio of monitoring software and hardware, as well as system administration experts in the middleware area. Most hosting organizations differentiate themselves based on operating systems and DBMS environments, but not middleware (yet). Even the best clustering architectures are not operator-less—servers go down, servers go into inconsistent states, server configurations change and servers need to be added to the cluster. These events need to be monitored and, in many cases, require human intervention.

In some proprietary environments, clustering can be achieved through high-speed interconnections for communication between primary and back-up servers. Additionally, platform-specific clustering requires every node to run the same operating system. In modern J2EE cross-platform implementations, some vendors use highly optimized protocols based on new commodity technologies, such as IP multicast. An IP multicast is used to broadcast regular, heartbeat-type messages that advertise the availability of individual server instances in a cluster. All servers in the cluster listen to heartbeat messages as a way to determine when a server has failed. A simple IP multicast communication requires that all subscribers to the multicast address reside on the same sub-net. So, for WAN clustering, the IP multicast would need to be tunneled.

Alternative constructs for persisting session state
Clustering implies state, but it is not a requisite. A cluster can consist of servers that contain pools of stateless objects, although maintaining state in a distributed environment is without a doubt a more complex engineering feat. Stateful middleware cluster solutions can range in sophistication from using files and DBMSs, to in-memory transfer of state to back-up processes. As is the case with all IT solutions, the deployed solution is based on requirements, budget and time to market.

File-based replication—A basic approach to state maintenance is to serialize session state to a file (that is, file system persistence). File storage might be an appropriate construct for persisting transient object state for lightweight apps that do not need to scale but require failover and recovery. For failover, all I/O needs to be written to shared disks or multi-tailed disks in order for the back-up servers to resume operation in the case of a failure.

This approach scales like stateless services, and differs only in that it requires explicit disk read/writes. Read/write operations might need to be load balanced between different directories that are partitioned on different disks so they are not in I/O conflict, but rather work in parallel to enhance performance and reduce bottlenecks.

DBMS—oriented replication - Another approach is to maintain state in a database or other persistent storage (JDBC), such as a queue (MQSeries). Inevitably, in most applications, all session state that needs to be persisted ends up in a DBMS. It is questionable whether transient objects or HTTP session state really needs to be there.

One advantage to a DBMS is that you can avoid concurrency conflicts by relying on underlying database locking. To enhance performance, use a write-through cache that keeps a current copy of the state in-memory and avoids subsequent reads from the DBMS. Databases are good at caching state in-memory and doing the minimal disk I/O necessary to provide transactional integrity. Application servers are better at caching stateful objects, and so such caching may be best applied to transient objects that are used by a single client.

A significant drawback to database-oriented replication and failover for transient session state is that it does not perform as well as redundant in-memory storage via network communication (that is, networked server replicas).

Memory-based replication—Memory-based replication is included in enterprise-grade J2EE application servers such as BEA WebLogic and IBM WebSphere. Essentially, this approach relies on keeping a copy of the state of an object or a session in-memory on another machine. From an application perspective, there are two ways to implement in-memory replication: HTTP sessions, and RMI and/or EJB replication.

HTTP sessions clustering provides support for clients that access servlets and Java Server Pages (JSPs) by replicating the HTTP session state. This approach relies on using a process pair of servers to hold session state. This CPU-bound approach is more susceptible to failures but, in general, transient session state does not need to be protected from multiple-site, simultaneous failures. If that is the case, then the replication strategy was flawed in the design phase. Secondary servers should be up and running at a remote site, so that, in the case of the primary site failure, all "in-flight traffic" is routed to the secondary site. Indeed, at the point when the secondary site essentially becomes the primary site, a third site should be up and running.

The major design decision with this approach is determining when and how the state of an object has changed. After that, the replication system needs to handle transporting the update delta from the primary server to the secondary server (replica). One disadvantage to letting the middleware be responsible for replication is that unnecessarily large updates may be performed more often than required. Alternatively, session replication can be built in by programmers, albeit, this represents a non-infrastructure, custom solution.

A replication strategy has to ensure that all servers in the cluster will not bottleneck on accessing a primary server and that the load is balanced as well as replicated for failover. Replication requires considerable network communication between servers even though some of the communication is multiplexed over sockets—a finite commodity. Adding servers to the cluster will reduce aggregate performance to provide scalability and reliability.

RMI object and EJB failover and load balancing are handled by replica-aware stubs that can locate object instances throughout the cluster. These smart stubs make the location of the service transparent to the calling client. Hard-coding references to a service into a client is a static architecture; services need to exist in a cluster not only for load balancing and failover, but to allow for changes in the architecture. From an implementation perspective, this requires the use of a directory service (JNDI, for example) in conjunction with a replica-aware stub. A smart stub has a replica handler that determines the specific algorithms that must be used for load balancing and failover. In the case of invocation failure, the replica handler determines where the secondary server in the replication group is that can handle the request and then redirects the request.

In Figure 1, the client using a smart stub locates Object X on Server 1 at the Primary Site and then binds with it. Server 2 at the Secondary Site contains only the state of the object, which is replicated from Server 1 at the Primary Site. Server 2 does not create an instance of the object unless there is a failover. This mechanism minimizes in-memory storage requirements for the Secondary Site and also sends fewer data packets over the wire for faster replication response time. The pattern could be implemented using either an RMI object or stateful session EJB.

Figure 1: Object cluster design pattern

The object cluster design pattern shown above could be implemented using either a Remote Method Invocation object or Stateful session Enterprise JavaBeans.

Stateful session EJB replication is similar to HTTP session state replication. And like HTTP session state, stateful session EJBs, as described in the EJB 2.0 specification, are non-transactional. Replication occurs after each method invocation, as opposed to after a transaction commits. For some apps, this might not be considered durable enough and entity EJBs should be considered as an alternative.

Some final recommendations
After implementing a cluster and prior to production, perform a load test to simulate expected usage. In this way, a determination can be made about whether to change the mix of application servers, memory, CPUs or server hardware in the cluster. As a rule of thumb, if a process is more than 80% utilized it will soon become a system bottleneck. If this happens, you should consider adding a server or reducing the number of application servers in the cluster.

One optimization that seems to be overlooked is object collocation. If possible, it is more efficient to use a replica that is located with a smart stub, rather than using a replica that is on another server. This concept might be counterintuitive to systems administrators who think load balancing on each method invocation is always benign. Using the local replica avoids excessive network communication to servers in the cluster.

Finally, some J2EE application servers scale best when deployed with a certain optimum number of server instances per CPU. However, since these are Java servers, the server-to-CPU ratio also depends on the command line heap memory settings, as well as thread policy. Again, simulation, as with all capacity planning, is probably the most effective way to determine the optimal number and distribution of server instances.