Failover

Gateway clustering with AOS 10.

Cluster Failover

Cluster failover is a new feature in AOS 10 which permits APs servicing mixed or tunneled profiles to failover between datacenters in the event that all the cluster nodes in the primary datacenter fail or become unreachable. Cluster failover is enabled by selecting a secondary Gateway cluster when defining a new mixed or tunnel profile. Unlike failover within a cluster which is non-impacting to clients and applications, failover between clusters is not hitless.

When a secondary cluster is selected in a profile, APs servicing the profile will tunnel the client traffic to the primary cluster during normal operation. IPsec and GRE tunnels are established from the APs to cluster nodes in both the primary and secondary cluster. Failover to the secondary cluster is initiated once all the tunnels to the cluster nodes in the primary cluster go down and at least one cluster node in the secondary cluster is reachable. A primary and secondary cluster selection within a WLAN profile is depicted below.

Configuring for primary and secondary cluster.

A primary cluster failure detection typically occurs within 60 seconds. When a primary cluster failure is detected, the profiles are disabled for a further 60 seconds to bounce the tunneled clients to permit broadcast domain changes when moving between datacenters. Once re-enabled, the tunneled clients obtain new IP addressing and are able to resume communications across the network through the secondary cluster. AP and client sessions are distributed between the secondary cluster nodes in the same way as the primary cluster. Each AP is assigned a DDG & S-DDG session based on each node’s capacity and load while each client is assigned a UDG & S-UDG session based on bucket map assignment.

Failover between clusters can be enabled with or without preemption. When preemption is enabled, APs can automatically fail-back to the primary cluster when one or more nodes in the primary cluster become available. When preemption is triggered, the APs include a default 5-minute hold-timer to prevent flapping. The primary cluster must be up and operational for 5 minutes (non-configurable) before fail-back to the primary cluster can occur. As with failover from the primary to secondary cluster, the profiles are disabled for 60 seconds to accommodate broadcast domain changes.

When considering deploying cluster failover, careful planning is required to ensure that the Gateways in the secondary cluster have adequate client and device capacity to accommodate a failover. Capacity of the secondary cluster should be equal or greater than the capacity in the primary cluster.

In addition to capacity planning, VLAN assignments must also be considered. While the IP networks can be unique within each datacenter, any static or dynamically assigned VLANs must be present in both datacenters and configured in both clusters. This will ensure that tunneled clients are assigned the same static or dynamically assigned VLAN during a failover. If VLAN pools are implemented, the hashing algorithm will ensure that the tunneled clients are assigned the same VLAN in each cluster.

Cluster failover can be implemented and leveraged in different ways. Your profiles can all be configured to prefer a cluster in the primary datacenter and only failover to a cluster residing in the secondary datacenter during a primary datacenter outage. All the traffic workload in this example being anchored to the primary datacenter during normal operation. A primary-secondary datacenter failover model is depicted below.

Datacenter workload failover

Alternatively, your WLAN profiles in different configuration groups can be configured to distribute the primary and secondary cluster assignments between the datacenters. For example, half the APs in a campus can be configured to prefer the primary datacenter and failover to the secondary datacenter while the other half of the APs in the campus can be configured to prefer the secondary datacenter and failover to the primary datacenter. With this model the traffic workload would be evenly distributed between both datacenters. This is sometimes referred to as salt-and-peppering as depicted below.

Datacenter workload distribution


Last modified: February 28, 2024 (614bf13)