Formation Process

Gateway clustering with AOS 10.

Cluster Formation

Cluster formation between Gateways is determined by the cluster configuration within each configuration group. When an automatic cluster mode is enabled, Central orchestrates the cluster name and configuration for each cluster node:

  • Auto group – A cluster is orchestrated between active Gateways within the same configuration group.

  • Auto site – A cluster is orchestrated between active Gateways within the same configuration group and site.

When manual cluster mode is enabled, the admin defines the cluster name and cluster members. The admin configuration initiates the cluster formation between the active Gateways.

Handshake Process

The first step of cluster formation involves a handshake process where messages are exchanged between all potential cluster members over the management VLAN between the Gateways system IP addresses. The handshake process occurs using PAPI hello messages that are exchanged between nodes to verify reachability between all cluster members. Information relevant to clustering is exchanged through these hello messages which includes platform type, MAC address, system IP address and version. After all members have exchanged hello messages, they establish IKEv2 IPsec tunnels with each other in a fully meshed configuration.

What follows is a depiction of cluster members engaging in the hello message exchange process as part of the handshake prior to cluster formation:

Handshake Process / Hello Messages

Cluster Leader Election

For each cluster one Gateway will be selected as the cluster leader. Depending on the persona of the Gateways, the cluster leader has multiple responsibilities including:

  • Active and standby VLAN designated Gateway (VDG) assignment

  • Active and standby device designated Gateway (DDG) assignment

  • Active and standby user designated Gateway (UDG) assignment

  • Standby switch designated Gateway (S-SDG) assignment

The cluster election takes place after the initial handshake as a parallel thread to VLAN probing and the heartbeat process.

WLAN Gateways

The cluster leader is elected as the result of the hello message exchange which includes each platform’s information, priority, and MAC address. The leader election process considers the following (in order):

  1. Largest Platform

  2. Configured Priority

  3. Highest MAC Address

For homogeneous clusters, the Gateway with the highest configured priority or MAC address will be elected as the cluster leader. For heterogeneous clusters, the largest Gateway with the highest configured priority or MAC address will be elected as the cluster leader. The MAC address being the tiebreaker when equal capacity nodes with the same priority are evaluated.

The following graphic depicts a cluster leader election for a four-node 7240XM heterogeneous cluster. In this example DC-GW2 has the highest MAC address and is elected as the cluster leader. All other nodes become members:

WLAN cluster leader election

Branch HA Gateways

When branch HA is configured on two branch Gateways, the leader can be either automatically elected or manually selected by the admin. When a preferred leader is manually selected, no automatic election occurs, and the selected node becomes the leader.

When no preferred leader is configured, the leader election process considers the following (in order):

  1. Number of Active WAN Uplinks (Uplink Tracking)

  2. Largest Platform

  3. Highest MAC Address

Most branch Gateway deployments will implement a pair of Gateways of the same series and model forming a homogeneous cluster. When uplink tracking is disabled, the branch Gateway with the highest MAC address will be elected as the cluster leader. The MAC address being the tiebreaker when equal capacity nodes with the same priority are evaluated.

When uplink tracking is enabled, the number of active WAN uplinks are evaluated and the Gateway with the highest number of active WAN uplinks will be elected as the cluster leader. Inactive, virtual, and backup WAN uplinks are not considered.

VLAN Probes

Gateways in a configuration group share the same VLAN configuration and port assignments. The management and user VLANs are common between the Gateways in a cluster and must therefore be extended between the Gateways by the respective core / aggregation layer switches. A missing or isolated VLAN on one or more Gateways can result in blackholed clients.

VLAN probes are used by Gateways in a cluster to detect isolated or missing VLANs on each cluster node. Each cluster node transmits unicast EtherType 0x88b5 frames out each VLAN destined to other cluster node. For a cluster consisting of four nodes, each node may transmit a VLAN probe per VLAN to three peers. To prevent unnecessary or duplicate probes, each Gateway keeps track of probe requests and responses to each cluster peer for each VLAN. If a Gateway responds to a probe for a given VLAN from a peer, the Gateway marks the VLAN as successful and will skip transmitting a probe to that peer for that VLAN.

VLANs that are present on each node that receive a response and are marked as successful while VLANs that do not receive a response are marked as failed and displayed as failed in Central. Prior to 10.6, Gateways will probe configured VLANs including VLAN 1. As there is no configuration to exclude explicit VLANs, VLAN 1 will often show in Central as being failed.

In 10.6 and above, VLAN probing has been enhanced to be more intelligent where only VLANs with assigned clients are probed. While the gateways management VLAN is always probed as its required for cluster establishment, only user VLANs with active tunneled clients will be probed. VLANs with no tunneled clients are no longer automatically probed preventing unused VLANs from being displayed as being failed in Central. Only user VLANs that have not been extended will be displayed.

VLANs that have failed probes are listed in the cluster detail’s view in Central. This is demonstrated below where VLANs 100 and 101 have not been extended to one Gateway node in a cluster and are both listed as failed for that node. Note that in this example the Gateways are running 10.5, as such VLAN 1 is also listed as being failed for each node:

Cluster polling failed VLANs

Heartbeats

Cluster nodes exchange PAPI heartbeat messages to cluster peers at regular intervals in parallel to the leader election and VLAN probing messages. These heartbeat messages are bidirectional and serve as the primary detection mechanism for cluster node failures. A round trip delay (RTD) is computed for every request and response. Heartbeats are integral to the process the cluster leader uses to determine the role of each cluster node and detect node failures.

Failure detection and failover time is determined by the cluster heartbeat threshold configuration for the cluster. The recommended detection time for a port-channel is 2000ms while the default value of 900ms is recommended for a single uplink. Failure detection is based on no response for the configured heartbeat threshold which is configurable between 500ms > 2000ms.

Connectivity and Verification

The Gateway Cluster dashboard displays a list of Gateway clusters provisioned and managed by Central. This can be accessed in Central by selecting Devices > Gateways > Clusters then selecting a specific cluster name. This view can be accessed with a global context filter or by selecting a specific configuration group or site.

The Summary view for a cluster provides important cluster information such leader version, capacity and number of node failures that can occur. The graphic below provides an example summary for a two node 7220 cluster. Note that the summary view provides color coded client capacity over time for each node which is useful for determining client distribution during normal and peak times. In this example each nodes client capacity is below 40% for the past 3 hours:

Cluster summary and capacity

The Gateways view provides a list of cluster nodes, operational status, per node capacity, model, and role information. The following graphic demonstrates the status view for the above production cluster. This below view shows that each cluster node is UP and SJQAOS10-GW11 has been elected as the cluster leader. Note that the number of current active and standby client sessions for each node is also provided. Clients are distributed between the available nodes based on published bucket map for the cluster:

Cluster gateway status

The Gateways view also provides additional heartbeat and VLAN probe information for each peer. You can view the peer details for each member of the cluster using the dropdown. This demonstrated below where the peer details for SJQAOS10-GW11 is shown. In this example the peer Gateways has a member role and is connected. Note that all VLANs (including 1) have been correctly extended between the Gateways, therefore no VLANs have failed probes:

Cluster peer status


Last modified: February 28, 2024 (614bf13)