Designing the EdgeConnect SD-Branch WAN overlay requires consideration of three key elements: WAN Topology, WAN Monitoring and WAN Policy. The elements work hand-in-hand to provide a highly secure overlay that delivers optimal performance.
Table of contents
- EdgeConnect SD-Branch WAN Overview
- WAN Topologies
- WAN Monitoring
- WAN Policies
The following section reviews the three types of overlay topologies available to organizations using EdgeConnect SD-Branch. Any of the topologies can be used in combination with another topology.
The Aruba EdgeConnect SD-Branch solution supports a hub-and-spoke topology in which SD-WAN overlay tunnels are established between headend gateways (hubs) and BGWs (spokes). The gateways at the headend sites provide routing and forwarding for hub-to-spoke and spoke-to-spoke traffic.
This is the default deployment, since most organizations’ applications are centralized in a single data center and branch sites commonly exchange no data or minimal data with lower priority.
The figure below illustrates a hub-and-spoke topology with spoke-to-spoke traffic passing through the hub location.
Deployments require one headend site with one or more installed gateways that terminate VPN tunnels initiated from the BGWs installed at the branch sites. The number of gateways deployed in each headend site depends on overall deployment size and redundancy needs. Smaller deployments consist of one gateway installed at a headend site to service all the BGWs installed at branch sites.
Larger SD-Branch deployments can incorporate additional hub sites, providing redundancy in case of a primary hub failure. A typical large deployment includes a primary and secondary headend.
More complex topologies using additional hub sites also are supported. For example, a deployment might include a cloud-based data center hosting a specific application or service using virtual gateways.
Aruba supports mesh topologies between on-premise hubs (physical gateways) and/or cloud hubs (virtual gateways). This enables hubs sites to communicate directly with one another and is generally used for communications between regional hubs or between multiple cloud providers. This includes traffic coming from a branch site. For example, a branch site could have traffic destined to “AWS Cloud DC” and have a preference for the “On-Premises DC”. In this case the “On-Premises DC” would forward the traffic to the “AWS Cloud DC” using the hub mesh tunnels.
The branch mesh topology configuration enables branch gateways to establish secure overlay tunnels with other branch gateways in the same group or a different group. When a branch mesh topology is configured between two or more branch gateways, a branch mesh link is established to transport traffic securely between them. Branch mesh can be used for distributed enterprises or for organizations with branch sites that have multiple inter-branch communication that should not be hairpinned through a hub site.
When using branch mesh, a hub must site designated in the SD-WAN fabric to enable ORO between sites so branch gateways can exchange routes. However, this does not stop branch sites from communicating directly with other branch sites: it is used as a backup path in case branch sites cannot communicate directly.
The system IP (system-ip) is a critical configuration element for each gateway operating as a VPNC or BGW. In the three typologies illustrated above, each gateway uses one VLAN interface as its system IP. By default, the Aruba gateway uses this interface to communicate with network services such as RADIUS, syslog, TACACS+, and SNMP. The VLAN interface selected for the system IP on each gateway must have an IPv4 address assigned for the gateway to be fully functional. A gateway cannot initialize fully unless the assigned VLAN interface is active and operational. Central does not allow a gateway to obtain addressing dynamically from Internet service providers using DHCP or PPPoE as the system IP.
Gateway pools can be used to allocate system IP addresses automatically to a dedicated VLAN interface, which is then designated as the system IP address. Each pool includes a unique name along with starting and ending IPv4 addresses. The range of addresses defined for each pool cannot overlap. Aruba recommends configuring one gateway pool for each group, since IP addresses are configured and applied to VLAN interfaces on a per-group basis. The gateway pool must include enough IPv4 addresses to support all the Aruba gateways assigned to the group. Although a group can support multiple gateway pools, specific IP addressing should not be applied dynamically.
The Aruba EdgeConnect SD-Branch solution relies on control-plane communication between gateways and Central, which enables the SD-WAN Orchestrator to negotiate tunnels and establish routes. At least two paths of communication are recommended between the gateways and Aruba Central. Aruba EdgeConnect SD-Branch actively monitors uplink availability to ensure connectivity.
This section presents design considerations for active and passive monitoring and WAN policy.
The gateway actively sends UDP or ICMP probes to determine that connectivity to underlay and overlay paths is available.
The gateway also actively monitors the WAN to identify the best path for applications, using one of three operations:
Default Gateway Monitoring - Aruba gateways monitor the state of every WAN circuit by probing their default gateways. A default gateway must be configured on every WAN interface to be considered an uplink. Note that the default gateway does not need to respond to ICMP messages: as long as the WAN Health Check IP/FQDN responds to the probes, uplinks are considered valid.
VPNC Reachability - Gateways send probes to all SD-WAN overlay destinations (through all uplinks) to measure health and state, as well as latency, jitter, and loss. Probes are sent every 2 seconds in batches of five. If packet loss is detected, the gateway switches to aggressive mode and sends 25 probes every 2 seconds to calculate packet loss accurately. UDP probes are managed by the BGW’s data path and marked as DSCP 48 to receive priority over other traffic for a more timely response.
WAN Health Check - Gateways send probes by default to the Aruba Path Quality Monitor (PQM), maintained by the Aruba Cloud Operations team. The PQM service is a set of distributed nodes that respond to ICMP/UDP probes. When using the PQM service, admins should set the PQM to UDP mode to measure latency, jitter, and packet loss (ICMP mode does not measure jitter). Admins can specify other health check locations by entering custom IP/FQDN locations. Failure to reach the Health Check responder over an uplink results in failing underlay traffic to a backup uplink. Overlay traffic is determined by probes destined to relevant VPNCs .
SaaS Express Optimization - Branch gateways resolve a specific application using the application’s FQDN to query the DNS servers configured on WAN uplinks (or learned through DHCP from the ISP) to determine the best uplink to use for an SaaS application. The probes provide a good measure of how the overlay communications are working, as well as the quality of the last mile for each WAN circuit. Without this monitoring, the gateway would not be able to provide the SaaS express optimization. SaaS express optimization is available only on branch gateways.
Note that business-critical SaaS applications may require a more dedicated method of enriching the user experience. Problems beyond the control of the enterprise network administrator, such as ISP-SaaS peering problems or DNS issues, may adversely affect critical business services.
Note: Active monitoring is always turned on for default gateway monitoring and VPNC reachability.
The gateway passively monitors the bandwidth usage of the physical interfaces associated with each uplink. Usage is compared with the WAN speed configured on the interface to calculate utilization. If a gigabit interface has 600 Mb of traffic, the circuit is at 60% utilization. The uplink utilization and DPS policies factor in the amount of traffic on every interface when making path decisions.
The gateway monitors the TCP sessions for round-trip time and packet loss on traffic coming and going from clients to SaaS providers. This information is used to calculate a quality of experience (QoE) score for each application. The Central dashboard shows the bandwidth usage, QoE, loss, and latency for each application.
Aruba’s SD-Branch has several WAN policies that help shape the traffic as it traverses the WAN transports at each location. The policies are configured with following features:
- Policy-Based Routing (PBR): PBR routes traffic across private or public WAN uplinks based on application and user role if the network destination is not found in the routing table.
Quality of Service (QOS): Role and application-based 802.1p COS and DSCP marking on LAN ingress enables an organization to schedule traffic using a four-class queueing model on the outbound WAN interface. Strict priority queues support real-time applications and deficit round robin (DRR) queues with bandwidth percentages support business-critical applications leaving the gateway.
- Dynamic Path Steering (DPS): When multiple WAN links exist, DPS helps choose the best available path for an application based on characteristics such as throughput, latency, jitter, packet loss, and uplink utilization. (Only available on branch gateways)
- Forward Error Correction (FEC): FEC enables the network to recover easily from packet loss that may be caused by a variety of network layer conditions, such as queue overflows or constrained bandwidth links. FEC is applied on a DPS policy and is needed most when there is loss on the WAN. (Only available on branch gateways)
- SaaS Express Optimization: Specific applications can be monitored and steered to the best path available based on Observe SaaS traffic as it traverses the GW Firewall to gather latency, loss, and jitter measurements.
In most deployments, gateways follow the route table when making routing decisions, referred to as destination-based routing. If traffic must be forwarded to a specific overlay tunnel or Internet uplink, PBR enables admins to override the route table for both underlay and overlay traffic. PBR allows admins to use multiple paths by setting the same priority in the next-hop list, which is recommended for fault tolerance. If more than one active path is available, the gateway selects the path using a combination of DPS and load-balancing. A typical use for PBR is to force all traffic to a specific VPN Concentrator or a cloud firewall service. The figure below shows the traffic path when a PBR policy is defined on the LAN ingress.
The most common uses for which PBR policies are implemented include:
All employee Internet traffic must be routed to the hub-site location to receive additional policy checks.
Traffic from a specific subset of clients must be forwarded to a specific WAN path.
Integration with third-party SaaS or unified threat management providers, such as Check Point, Palo Alto Networks, or Zscaler, requires steering certain traffic through a cloud-based security provider.
The branch gateways can redirect selected traffic through a cloud-based security platform such as the Zscaler, and Checkpoint. The integration between cloud security providers and SD-WAN fabrics is discovered automatically, and tunnels and routes are orchestrated based on business and topological requirements.
For information on cloud security providers, see the following guides: Aruba SD-Branch and Zscaler Internet Access Integration guide.
Note: Cloud security integration applies to only branch gateways since VPNCs generally send traffic from trusted sources.
Quality of service (QoS) refers to the ability of a network to provide higher levels of service by identifying, marking, and prioritizing traffic. Applying the proper QoS policy is important when the network is congested and limited bandwidth is available. Real-time traffic such as Teams, video conferencing, or business-critical applications have specific latency requirements. When a network is congested, applications can be affected in several ways, including bit rate, throughput, path availability, delay, jitter, and loss. Delay, jitter, and loss can be improved by using the correct QoS policy on the egress interfaces of network devices so applications with higher priority are delivered before applications with lower priority.
Two main strategies can be considered when creating a QoS scheduling policy:
The first strategy identifies applications that are important to the business and gives them a higher level of service using the QoS scheduling techniques described in this section. Remaining applications stay in the best-effort queue to minimize upfront configuration time and lower the daily effort needed to troubleshoot a more complex QoS policy. If new applications become important in the future, add them to the list of business-critical applications. Updating the QoS level can be repeated as needed without requiring a comprehensive policy change for all applications on the network. This strategy is normally used by organizations that do not have a corporate-wide QoS policy or that are troubleshooting application performance problems across the WAN.
The second strategy creates a comprehensive QoS policy that identifies all traffic flows and applications using the Aruba deep packet inspection (DPI) engine. The engine can identify more than 3K applications using well-known signatures and protocols. The applications are placed in pre-defined categories in the DPI engine for convenience, but it may be necessary to create custom groupings if the categories do not align with specific organizational needs. This strategy is best suited for organizations that want to use an existing QoS policy with the SD-Branch solution.
The first step in enabling a QoS policy is identifying and marking applications as they pass through a network device. Aruba recommends marking applications with class of service (CoS) for queueing.
Differentiated service code point (DSCP) also can be used for marking. However, DSCP values are not always honored. Check with service providers to ensure that markings are honored.
Applications should be marked using Access Control List (ACL) matching rules. It is recommended to match specific applications using a combination of aliases and TCP/UDP ports or a list of service applications when creating a matching ACL. The combination of both aliases and ports enables administrators to identify applications and mark them more accurately. For less specific applications that still may need some level of prioritization, admins can use other ACL matching methods such as Application Categories, TCP/UDP Ports, or Subnets.
When marking applications, it is important to categorize similar applications so they can be marked the same. For example, Zoom, GoToMeeting, and Teams are all chat/video/voice collaboration tools, so it makes sense to place them in the same category for marking.
After the ACL is defined, the ACL can be applied in two places: on the LAN side of the gateway or within a user role. When tunneling to the gateway, it is important to apply the ACL to user roles, since they are usually encapsulated in a GRE tunnel and a QoS policy cannot remark the incoming packet accurately.
All applications are marked at the ingress of the gateway. If applications are not identified, they are placed in the default queue, with a best-effort level of service. East-west traffic that remains in the location is identified and marked when it passes through the gateway between the VLANs.
After applications are marked, they are placed in a queue with a correlating marking to determine their priority levels. Aruba gateways support four QoS queues: one strict priority queue and three Deficit Round Robin (DRR) queues. Strict priority queues always have all traffic forwarded; other queues are serviced until the priority queue is empty. DRR is a scheduling algorithm that allocates a percentage of bandwidth allocated to each DRR queue for forwarding. Network administrators can define the DRR bandwidth percentages to allocate.
Real-time applications and network management traffic such as OSPF hello packets, etc., should always be placed in a strict priority queue. Business-critical applications should be serviced by one or two of the DRR queues to provide a higher level of service during congested times.
The last queue should be used as the default queue where all unmarked and low priority marked traffic is placed. This queue provides a lower level of service.
The figure below provides an example of marking and queuing.
Using the active and passive monitoring details above, DPS intelligently selects the best uplink for traffic. DPS ensures that applications are sent over the path most appropriate for their service level agreements (SLAs). For example, if a gateway has two paths, Uplink 1 and Uplink 2, and a Cloud SaaS application matches the DPS policy active monitoring criteria, DPS determines which uplink has the best SLA at the moment by comparing latency, jitter, and packet loss statistics from the WAN health check or VPNC reachability probes. In this example, DPS would use only the WAN health check information, since the SaaS application is not hosted at the VPNC site and the relevant path is the Internet. DPS policy uses only the relevant path statistics for each application to determine the uplink for sending traffic,
The network administrator can define SLAs, priority uplinks, and FEC thresholds for a DPS policy. The admin can set the SLA for an application based on traffic categorization, aliases, or IP/Subnet matching criteria. Admins can then use one of the built-in SLAs or adjust the latency, jitter, packet loss, and uplink utilization parameters. The FEC loss threshold is available to delay steering the application based on how much packet loss FEC can handle. For example, under normal circumstances, VoIP would be steered with more than 1% loss, but if FEC is enabled to protect it, steering can be delayed until the loss is 5%.
When configuring a SLA for a DPS policy, it is important to set the SLA thresholds to a point just before applications could register a negative user experience.
The following figure shows the traffic path when a DPS policy is matched on the WAN egress.
Note: The gateway’s routing table or PBR rules determine the next hop and the DPS policy selects an uplink.
After FEC is enabled in a DPS policy, all packets corresponding to the DPS policy are sent to the FEC engine for encoding. FEC enables admins to add a parity packet for every ‘N’ number of packets in each block, where ‘N’ equals 2, 4, or 8 packets. The number of parity packets to be sent per policy depends on the type of resiliency needed for the applications.
FEC parity packets are added to traffic only between a BGW and a VPNC. Even if FEC is enabled on a policy where traffic is destined to the Internet, the FEC parity encoding is not added. Packets sent to a VPNC or BGW are sent to the FEC engine as soon as they are decrypted from the IPsec tunnel.
In the FEC engine, the number of packets received for every ‘N’ packets is checked. If there is no loss, the FEC parity packet is discarded. If a packet is lost or contains an error, the FEC parity packet is used to reconstruct the packet. If more than one packet for a given block is lost or erroneous, the FEC engine cannot reconstruct them.
Note: There is a short wait time of 2 ms between packets in an FEC block to minimize the delay.
DPS policy should be configured to ensure that all traffic traversing the WAN hits the proper SLAs to ensure a smooth user experience. SLAs should be set so that similar applications can be grouped for easier configuration management. The following categories, FEC ratios, and SLAs are recommended.
|Application type||FEC Ratio||DPS SLA||Path Priority|
|n/a||SLA per app recommendations (SaaS Express to pick best exit)||Primary – ALL_INET Secondary – LTE*|
5-8 % loss
|150 ms delay |
30 ms jitter
|Primary – ALL_INET Secondary - LTE|
|Other real-time business apps (telemetry)||1:8 |
5-8 % loss
|150 ms delay |
50 ms jitter
BW utilization: 75%
|Primary – ALL_INET Secondary - LTE|
|Business applications||Disabled||150 ms delay |
50 ms jitter
BW utilization: 75%
|Primary – ALL_INET Secondary - LTE|
|Internet applications |
(local-breakout or exit through cloud security)
|n/a||2% packet-loss |
BW utilization: 75%
|Primary – ALL_INET|
As more businesses deploy SD-Branch to take advantage of inexpensive broadband Internet services, and as they adopt Software-as-a-Service (SaaS) applications such as Office 365, Box, Slack, and Zendesk, operations teams must ensure that users at a branch site can connect seamlessly and securely to cloud-hosted applications with the best possible performance. Cloud applications are hosted in multiple geographic locations, so different paths provide different levels of service.
SaaS Express is designed to optimize application performance by probing and steering SaaS applications to the path with the best connectivity. Probing is performed on every available path using the application’s FQDN to query the DNS servers configured on the uplink interfaces (or learned via DHCP from the ISP) every 15 minutes. SaaS Express uses the FQDN as the match criteria for a proxy DNS request to the uplink DNS server to ensure that applications do not use non-local DNS servers that could forward traffic to another region and lower performance. The gateways then send the HTTP probes to the application every 10 seconds to measure loss, latency, and jitter for that particular application. Traffic steering is dependent on where the SaaS application exist. Most SaaS applications are broken out locally. If the applications are hosted at a hub site, the gateway follows the route table.
Unlike Dynamic Path Steering, SaaS Express uses the loss, latency, and jitter at the uplinks’ exit point to determine the best path. SaaS Express considers the measurement of the full round-trip performance of a SaaS application by probing the FQDN of the application. SaaS Express policies take precedence over DPS policy due to the difference in monitoring. Admins should use SaaS Express with SaaS applications of special interest or when DPS must be used to organize and set a SLA for groups of applications.
Note: In full tunnel situations or when the Internet traffic is sent through a cloud security service, exceptions must be introduced in the routing policies to prevent sending SaaS traffic in the overlay.
The gateway supports a set of applications and application categories in the DPI library. The built-in application profiles include a set of SaaS applications such as Adobe, Dropbox, Amazon, Google, Salesforce, Slack, Webex, etc. If a SaaS application is not available in the list, the network administrator can configure it.
Each SaaS application profile includes the following elements:
Name: Name of the SaaS application
FQDN: A list of domain URLs bound to the SaaS application
Exit profile: Traffic steering policy to determine the optimal path exit
SLA: Threshold profile for measuring path quality and performance
Health check probe URI: URI to use for probes to determine the best available path.
Note: For more information on the SaaS Express feature, see the SaaS Express Feature Guide.
The load-balancing algorithm determines how sessions are distributed among the active WAN uplinks. The algorithm kicks in only if route preferences are equal. For DPS and SaaS Express, load-balancing is activated only if the SLAs are the same.
Gateways support the following load-balancing algorithms:
- Round robin: Outbound traffic is distributed sequentially between each active WAN uplink. This is the simplest algorithm to configure and implement but it may result in uneven traffic distribution over time.
- Session count: Outbound traffic is distributed between active WAN uplinks based on the number of sessions managed by each link. This algorithm attempts to ensure that the session count on each active WAN uplink is within 5% of the other active WAN uplinks.
- Uplink utilization: Traffic is distributed between active WAN uplinks based on each uplink’s utilization percentage. Uplink utilization considers the link speed to calculate the utilization for a given link and allows the definition of a maximum bandwidth percentage threshold. When the bandwidth threshold percentage is exceeded, that WAN uplink is no longer considered available.
The following figure illustrates the different load-balancing algorithms.
Note: Aruba recommends the uplink utilization algorithm because it accounts for the service speed when making path selections.
When a path is selected for sessions destined for the corporate network through a VPN tunnel, the reverse traffic must take the same WAN path to prevent connectivity problems caused by asymmetric routing. Reverse-path pinning allows the headend gateway to choose the same WAN path for each active session to and from the branch. This is important because the branch gateway selects paths based on performance and SLAs. Reverse-path pinning is performed for corporate sessions originating from the branch destined to the data center, as well as sessions originating from the data center toward the branches.
When traffic originates from the data center, the headend gateway chooses the path based on equal-cost, multi-path algorithms. As soon as the traffic returns from the branch, the BGW steers the 5-tuple session to the correct path based on the DPS policy. When the headend gateway sees the return traffic, the session is updated to use the chosen path for the duration of the flow.
The headend gateway selects an available WAN path using equal-cost, multi-path routing.
If the WAN path matches the preferred path defined in the BGW’s DPS policy, no additional steering is required.
If the WAN path does not match the preferred path defined in the DPS policy, the branch gateway sends the return session over the preferred path. After receiving traffic from the new path, the VPNC steers the outbound session to the preferred path to maintain symmetry.
The figure below shows traffic from a branch location over the private WAN overlay tunnel and the reverse path pinning feature on the VPNC that returns the traffic on the same path to enforce symmetry.