Aruba ESP Data Center Storage and Lossless Ethernet
The Aruba Edge Services Platform (ESP) Data Center supports Data Center Bridging (DCB) protocols that create lossless Ethernet fabrics to support storage area networks, big data analytics, and artificial intelligence (AI) applications.
Table of contents
Storage Over Ethernet Challenges
Traditional IEEE 802.3 Ethernet relies on higher layer protocols, such as TCP, to accommodate strategies for reliable data delivery. Data transmitted over an Ethernet network can be lost between source and destination hosts, which incurs a performance penalty on applications sensitive to data loss.
Storage performance is particularly sensitive to packet loss. TCP can guarantee data delivery at the transport layer by sequencing data segments and performing retransmission when loss occurs, but the need to perform TCP retransmissions for storage significantly reduces the performance of applications depending on that storage.
Advances in storage technology, such as SSD flash memory and the Non-Volatile Memory express (NVMe) protocol, facilitate read/write storage that exceeds the performance of traditional storage networking protocols, such as FibreChannel. The performance bottleneck in a storage area network (SAN) has moved from the storage media to the network.
Remote Direct Memory Access (RDMA) was developed to provide high-performance storage communication between two networked hosts using the proprietary InfiniBand (IB) storage network. IB guarantees medium access and no packet loss, and requires a special host bus adapter (HBA) for communication. The IB HBA receives and writes data directly to host memory using dedicated hardware, bypassing both traditional protocol decapsulation and the host’s primary CPU. This reduces latency, improves performance, and frees CPU cycles for other application processes.
Ethernet solutions offer high-speed networking interfaces, making them attractive options for storage communication, if the reliability issue can be solved. RMDA over converged Ethernet (RoCE) is a protocol developed by the InfiniBand Trade Association (IBTA) to extend RDMA reliability and enhanced performance over a low-cost Ethernet network. A converged network adapter (CNA) performs the task of writing received data directly to memory and enables Ethernet as the underlying communication protocol. A lossless data communication path to support RoCE is created by modifying both Ethernet host and switch behavior.
RoCE version 1 (RoCEv1) encapsulates IB Layer 3 addressing and RDMA data directly into an Ethernet frame. Ethernet replaces RDMA Layer 1 and 2 functions, and it specifies a unique EtherType value to indicate RDMA as the Ethernet payload.
RoCE version 2 (RoCEv2) replaces IB Layer 3 addressing with IP. It encapsulates IB Layer 4 and RDMA data into a UDP header. This strategy makes RoCEv2 routable over IPv4 and IPv6 networks. RoCEv2 is the most common implementation of RoCE.
The lossless Ethernet optimizations implemented in CX switches improve data center performance for applications using both RoCE and non-RoCE protocols such as standard iSCSI. In addition to storage communication, RoCE enhances the performance of database operations, big data analytics, and generative AI.
Non-Volatile Memory express (NVMe) is an intra-device data transfer protocol that leverages multi-lane data paths and direct communication to the CPU provided by PCIe to move large amounts of data at a high rate with low latency. NVMe is designed specifically for solid state drives (SSDs) as a replacement for the decades-old Serial Advanced Technology Attachment (SATA) protocol. NVMe over Fabrics (NVMe-oF) extends NVMe to work between networked hosts. NVMe-oF works over multiple protocols, including RoCE.
The primary challenge running RDMA over Ethernet is overcoming the problem of link congestion, the most common cause of dropped Ethernet frames in a modern Ethernet network. Link congestion occurs when frames are received at a faster rate on a switch than can be transmitted on an outgoing port. Link congestion has two common causes. First, the receive and transmit ports on a switch are different speeds, so the higher speed port can receive data faster than transmission to the lower speed port allows. Second, a switch receives a large number of frames on multiple interfaces destined to the same outgoing interface. In both cases, the switch can queue surplus frames in a memory buffer until the outgoing port is able to transmit them. If buffer memory becomes full, additional incoming frames are dropped as long as the buffer remains full. This results in TCP retransmissions and poor application performance.
Building Reliable Ethernet
A lossless Ethernet fabric can be created by connecting a contiguous set of switches and hosts that employ a set of strategies to prevent frame drops for a particular application.
Three primary Quality of Service (QoS) strategies manage competing demands for buffer memory and switch port bandwidth: dedicated switch buffers for an application, flow-control, and guaranteed media access for an application. Combining these three strategies enables a lossless Ethernet fabric for storage and other applications.
The following table displays the key DCB protocols supported by Aruba data center switches.
Data Center Bridging Component | Description |
---|---|
PFC: Priority Based Flow Control | Establishes queues that do not drop packets by preventing buffer exhaustion. |
ETS: Enhanced Transmission Selection | Defines bandwidth reservations for traffic classes so that lossless and lossy traffic can coexist on the same link. |
DCBx: Data Center Bridging Exchange Protocol | Exchanges PFC and ETS information between devices on a link using Link Layer Discovery Protocol (LLDP) to simplify configuration. |
In addition to the protocols above, Aruba CX switches support IP Explicit Congestion Notification (ECN). IP ECN is a Layer 3 flow-control method that allows any switch in the communication path to notify a traffic receiver of the presence of congestion. After receiving a congestion notification, the receiving host sends a direct, IP-based congestion notification to the traffic source to slow its data transmission rate.
Enhancements in RoCE have produced two different versions. RoCEv1 relies on the base DCB protocols in the table above and is not supported over a routed IP network. RoCEv2 enables IP routing of RoCE traffic, includes IP ECN support, and is the protocol version most often referenced by the term “RoCE.”
Priority Flow Control
Ethernet pause frames introduced link-level flow control (LLFC) to Ethernet networks in the IEEE 802.3x specification. When necessary, a traffic receiver can request a directly connected traffic source to pause transmission for a short period of time, allowing the receiver to process queued frames and avoid buffer exhaustion. The traffic source can resume transmitting frames after the requested pause period expires. The receiver also can inform the source that a pause is no longer needed, so the source can resume transmitting frames before the original pause period expires.
Priority Flow-Control (PFC) works in conjunction with quality of service (QoS) queues to enhance Ethernet pause frame function. PFC can pause traffic on a per-application basis by associating applications with a priority value. When PFC pauses traffic associated with an individual priority value, traffic assigned other priorities are unaffected and can continue to transmit.
On a link, both the CX switch and attached device must locally assign a priority to application traffic and indicate that priority to its peer on the link. Traffic priority can be signaled using either 802.1p Priority Code Point (PCP) values or Differentiated Services Code Point (DSCP) values.
PCP Priority Marking
The IEEE 802.1Qbb standard uses 802.1p PCP values in an 802.1Q header to assign application traffic priority. The three-bit PCP field allows for eight Class of Service (CoS) priority values (0-7). PCP-based PFC requires the use of trunk links with VLAN tags to add an 802.1Q header to a frame.
The diagram below illustrates the PCP bits used to specify 802.1p CoS priorities in the 802.1Q header of an Ethernet frame.
By default, there is a one-to-one mapping of CoS priorities to local priorities on the switch used for frame queueing.
DSCP Priority Marking
Lossless behavior between two data center hosts requires that both hosts and all switches in the data path have a consistent PFC configuration. If a routed-only interface is in the data path, application priority can be preserved by specifying a priority using the DSCP bits in the IP header. DSCP bits also can be used to mark application traffic priority on both 802.1Q tagged and untagged switch access ports.
The diagram below illustrates the DSCP bits located in the legacy Type-of-Service (ToS) field of the IP header.
The six-bit DSCP field allows for 64 DiffServ priority values. By default, DiffServ values are mapped in sequential groups of eight to each of the eight individual local-priority values.
CX switches support a mix of CoS and DSCP priority values by allowing each interface to specify which QoS marking method is trusted. When a mix of strategies is present on different switch ports, traffic must may require re-marking between Layer 2 CoS priority values and Layer 3 DSCP values.
Responding to the growth of routed spine-and-leaf data center architectures and VXLAN overlays, an increasing number of hosts and storage devices support DSCP-based priority marking. This enables consistent QoS markings across a routed domain without the need to translate between Layer 2 CoS values and Layer 3 DSCP values on network switches.
In addition to CoS and DSCP values, CX switches can apply a classifier policy to ingress traffic to assign priorities (PCP, DSCP, and local) based on header field values in the packet.
When a frame is encapsulated for VXLAN transport, the QoS DSCP priority of the encapsulated traffic is honored in the outer VXLAN packet’s IP header to ensure proper queueing.
PFC Operations
CX data center switches support a special shared QoS buffer pool dedicated for lossless traffic. The CX 8325, 10000, and 9300 models support up to three lossless pools. Typically, only one lossless queue is defined for storage traffic. Each lossless pool is assigned a size, headroom capacity, and associated local priority values. The buffers assigned to a lossless pool are allocated from the total available buffer memory on the device, which are assigned to a single lossy pool by default. The CX 8100 and 8360 support a single, fixed lossless pool for smaller data centers.
Received frames are assigned a local priority value based on a mapping of PCP and DSCP values to local priority values. A frame is placed into the special lossless buffer pool when its local priority value is associated with a lossless queue. When a port’s allocation of the shared lossless buffer pool nears exhaustion, packet drops are avoided by notifying the directly-connected sender to stop transmitting frames with the queue’s associated priority value for a short period of time. The headroom pool stores packets that arrive at the interface after a pause in transmission was requested for the associated priority.
PFC support is included on the CX 8325, 9300, 10000, 8360, and 8100. However, traffic arriving on a CX 10000 with a QoS priority associated with a lossless queue will not be sent to the AMD Pensando Distributed Processing Unit (DPU) for policy enforcement or enhanced monitoring.
The diagram below illustrates the queuing relationship between a sender and a CX switch receiver with two queues defined using CoS priority values. All priorities are mapped to the default lossy queue or to a single lossless queue. Using two queues on the CX platform provides the best queue depth and burst absorption.
A PFC pause notification briefly stops transmissions related to a single application by its association with a priority queue number.
Storage is the most common application for lossless Ethernet. Applying the diagram above to a storage scenario, all storage traffic is assigned a PCP value of 4, which is mapped to local-priority 4. When storage traffic is received on the CX switch, it is placed in the lossless QoS queue dedicated for storage. Traffic assigned to the lossy queue does not impact buffer availability for the lossless storage traffic. When the lossless storage queue on the CX switch reaches a threshold nearing exhaustion, a pause frame is sent to inform the sender to pause only storage traffic. All other traffic from the sender continues to be forwarded and is placed in the shared lossy queue on the CX switch, if buffers are available.
Link-Level Flow Control (LLFC)
PFC is the preferred flow-control strategy, but it requires data center hosts to support marking traffic priority appropriately. PFC is built into specialized HBAs and is required for RoCE compliance.
LLFC can enable lossless Ethernet when implemented in combination with other QoS components for prioritization, queueing, and transmission. Many virtual and physical storage appliances do not support PFC or other DCB protocols, but LLFC is widely supported on most standard Ethernet network interface cards (NICs). Implementing LLFC extends the benefits of lossless data transport to hosts that do not support PFC and for non-RoCE protocols.
All traffic received on a switch port using LLFC is treated as lossless. It is recommended to minimize sending lossy traffic from a host connected to a link using LLFC.
When a CX switch sends an LLFC pause frame to an attached device, it pauses all traffic from that source instead of from a single targeted application. The pause in transmission gives the switch time to transmit frames in its lossless queues and prevents frame drops.
Application traffic priority is typically not signaled from a source limited to link-level flow control. In place of the source marking traffic priority, a classifier policy is implemented on the CX ingress port to identify application traffic that should be placed in a lossless queue by matching defined TCP/IP characteristics. When an interface also trusts DSCP or CoS priority values, the trusted QoS markings are honored and take precedence over a custom policy prioritization.
Enhanced Transmission Selection (ETS)
ETS allocates a portion of the available transmission time on a link to an application using its association with a priority queue number. This helps to ensure buffer availability by guaranteeing that the application traffic has sufficient bandwidth to transmit queued frames. This behavior reduces the probability of congestion and dropped frames.
Allocation of bandwidth is divided among traffic classes. ETS is implemented on CX switches using QoS scheduling profiles, where locally defined queues are treated as a traffic class. Traffic is associated with a queue by associating it with a local priority value. Traffic can be mapped to local priorities based on DSCP priorities, CoS priorities, or TCP/IP characteristics using a classifier policy.
Aruba CX 8325, 10000, and 9300 switches apply a deficit weighted round robin (DWRR) strategy to calculate a queue’s bandwidth allocation by applying a weight to each queue in a scheduling profile. The following example shows the resulting percentage of bandwidth associated with a queue for the collective set of weights.
Queue Number | Weight | Guaranteed Bandwidth |
---|---|---|
Queue 0 (Lossy) | 8 | 40% |
Queue 1 (Lossless) | 10 | 50% |
Queue 2 (Lossless) | 2 | 10% |
In the example above, storage traffic can be assigned to queue 1, which guarantees storage traffic the ability to consume up to 50% of the link’s bandwidth. When a class of traffic is not consuming its full allocation, other classes are permitted to use it. This enables the link to operate at full capacity, while also providing a guaranteed allocation to each traffic class. When a link is saturated, each class can consume only the bandwidth allocated to it based on the assigned weights.
Multiple scheduling profiles can be defined, but an individual port is assigned a single profile that governs its transmission schedule.
The following diagram illustrates traffic arriving on a switch, being placed in a queue, and the reserved bandwidth per queue of the outgoing port. Scheduling enforcement occurs when the outgoing port is saturated and the ingress rate for each traffic class meets or exceeds the reserved bandwidth configured on the outgoing port.
When the outgoing port is not oversubscribed, its transmission rates may be different. The unused bandwidth allocations in one class may be consumed by another class. For example, if the port is transmitting at 75% of its capacity, where 60% is from queue 0, 20% is from queue 1, and 5% is from queue 2, the switch does not need to enforce the scheduling algorithm. The lossy traffic in queue 0 is allowed to consume the unused capacity assigned to other traffic classes and transmit at a higher rate than the schedule specifies.
Data Center Bridging Exchange (DCBx)
DCBx-capable hosts dynamically set PFC and ETS values advertised by CX switches. This ensures a consistent configuration between data center hosts and attached switches. DCBx also informs compute and storage hosts of application traffic to priority mappings, which ensures that traffic requiring lossless queuing is marked appropriately. Lossless Ethernet configuration on connected hosts becomes a plug-and-play operation by removing the administrative burden of configuring PFC, ETS, and application priority mapping on individual hosts.
DCBx is a link-level communication protocol that employs Link Layer Discovery Protocol (LLDP) to share settings. PFC, ETS, and application priority settings are advertised from the switch using specific LLDP Type-Length-Value (TLV) data records. CX switches set the willing bit to 0 in all TLVs to indicate that it is not willing to change its configuration to match its peer’s configuration. CX switches support both IEEE and Convergence Enhanced Ethernet (CEE) DCBx versions.
IP Explicit Congestion Notification (ECN)
IP ECN is a flow-control mechanism that reduces traffic transmission rates between hosts when a network switch or router in the path signals that congestion is present. IP ECN can be used between hosts separated by multiple network devices and on different routed segments. It is required for RoCEv2 compliance.
Hosts that support IP ECN set one of two reserved Type of Service (ToS) bits in the IP header to a value of 1. When a switch or router in the communication path experiences congestion, it sets the remaining zero ECN bit to 1, which informs the traffic receiver that congestion is present in the communication path.
When the traffic receiver is notified of congestion, it signals this to the source by sending an IP unicast message. The source responds by reducing its data transmission rate for a brief period of time.
IP ECN smooths traffic flows under most conditions, reducing the need for PFC to trigger a full pause, except as a fast acting mechanism to address microbursts.
IP ECN also can be implemented to improve the performance of non-RoCE protocols, such as iSCSI.
Storage Positioning
Storage in a data center is typically deployed as a SAN, part of hyper-converged infrastructure (HCI), or as disaggregated HCI (dHCI).
SANs comprise one or more dedicated storage appliances that are connected to servers over a network. A proprietary network using storage based protocols, such as FibreChannel, can be used to connect servers to storage. However, IP-based solutions over an Ethernet network provide a high-bandwidth, low-cost option, with accelerating adoption levels. Common IP-based SAN protocols include iSCSI and RoCE.
HCI decouples the storage and compute capabilities of off-the-shelf x86 infrastructure, providing a cloud-like resource management experience. Each x86 host in the HCI environment provides both distributed storage and compute services. The local storage on an HCI cluster member can be used by any other member of the cluster. This provides a simple scaling model, where adding an additional x86 node will add both additional storage and compute to the cluster.
The HPE SimpliVity dHCI solution divides compute and storage resources into separate physical host buckets to allow flexible scaling of one resource at a time. In the traditional HCI model, both storage and compute must be increased together when adding an x86 node. This can be costly if only an increase in one resource is required. For example, if additional compute is required and storage is already adequately provisioned in an HCI solution, significant additional storage is still added to the cluster regardless of the need. dHCI supports scaling compute and storage individually, while using x86 hardware for both compute and storage services.
All the above storage models improve performance when using lossless Ethernet.
Parallel Storage Network
Traditionally, a storage network is deployed in parallel with a data network using proprietary network hardware and protocols to support the reliability needs of storage protocols such as FibreChannel and InfiniBand. TCP/IP-based storage models enabled the migration to lower-cost Ethernet-based network infrastructure, and using a parallel set of storage Ethernet switches became a common method of avoiding competition between storage and data hosts for network bandwidth.
When implementing a dedicated storage network over Ethernet, congestion resulting in dropped frames can still occur, so deploying the full suite of Layer 2 DCB protocols (DCBx, PFS, and ETS) is recommended to maximize storage networking performance.
The diagram below illustrates a dedicated Ethernet storage network deployed in parallel to a data network. Lossless Ethernet protocols are recommended even when using a dedicated storage network.
Converged Data/Storage Network
High speed top-of-rack (ToR) switches with high port density facilitate the convergence of storage and data networks onto the same physical Ethernet switch infrastructure. An organization can maximize its budgetary resources by investing in a single network capable of handling data and storage needs.
A converged storage and data network requires queueing and transmission prioritization to ensure that network resources are allocated appropriately for high-speed storage performance. IP ECN provides additional flow-control options to smooth traffic flow and improve performance. DCBx is beneficial to automate PFC and ETS host configuration.
The diagram below illustrates protocols and positioning to achieve lossless Ethernet in a two-tier data center model.
A spine and leaf network architecture allows linear scaling to reduce oversubscription and competition for network resources. This is achieved by adding spine switches to increase east-west network capacity. Spine and leaf networks use Layer 3 protocols between data center racks, which requires mapping 802.1p priorities to DSCP values to ensure consistent QoS prioritization of traffic across the network infrastructure.
iSCSI
iSCSI is one of the most prevalent general purpose SAN solutions. Standard iSCSI is TCP-based and supports routed IP connectivity, but initiators and targets are typically deployed on the same Layer 2 network. Lossless Ethernet is not a requirement for iSCSI, but it can improve overall performance. Many iSCSI storage arrays using 10 Gbps or faster network cards support PFC and ETS.
When PFC is not supported, LLFC can be used to achieve a lossless Ethernet fabric. Separate switching infrastructure can be implemented to avoid competition between storage and compute traffic, but lossless Ethernet enables the deployment of a single converged network to reduce both capital and operating expenditures.
The following diagram illustrates the components of a converged data and iSCSI storage network.
High Availability
Applications using lossless Ethernet are typically critical to an organization’s operations. To maintain application availability and business continuity, redundant links from Top-of-Rack (ToR) switches provide attached hosts continued connectivity in case of a link failure. Use a data center network design that provides redundant communication paths and sufficient bandwidth for the application. The Data Center Connectivity Design guide details network design options.
CX Switch Lossless Ethernet Support
The following illustration summarizes HPE Aruba CX data center switch support for lossless Ethernet and storage protocols, and the feature requirements for common storage protocols.
HPE Storage Validation for CX Switches
Single Point Of Connectivity Knowledge (SPOCK) is a database that compiles validated compatibility for HPE Storage components, including CX switches. HPE Aruba Networking CX 8325 and CX 9300 series switches have been SPOCK validated and approved by the HPE Storage Networking Stream.