Technology provides great utility in almost every aspect of our lives but can prove to be frustrating to work with at times. As technology gets better and better, our expectations for how applications should perform continue to increase. Now, what expectations do you ask? I’m referring to the impatience you feel when a page doesn’t load in 0.5 seconds, when a button click doesn’t give you almost instantaneous feedback, or when your favorite website is down due to network issues. Frustrating, right? In order to fuel our increasing expectations for performance, innovations with how we transport data across the internet need to happen. Luckily, we have great companies that are making strides to satisfy our demands.
In this blog, we’ll take a look at how the standard TCP protocol works at a high level, talk about Scalable Reliable Datagram (SRD) and the benefits it brings over the traditional TCP networking protocol, and how AWS’s Elastic Network Adapter (ENA) Express is leveraging these performance improvements.
Transmission Control Protocol
The backbone of the internet is derived from the Open Systems Interconnect (OSI) model which is made up of seven layers: the physical layer (L1), data link layer (L2), the network layer (L3), the transport layer (L4), the session layer (L5), the presentation layer (L6), and the application later (L7). The transport layer is responsible for the end-to-end data transmissions between devices. The best known example of the transport layer is the TCP protocol which is built on top of the IP protocol that resides in the network layer, commonly referred to as TCP/IP. The transport layer is responsible for breaking apart data into smaller segments before sending it to the network layer. The transport layer on the receiving device is responsible for taking these segments and reconstructing the original piece of data that was transmitted. The transport layer is also responsible for data flow control, which dictates the optimal data transfer speeds and error handling to ensure missing or dropped pieces of data are retransmitted.
Transmission Control Protocol (TCP) is one of the basic communication standards that define the rules of the internet and used to transport packets (pieces of data) across the internet. To put it simply, it ensures that the packets of data you send when you click that button makes it to their destination. TCP requires a virtual connection to be made between a client and a server before transmitting packets and also guarantees the order in which those packets were transmitted. This is an important concept to understand for the rest of the blog post. There are other protocols like User Datagram Protocol (UDP) that don’t require an established connection between the client and server or guarantee ordering which makes it faster and cheaper than TCP, albeit less reliable, but we won’t dive deep into those in this article.
To make developers’ lives easier, the TCP protocol works to ensure that each packet gets sent down the same set of paths and reorders packets on the receiving end to ensure the integrity of the message (A->B->C->D->E->F->G).
The ability to reorder packets sequentially on the receiving end reduces the amount of work an application consuming the TCP protocol has to do but because each TCP connection relies on a single path, it doesn’t leverage the benefits of having a multipathing network like we have in the image above. By using a single path, if the connection encounters an issue, say a slow router, even though there’s plenty of additional capacity within the network, you won’t be able to route around that troublesome router. TCP has been powering the internet for years and providing reliable message delivery for the most part, but as you can see, TCP isn’t perfect. With the evolution of intense computing needs for high performance computing workloads, TCP doesn’t provide the ideal solution. There are ways we can improve the protocol to leverage the full benefits a wide network provides which we’ll touch on shortly but first, let’s take a look at an area that would greatly benefit from improved networking speeds.
High Performance Computing (HPC) and Network Latency
The evolution of technology produces faster, smarter, and more reliable applications, sometimes to a breathtaking degree (just look at ChatGPT). Modern applications depend on processing and analyzing massive amounts of data to become smarter and the cost of doing business is significant. Large data processing tasks typically leverage a fleet of servers that perform calculations and share the results with each other. The server-to-server communication incurs network latency when passing information back and forth which, repeated millions of times, can pose a serious bottleneck in performance. This network latency can be greatly reduced by using the Scalable Reliable Datagram protocol. There are many more potential bottlenecks when it comes to HPC workloads but for this article, we’ll be focusing on the networking aspects.
Scalable Reliable Datagram (SRD) – How it Works
Amazon Labs took a fresh look at the network and developed the Scalable Reliable Datagram protocol (SRD) protocol which consistently provides a cheap, low-latency, scalable, and reliable message delivery system. The key feature that SRD brings to the table is transmitting packets via network multipathing. Multipathing allows the server A to transmit packets quickly, deliberately using different paths within the network to transmit packets to its destination. The picture below depicts server A sending packets 1 and 2 simultaneously. Naturally, this can result in packets arriving at their destination in a random order like TCP.
Path #1: A -> B -> C -> D -> E -> F -> G
Path #2: A -> H -> I -> J -> E -> K -> G
Another added benefit of SRD is that it is able to detect dropped or missing packets much quicker than TCP because TCP has to work on a number of different networking environments such as wide open internet to the cloud, whereas SRD is purpose-built and optimized for the AWS network. SRD also handles packet retransmissions in microseconds instead of milliseconds and can retransmit packets using a different path within the network to avoid a congested route. While the standard TCP protocol leverages resources on the guest operating system, SRD uses dedicated resources on AWS’s Nitro Controller, allowing SRD to refrain from impacting an application’s performance.
With all of these improvements, SRD is able to significantly lower latency and higher throughput than traditional networking protocols. As Peter DeSantis said in his re:Invent 2022 keynote presentation, the development of SRD on the AWS network has acted similarly to the invention of the wheel, where now AWS is able to use this performance upgrade in many different areas of AWS, essentially acting as a backbone and supercharger for their services. AWS’s Elastic Network Adapter (ENA) is one of the services that saw a significant performance upgrade from leveraging SRD so let’s take a look at ENA in detail.
Elastic Network Adapter Express (ENA Express)
AWS’s Elastic Network Adapter is the standard network driver that is used with AWS’s EC2 instances and leveraged by traditional communication protocols such as TCP. ENA works directly with the AWS Nitro Controller to offload tasks from the server so users can leverage more of the underlying EC2 resources for their workloads. During re:Invent 2022, Peter DeSantis introduced an upgrade to ENA; Elastic Network Adapter Express. ENA Express seamlessly integrates SRD for various protocols like TCP and UDP without making modifications to your application code. Enabling ENA Express on your EC2 instances only requires a single CLI command or console toggle. All you have to do is enable ENA Express on your ENA interface and you’ll start seeing lower latency and higher throughput instantly. Using ENA Express is completely free so there’s really no reason to not use it. To enable ENA express, go to the EC2 service within your console and select the Network Interfaces tab on the left-hand column. From here, select the network interface you want to enable ENA Express on.
After you select your network interface, click on Actions and at the bottom of the list you should see an option to Manage ENA Express. After clicking Manage ENA Express, you should see a simple dialog popup asking if you want to enable ENA Express on this interface.
Yep, it’s that easy.
Enabling ENA Express on your ENA interface increases the maximum single flow throughput bandwidth from 5 Gbps to 25 Gbps, a 500% increase. ENA express can also improve your P99.9 latency by up to 85% for high throughput workloads. When ENA express is enabled on your EC2 instances, it will detect compatibility between your instances and establish an SRD connection. Keep in mind that this only works if both EC2 instances have ENA Express enabled and also live in the same availability zone/subnet! Once the connection has been established, your network traffic will take advantage of all the benefits of SRD.
As of December 2022, ENA Express is only supported on c6gn instances running in the same availability zone but more support is coming for other EC2 instances in the near future. ENA Express brings loads of benefits to HPC and ML workloads at no additional cost. With 5x higher throughput and significant latency improvements, it opens up many doors for the next generation of HPC and ML applications.