Introduction
Tier-1 telco O2 Telefónica Germany recently announced a major bet on the public cloud with their new 5G Cloud network, built entirely in AWS in collaboration with Nokia.1 O2 Telefonica’s plan is to have 40% of their customer base migrated to this new 5G core network on AWS during the 2025-2026 timeframe. This is the first time a leading telco has chosen to move its existing network and customers to a 5G core network running on a hyperscaler’s public cloud, although others have done so for greenfield projects. For example, DISH Network has opted to construct its entire 5G infrastructure on the public cloud. DISH estimates it will cost little as $10 billion, which is about 50% less than using traditional on-premises infrastructure. And in 2021, AT&T announced plans for a multi-year project that’s still underway to run its mobility network on Microsoft’s Azure for Operators cloud to deliver 5G services at scale. These telcos believe the public cloud can offer unparalleled agility, scalability, and cost savings, upending the conventional wisdom that telecom giants would stick with on-premises data centers and private cloud infrastructure for their core networks. Certainly, other telcos will follow.
The cost advantages of replacing physical network infrastructure with virtualized software are undeniable given 5G’s vastly greater infrastructure requirements compared to LTE. Telcos already understand this from their experience implementing network functions virtualization (NFV) in their own data centers and private clouds. Public cloud services offer significantly greater levels of scalability and agility, providing real-time access to virtually unlimited on-demand resources at a much lower price point than on-premises NFV. Additionally, telcos will generally see an immediate boost to their bottom line by moving to the public cloud, as this shift transitions them from a CapEx to an OpEx model for their infrastructure purchases. Instead of buying infrastructure upfront and treating it as an asset, they pay for cloud service subscription fees treated as recurring expenses.
The Source of Hidden Costs in Cloud-Native 5G Networks
However, running network functions in a virtualized environment with 5G, whether they run in VMs as virtual network functions (VNFs) or container network functions (CNFs), can incur significant performance overhead and waste resources, leading to unnecessary bandwidth and server upgrades. These upgrades can increase cloud services costs at least 80% beyond what they otherwise should be. The source of this performance overhead and inefficient resource usage is the massive and growing amount of packet delay variation (PDV), or jitter, impacting today’s networks. In cloud-native 5G networks, jitter comes from four main sources that compound each other:
- Application Behavior:5G networks typically support applications like real-time streaming, IoT, AR, VR, autonomous vehicles, remote surgery, and generative AI. These applications generally transmit data in random, often massive bursts with variable payload sizes, leading to irregular transmission and processing times, i.e., jitter. For IoT and other mobile applications, the effects are multiplied as devices move around and more devices join the network.
- Application Architecture:Like many hosted applications, the CNFs that comprise cloud-native 5G networks are implemented using a distributed microservices architecture. This architecture breaks applications into multiple reusable components, enhancing portability and flexibility. It also moves some data and processing to the edge, improving response times and reducing bandwidth usage. However, it also increases network hops between the cloud and the edge to complete an application’s processing across its distributed components. These additional network hops can amplify jitter caused by application behavior.
- Cloud and Edge Environments:In virtualized environments, scheduling conflicts due to competition between hosted applications for virtual and physical CPU, memory, storage, and network resources create random delays that intensify jitter originating from application behavior. Additionally, cloud network overlays like VXLAN and GRE introduce random packet encapsulation/decapsulation delays as packets move between virtual and physical subnets, creating still more jitter.
- Inherent Factors in 5G’s Architecture:5G’s smaller cells, poorer propagation characteristics, and clear path requirements mean that any obstacle can cause signals to be deflected, refracted, or diffracted, resulting in variation in packet delivery times. Furthermore, 5G networks rely on GTP, which encapsulates application data packets using other protocols such as TCP for transport within the network infrastructure. As packets travel between client devices and the core cloud network, GTP encapsulation/decapsulation is required, leading to variable packet delivery times. GTP encapsulation/decapsulation and the random delays that result, happen in addition to network overlay encapsulation/decapsulation delays that occurs as packets move between virtual and physical subnets in the cloud environment hosting the 5G network.
Jitter’s Broader Impact
Jitter has a far more serious knock-on effect on network and application performance than added latency caused by the random delays outlined above. TCP, the network protocol widely used by applications that require guaranteed packet delivery, and public cloud services such as AWS and MS Azure that host them, consistently treats jitter as a sign of congestion. To prevent data loss, TCP responds by retransmitting packets and throttling traffic, even when plenty of bandwidth is available. Just modest amounts of jitter can cause throughput to collapse and applications to stall. This happens even when the network isn’t saturated and plenty of bandwidth is available, adversely impacting not only TCP, but also UDP and other non-TCP traffic sharing a network.
Throughput collapse in response to jitter is triggered in the network transport layer (layer 4 of the OSI stack) by TCP’s congestion control algorithms (CCAs). These algorithms have no ability to determine whether jitter is due to actual congestion or other factors such as application behavior, virtualization, or wireless network issues. However, the standard approaches network administrators turn to, including those that AI Networking solutions also employ for improving network performance, generally don’t operate at the transport layer. When they do, they do little or nothing to address jitter-induced throughput collapse, and sometimes make it worse:
- Jitter Buffers – Jitter buffers work at the application layer (layer7) by reordering packets and realigning packet timing to adjust for jitter before packets are passed to an application. While this may work for some applications, packet reordering and realignment creates random delays that can ruin performance for real-time applications and create more jitter.
- Bandwidth Upgrades – A bandwidth upgrade is a physical layer 1 solution that only works in the short run, because the underlying problem of jitter-induced throughput collapse, and the bandwidth wasted because of it isn’t addressed. Traffic increases to the capacity of the added bandwidth, and the incidence of jitter-induced throughput collapse and stalled applications goes up in tandem, leading to yet another round of upgrades.
- SD-WAN – There’s a widespread assumption that SD-WAN can optimize performance merely by choosing the best available path among broadband, LTE, 5G, MPLS, Wi-Fi or any other available link. The problem is SD-WAN makes decisions based on measurements at the edge, but has no control beyond it. What if all paths are bad?
- QoS techniques – Often implemented in conjunction with SD-WAN, these include: packet prioritization; traffic shaping to smooth out traffic bursts and control the rate of data transmission for selected applications and users; and resource reservation to reserve bandwidth for high priority applications and users. But performance tradeoffs will be made, and QoS does nothing to alter TCP’s behavior in response to jitter. In some cases, implementing QoS adds jitter, because the techniques it uses such as packet prioritization can create variable delays for lower priority traffic.
- 5G networks can use beamforming and MIMO (Multiple Input Multiple Output) to improve signal quality and reduce multipath interference that adds jitter. Additionally, 5G supports network slicing to create multiple virtual networks on top of physical network infrastructure. Each slice can be configured to meet the needs of a specific application, providing some insulation from the impact of other applications with more jittery characteristics. But these technologies only mitigate the impact of jitter; and they have no impact on the behavior of TCP’s CCAs.
- TCP Optimization – Focuses on the CCAs at layer 4 by increasing the size of the congestion window, using selective ACKs, adjusting timeouts, etc. Unfortunately, performance improvements are limited, generally in the range of 10-15%. The reason is these solutions like all the others, don’t address the fundamental problem of how TCP’s CCAs consistently respond to jitter.
Apparently, jitter-induced throughput collapse is not an easy problem to overcome. MIT Research recently cited TCP’s CCAs as having a significant and growing impact on network performance because of their response to jitter, but offered no practical solution.2
Jitter-induced throughput collapse can only be resolved by modifying or replacing TCP’s congestion control algorithms to remove the bottleneck they create. However, to be acceptable and scale in a production environment, a viable solution can’t require any changes to the TCP stack itself, or any client or server applications. It must also co-exist with ADCs, SD-WANs, VPNs and other network infrastructure already in place.
There’s Only One Proven and Cost-Effective Solution
Only Badu Networks’ patented WarpEngineTM carrier-grade optimization technology meets the key requirements outlined above for eliminating jitter-induced throughput collapse. WarpEngine’s single-ended transparent proxy architecture means no modifications to client or server applications or network stacks are required. It works with existing network infrastructure, so there’s no rip-and-replace. WarpEngine determines in real-time whether jitter is due to congestion, and prevents throughput from collapsing and applications from stalling when it’s not. As a result, bandwidth that would otherwise be wasted is recaptured, eliminating unnecessary and expensive upgrades. WarpEngine builds on this with other performance and security enhancing features that benefit not only TCP, but also GTP, UDP and other traffic. These capabilities enable WarpEngine to deliver massive network throughput improvements ranging from 2-10x or more for some of the world’s largest mobile network operators, cloud service providers, government agencies and businesses of all sizes.3 It achieves these results with existing network infrastructure at a fraction of the cost of upgrades.
WarpEngine can be deployed at core locations as well as the network edge as a hardware appliance, or as software installed on the customer or partner’s server. It can be implemented in a carrier’s core network, or in front of hundreds or thousands of servers in a corporate or cloud data center. WarpEngine can be deployed at cell tower base stations, or with access points supporting public or private Wi-Fi networks of any scale. AI networking vendors can integrate it into their solutions without any engineering effort. They can offer WarpEngine to their enterprise customers to deploy on-prem with their Wi-Fi access points, or at the edge of their networks between the router and firewall for dramatic WAN, broadband or FWA throughput improvements.
WarpVMTM, the VM form factor of WarpEngine, is designed specifically for cloud and virtualized edge environments where core network functionality, as well as AI and other applications are deployed. WarpVM installs in minutes in AWS, Azure, VMWare, or KVM environments, and works equally well for VM or container-based applications such as CNFs. WarpVM has also been certified by NutanixTM for use with its multicloud platform, achieving similar performance results to those cited above.4
Telcos can install WarpVM in cloud and edge environments hosting the VNFs and CNFs that comprise their 5G networks to boost performance for a competitive edge, and avoid the cost of unnecessary network and server upgrades as their subscriber bases grow. For example, a Telco like O2 Telefonica Germany running on AWS could use WarpVM to improve cloud network throughput by 3X, for at least 80% less than the cost of an AWS Direct Connect upgrade to achieve the same result. The cost savings usually end up being much greater, because WarpVM doesn’t require the additional Direct Connect ports and new servers for each port, as is the case with a standard AWS direct connect upgrade. Telco customers can also deploy WarpVM in the cloud environments they use for other applications to achieve many of the same benefits.
Conclusion
As telcos move their core 5G networks to the cloud, and AI, IoT, AR, VR and similar applications rely on these networks to drive innovation, jitter’s massive impact on network and application performance will only grow. WarpEngine, implemented in the form of WarpVM for cloud-native networks offers the only optimization solution that tackles TCP’s reaction to jitter head-on at the transport layer, while incorporating other performance enhancing features that benefit not only TCP, but also GTP, UDP and other traffic. By deploying WarpVM, telcos can ensure their cloud-native networks always operate at full potential.
If your organization is considering deploying a 5G or other cloud-native network, it would be a major miss to overlook WarpVM. To learn more and request a free trial, click the button below.
Notes
1. O2 Telefonica Press Release 05/08/2024: https://www.telefonica.de/news/press-releases-telefonica-germany/2024/05/first-5g-core-network-in-the-cloud-for-an-existing-operator-o2-telefonica-sets-new-impulses-in-the-core-network-together-with-nokia-and-aws.html
2. Starvation in End-to-End Congestion Control, August 2022: https://people.csail.mit.edu/venkatar/cc-starvation.pdf
3. Badu Networks Performance Case Studies: https://www.badunetworks.com/wp-content/uploads/2022/11/Performance-Case-Studies.pdf
4. Nutanix Technology Partners: https://www.nutanix.com/partners/technology-alliances/badu-networks