Debugging Azure Networking for Elastic Cloud Serverless

Summary of Findings

Elastic's Site Reliability Engineering team (SRE) observed unstable throughput and packet loss in Elastic Cloud Serverless running on Azure Kubernetes Service (AKS). After investigation, we identified the primary contributing factors to be RX ring buffer overflows and kernel input queue saturation on SR-IOV interfaces. To address this, we increased RX buffer sizes and adjusted the netdev backlog, which significantly improved network stability.

Setting the Scene

Elastic Cloud Serverless is a fully managed solution that allows you to deploy and use Elastic for your use cases without managing the underlying infrastructure. Built on Kubernetes, it represents a shift in how you interact with Elasticsearch. Instead of managing clusters, nodes, data tiers, and scaling, you create serverless projects that are fully managed and automatically scaled by Elastic. This abstraction of infrastructure decisions allows you to focus solely on gaining value and insight from your data.

Elastic Cloud Serverless is generally available (GA) on AWS, GCP and currently in Technical Preview on Azure. As part of preparing Elastic Cloud Serverless GA on Azure, we have been conducting extensive performance and scalability tests to ensure that our users get a consistent and reliable user experience.

In this post, we’ll take you behind the scenes of a deep technical investigation into a surprising performance issue that affected Serverless Elasticsearch in our Azure Kubernetes clusters. At first, the network seemed like the least likely place to look, especially with a high-speed 100 Gb/s interface on the host backing it. But as we dug deeper, with help from the Microsoft Azure team, that’s exactly where the problem led us.

Unexpected Results!

While the high-level architectures and system design patterns of the major cloud provider’s systems are often similar, the implementations are different, and these differences can have dramatic impacts on a system’s performance characteristics.

One of the most significant differences between the different cloud providers is that the underlying hypervisor software and server hardware of the Virtual Machines can vary significantly, even between instance families of the same provider.

There is no way to fully abstract the hardware away from an application like Elasticsearch. Fundamentally, its performance is dictated by the CPU, memory, disks, and network interfaces on the physical server. In preparation for the Elastic Cloud Serverless GA on Azure, our Elasticsearch Performance team kicked off large-scale load testing against Serverless Elasticsearch projects running on Azure Kubernetes Service (AKS), using ARM-based VMs (we’re big fans!). Throughout this process, we relied heavily on Elastic tools to analyse system behaviour, identify bottlenecks, and validate performance under load.

To perform these scale and load tests, the Elasticsearch Performance team use Rally, an open-source benchmarking tool designed to measure the performance of Elasticsearch clusters. The workload (or in Rally nomenclature, ‘Track’) used for these tests was the GitHub Archive Track. Rally collects and sends test telemetry using the official Python client to a separate Elasticsearch cluster running Elastic Observability, which allows for monitoring and analysis during these scale and load tests in real time via Kibana.

When we looked at the results, we observed that the indexing rate (the number of docs/s) for the Serverless projects was not only much lower than we had expected for the given hardware, but the throughput was also quite unstable. There were peaks and valleys, interspersed with frequent errors, whereas we were instead expecting a stable indexing rate for the duration of the test.

These tests are designed to push the system to its limits, and in doing so, they surfaced unexpected behavior in the form of unstable indexing throughput and intermittent errors. This was precisely the kind of problem we'd hoped to uncover prior to going GA — giving us the opportunity to work closely with Azure.

A Kibana visualisation of Rally telemetry, showing fluctuating Elasticsearch indexing rates alongside spikes in 5xx and 4xx HTTP error responses.

Debugging!

Debugging performance issues can feel a little bit like trying to find a ‘Butterfly in a Hurricane’, so it’s crucial that you take a methodological approach to analysing application and system performance.

Using methodologies helps you to be more consistent and thorough in your debugging, and avoids missing things. We started with the Utilisation Saturation and Errors (USE) Method, looking at both the client and server side to identify any obvious bottlenecks in the system.

Elastic's Site Reliability Engineers (SREs) maintain a suite of custom Elastic Observability dashboards designed to visualise data collected from various Elastic Integrations. These dashboards provide deep visibility into the health and performance of Elastic Cloud infrastructure and systems.

For this investigation, we leveraged a custom dashboard built using metrics and log data from the System and Linux Integrations:

One of many Elastic Observability dashboards built and maintained by the SRE team.

Following the USE Method, these dashboards highlight resource utilisation, saturation, and errors across our systems. With their help, we quickly identified that the AKS nodes hosting the Elasticsearch pods under test were dropping thousands of packets per second.

A Kibana visualisation of Elastic Agent's System Integration, showing the rate of packet drops per second for AKS nodes.

Dropping packets forces reliable protocols, such as TCP, to retransmit any missing packets. These retransmissions can introduce significant delays, which kills the throughput of any system where client requests are only triggered upon the previous request completion (known as a Closed System).

To investigate further, we jumped onto one of the AKS nodes exhibiting the packet loss to check the basics. First off, we wanted to identify what type of packet drops or errors we’re seeing; is it for specific pods, or the host as a whole?

root@aks-k8s-node-1:~# ip -s link show
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 7c:1e:52:be:ce:5e brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast
    373507935420 134292481      0       0       0      15
    TX:    bytes   packets errors dropped carrier collsns
    644247778936 303191014      0       0       0       0
3: enP42266s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP mode DEFAULT group default qlen 1000
    link/ether 7c:1e:52:be:ce:5e brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast
    386782548951 307000571      0       0 5321081       0
    TX:    bytes   packets errors dropped carrier collsns
    655758630548 477594747      0       0       0       0
    altname enP42266p0s2
15: lxc0ca0ec41ecd2@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f6:f5:5e:c9:4e:fb brd ff:ff:ff:ff:ff:ff link-netns cni-3f90ab53-df66-cac5-bd19-9cea4a68c29b
    RX:    bytes   packets errors dropped  missed   mcast
    627954576078  54297550      0    1600       0       0
    TX:    bytes   packets errors dropped carrier collsns
    372155326349 133538064      0    3927       0       0

In this output you can see the

enP42266s1

interface is showing a significant number of packets in the

missed

column. That’s interesting, sure, but what does missed actually represent? And what is

enP42266s1

To understand, let’s look at roughly what happens when a packet arrives at the NIC:

A packet arrives at the NIC from the network.
The NIC uses DMA (Direct Memory Access) to place the packet into a receive ring buffer allocated in memory by the kernel, mapped for use by the NIC. Since our NICs supports multiple hardware queues, each queue has its own dedicated ring buffer, IRQ, and NAPI context.
The NIC raises a hardware interrupt (IRQ) to notify the CPU that a packet is ready.
The CPU runs the NIC driver’s IRQ handler. The driver schedules a NAPI (New API) poll to defer packet processing to a softirq context. A mechanism in the Linux kernel that defers work to be processed outside of the hard IRQ context, for better batching and CPU efficiency, enabling improved scalability.
The NAPI poll function is executed in a softirq context (
NET_RX_SOFTIRQ
) and retrieves packets from the ring buffer. This polling continues either until the driver’s packet budget is exhausted (
net.core.netdev_budget
) or the time limit is hit (
net.core.netdev_budget_usecs
).
Each packet is wrapped in an
sk_buff
(socket buffer) structure, which includes metadata such as protocol headers, timestamps, and interface identifiers.
If the networking stack is slower than the rate at which NAPI fetches packets, excess packets are queued in a per-CPU backlog queue (via
enqueue_to_backlog
). The maximum size of this backlog is controlled by the
net.core.netdev_max_backlog
sysctl.
Packets are then handed off to the kernel’s networking stack for routing, filtering, and protocol-specific processing (e.g. TCP, UDP).
Finally, packets reach the appropriate socket receive buffer, where they are available for consumption by the user-space application.

Visualised, it looks something like this:

The

missed

counter is incremented whenever the NIC tries to DMA a packet into a fully occupied ring buffer. The NIC essentially "misses" the chance to deliver the packet to the VM’s memory. However, what’s most interesting is that this counter seldom increments for VMs. This is because Virtual NICs are usually implemented as software via the hypervisor, which typically has much more flexible memory management compared to the physical NICs and can reduce the chance of ring buffer overflow.

We mentioned earlier that we’re building Azure Elasticsearch Serverless on top of Azure’s AKS service, which is important to note because all of our AKS nodes use an Azure feature called Accelerated Networking. In this setup, network traffic is delivered directly to the VM’s network interface, bypassing the hypervisor. This is enabled by single root I/O virtualization (SR-IOV), which offers much lower latency and higher throughput than traditional VM networking. Each node is physically connected to a 100 Gb/s network interface, although the SR-IOV Virtual Function (VF) exposed to the VM typically provides only a fraction of that total bandwidth.

Despite the VM only having a fraction of the 100 Gb/s bandwidth, microbursts are still very possible. These physical interfaces are so fast that they can transmit and receive multiple packets in just nanoseconds, far faster than most buffers or processing queues can absorb. At these timescales, even a short-lived burst of traffic can overwhelm the receiver, leading to dropped packets and unpredictable latency.

Direct access to the SR-IOV interface means that our VMs are responsible for handling the hardware interrupts triggered by the NIC in a timely manner, if there's any delay in handling the hardware interrupt (e.g. waiting to be scheduled onto CPU by the hypervisor) then network packets can be missed!

Firstly - NIC-level Tuning

Since we'd confirmed that our VMs were using SR-IOV, we established that the

enP42266s1

and

eth0

interfaces were a bonded pair and acted as a single interface. Knowing this, then we reasoned that we should be able to adjust the ring buffer values directly using

ethtool

root@aks-k8s-node-1:~# ethtool -g enP42266s1
Ring parameters for enP42266s1:
Pre-set maximums:
RX:		8192
RX Mini:	n/a
RX Jumbo:	n/a
TX:		8192
Current hardware settings:
RX:		1024
RX Mini:	n/a
RX Jumbo:	n/a
TX:		1024

In the output above, we were using only 1/8th of the available ring buffer descriptors. These values were set by the OS defaults, which generally aim to balance performance and resource usage. Set too low, they risk packet drops under load; set too high, they can lead to unnecessary memory consumption. We knew that the VMs were backed by a virtual function carved out of the directly attached 100 Gb/s network interface, which is fast enough to deliver microbursts that could easily overwhelm small buffers. To better absorb those short, high-intensity bursts of traffic, we increased the NIC’s RX ring buffer size from 1024 to 8192. Using a privileged DaemonSet, we rolled out the change across all of our AKS nodes by installing a

udev

rule to automatically increase the buffer size:

# Match Mellanox ConnectX network cards and run ethtool to update the ring buffer settings
ENV{INTERFACE}=="en*", ENV{ID_NET_DRIVER}=="mlx5_core", RUN+="/sbin/ethtool -G %k rx ${CONFIG_AZURE_MLX_RING_BUFFER_SIZE} tx ${CONFIG_AZURE_MLX_RING_BUFFER_SIZE}"

A Kibana visualisation of Elastic Agent's System Integration, showing packet loss reduced by ~99% after increasing the NIC's RX ring buffer values.

As soon as the change had been applied to all AKS nodes we stopped ‘missing’ RX packets! Fantastic! As a result of this simple change we observed a significant improvement in our indexing throughput and stability.

A Kibana visualisation of Rally telemetry, showing stable and improved Elasticsearch indexing rates after increasing the RX ring buffer size.

Job done, right? Not quite..

Further improvements - Kernel-level Tuning

Eagle eyed readers may have noticed two things:

In the previous screenshot, despite adjusting the physical RX ring buffer values, we still observed a small number of
dropped
packets on the TX side.
In the original
ip link -s show
output, one of the ‘logical’ interfaces used by the Elasticsearch pod was showing
dropped
packets on both the TX and RX sides.

15: lxc0ca0ec41ecd2@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f6:f5:5e:c9:4e:fb brd ff:ff:ff:ff:ff:ff link-netns cni-3f90ab53-df66-cac5-bd19-9cea4a68c29b
    RX:    bytes   packets errors dropped  missed   mcast
    627954576078  54297550      0    1600       0       0
    TX:    bytes   packets errors dropped carrier collsns
    372155326349 133538064      0    3927       0       0

So, we continued to dig. We’d eliminated ~99% of the packet loss, and the remaining loss rate wasn’t as significant as what we’d started with, but we still wanted to understand why it was occurring even after adjusting the RX ring buffer size of the NIC.

So what does

dropped

represent, and what is this

lxc0ca0ec41ecd2

interface?

dropped

is similar to

missed

, but only occurs when packets are deliberately dropped by the kernel or network interface. Crucially though, it doesn’t tell you why a packet was dropped. As for the

lxc0ca0ec41ecd2

interface, we use the Azure CNI Powered by Cilium to provide the network functionality to our AKS clusters. Any pod spun up on an AKS node gets a ‘logical’ interface, which is a virtual ethernet (

veth

) pair that connects the pod’s network namespace with the host’s network namespace. It was here that we were dropping packets.

In our experience, packet drops at this layer are unusual, so we started digging deeper into the cause of the drops. There are numerous ways you can debug why a packet is being dropped, but one of the easiest is to use

perf

attach to the

skb:kfree_skb

tracepoint. The "socket buffer" (

skb

) is the primary data structure used to represent network packets in the Linux kernel. When a packet is dropped, its corresponding socket buffer is usually freed, triggering the

kfree_skb

tracepoint. Using

perf

to attach to this event allowed us to capture stack traces to analyze the cause of the drops.

# perf record -g -a -e skb:kfree_skb

We left this to run for ~10 minutes or so to capture as many drops as possible, and then ‘heavily inspired’ by this GitHub Gist by Ivan Babrou, we converted the stack traces into an ‘easier’ to read Flamegraphs:

# perf script | sed -e 's/skb:kfree_skb:.*reason:\(.*\)/\n\tfffff \1 (unknown)/' -e 's/^\(\w\+\)\s\+/kernel /' > stacks.txt
cat stacks.txt | stackcollapse-perf.pl --all | perl -pe 's/.*?;//' | sed -e 's/.*irq_exit_rcu_\[k\];/irq_exit_rcu_[k];/' | flamegraph.pl --colors=java --hash --title=aks-k8s-node-1 --width=1440 --minwidth=0.005 > aks-k8s-node-1.svg

A Flamegraph showing the various stack trace ancestry of packet loss.

The flamegraph here shows how often different functions appeared in stack traces for packets drops. Each box represents a function call and wider boxes mean the function appears more frequently in the traces. The stack's ancestry builds upward from the bottom with earlier calls, to the top with later calls.

Firstly, we quickly discovered that unfortunately the

skb_drop_reason

enum was only added in Kernel 5.17 (Azure’s Node Image at the time was using 5.15). This meant that there was no single human readable message that told us why the packets were being dropped, instead all we got was

NOT_SPECIFIED

. To work out why packets were being dropped we needed to do a little sleuthing through the stack traces to work out what code paths were being taken when a packet was dropped.

In the flamegraph above you can see that many of the stack traces include

veth

driver function calls (e.g.

veth_xmit

), and many end abruptly with a call to the

enqueue_to_backlog

function. When many stacks end at the same function (like

enqueue_to_backlog

) it suggests that function is a common point where packets are being dropped. If you go back to the earlier explanation of what happens when a packet arrives at the NIC, you’ll notice that in step 7 we explained:

7. If the networking stack is slower than the rate at which NAPI fetches packets, excess packets are queued in a per-CPU backlog queue (via
enqueue_to_backlog
). The maximum size of this backlog is controlled by the
net.core.netdev_max_backlog
sysctl.

Using the same privileged DaemonSet method for the RX ring buffer adjustment, we set the value of the

net.core.netdev_max_backlog

adjustable kernel parameter from 1000 to 32768:

/usr/sbin/sysctl -w net.core.netdev_max_backlog=32768

This value was based on the fact we knew the hosts were using a 100 Gb/s SR-IOV NIC, even if the VM was allowed only a fraction of the total bandwidth. We acknowledge that it’s worth revisiting this value in the future to see if it can be better optimised to not waste extraneous memory, but at the time “perfect was the enemy of good”.

We re-ran the load tests and compared the three sets of results we’d collected thus far.

A Kibana visualisation of Rally results, comparing impact to median throughput after each configuration change.

Tuning Step	Packet Loss	Median indexing throughput
Baseline	High	~18,000 docs/s
+RX Buffer	~99% drop ↓	~26,000 (+ ~40% from baseline)
+Backlog & +RX Buffer	Near zero	~29,000 (+ ~60% from baseline)

Here you can see the P50 of throughput in docs/s over the course of the hours-long load tests. Compared to the baseline, we saw a roughly ~40% increase in throughput by only adjusting the RX ring buffer values, and a ~50-60% increase with both the RX ring buffer and backlog changes! Hooray!

A great result and one more step on our journey towards better Serverless Elasticsearch performance.

Working with Azure

It’s great that we were able to quickly identify and mitigate the majority of our packet loss issues, but since we were using AKS with AKS node images, it made sense to engage with Azure to understand why the defaults weren’t working for our workload.

We walked Azure through our investigation, mitigations and results, and asked for some additional validation of our mitigations. Azure Engineering confirmed that the host NICs were not discarding packets, which confirmed that everything arriving at the host level was passed through to the hypervisor on the host. Further investigation confirmed that no loss or discards were occurring to Azure network fabric, or internal to the hypervisor – which shifted focus from the host to the guest OS and why the guest OS kernel was slow when reading packets off of the

enP*

SR-IOV interfaces.

Given the complexity of our load testing scenario — which involved configuring multiple systems and tools, including Elastic Observability, we also developed a simplified reproduction of the packet loss issue using

iperf3

. This simplified test was created specifically to share with Azure for targeted analysis, and added to the broader monitoring and analysis enabled by Elastic Observability and Rally.

With this reproduction Azure was able to confirm the increasing

missed

and

dropped

packet counters we had observed, and confirmed the increased RX ring buffer and

netdev_max_backlog

increase as the recommended mitigations.

Conclusion

While cloud providers offer various abstractions to manage your resources, the underlying hardware ultimately determines your application's performance and stability. High-performance hardware often requires tuning at the operating system level, well beyond the default settings most environments ship with. In managed platforms like AKS, where Azure controls both the node images and infrastructure, it is easy to overlook the impact of low-level configurations such as network device ring buffer sizes or sysctls like

net.core.netdev_max_backlog

Our experience shows that even with the convenience of a managed Kubernetes service, performance issues can still emerge if these hardware parameters are not tuned appropriately. It was tempting to assume that high-speed 100 Gb/s network interfaces, directly attached to the VM using SR-IOV would eliminate any chance of network-related bottlenecks. In reality, that assumption didn’t hold up.

Engaging early with Azure was essential, as they provided deeper visibility into the underlying infrastructure and worked with us to tune low-level, performance-critical settings. Combined with thorough load and scale testing and robust observability using tools like Elastic Observability, this collaboration helped us detect and rectify the issue early in order to deliver a consistent, reliable, and high-performing experience for our users.

Debugging Azure Networking for Elastic Cloud Serverless

Summary of Findings

Setting the Scene

Unexpected Results!

Debugging!

Firstly - NIC-level Tuning

Further improvements - Kernel-level Tuning

Working with Azure

Conclusion

Share this article