Netflix operates a extremely environment friendly cloud computing infrastructure that helps a big selection of purposes important for our SVOD (Subscription Video on Demand), dwell streaming and gaming companies. Using Amazon AWS, our infrastructure is hosted throughout a number of geographic areas worldwide. This international distribution permits our purposes to ship content material extra successfully by serving visitors nearer to our prospects. Like every distributed system, our purposes sometimes require information synchronization between areas to keep up seamless service supply.
The next diagram exhibits a simplified cloud community topology for cross-region visitors.
Our Cloud Community Engineering on-call group obtained a request to deal with a community subject affecting an utility with cross-region visitors. Initially, it appeared that the appliance was experiencing timeouts, probably attributable to suboptimal community efficiency. As everyone knows, the longer the community path, the extra units the packets traverse, growing the chance of points. For this incident, the consumer utility is positioned in an inner subnet within the US area whereas the server utility is positioned in an exterior subnet in a European area. Due to this fact, it’s pure responsible the community since packets must journey lengthy distances by means of the web.
As community engineers, our preliminary response when the community is blamed is often, “No, it could actually’t be the community,” and our process is to show it. Given that there have been no current modifications to the community infrastructure and no reported AWS points impacting different purposes, the on-call engineer suspected a loud neighbor subject and sought help from the Host Community Engineering group.
On this context, a loud neighbor subject happens when a container shares a bunch with different network-intensive containers. These noisy neighbors devour extreme community assets, inflicting different containers on the identical host to endure from degraded community efficiency. Regardless of every container having bandwidth limitations, oversubscription can nonetheless result in such points.
Upon investigating different containers on the identical host — most of which had been a part of the identical utility — we rapidly eradicated the potential of noisy neighbors. The community throughput for each the problematic container and all others was considerably under the set bandwidth limits. We tried to resolve the problem by eradicating these bandwidth limits, permitting the appliance to make the most of as a lot bandwidth as mandatory. Nonetheless, the issue endured.
We noticed some TCP packets within the community marked with the RST flag, a flag indicating {that a} connection needs to be instantly terminated. Though the frequency of those packets was not alarmingly excessive, the presence of any RST packets nonetheless raised suspicion on the community. To find out whether or not this was certainly a network-induced subject, we carried out a tcpdump on the consumer. Within the packet seize file, we noticed one TCP stream that was closed after precisely 30 seconds.
SYN at 18:47:06
After the 3-way handshake (SYN,SYN-ACK,ACK), the visitors began flowing usually. Nothing unusual till FIN at 18:47:36 (30 seconds later)
The packet seize outcomes clearly indicated that it was the consumer utility that initiated the connection termination by sending a FIN packet. Following this, the server continued to ship information; nevertheless, because the consumer had already determined to shut the connection, it responded with RST packets to all subsequent information from the server.
To make sure that the consumer wasn’t closing the connection attributable to packet loss, we additionally carried out a packet seize on the server facet to confirm that every one packets despatched by the server had been obtained. This process was sophisticated by the truth that the packets handed by means of a NAT gateway (NGW), which meant that on the server facet, the consumer’s IP and port appeared as these of the NGW, differing from these seen on the consumer facet. Consequently, to precisely match TCP streams, we wanted to establish the TCP stream on the consumer facet, find the uncooked TCP sequence quantity, after which use this quantity as a filter on the server facet to search out the corresponding TCP stream.
With packet seize outcomes from each the consumer and server sides, we confirmed that all packets despatched by the server had been accurately obtained earlier than the consumer despatched a FIN.
Now, from the community viewpoint, the story is evident. The consumer initiated the connection requesting information from the server. The server stored sending information to the consumer with no drawback. Nonetheless, at a sure level, regardless of the server nonetheless having information to ship, the consumer selected to terminate the reception of knowledge. This led us to suspect that the problem is perhaps associated to the consumer utility itself.
So as to absolutely perceive the issue, we now want to know how the appliance works. As proven within the diagram under, the appliance runs within the us-east-1 area. It reads information from cross-region servers and writes the information to customers throughout the similar area. The consumer runs as containers, whereas the servers are EC2 cases.
Notably, the cross-region learn was problematic whereas the write path was clean. Most significantly, there’s a 30-second application-level timeout for studying the information. The applying (consumer) errors out if it fails to learn an preliminary batch of knowledge from the servers inside 30 seconds. Once we elevated this timeout to 60 seconds, the whole lot labored as anticipated. This explains why the consumer initiated a FIN — as a result of it misplaced persistence ready for the server to switch information.
Might it’s that the server was up to date to ship information extra slowly? Might it’s that the consumer utility was up to date to obtain information extra slowly? Might it’s that the information quantity turned too giant to be fully despatched out inside 30 seconds? Sadly, we obtained unfavourable solutions for all 3 questions from the appliance proprietor. The server had been working with out modifications for over a yr, there have been no vital updates within the newest rollout of the consumer, and the information quantity had remained constant.
If each the community and the appliance weren’t modified lately, then what modified? The truth is, we found that the problem coincided with a current Linux kernel improve from model 6.5.13 to six.6.10. To check this speculation, we rolled again the kernel improve and it did restore regular operation to the appliance.
Actually talking, at the moment I didn’t imagine it was a kernel bug as a result of I assumed the TCP implementation within the kernel needs to be stable and steady (Spoiler alert: How flawed was I!). However we had been additionally out of concepts from different angles.
There have been about 14k commits between the great and dangerous kernel variations. Engineers on the group methodically and diligently bisected between the 2 variations. When the bisecting was narrowed to a few commits, a change with “tcp” in its commit message caught our consideration. The ultimate bisecting confirmed that this commit was our perpetrator.
Apparently, whereas reviewing the e-mail historical past associated to this commit, we discovered that another user had reported a Python test failure following the same kernel upgrade. Though their resolution was in a roundabout way relevant to our scenario, it recommended that an easier take a look at may also reproduce our drawback. Utilizing strace, we noticed that the appliance configured the next socket choices when speaking with the server:
[pid 1699] setsockopt(917, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
[pid 1699] setsockopt(917, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid 1699] setsockopt(917, SOL_SOCKET, SO_SNDBUF, [131072], 4) = 0
[pid 1699] setsockopt(917, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0
[pid 1699] setsockopt(917, SOL_TCP, TCP_NODELAY, [1], 4) = 0
We then developed a minimal client-server C utility that transfers a file from the server to the consumer, with the consumer configuring the identical set of socket choices. Throughout testing, we used a 10M file, which represents the quantity of knowledge sometimes transferred inside 30 seconds earlier than the consumer points a FIN. On the outdated kernel, this cross-region switch accomplished in 22 seconds, whereas on the brand new kernel, it took 39 seconds to complete.
With the assistance of the minimal copy setup, we had been finally in a position to pinpoint the basis reason for the issue. So as to perceive the basis trigger, it’s important to have a grasp of the TCP obtain window.
TCP Obtain Window
Merely put, the TCP obtain window is how the receiver tells the sender “That is what number of bytes you possibly can ship me with out me ACKing any of them”. Assuming the sender is the server and the receiver is the consumer, then we now have:
The Window Measurement
Now that we all know the TCP obtain window measurement may have an effect on the throughput, the query is, how is the window measurement calculated? As an utility author, you possibly can’t determine the window measurement, nevertheless, you possibly can determine how a lot reminiscence you wish to use for buffering obtained information. That is configured utilizing SO_RCVBUF socket possibility we noticed within the strace outcome above. Nonetheless, observe that the worth of this selection means how a lot utility information will be queued within the obtain buffer. In man 7 socket, there’s
SO_RCVBUF
Units or will get the utmost socket obtain buffer in bytes.
The kernel doubles this worth (to permit house for
bookkeeping overhead) when it’s set utilizing setsockopt(2),
and this doubled worth is returned by getsockopt(2). The
default worth is ready by the
/proc/sys/internet/core/rmem_default file, and the utmost
allowed worth is ready by the /proc/sys/internet/core/rmem_max
file. The minimal (doubled) worth for this selection is 256.
This implies, when the person provides a worth X, then the kernel stores 2X in the variable sk->sk_rcvbuf. In different phrases, the kernel assumes that the bookkeeping overhead is as a lot because the precise information (i.e. 50% of the sk_rcvbuf).
sysctl_tcp_adv_win_scale
Nonetheless, the idea above is probably not true as a result of the precise overhead actually relies on plenty of components resembling Most Transmission Unit (MTU). Due to this fact, the kernel offered this sysctl_tcp_adv_win_scale which you need to use to inform the kernel what the precise overhead is. (I imagine 99% of individuals additionally don’t know the way to set this parameter accurately and I’m undoubtedly one among them. You’re the kernel, if you happen to don’t know the overhead, how are you going to anticipate me to know?).
Based on the sysctl doc,
tcp_adv_win_scale — INTEGER
Out of date since linux-6.6 Depend buffering overhead as bytes/2^tcp_adv_win_scale (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), whether it is <= 0.
Doable values are [-31, 31], inclusive.
Default: 1
For 99% of individuals, we’re simply utilizing the default worth 1, which in flip means the overhead is calculated by rcvbuf/2^tcp_adv_win_scale = 1/2 * rcvbuf. This matches the idea when setting the SO_RCVBUF worth.
Let’s recap. Assume you set SO_RCVBUF to 65536, which is the worth set by the appliance as proven within the setsockopt syscall. Then we now have:
- SO_RCVBUF = 65536
- rcvbuf = 2 * 65536 = 131072
- overhead = rcvbuf / 2 = 131072 / 2 = 65536
- obtain window measurement = rcvbuf — overhead = 131072–65536 = 65536
(Be aware, this calculation is simplified. The true calculation is extra complicated.)
In brief, the obtain window measurement earlier than the kernel improve was 65536. With this window measurement, the appliance was in a position to switch 10M information inside 30 seconds.
The Change
This commit obsoleted sysctl_tcp_adv_win_scale and launched a scaling_ratio that may extra precisely calculate the overhead or window measurement, which is the appropriate factor to do. With the change, the window measurement is now rcvbuf * scaling_ratio.
So how is scaling_ratio calculated? It’s calculated utilizing skb->len/skb->truesize the place skb->len is the size of the tcp information size in an skb and truesize is the entire measurement of the skb. That is certainly a extra correct ratio based mostly on actual information somewhat than a hardcoded 50%. Now, right here is the following query: in the course of the TCP handshake earlier than any information is transferred, how can we determine the preliminary scaling_ratio? The reply is, a magic and conservative ratio was chosen with the worth being roughly 0.25.
Now we now have:
- SO_RCVBUF = 65536
- rcvbuf = 2 * 65536 = 131072
- obtain window measurement = rcvbuf * 0.25 = 131072 * 0.25 = 32768
In brief, the obtain window measurement halved after the kernel improve. Therefore the throughput was reduce in half, inflicting the information switch time to double.
Naturally, it’s possible you’ll ask, I perceive that the preliminary window measurement is small, however why doesn’t the window develop when we now have a extra correct ratio of the payload later (i.e. skb->len/skb->truesize)? With some debugging, we ultimately discovered that the scaling_ratio does get updated to a more accurate skb->len/skb->truesize, which in our case is round 0.66. Nonetheless, one other variable, window_clamp, shouldn’t be up to date accordingly. window_clamp is the maximum receive window allowed to be advertised, which can also be initialized to 0.25 * rcvbuf utilizing the preliminary scaling_ratio. In consequence, the obtain window measurement is capped at this worth and may’t develop greater.
In principle, the repair is to replace window_clamp together with scaling_ratio. Nonetheless, with a purpose to have a easy repair that doesn’t introduce different surprising behaviors, our final fix was to increase the initial scaling_ratio from 25% to 50%. It will make the obtain window measurement backward suitable with the unique default sysctl_tcp_adv_win_scale.
In the meantime, discover that the issue shouldn’t be solely brought on by the modified kernel conduct but in addition by the truth that the appliance units SO_RCVBUF and has a 30-second application-level timeout. The truth is, the appliance is Kafka Join and each settings are the default configurations (receive.buffer.bytes=64k and request.timeout.ms=30s). We additionally created a kafka ticket to change receive.buffer.bytes to -1 to permit Linux to auto tune the obtain window.
This was a really attention-grabbing debugging train that lined many layers of Netflix’s stack and infrastructure. Whereas it technically wasn’t the “community” responsible, this time it turned out the perpetrator was the software program elements that make up the community (i.e. the TCP implementation within the kernel).
If tackling such technical challenges excites you, contemplate becoming a member of our Cloud Infrastructure Engineering groups. Discover alternatives by visiting Netflix Jobs and trying to find Cloud Engineering positions.
Particular due to our beautiful colleagues Alok Tiagi, Artem Tkachuk, Ethan Adams, Jorge Rodriguez, Nick Mahilani, Tycho Andersen and Vinay Rayini for investigating and mitigating this subject. We’d additionally wish to thank Linux kernel community skilled Eric Dumazet for reviewing and making use of the patch.