W00t. Donc la latence d'une appli peut dépendre du buffer de réception TCP (si trop gros par rapport au volume moyen de données reçues, la structure contient uniquement les métasdonnées et que dalle de données donc le noyau tente de merger les paquets comme un ramasse-miettes et donc la latence explose...) et du nombre de sockets ouvertes sur un même port (adresses IP différentes) puisque la liste des sockets TCP ouvertes sur un système est stockée dans une table de hachage qu'il faut consulter à chaque SYN avant de répondre SYN ACK ou RST. Sauf que la fonction de hachage utilisée prend en compte uniquement le port... Donc, un grand nombre de sockets sur un même port massacre les performances puisque la table de hachage se retrouve avec la même complexité qu'une liste chaînée... Voici un très bon exemple qui pourrait être utilisé dans un cours d'informatique sur la complexité algorithmique à la place des exemples vaseux ou en tout cas très vaporeux.
« A customer reported an unusual problem with our CloudFlare CDN: our servers were responding to some HTTP requests slowly. Extremely slowly. 30 seconds slowly. This happened very rarely and wasn't easily reproducible. To make things worse all our usual monitoring hadn't caught the problem.
We ran thousands of HTTP queries against one server over a couple of hours. Almost all the requests finished in milliseconds, but, as you can clearly see, 5 requests out of thousands took as long as 1000ms to finish. When debugging network problems the delays of 1s, 30s are very characteristic. They may indicate packet loss since the SYN packets are usually retransmitted at times 1s, 3s, 7s, 15, 31s.
[...]
The first "ping" went from an external test machine to the router and showed a flat latency of about 20ms
[...]
The second "ping" session was launched from our external test machine against one of our Linux servers behind the router [...] The "ping" output shows the max RTT being 1.8s. The gigantic latency spikes are also clearly visible on the graph.
The first experiment showed that the network between the external testing server and a router is not malfunctioning. But the second test, against a server just behind this router, revealed awful spikes. This indicates the problem is somewhere between the router and the server inside our datacenter.
[...]
As you see from this tcpdump output, one particular ICMP packet was indeed received from the network at time 0, but for some reason the operating system waited 1.3s before answering it. On Linux network packets are handled promptly when the interrupt occurs; this delayed ICMP response indicates some serious kernel trouble.
[...]
To understand what's going on we had to look at the internals of operating system packet processing. Nowadays there are a plethora of debugging tools for Linux and, for no particular reason, we chose System Tap (stap). With a help of a flame graph we identified a function of interest: net_rx_action [...] The net_rx_action function is responsible for handling packets in Soft IRQ mode.
[...]
During a 30s run, we hit the net_rx_action function 3.6 million times. Out of these runs most finished in under 1ms, but there were some outliers. Most importantly one run took an astonishing 23ms.
Having a 23ms stall in low level packet handling is disastrous. It's totally possible to run out of buffer space and start dropping packets if a couple of such events get accumulated. No wonder the ICMP packets weren't handled in time!
[...]
We repeated the procedure a couple more times. That is:
We made a flame graph source.
By trial and error figured out which descendant of net_rx_action caused the latency spike source.
This procedure was pretty effective, and after a couple of runs we identified the culprit: the tcp_collapse function. [...] ver 300 seconds there were just about 1,500 executions of the tcp_collapse function. Out of these executions half finished in under 3ms, but the max time was 21ms.
[...]
The tcp_collapse function is interesting. It turns out to be deeply intermixed with how the BSD sockets API works.
The naive answer would go something along the lines of: the TCP receive buffer setting indicates the maximum number of bytes a read() syscall could retrieve without blocking.
While this is the intention, this is not exactly how it works. In fact, the receive buffer size value on a socket is a hint to the operating system of how much total memory it could use to handle the received data. Most importantly, this includes not only the payload bytes that could be delivered to the application, but also the metadata around it.
Under normal circumstances, a TCP socket structure contains a doubly-linked list of packets—the sk_buff structures. Each packet contains not only the data, but also the sk_buff metadata (sk_buff is said to take 240 bytes). The metadata size does count against the receive buffer size counter. In a pessimistic case—when the packets are very short—it is possible the receive buffer memory is almost entirely used by the metadata.
Using a large chunk of receive buffer space for the metadata is not really what the programmer wants. To counter that, when the socket is under memory pressure complex logic is run with the intention of freeing some space. One of the operations is tcp_collapse and it will merge adjacent TCP packets into one larger sk_buff. This behavior is pretty much a garbage collection (GC)—and as everyone knows, when the garbage collection kicks in, the latency must spike.
[...]
There are two ways to control the TCP socket receive buffer on Linux:
You can set setsockopt(SO_RCVBUF) explicitly.
Or you can leave it to the operating system and allow it to auto-tune it, using the tcp_rmem sysctl as a hint.
[...]
This setting tells Linux to autotune socket receive buffers, and allocate between 4KiB and 32MiB, with a default start buffer of 5MiB.
Since the receive buffer sizes are fairly large, garbage collection could take a long time. To test this we reduced the max rmem size to 2MiB and repeated the latency measurements: [...] Now, these numbers are so much better. With the changed settings the tcp_collapse never took more than 3ms! [...] With the rmem changes the max latency of observed net_rx_action times dropped from 23ms to just 3ms.
[...]
Setting the rmem sysctl to only 2MiB is not recommended as it could affect the performance of high throughput, high latency connections. On the other hand reducing rmem definitely helps to alleviate the observed latency issue. We settled with 4MiB max rmem value which offers a compromise of reasonable GC times and shouldn't affect the throughput on the TCP layer. »
Et la suite en
https://blog.cloudflare.com/revenge-listening-sockets/ :
« After adjusting the previously discussed rmem sysctl we continued monitoring our systems' latency. Among other things we measured ping times to our edge servers. While the worst case improved and we didn't see 1000ms+ pings anymore, the line still wasn't flat. [...] As you can see most pings finished below 1ms. But out of 21,600 measurements about 20 had high latency of up to 100ms. Not ideal, is it?
[...]
The latency occurred within our datacenter and the packets weren't lost. This suggested a kernel issue again. Linux responds to ICMP pings from its soft interrupt handling code. A delay in handling ping indicates a delay in Soft IRQ handling which is really bad and can affect all packets delivered to a machine. Using the system tap script we were able to measure the time distribution of the main soft IRQ function net_rx_action [...] While most of the calls to net_rx_action were handled in under 81us (average), the slow outliers were really bad. Three calls took a whopping 32ms!
[...]
With some back and forth with flame graphs and the histogram-kernel.stp script we went deeper to look for the culprit. We found that tcp_v4_rcv had a similarly poor latency distribution. More specifically the problem lies between lines 1637 and 1642 in the tcp_v4_rcv function in the tcp_ipv4.c file [...] The numbers shown above indicate that the function usually terminated quickly, in under 2us, but sometimes it hit a slow path and took 1-2ms to finish.
The __inet_lookup_skb is inlined which makes it tricky to accurately measure. Fortunately the function is simple - all it does is to call __inet_lookup_established and __inet_lookup_listener. It's the latter function that was causing the trouble
Let's discuss how __inet_lookup works. This function tries to find an appropriate connection sock struct structure for a packet. This is done in the __inet_lookup_established call. If that fails, the __inet_lookup will attempt to find a bound socket in listening state that could potentially handle the packet. For example, if the packet is SYN and the listening socket exists we should respond with SYN+ACK. If there is no bound listening socket we should send an RST instead. The __inet_lookup_listener function finds the bound socket in the LHTABLE hash table. It does so by using the destination port as a hash and picks an appropriate bucket in the hash table. Then it iterates over it linearly to find the matching listening socket.
To understand the problem we traced the slow packets, with another crafted system tap script. It hooks onto __inet_lookup_listener and prints out the details of only the slow packets [...] With this data we went deeper and matched these log lines to specific packets captured with tcpdump. I'll spare you the details, but these are inbound SYN and RST packets which destination port modulo 32 is equal to 21.
[...]
As mentioned above, Linux maintains a listening hash table containing the listening TCP sockets - the LHTABLE. It has a fixed size of 32 buckets
To recap:
All the SYN and RST packets trigger a lookup in LHTABLE. Since the connection entry doesn't exist the __inet_lookup_established call fails and __inet_lookup_listener will be called.
LHTABLE is small - it has only 32 buckets.
LHTABLE is hashed by destination port only.
[...]
At CloudFlare we are using a custom DNS server called rrdns. Among many other requirements, the server is designed to withstand DDoS attacks. [...] In fact, our DNS architecture is designed to spread the load among 16k IP addresses.
When an IP address is under attack, and the server is not keeping up with incoming packets, the kernel receive queue on a UDP socket will overflow. We monitor that by looking at the netstat counters [...] It was more than the DNS server could handle, the receive queues built up and eventually overflowed. Fortunately, because we are binding to specific IP addresses, overflowing some UDP receive queues won't affect any other IP addresses.
[...]
But what does that have to do with the LHTABLE? Well, in our setup we bound to specific IP addresses for both UDP and TCP. While having 16k listening sockets in UDP is okay, it turns out it is not fine for TCP.
[...]
Due to our DNS setup we had 16k TCP sockets bound to different IP addresses on port 53. Since the port number is fixed, all these sockets ended in exactly one LHTABLE bucket. This particular bucket was number 21 (53 % 32 = 21). When an RST or SYN packet hit it, the __inet_lookup_listener call had to traverse all 16k socket entries. This wasn't fast, in fact it took 2ms to finish.
To solve the problem we deployed two changes:
For TCP connections our DNS server now binds to ANY_IP address (aka: 0.0.0.0:53, *:53). We call this "bind to star". While binding to specific IP addresses is still necessary for UDP, there is little benefit in doing that for the TCP traffic. For TCP we can bind to star safely, without compromising our DDoS defenses.
We increased the LHTABLE size in our kernels. We are not the first to do that: Bill Sommerfeld from Google suggested that back in 2011.
With these changes deployed the ping times within our datacenter are finally flat, as they should always have been »