mtcp
mtcp是KAIST实验室在NSDI'2014上发表的一篇关于用户态协议栈的论文。mtcp的设计目标就是解决高性能网络领域处理短流的并发能力,它选择了走用户态协议栈条路。
现在高性能服务器网络领域常年讨论一些C10K、C10M的问题,也一直在吐槽内核协议栈的低效率。所以适当了解mtcp这类项目还是蛮有意思的。
设计动机
- 91%的Flow size不超过32KB
- CPU usage breakdown of Web Server发现,83%的CPU使用发生在内核态,其中TCP/IP部分占34%。
这种数据不能太信,CPU发生在内核态本身没有问题,内核态也能接收和发送数据包。个人认为这个地方拿出中断和context switch或者锁开销所占比例更具有说服力
- MTCP的设计认为Linux TCP存在以下性能瓶颈
- Shared resources
- Broken locality
- Per packet processing
Linux TCP的性能问题
Shared resources
Shared listening queue
对于服务器端使用一个端口提供网络服务是,只有一个处于LISTEN状态的socket负责处理“TCP建连”请求。
即使使用了网卡多队列+RSS技术来提高并发处理能力,由于这个唯一的listening queue的存在,导致它成为了性能瓶颈。
这个问题确实其中要害,据我了解Linux kernel也在尝试解决这方面的问题。比如支持多个listener监听同一个端口、尽量消除listening queue的锁。Shared file descriptor space
在Linux TCP中,每个socket在系统开来都是一个FILE。所以它的并发量、查找效率上都存在性能瓶颈。
Broken locality
在CPU中,局部性是一个高性能非常关注的领域。这里MTCP吐槽的局部性问题具体是指:执行中断请求的CPU可能与执行read/write等操作的CPU和不是同一个。从而没有高效的使用CPU私有的L1/L2 cache。
根据我的了解,RSS + RPS技术已经一定程度解决了这类问题。
- Per packet, per system call processing
这里吐槽的是Linux TCP的system call问题,MTCP的主要解决思路是将多个数据包的处理进行打包,进行批处理。
MTCP设计
下图是MTCP设计者总结的MTCP优势,还是很有意思的。
更多论文中值得关注的细节:
MTCP作为用户态协议栈,本质是一个多线程程序。对应关系是:一个CPU core 对应 一个MTCP thread。另外,在MTCP的架构中,仅允许一个应用程序绑定访问一个NIC。
Our user-level TCP implementation runs as a thread on each CPU core within the same application process. The mTCP thread directly transmits and receives packets to and from the NIC using our custom packet I/O library. Existing user-level packet libraries only allow one application to access an NIC port. Thus, mTCP can only support one application per NIC port. However, we believe this can be addressed in the future using virtualized network interfaces (more details in Section 3.3).
MTCP为每个CPU core建立一个mTCP Thread,该Thread负责分发NIC收到的数据到具体的TCP缓存,或者将每条TCP中的缓存数据发送到NIC。这样就存在应用thread/process与MTCP Thread之间的进程切换,而这种切换比system call的开销往往更高,因此需要通过patch技术来减少这类性能开销。
The application uses mTCP library functions that communicate with the mTCP thread via shared buffers. The access to the shared buffers is granted only through the library functions, which allows safe sharing of the internal TCP data. When a library function needs to modify the shared data, it simply places a request (e.g., write() request) to a job queue. This way, multiple requests from different flows can be piled to the job queue at each loop, which are processed in batch when the mTCP thread regains the CPU. Flow events from the mTCP thread (e.g., new the CPU core. Flow events from the mTCP thread (e.g., new connections, new data arrival, etc.) are delivered in a similar way
This, however, requires additional overhead of manag- ing concurrent data structures and context switch between the application and the mTCP thread. Such cost is un- fortunately not negligible, typically much larger than the system call overhead [29]. One measurement on a recent Intel CPU shows that a thread context switch takes 19 times the duration of a null system call.MTCP批处理的第一个核心流程:MTCP thread一次性处理多个NIC队列中的数据包,并将他们分发到不同TCP各自的接收缓存中。在处理每一个数据包的同时,也会产生相对应的event存在临时event queue中。如,每条TCP流收到数据后,会为之产生一个read event。
When the mTCP thread reads a batch of packets from the NIC’s RX queue, mTCP passes them to the TCP packet processing logic which follows the standard TCP specification.
MTCP批处理的第二个核心流程:MTCP thread在处理完一批数据包后,将存在临时event queue中的事件一次性导入到应用程序的event queue中。这个过程虽然存在上下文切换,但由于可以把大量event放到这一次switch中处理,均摊的看来性能就非常不错了。
After processing a batch of received packets, mTCP flushes the queued events to the application event queue and wakes up the application by signaling it. When the application wakes up, it processes multiple events in a single event loop, and writes responses from multiple flows without a context switch.
在应用程序依次处理event queue中的事件时,每个TCP socket会数据写入自己的发送缓存中。之后MTCP thread会从这些发送缓存中收集数据放到NIC的TX队列中。
注:这个地方MTCP没有详细解释从不同TCP的发送缓存中收集数据包的逻辑。 a. 是一次性全部都收集吗? 应该不是。否则一个大流一次性写入大量数据,不可能一次性将所有数据都放到NIC的TX队列中。 b. 那是否是依次遍历所有socket的发送缓存,根据cwnd/rwnd等因素的限制将数据放入到NIC的TX队列中? 这样做与MTCP在收数据包时的流程类似。 MTCP thread在分发数据包到各个TCP流的接收缓存时,需要考虑许多TCP相关的正确性检查。比如rwnd、TCP flags、PAWS等等。在从各个发送缓存收集数据时,也要考虑比如cwnd、rwnd的限制。 这样做本质上是将Linux TCP中数据每条TCP的发送、接收流程,放到了一个线程里面来做。 那么问题就来了,这样是否真的会效率更高?是否会影响公平性(大流VS小流,有重传VS无重传)? 这部分内容需要看一下mtcp的代码确认后,再来补充。 c. 真的是遍历所以socket的发送缓存吗 ? 是否应该像接收流程中一样,也有一个event queue呢?否则就会浪费一些CPU cycle去遍历发送缓存为空的socket。
Each socket’s write() call writes data to its send buffer, and enqueues its tcb to the write queue. Later, mTCP collects the tcbs that have data to send, and puts them into a send list. Finally, a batch of outgoing packets from the list will be sent by a packet I/O system call, transmitting them to the NIC’s TX queue.
MTCP的high scalable体现以下几个方面:
- 每个CPU core绑定运行一个MTCP thread, 每个CPU core上面也绑定运行一个应用程序thread。
- 通过RSS技术将Flow均摊到各个CPU core上
- 所有的数据结构尽量每个CPU core分配独立的一份,在必须要进行数据共享的部分尽量使用lockless的数据结构。
To minimize inter-core contention between the mTCP threads, we localize all resources (e.g., flow pool, socket buffers, etc.) in each core, in addition to using RSS for flow-level core affinity. Moreover, we completely elimi- nate locks by using lock-free data structures between the application and mTCP.
Thread mapping and flow-level core affinity
Multi-core and cache-friendly data structures.
MTCP中考虑到的一些高性能细节
- CPU cache line对齐
- 将最常访问的数据结构分为最常访问的部分和其他部分
pre-core memory pool for TCP控制结构和socket 缓存
a large portion of connection setup cost is from allo- cating memory space for TCP control blocks and socket buffers. When many threads concurrently call malloc() or free(), the memory manager in the kernel can be eas- ily contended. To avoid this problem, we pre-allocate large memory pools and manage them at user level to sat- isfy memory (de)allocation requests locally in the same thread.
HUGEPAGE减少TLB miss
- 针对短流优化,为control packet(如SYN包、SYN/ACK包)单独建立list。对于纯ACK包,也单独维持list。在MTCP TX manager在收集数据时,优先考虑control list和ack list。