# Fast checksum computation

## Checksums are a simple mechanism to detect errors in data transmissions. In this article I write about how to compute the Internet checksum and how to do it fast.

Posted by Diego Pino García on June 14, 2018

An Internet packet generally includes two checksums: a TCP/UDP checksum and an IP checksum. In both cases, the checksum value is calculated using the same algorithm. For instance, IP header checksum is computed as follows:

• Set the packet’s IP header checksum to zero.
• Fetch the IP header octets in groups of 16-bit and calculate the accumulated sum.
• In case there’s an overflow while adding, sum the carry-bit to the total sum.
• Once the sum is done, set the checksum to the one’s complement of the accumulated sum.

Here is an implementation of such algorithm in Lua:

A pseudo-header is a 12-byte data structured composed of:

• IP source and destination addresses (8 bytes).
• Protocol (TCP=0x6 or UDP=0x11) (2 bytes).
• TCP length, calculated from IP header’s total length minus TCP or UDP header size (2 bytes).

You may wonder what’s the purpose of the pseudo-header? David P Reed, often considered the father of UDP, provides a great explanation in this thread: Purpose of pseudo header in TCP checksum. Basically, the original goal of the pseudo-header was to take into account IP addresses as part of the TCP checksum, since they’re relevant fields in an end-to-end communication. Back in those days, the original plan for TCP safe communications was to leave source and destination addresses clear but encrypt the rest of the TCP fields. That would avoid man-in-the middle attacks. However NAT, which is essentially a man-in-the-middle, happened thrashing away this original plan. In summary, the pseudo-header exists today for legacy reasons.

Lastly, is worth mentioning that UDP checksum is optional in IPv4, so you might see it set as zero many times. However, the field is mandatory in IPv6.

Verifying a packet’s checksum

Verifying a packet’s checksum is easy. On receiving a packet, the receiver sums all the relevant octets, including the checksum field. The result must be zero if the packet is correct, since the sum of a number and its one’s complement is always zero.

From a developer’s perspective there are several tools for verifying the correctness of a packet’s checksum. Perhaps my preferred tool is Wireshark, which features an option to check the validity of TCP checksums (Edit->Preferences->Protocols[TCP]. Mark Validate the TCP checksum if possible). When this option is enabled, packets with a wrong checksum are highlighted in a black background.

Seeing packets with wrong checksums is common when capturing packets with tcpdump and open them in Wireshark. The reason why checksums are not correct is that TCP checksumming is generally offloaded to the NIC, since it’s a relatively expensive operation (nowadays, NICs count with specialized hardware to do that operation fast). Since tcpdump captures outgoing packets before they hit the NIC, the checksum value hasn’t been calculated yet and likely contains garbage. It’s possible to check whether checksum offloading is enabled in a NIC by typing:

Another option for verifying checksum values is using `tshark`:

Lastly, in case you’d like to fix wrong checksums in a pcap file is possible to do that with `tcprewrite`:

Fast checksum computation

Since TCP checksum computation involves a large chunk of data improving its performance is important. There are, in fact, several RFCs dedicated exclusively to discuss this topic. RFC 1071 (Computing the Internet Checksum) includes a detailed explanation of the algorithm and also explores different techniques for speeding up checksumming. In addition, it features reference implementations in several hardware architectures such as Motorola 68020, Cray and IBM 370.

Perhaps the fastest way to recompute a checksum of a modified packet is to incrementally update the checksum as the packet gets modified. Take for instance the case of NAT which modifies origin and destination ports and addresses. Those operations affect both the TCP and IP checksums. In the case of the IP checksum, if the source address gets modified we can recompute the new IP checksum as:

Or more generically using the following formula:

This technique is covered in RFC 1071 and further polished over two other RFCs: RFC 1141 and RFC 1624 (Incremental Updating of the Internet Checksum).

If we decide to recompute the checksum, there are several techniques to do it fast. On its canonical form, the algorithm says octets are summed as 16-bit words. If there’s carry after an addition, the carry should be added to the accumulated sum. Truth is it’s not necessary to add octets as 16-bit words. Due to the associative property of addition, it is possible to do parallel addition using larger word sizes such as 32-bit or 64-bit words. In those cases the variable that stores the accumulative sum has to be bigger too. Once the sum is computed a final step folds the sum to a 16-bit word (adding carry if any).

Here’s an implementation in C using 32-bit words:

Using larger word sizes increases speed as it reduces the total number of operations. What about using this technique on 64-bit integers? It definitely would be possible, but it requires to handle carry in the body loop. In the algorithm above, 32-bit words are summed to a 64-bit word. Carry, if any, is stored in the higher part of sum, which later gets summed in the folding step.

Using SIMD instructions should allow us to sum larger sizes of data in parallel. For instance using AVX2’s VPADD (vector-packed addition) instruction it should be possible to sum 16x16-bit words in parallel. The issue here once again is handling the possible generated carry to the accumulated sum. So instead of a 16x16-bit vector a 8x32 vector is used instead. From a functional point of view this is equivalent to sum using 128-bit words.

Snabb features implementations of checksum computation, generic and using SIMD instructions. In the latter case there are versions for SSE2 and AVX2 instruction sets. Snabb’s philosophy is to do everything in software and rely as much less as possible in offloaded NIC functions. Thus checksum computation is something done in code. Snabb’s implementation using AVX2 instructions is available at src/arch/avx2.c (Luke pushed a very interesting implementation in machine code as well. See PR#899).

Going back to RFC 1071, many of the reference implementations do additions in the main loop taking into account the carry bit. For instance, in the Motorola 68020 implementation that is done using the ADDXL instruction. In X86 there’s an equivalent add-with-carry instruction (ADC). Basically this instruction performs a sum of two operands plus the carry-flag.

Another technique described in RFC 1071, and also used in the reference implementations, is loop unrolling. Instead of summing one word per loop, we could sum 2, 4 or 8 words instead. A loop that sums 64-bit words in strides of 8 means actually avoiding loops for packet sizes lower than 512 bytes. Unrolling a loop requires adding waterfall code after the loop to handle the edge-cases that control the bounds of the loop.

As an exercise to teach myself more DynASM and X86-64 assembly, I decided to rewrite the generic checksum algorithm and see if performance improved. The first implementation followed the canonical algorithm, summing words as 16-bit values. Performance was much better than the generic Lua implementation posted at the beginning of this article, but it wasn’t better than Snabb’s C implementation, which does loop unrolling.

After this initially disappointing result, I decided to apply some of the optimization techniques commented before. Summing octets as 32-bit words definitely improved performance. The advantage of writing the algorithm in assembly is that I could make use of the ADC instruction. That allowed me to use 64-bit words. Performance improved once again. Finally I tried out several loop unrolling. With a loop unrolling of 4 strides the algorithm proved to be better than the SSE2 algorithm for several packet sizes: 64 bytes, 570 bytes and 1520 bytes. However, it doesn’t beat the AVX2 implementation in the large packet case, but it shown better performance for small and medium sizes.

And here’s the final implementation:

Benchmark results for several data sizes:

Data size: 44 bytes

Algorithm Time per csum Time per byte
Generic 87.77 ns 1.99 ns
SSE2 86.06 ns 1.96 ns
AVX2 83.20 ns 1.89 ns
New 52.10 ns 1.18 ns

Data size: 550 bytes

Algorithm Time per csum Time per byte
Generic 1058.04 ns 1.92 ns
SSE2 510.40 ns 0.93 ns
AVX2 318.42 ns 0.58 ns
New 270.79 ns 0.49 ns

Data size: 1500 bytes

Algorithm Time per csum Time per byte
Generic 2910.10 ns 1.94 ns
SSE2 991.04 ns 0.66 ns
AVX2 664.98 ns 0.44 ns
New 743.88 ns 0.50 ns

All in all, it has been a fun exercise. I have learned quite a lot about the Internet checksum algorithm. I also learned how loop unrolling can help improving performance in a dramatic way (more than I initially expected). I found very interesting as well how changing the context of a problem, in this case the target programming language, forces to think about the problem in a different way but it also enables the possibility of doing more optimizations that were not possible before.