In a previous post I explained how to build a kernel with XDP (eXpress Data Path) support. Having that feature enabled is mandatory in order to use it. XDP is a new Linux kernel component that highly improves packet processing performance.

In the last years, we have seen an upraise of programming toolkits and techniques to overcome the limitations of the Linux kernel when it comes to do high-performance packet processing. One of the most popular techniques is kernel bypass which means to skip the kernel’s networking layer and do all packet processing from user-space. Kernel bypass also involves to manage the NIC from user-space, in other words, to rely on an user-space driver to handle the NIC.

By giving full control of the NIC to an user-space program, we reduce the overhead introduced by the kernel (context switching, networking layer processing, interruptions, etc), which is relevant enough when working at speeds of 10Gbps or higher. Kernel bypass plus a combination of other features (batch packet processing) and performance tuning adjustments (NUMA awareness, CPU isolation, etc) conform the basis of high-performance user-space networking. Perhaps the poster child of this new approach to packet processing is Intel’s DPDK (Data Plane Development Kit), although other well-know toolkits and techniques are Cisco’s VPP (Vector Packet Processing), Netmap and of course Snabb.

The disadvantages of user-space networking are several:

  • An OS’s kernel is an abstraction layer for hardware resources. Since user-space programs need to manage their resources directly, they also need to manage their hardware. That often means to program their own drivers.
  • As the kernel-space is completely skipped, all the networking functionality provided by the kernel is skipped too. User-space programs need to reimplement functionality that might be already provided by the kernel or the OS.
  • Programs work as sandboxes, which severely limit their ability to interact, and be integrated, with other parts of the OS.

Essentially, user-space networking achieves high-speed performance by moving packet-processing out of the kernel’s realm into user-space. XDP does in fact the opposite: it moves user-space networking programs (filters, mappers, routing, etc) into the kernel’s realm. XDP allow us to execute our network function as soon as a packet hits the NIC, and before it starts moving upwards into the kernel’s networking subsystem, which results into a significant increase of packet-processing speed. But how does the kernel make possible for an user to execute their programs within the kernel’s realm? Before answering this question we need to take a look at BPF.

BPF and eBPF

Despite its somehow misleading name, BPF (Berkeley Packet Filtering) is in fact a virtual machine model. This VM was originally designed for packet filtering processing, thus its name.

One of the most prominent users of BPF is the tool tcpdump. When capturing packets with tcpdump, an user can define a packet-filtering expression. Only packets that match that expression will actually be captured. For instance, the expression “tcp dst port 80” captures all TCP packets which destination port equals to 80. This expression can be reduced by a compiler to BPF bytecode.

$ sudo tcpdump -d "tcp dst port 80"
(000) ldh      [12]
(001) jeq      #0x86dd          jt 2    jf 6
(002) ldb      [20]
(003) jeq      #0x6             jt 4    jf 15
(004) ldh      [56]
(005) jeq      #0x50            jt 14   jf 15
(006) jeq      #0x800           jt 7    jf 15
(007) ldb      [23]
(008) jeq      #0x6             jt 9    jf 15
(009) ldh      [20]
(010) jset     #0x1fff          jt 15   jf 11
(011) ldxb     4*([14]&0xf)
(012) ldh      [x + 16]
(013) jeq      #0x50            jt 14   jf 15
(014) ret      #262144
(015) ret      #0

Basically what the program above does is:

  • Instruction (000): loads the packet’s offset 12, as a 16-bit word, into the accumulator. Offset 12 represents a packet’s ethertype.
  • Instruction (001): compares the value of the accumulator to 0x86dd, which is the ethertype value for IPv6. If the result is true, the program counter jumps to instruction (002), if not it jumps to (006).
  • Instruction (006): compares the value to 0x800 (ethertype value of IPv4). If true jump to (007), if not (015).

And so forth, until the packet-filtering program returns a result. This result is generally a boolean. Returning a non-zero value (instruction (014)) means the packet matched, whereas returning a zero value (instruction (015)) means the packet didn’t match.

The BPF VM and its bytecode was introduced by Steve McCanne and Van Jacobson in late 1992, in their paper The BSD Packet Filter: A New Architecture for User-level Packet Capture, and it was presented for the first time at Usenix Conference Winter ‘93.

Since BPF is a VM, it defines an environment where programs are executed. Besides a bytecode, it also defines a packet-based memory model (load instructions are implicitly done on the processing packet), registers (A and X; Accumulator and Index register), a scratch memory store and an implicit Program Counter. Interestingly, BPF’s bytecode was modeled after the Motorola 6502 ISA. As Steve McCanne recalls in his Sharkfest ‘11 keynote, he was familiar with 6502 assembly from his junior high-school days programming on an Apple II and that influence him when he designed the BPF bytecode.

The Linux kernel features BPF support since v2.5, mainly added by Jay Schullist. There were not major changes in the BPF code until 2011, when Eric Dumazet turned the BPF interpreter into a JIT (Source: A JIT for packet filters). Instead of interpreting BPF bytecode, now the kernel was able to translate BPF programs directly to a target architecture: x86, ARM, MIPS, etc.

Later on, in 2014, Alexei Starovoitov introduced a new BPF JIT. This new JIT was actually a new architecture based on BPF, known as eBPF. Both VMs co-existed for some time I think, but nowadays packet-filtering is implemented on top of eBPF. In fact, a lot of documentation refers now to eBPF as BPF, and the classic BPF is known as cBPF.

eBPF extends the classic BPF virtual machine in several ways:

  • Takes advantage of modern 64-bit architectures. eBPF uses 64-bit registers and increases the number of available registers from 2 (Accumulator and X register) to 10. eBPF also extends the number of opcodes (BPF_MOV, BPF_JNE, BPF_CALL…).
  • Decoupled from the networking subsystem. BPF was bounded to a packet-based data model. Since it was used for packet filtering, its code lived within the networking subsystem. However, the eBPF VM is no longer bounded to a data model and it can be used for any purpose. It’s possible to attach now an eBPF program to a tracepoint or to a kprobe. This opens up the door of eBPF to instrumentation, performance analysis and many more uses within other kernel subsystems. The eBPF code lives now at its own path: kernel/bpf.
  • Global data stores called Maps. Maps are key-value stores that allow the interchange of data between user-space and kernel-space. eBPF provides several types of Maps.
  • Helper functions. Such as packet rewrite, checksum calculation or packet cloning. Unlike user-space programming, these functions get executed inside the kernel. In addition, it’s possible to execute system calls from eBPF programs.
  • Tail-calls. eBPF programs are limited to 4096 bytes. The tail-call feature allows a eBPF program to pass control a new eBPF program, overcoming this limitation (up to 32 programs can be chained).

eBPF: an example

The Linux kernel sources include several eBPF examples. They’re available at samples/bpf/. To compile these examples simply type:

$ sudo make samples/bpf/

Instead of coding a new eBPF example, I’m going to reuse one of the samples available in samples/bpf/. I will go through some parts of the code and explain how it works. The example I chose was the tracex4 program.

Generally, all the examples at samples/bpf/ consist of 2 files. In this case:

We need to compile then tracex4_kern.c to eBPF bytecode. At this moment, gcc lacks a backend for eBPF. Luckily, clang can emit eBPF bytecode. The Makefile uses clang to compile tracex4_kern.c into an object file.

I commented earlier that one of the most interesting features of eBPF are Maps. Maps are key/value stores that allow to exchange data between user-space and kernel-space programs. tracex4_kern defines one map:

struct pair {
    u64 val;
    u64 ip;
};  

struct bpf_map_def SEC("maps") my_map = {
    .type = BPF_MAP_TYPE_HASH,
    .key_size = sizeof(long),
    .value_size = sizeof(struct pair),
    .max_entries = 1000000,
};

BPF_MAP_TYPE_HASH is one of the many Map types offered by eBPF. In this case, it’s simply a hash. You may also have noticed the SEC("maps") declaration. SEC is a macro used to create a new section in the binary. Actually the tracex4_kern example defines two more sections:

SEC("kprobe/kmem_cache_free")
int bpf_prog1(struct pt_regs *ctx)
{   
    long ptr = PT_REGS_PARM2(ctx);

    bpf_map_delete_elem(&my_map, &ptr); 
    return 0;
}
    
SEC("kretprobe/kmem_cache_alloc_node") 
int bpf_prog2(struct pt_regs *ctx)
{
    long ptr = PT_REGS_RC(ctx);
    long ip = 0;

    // get ip address of kmem_cache_alloc_node() caller
    BPF_KRETPROBE_READ_RET_IP(ip, ctx);

    struct pair v = {
        .val = bpf_ktime_get_ns(),
        .ip = ip,
    };
    
    bpf_map_update_elem(&my_map, &ptr, &v, BPF_ANY);
    return 0;
}   

These two functions will allow us to delete an entry from a map (kprobe/kmem_cache_free) and to add a new entry to a map (kretprobe/kmem_cache_alloc_node). All the function calls in capital letters are actually macros defined at bpf_helpers.h.

If I dump the sections of the object file, I should be able to see these new sections defined:

$ objdump -h tracex4_kern.o

tracex4_kern.o:     file format elf64-little

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000000  0000000000000000  0000000000000000  00000040  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 kprobe/kmem_cache_free 00000048  0000000000000000  0000000000000000  00000040  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  2 kretprobe/kmem_cache_alloc_node 000000c0  0000000000000000  0000000000000000  00000088  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  3 maps          0000001c  0000000000000000  0000000000000000  00000148  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  4 license       00000004  0000000000000000  0000000000000000  00000164  2**0
                  CONTENTS, ALLOC, LOAD, DATA
  5 version       00000004  0000000000000000  0000000000000000  00000168  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  6 .eh_frame     00000050  0000000000000000  0000000000000000  00000170  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

Then there is tracex4_user.c, the main program. Basically what the program does is to listen to kmem_cache_alloc_node events. When that event happens, the corresponding eBPF code is executed. The code stores the IP attribute of an object into a map, which is printed in loop in the main program. Example:

$ sudo ./tracex4
obj 0xffff8d6430f60a00 is  2sec old was allocated at ip ffffffff9891ad90
obj 0xffff8d6062ca5e00 is 23sec old was allocated at ip ffffffff98090e8f
obj 0xffff8d5f80161780 is  6sec old was allocated at ip ffffffff98090e8f

How the user-space program and the eBPF program are connected? On initialization, tracex4_user.c loads the tracex4_kern.o object file using the load_bpf_file function.

int main(int ac, char **argv)
{
    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
    char filename[256];
    int i;

    snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);

    if (setrlimit(RLIMIT_MEMLOCK, &r)) {
        perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");
        return 1;
    }

    if (load_bpf_file(filename)) {
        printf("%s", bpf_log_buf);
        return 1;
    }

    for (i = 0; ; i++) {
        print_old_objects(map_fd[1]);
        sleep(1);
    }

    return 0;
}

When load_bpf_file is executed, the probes defined in the eBPF file are added to /sys/kernel/debug/tracing/kprobe_events. We’re listening now to those events and our program can do something when they happen.

$ sudo cat /sys/kernel/debug/tracing/kprobe_events
p:kprobes/kmem_cache_free kmem_cache_free
r:kprobes/kmem_cache_alloc_node kmem_cache_alloc_node

All the other programs in sample/bpf/ follow a similar structure. There’s always two files:

  • XXX_kern.c: the eBPF program.
  • XXX_user.c: the main program.

The eBPF program defines Maps and functions hooked to a binary section. When the kernel emits a certain type of event (a tracepoint, for instance) our hooks will be executed. Maps are used to exchange data between the kernel program and the user-space program.

Wrapping up

In this article I have covered BPF and eBPF from a high-level view. I’m aware there’s a lot of resources and information nowadays about eBPF, but I feel I needed to explain it with my own words. Please check out the list of recommended readings for further information.

On the next article I will cover XDP and its relation with eBPF.

Recommended readings: