Introduction to eBPF in Red Hat Enterprise Linux 7


The recent release of Red Hat Enterprise Linux 7.6 enables extended Berkeley Packet Filter (eBPF) in-kernel virtual machine which can be used for system tracing. In this blog we introduce the basic concept of this technology and few example use cases. We also present some of the existing tooling built on top of eBPF.

Before starting with eBPF it’s worth noting that traditional Berkeley Packet Filter available via setsockopt(SO_ATTACH_FILTER) is still available unmodified.

eBPF enables programmers to write code which gets executed in kernel space in a more secure and restricted environment. Yet this environment enables them to create tools which otherwise would require writing a new kernel module.

The eBPF in Red Hat Enterprise Linux 7.6 is provided as Tech Preview and thus doesn’t come with full support and is not suitable for deployment in production. It is provided with the primary goal to gain wider exposure, and potentially move to full support in the future.

eBPF in Red Hat Enterprise Linux 7.6 is enabled only for tracing purposes, which allows attaching eBPF programs to probes, tracepoints and perf events. Other use cases such as eBPF socket filters or eXpress DataPath (XDP) are not enabled at this stage.

Design of eBPF

eBPF introduces a new syscall, bpf(2). This syscall is used for all eBPF operations like loading programs, attaching them to certain events, creating eBPF maps and access the map contents from tools. We’ll talk about eBPF maps later.

eBPF programs are written in a special assembly language. When an application uses the bpf(2) syscall to load the program into the kernel the eBPF verifier inspects the code for safe execution. The special assembly language allows execution of the programs in a sandboxed virtual machine, with access only to a limited set of resources and data, the verifier is designed to ensure the program safely terminates. This means that an eBPF program should be safe to run, even in production, as it is designed not to cause any unwanted side-effects.

After loading these programs into the kernel, they are just-in-time compiled into machine native code.

eBPF maps provide a generic key-value store, which can be accessed from both in-kernel loaded eBPF programs as well as the userland applications via bpf(2) syscall. Together maps can be used to pass data from kernel to userland and vice versa.

A limited set of kernel functions may be called from eBPF programs. These functions provide read-write access to eBPF maps, may be used to retrieve current processor id and task pointer and also provide access to perf events.

Note that all eBPF programs need to be always compiled against the kernel-headers of the specific kernel where these will run.

Quick start guide to eBPF with bcc-tools

eBPF was enabled in Red Hat Enterprise Linux 7.6 Beta release onwards so the first step is to ensure we are running a Linux kernel newer than 3.10.0-940.el7 with eBPF support:

Developing tools based on eBPF can require deep knowledge of the kernel. Fortunately many of these tools are already created and ready to use. These tools can be used on their own and can also serve as a reference for creation of new eBPF programs.

eBPF programs can be attached to any function in the kernel with access to its arguments, thus only user with CAP_SYS_ADMIN capability can use the bpf(2) syscall. Therefore I am running all examples below as root user.

Development of tracing tools using eBPF can be simplified by using Berkeley Compiler Collection, BCC. Many useful pre-created tools are also shipping as part of the bcc-tools package. For more details about BCC and the tools it provides follow the project’s GitHub page.

To start using bcc-tools, install the respective package:

These tools require no configuration and are ready to be used. Let’s say we want to trace all kill signals being sent by processes running on my machine. There’s a tool killsnoop(8) for this purpose:

This is the output of the killsnoop(8) command after I sent a signal to the process with pid 18315 using kill(1) command.

Using eBPF with systemtap

There are few other interesting tools which internally use eBPF. SystemTap release 3.2 includes BPF backend which can use eBPF to run stap scripts rather than using kernel modules as traditional SystemTap.

To use this backend install systemtap using yum:

RHEL-7.6 Beta comes with systemtap-3.3:

Systemtap uses debuginfo to understand function arguments thus we also need to install kernel-debuginfo package from respective channels:

Let’s say I’d like to create a similar tool to killsnoop we’ve presented above. This tool is just tracing kill(2) syscall, so we can trace respective kernel function for the same outcome:

This output was printed while I was running these commands in the other shell:

There are more details on stapbpf can be found in stapbpf(8) man page or Aaron Merey’s post “Introducing stapbpf – SystemTap’s new BPF backend.”

Debugging of eBPF

Let’s verify that running these tools really uses eBPF for execution. RHEL-7.6 comes with bpftool which can be used to list and dump eBPF programs loaded in the running kernel.

We can see that two eBPF programs are loaded on the machine while I was running the killsnoop tool. This is because killsnoop (kprobe) trace both call entry and return, thus two BPF programs are attach. The one named syscall__kill is used to trace the entry call of kill(2) syscall handler, and the other named do_ret_sys_kill trace returns from this handlers (in-order to record functions return value).

Another option to list processes using eBPF is to run the bpflist(8) tool from bcc-tools:

We can use bpftool to dump and disassemble one of these programs:

This is the eBPF assembly used by killsnoop program to collect the data it needs.

Creating eBPF tools

Let’s take a deep dive and look at how the killsnoop(8) tool is implemented. We can see that the tool itself is actually a Python script:

A closer look at this script, reveals that the Python script contains some C code, as quoted text in the variable bpf_text.

The bpf_text string contains the C code which is compiled into the BPF assembly and passed to the kernel. The BPF_HASH() macro tells bcc framework to create a BPF map of type hash named ‘infotmp’. This map is used to pass data between sys_kill kprobe and respective kretprobe which executes on exit of sys_kill() function. The map is indexed by sender’s process id (pid) thus the key type is u32. Value type of val_t is used to pass necessary data to the return probe.

BPF_PERF_OUTPUT() creates a perf event buffer which is later used to pass data to userland. We can see that the final event contains all important information about the signal consisting of pid’s of sending and receiving process, signal number itself, return value of the kill() syscall and the name of the sender. In the sys_kill() kprobe we don’t yet know the result of the call which we can only see in the return from this function.

The code for syscall kprobe and kretprobe is defined further in functions syscall__kill() and do_ret_sys_kill(). The kprobe function stores the target process pid, number of the syscall and the name of the sender in infotmp map. The kretprobe looks up this data in infotmp map using current process pid as the index, the code accompanies them with the return value of the call and submits these data to the perf event buffer.

The rest of the script is rather trivial:

Here we initiate the bcc BPF framework and tell it to attach the syscall__kill() function from the C code to kprobe defined for syscall kill(2) and do_rt_sys_kill() to attach to the return probe of the same syscall.

The data structure layout is defined to match the struct data_t in the C code.

Finally poll for perf events and print each received perf event.

Writing eBPF tools requires deeper understanding of Linux kernel internals to identify what kprobes need to be employed to collect the data. The BCC framework provides multiple ways to collect, store, sort and analyse the data so the amount of data passed to userspace is minimal. Once these tools are created eBPF can provide an effective way to collect any data from running system and offer a new way to monitor status of all kernel subsystems.

References