Changwoo Writes Code

concurrent, de-centralized, but still organized world under the hood

sched_ext: a BPF-extensible scheduler class (Part 1)

This blog post introduces sched_ext, a recently proposed extensible scheduler class. sched_ext allows you to write and run your custom process scheduler optimized for your target workloads and hardware architectures using BPF programs. In the rest of the post, I recap the concept of a scheduler at the high level and go over the default schedulers in Linux kernel. Then, I introduce why an extensible scheduler framework, sched_ext, is necessary, how to use it, and provide a BPF scheduler example.

What is a process scheduler? #

A process scheduler is essential in any OS, including the Linux kernel. A scheduler multiplexes CPU time into multiple tasks -- processes and threads. It essentially makes two decisions:

With a multi-core CPU, which is a norm today, a scheduler also makes one more decision:

The above three decisions often conflict each other. For example, a scheduler may assign a long time slice to reduce the context-switching overhead and take advantage of the warmed-up processor cache. However, a time slice that is too long may degrade the responsiveness of a system, hurting the interactive performance.

Also, a scheduler may want to fully utilize multiple CPU cores, so it can assign a task to any idle CPU core. Such work-conserving scheduling may increase overall CPU utilization, but higher CPU utilization does not always result in higher performance. When a task is too frequently migrated from one CPU to another CPU, the cache of the newly assigned CPU should be warmed up again so there is little chance to reuse the cache when the task is scheduled next, resulting in lower performance. The cache re-warming cost will be higher when a system has multiple cache domains, such as NUMA (Non-Uniform Memory Architecture) and AMD CCX (Core Complex) architectures.

How does the Linux scheduler make scheduling decisions? #

Completely Fair Scheduler (CFS) scheduler #

You may wonder how the Linux scheduler makes scheduling decisions to meet such conflicting goals. The Completely Fair Scheduler (CFS) has been the default scheduler for the last 15+ years since merged into the 2.6.23 kernel (October 2007). The core idea of CFS is simple (of course, the devil is in the details), but it works surprisingly well in many workloads. CFS manages the virtual total execution time of a task -- vruntime. The actual execution time of a task is adjusted based on the task's static priority (i.e., nice value). CFS pursues the complete fairness of tasks' vruntime to equally distribute the CPU time to each task based on its priority. It always chooses a task with the smallest vruntime value for complete fair use of CPU time. CFS internally manages a run queue -- a queue of runnable tasks -- using a red-black tree indexed by each task's vruntime. The red-black tree offers O(logN) time complexity in picking the next task where N is the number of runnable tasks. CFS handles IO-intensive tasks pretty well. The most IO-intensive tasks sleep most of the time, waiting for IO events, so their vruntime values tend to be small. Hence, whenever an IO event occurs and the IO-intensive task is wakened up and becomes runnable, CFS quickly chooses that task since it has a small vruntime value.

However, CFS scheduler does not enforce any scheduling deadlines and has many corner cases where a latency-critical task is not scheduled on time -- such scheduling delay of latency-critical tasks results in high tail latency. For instance, Wine -- a Windows-API emulation layer on Linux -- has a wineserver process. The wineserver process is a single-threaded event router which relays Window events to other processes. It is super latency-critical, so scheduling delay of wineserver triggers cascading delays of dependent tasks. However, CFS does not always schedule wineserver immediately. Since wineserver consumes a fair amount of CPU time as a central event rounter, its vruntime value would be pretty high so CFS won't immediately pick the wineserver task all the time.

Earliest Eligible Virtual Deadline First (EEVDF) scheduler #

Due to such limitations, the CFS scheduler was recently retired after serving 15+ years. The Earliest Eligible Virtual Deadline First (EEVDF) scheduler was introduced as a new default scheduler in the Linux kernel 6.6 release (November 2023).

Like CFS, EEVDF pursues the fair use of CPU time among tasks. For this, EEVDF introduces two notions: eligibility and lag of a task. EEVDF manages how much CPU time should be allocated to a task adjusted by the task's static (nice) priority. Task's lag is the gap between the ideal CPU time and the actual CPU time allocated to the task. The positive lag means less CPU time has been allocated. On the other hand, the negative lag means too much CPU time is allocated to the task, so scheduling such a task with a negative lag value violates the fair use of CPU time. With the EEVDF terms, tasks with only positive lag values are eligible for the scheduling. EEVDF scheduler does not schedule an ineligible task, which has a negative lag value, for fairness.

Another essential concept in EEVDF is virtual deadline. A virtual deadline is the time to finish executing the task's requested time slice. It is calculated by adding the task's allocated time slice, and its heuristically estimated eligible time, which is when a task may start to run. Conceptually, a task may start at the eligible time and finish its time slice at the virtual deadline. EEVDF assumes the task's time slice is given by setting a latency nice value. The EEVDF scheduler always chooses an eligible task with the earliest virtual deadline.

EEVDF scheduler naturally handles latency-critical tasks unlike CFS. A latency-critical task has a small latency nice value, so it has a shorter time slice. Therefore, its virtual deadline is earlier than non-latency critical tasks with longer slices. Hence, suppose two tasks have the same nice value with different latency nice values. The latency-critical task will be scheduled more frequently with a shorter slice.

One size does not fit all, so does a scheduler #

The upsteam schedulers, like CFS and EEVDF, aim to generically work well for most workloads (e.g., databases, games, micro-services) on most hardware platforms (e.g., x86, ARM, NUMA, non-NUMA, CCX, etc.). It is not uncommon that one scheduler optimization works well in some workloads and shows pathological degradations in other workloads and hardware combinations. To avoid such performance regression bugs, the kernel community sets a high bar in changing the scheduler code.

However, if you always run a particular set of workloads on a specific hardware, would it be best to use such generic schedulers? In the real world, there are many such cases. For example, Linux-based gaming devices, like the Steam Deck, usually run Windows-based games. Hyperscalars, like Google and Meta, standardize their servers to a few types (e.g., OCP) for maintenance and run a certain set of workloads in a cluster. Similarly, many other companies (e.g., Airbnb, eBay) run their own workloads on their own mini-datacenters.

Adopting a custom scheduler for specific workload/hardware combinations would be more beneficial than using generic schedulers. However, implementing and maintaining custom schedulers is very painful. That is not only because replacing or customizing the in-kernel scheduler is challenging. But also, it is unlikely that a custom scheduling policy is upstreamed so that the custom scheduler will impose a heavy maintenance burden in the long term.

Here comes the sched_ext framework #

sched_ext was proposed to address the problems mentioned above. It allows users to write a custom scheduling policy using BPF without modifying the kernel code. You don't need to struggle to maintain the out-of-tree custom scheduler. In addition, BPF provides a safe kernel programming environment. In particular, the BPF verifier ensures that your custom scheduler has neither a memory bug nor an infinite loop. Also, if your custom scheduler misbehaves -- like failing to schedule a task for too long (say 30 seconds), the kernel portion of sched_ext kills your custom scheduler and falls back to the default kernel scheduler (CFS or EEVDF). Last but not least, you can update the BPF scheduler without reinstalling the kernel and rebooting a server. It is a charm in a large data center with hundreds of thousands of servers. Rebooting that many servers in a rolling manner easily takes several weeks.

Obviously, sched_ext is a safe and nice framework that allows for the rapid experimentation and deployment of various scheduling policies. In the long term, I believe providing such an environment will also positively affect the default scheduler's improvement. Who knows that some scheduling policies battle-tested with sched_ext have turned out beneficial to generic workloads (with some tweaks) so they can be re-implemented on the default scheduler?

How to build and run sched_ext #

Enough talking. Now, let's get our hands dirty. As of this writing, the fifth version of the sched_ext patch set has been proposed. It still needs to be merged into the mainline kernel. Hence, you should apply the patch set to the kernel source or use the sched_ext kernel code on GitHub. Either way, the following kernel configuration should be enabled to turn on sched_ext and enable BPF. Make sure the following bits are included in your .config.

# essential configs to enable sched_ext
CONFIG_BPF=y
CONFIG_SCHED_CLASS_EXT=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_PAHOLE_HAS_BTF_TAG=y
CONFIG_DEBUG_INFO_BTF=y

# useful debug features for sched_ext
CONFIG_DEBUG_INFO=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_INFO_DWARF5=y
CONFIG_DEBUG_INFO_BTF_MODULES=y

Once the kernel configuration is done, you just follow your usual kernel compilation and installation procedure -- one example is here.

After the kernel is compiled, it's time to compile the BPF schedulers. The sched_ext scheduler sources provide several BPF schedulers under tools/sched_ext. BPF code should be compiled with clang/LLVM, so the following options should be added to the make command. Note the clang version should be 16 or higher to compile BPF code correctly.

$> cd tools/sched_ext
$> clang --version # check if clang version is 16 or higher
$> make CC=clang LLVM=1 -j

After rebooting with the sched_ext kernel, let's run any of the sched_ext BPF schedulers.

$> cd tools/sched_ext/build/bin
$> sudo ./scx_simple # a simple vruntime-based scheduler, like CFS
local=2 global=0
local=885 global=5
local=895 global=12
local=906 global=20
...

Yeah! Now the simple vruntime-based scheduler, scx_simple, is up and running. The log message shows how many tasks are scheduled from the local or global dispatch queue (DSQ). For now, please consider DSQ as a run queue in sched_ext holding runnable tasks. I will explain the concept of the dispatch queue in more detail in my next blog post. Stay tuned!

While simple, scx_simple performs better than the CFS scheduler under some workloads. You can check the status of sched_ext from sysfs:

$> sudo cat /sys/kernel/debug/sched/ext
ops : simple
enabled : 1
switching_all : 1
switched_all : 1
enable_state : enabled
nr_rejected : 0

If the BPF scheduler prints out some messages (using, for example, bpf_printk()), you can check the message at /sys/kernel/debug/tracing/trace_pipe as in any other BPF programs.

By the way, if you are lucky enough to use Arch Linux or Ubuntu Linux (24.04 NobleNumbat), you can install a sched_ext-enabled kernel and the necessary tools from package repositories. Please find the detailed instructions to install the sched_ext packages in Arch or Ubuntu Linux here.

What's inside scx_simple? #

Let's take a look the source code of scx_simple. The scx_simple scheduler has two files: scx_simple.c and scx_simple.bpf.c.

scx_simple.c opens and load the BPF scheduler (scx_simple__open() and scx_simple__load()). Then it enables the operations provided by the BPF scheduler (bpf_map__attach_struct_ops(skel->maps.simple_ops)). Once the BPF scheduler operations are enabled -- the BPF scheduling policy is enabled, it periodically reads and prints the status.

int main(int argc, char **argv)
{
/* ... */
skel = scx_simple__open();
SCX_BUG_ON(!skel, "Failed to open skel");

/* ... */
SCX_BUG_ON(scx_simple__load(skel), "Failed to load skel");

link = bpf_map__attach_struct_ops(skel->maps.simple_ops);
SCX_BUG_ON(!link, "Failed to attach struct_ops");

while (!exit_req && !uei_exited(&skel->bss->uei)) {
__u64 stats[2];
read_stats(skel, stats);
printf("local=%llu global=%llu\n", stats[0], stats[1]);
fflush(stdout);
sleep(1);
}

bpf_link__destroy(link);
uei_print(&skel->bss->uei);
scx_simple__destroy(skel);
return 0;
}

sched_ext defines a struct of callback functions, struct sched_ext_ops defined at include/linux/sched/ext.h. A BPF scheduler, scx_simple.bpf.c, implements those callback functions, which will be called in various situations by the in-kernel portion of sched_ext.

SEC(".struct_ops.link")
struct sched_ext_ops simple_ops = {
.enqueue = (void *)simple_enqueue,
.dispatch = (void *)simple_dispatch,
.running = (void *)simple_running,
.stopping = (void *)simple_stopping,
.enable = (void *)simple_enable,
.init = (void *)simple_init,
.exit = (void *)simple_exit,
.name = "simple",
};

When the BPF scheduler is initialized, an init() callback will be called. simple_init () changes all tasks (governed by the default CFS/EEVDF scheduler) to be under the control of scx_simple (scx_bpf_switch_all()) if a command line option is specified (!switch_partial). Then it creates a centralized task queue (scx_bpf_create_dsq(SHARED_DSQ, -1)).

s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
{
if (!switch_partial)
scx_bpf_switch_all();

return scx_bpf_create_dsq(SHARED_DSQ, -1);
}

When a task starts to be controlled by sched_ext (for instance, after fork(), clone(), scx_bpf_switch_all()), an enable() callback will be called. It is the best time to initialize per-task information. struct sched_ext_entity is defined to store sched_ext-specific scheduling information, such as time slice (slice) and task's virtual time (dsq_vtime), and it is added to struct task_struct (p->scx). simple_enable() simply initializes the task's virtual time to the current time (p->scx.dsq_vtime = vtime_now).

void BPF_STRUCT_OPS(simple_enable, struct task_struct *p,
struct scx_enable_args *args)
{
p->scx.dsq_vtime = vtime_now;
}

When a task p becomes ready to run, the enqueue() callback is called. simple_enqueue() dispatches a task to either the local or global dispatch queue (DSQ) (SCX_DSQ_LOCAL vs. SHARED_DSQ inscx_bpf_dispatch()) for execution depending on its flag (enq_flag) set by the underlying sched_ext framework. The local dispatch queue (SCX_DSQ_LOCAL) is a kind of staging area waiting for execution, so the underlying sched_ext framework fetches a task from the per-CPU local DSQ and schedules that task to run on a CPU. simple_enqueue() manages tasks on the local DSQ in a FIFO order (scx_bpf_dispatch()) but tasks on the global DSQ in a vtime order (scx_bpf_dispatch_vtime()). When dispatching a task to a DSQ, the time slice of a task is set as an argument of scx_bpf_dispatch(). scx_simple sets the time slice to the default value, SCX_SLICE_DFL (20 msec).

void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
{
/*
* If scx_select_cpu_dfl() is setting %SCX_ENQ_LOCAL, it indicates that
* running @p on its CPU directly shouldn't affect fairness. Just queue
* it on the local FIFO.
*/

if (enq_flags & SCX_ENQ_LOCAL) {
stat_inc(0); /* count local queueing */
scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
return;
}

stat_inc(1); /* count global queueing */

if (fifo_sched) {
scx_bpf_dispatch(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
} else {
u64 vtime = p->scx.dsq_vtime;

/*
* Limit the amount of budget that an idling task can accumulate
* to one slice.
*/

if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL))
vtime = vtime_now - SCX_SLICE_DFL;

scx_bpf_dispatch_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime,
enq_flags);
}
}

When a task p gets scheduled out, the stopping() callback is called. simple_stopping() accumulates the consumed time slice so far (SCX_SLICE_DFL - p->scx.slice) to the task's vtime (p->scx.dsq_vtime) based on its nice priority (p->scx.weight).

void BPF_STRUCT_OPS(simple_stopping, struct task_struct *p, bool runnable)
{
/*
* Scale the execution time by the inverse of the weight and charge.
*
* Note that the default yield implementation yields by setting
* @p->scx.slice to zero and the following would treat the yielding task
* as if it has consumed all its slice. If this penalizes yielding tasks
* too much, determine the execution time by taking explicit timestamps
* instead of depending on @p->scx.slice.
*/

p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight;
}

When a per-CPU local DSQ becomes empty, there is no task to run on a CPU. In this case, the dispatch() callback is called. simple_dispatch() moves a task from the global DSQ to a local DSQ (scx_bpf_consume(SHARED_DSQ)) for execution. Since the global DSQ is ordered by vtime (scx_bpf_dispatch_vtime(p, SHARED_DSQ, ...)), scx_simple picks a task with the smallest vtime, which consumed the least amount of CPU time so far, and it moves the task to the local DSQ for the execution.

void BPF_STRUCT_OPS(simple_dispatch, s32 cpu, struct task_struct *prev)
{
scx_bpf_consume(SHARED_DSQ);
}

What's next? #

So far, I have introduced the concept of process scheduler, and the key idea of the default scheduler in Linux (CFS and EEVDF). Then, I shared the motivation of sched_ext framework and how to run on your machine. Finally, I skimmed through how a BPF scheduler is organized. In the next blog post, I will dive into what's going on under the hood. Please stay tuned!