Changwoo Writes Code

concurrent, de-centralized, but still organized world under the hood

sched_ext: scheduler architecture and interfaces (Part 2)

This is the second blog post about the sched_ext, a BPF-based extensible scheduler class. Just in case you didn’t read the first one, you can find it here. In this blog post, I briefly update what has been happening in sched_ext, then introduce the scheduler architecture and the sched_ext API. After reading this, you should have a good understanding of the sched_ext architecture and be ready to read the source code of any sched_ext schedulers. Let’s get started!

Now, it is happening! #

After a long debate about the sched_ext patch, Linus finally approved its inclusion in the 6.11 kernel. As of the time of this writing, the V7 patch set has been posted to the mailing list. In addition, the developer and user communities around the sched_ext are expanding. More developers have been attending the weekly office hours. If you are interested in scheduler development, please join our Slack channel. Also, many enthusiasts (especially from the CachyOS community) have been actively testing, benchmarking, and bug-reporting the sched_ext schedulers.

Where to start developing a sched_ext scheduler? #

As discussed in the first blog post, sched_ext is an extensible scheduler class, so more than one sched_ext-based scheduler exists (e.g., scx_lavd, scx_rusty, and scx_rustland). To play with sched_ext schedulers, you should use a sched_ext-enabled kernel. There are three ways to get a sched_ext-enabled kernel:

Now, you are ready to hack. Congrats!

Understanding the big picture of Linux schedulers #

Before getting your hands dirty, let’s step back and see the big picture of the Linux scheduler first.

The following diagram shows the architecture of Linux scheduler. As the figure illustrates, the Linux scheduler conceptually consists of two layers. The core kernel scheduler (core.c) defines the common behavior of all scheduling policies. For example, the core kernel scheduler defines what to do upon a timer interrupt (or scheduler tick). At the high level, it is analogous to the abstract base class for all schedulers in a C++ term.

Concrete scheduling policies are defined on top of the core kernel scheduler. For example, Linux provides real-time scheduling policies (rt.c) such as FIFO (First-In, First-Out, SCHED_FIFO) and RR (Round Robin, SCHED_RR). The fair scheduling policy (SCHED_NORMAL, EEVDF: Earliest Eligible Virtual Deadline First) governs non-realtime regular tasks.

This two-levels architecture allows one to write a new scheduling policy while reusing the logic of the core kernel scheduler. In Linux, such an architecture is called a scheduler class.

                                                  || User-space part of   ||
                                                  || your scheduler       ||
                                                  || (e.g.,      ||
~~~~~~~~~~~~~~~~ <kernel space vs. user space boundary > ~~~~~~~~~~~~~~~~~~~
                                                  || Your BPF scheduler   ||
                                                  || (e.g., main.bpf.c)   ||
+-----------------------+ +---------------------+ +========================+
| Default scheduler     | | Real-time scheduler | || Sched_ext framework  ||
| (EEVDF)               | | (FIFO, RR)          | ||                      ||
| (kernel/sched/fair.c) | | (kernel/sched/rt.c) | || (kernel/sched/ext.c) ||
+-----------------------+ +---------------------+ +========================+
|   Core kernel scheduler                                                  |
|   (kernel/sched/core.c)                                                  |

The sched_ext framework inside the kernel (ext.c) is defined on top of the core kernel scheduler like any other scheduler classes (e.g., fair.c and rt.c). That implies all the scheduler classes should communicate to the core kernel scheduler via the same interface; we will discuss that interface shortly.

The sched_ext framework is a vessel that ships a BPF scheduler with a user-defined policy. The sched_ext framework adds two more layers on top of it: a BPF scheduler layer and a user-space process, which interacts with the BPF scheduler. Writing a new sched_ext scheduler means writing a new BPF scheduler and its user-space counterpart.

Sharing time is different from sharing space #

You may notice that the idea of a scheduler class is somewhat similar to the idea of a VFS (Virtual File System) layer for file systems. Both define the common behavior of concrete policies (e.g., a new scheduling policy or file system layout) and the common interfaces for them.

Yet, there is a key distinction. VFS, with its focus on spatially partitioned disk space (or drive), allows for the simultaneous existence of multiple file systems, as long as each manages disjoint disk (or drive) space.

 SCHED_FIFO => SCHED_RR => SCHED_NORMAL (either EEVDF or sched_ext)

In contrast, scheduling is about how to slice and use CPU time, so at a certain point, only one scheduling policy can make scheduling decisions. In that sense, the scheduler classes are hierarchical in using CPU time. Real-time schedulers will always take the CPU time first (SCHED_FIFO => SCHED_RR). If no real-time task wants more CPU time, the schedulers for “normal” tasks (SCHED_NORMAL) will take turns with the real-time schedulers. Since EEVDF and sched_ext are both defined as SCHED_NORMAL, they cannot coexist at a certain point. When the sched_ext-based scheduler becomes active, it will take over all normal class tasks, replacing EEVDF.*

(*Note that sched_ext provides a special mode for scheduling only a subset of normal class tasks, mainly for testing purposes. However, I won’t discuss it in this post for brevity of discussion.)

Interface matters #

A sched_ext scheduler consists of four layers, as discussed from bottom to top: 1) core kernel scheduler, 2) sched_ext framework, 3) BPF scheduler, and 4) BPF’s user-space counterpart. These layers interact with each other through (relatively) well-defined interfaces. We now discuss those interfaces to understand how a scheduler works in action.

|| User-space part of   ||
|| your scheduler       ||
|| (e.g.,      ||
   \\//   ^^^^
   \\//   ^^^^ <== Interface 4: BPF scheduler <=> user-space counter part
   \\//   ^^^^
|| Your BPF scheduler   ||
|| (e.g., main.bpf.c)   ||
   ^^^^   \\// <== Interface 3: BPF scheduler => sched_ext framework
   ^^^^   \\//
   ^^^^  <======== Interface 2: sched_ext framework => BPF scheduler
|| Sched_ext framework  ||
||                      ||
|| (kernel/sched/ext.c) ||
          ^^^^ <== Interface 1: core kernel scheduler => scheduler class
| Core kernel scheduler  |
|  (kernel/sched/core.c) |

Interface 1: core kernel scheduler ⇒ scheduler class #

The core kernel scheduler defines the common underlying behavior for scheduling and defines an interface for concrete scheduler classes (e.g., SCHED_FIFO, SCHED_NORMAL). It defines an ops structure (struct sched_class) – a struct of function pointers – that needs to be filled by a concrete scheduler class to implement a concrete scheduling policy.

In the Linux kernel, a task (thread or process) is represented by struct task_struct, and a runnable (i.e., non-sleeping, ready-to-run) task should be in one of the run queues (struct rq) associated with each CPU. Scheduling is the problem of treating runnable tasks through run queues.

In a particular situation, when each scheduling policy needs its specific action, the core kernel scheduler calls an operation defined in struct sched_class. For example, when the core kernel scheduler needs to select a task to be scheduled, it calls the sched_class.pick_next_task(rq) callback of a concrete scheduling policy. When a task becomes runnable, the core kernel scheduler calls sched_calss.enqueue(rq, p, flags) so the concrete scheduling policy enqueues task p to run queue rq. When a task’s runtime state needs to be updated, the core kernel scheduler calls sched_calss.update_curr(rq).

/* kernel/sched/sched.h */
struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
bool (*yield_to_task)(struct rq *rq, struct task_struct *p);

void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags);

struct task_struct *(*pick_next_task)(struct rq *rq);

void (*put_prev_task)(struct rq *rq, struct task_struct *p);
void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);

int (*balance)(struct rq *rq, struct task_struct *prev,
struct rq_flags *rf);
int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);

struct task_struct * (*pick_task)(struct rq *rq);

void (*migrate_task_rq)(struct task_struct *p, int new_cpu);

void (*task_woken)(struct rq *this_rq, struct task_struct *task);

void (*set_cpus_allowed)(struct task_struct *p,
struct affinity_context *ctx);

void (*rq_online)(struct rq *rq);
void (*rq_offline)(struct rq *rq);

struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq);

void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
void (*task_fork)(struct task_struct *p);
void (*task_dead)(struct task_struct *p);

void (*switching_to) (struct rq *this_rq, struct task_struct *task);
void (*switched_from)(struct rq *this_rq, struct task_struct *task);
void (*switched_to) (struct rq *this_rq, struct task_struct *task);
void (*reweight_task)(struct rq *this_rq, struct task_struct *task,
int newprio);
void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
int oldprio);

unsigned int (*get_rr_interval)(struct rq *rq,
struct task_struct *task);

void (*update_curr)(struct rq *rq);
/* ... */

The sched_ext kernel framework implements functions required in struct sched_class. For example, when the core kernel scheduler enqueues a task or selects a next task to be scheduled or updates task’s runtime state, it eventually calls enqueue_task_scx(), pick_next_task_scx(), and update_curr_scx(), respectively, through struct sched_class.

/* kernel/sched/ext.c */
.enqueue_task = enqueue_task_scx,
.dequeue_task = dequeue_task_scx,
.yield_task = yield_task_scx,
.yield_to_task = yield_to_task_scx,

.wakeup_preempt = wakeup_preempt_scx,

.pick_next_task = pick_next_task_scx,

.put_prev_task = put_prev_task_scx,
.set_next_task = set_next_task_scx,

.balance = balance_scx,
.select_task_rq = select_task_rq_scx,
.set_cpus_allowed = set_cpus_allowed_scx,

.rq_online = rq_online_scx,
.rq_offline = rq_offline_scx,

.pick_task = pick_task_scx,

.task_tick = task_tick_scx,

.switching_to = switching_to_scx,
.switched_from = switched_from_scx,
.switched_to = switched_to_scx,
.reweight_task = reweight_task_scx,
.prio_changed = prio_changed_scx,

.update_curr = update_curr_scx,

.uclamp_enabled = 1,

The sched_ext framework provides the common implementation for BPF schedulers. For example, when a task’s state needs to be updated, update_curr_scx() in the sched_ext decrements the time slice of the currently running task.

static void update_curr_scx(struct rq *rq)
struct task_struct *curr = rq->curr;
u64 now = rq_clock_task(rq);
u64 delta_exec;

delta_exec = now - curr->se.exec_start;
/* ... */
curr->scx.slice -= min(curr->scx.slice, delta_exec);
/* ... */

Interface 2: sched_ext framework ⇒ BPF scheduler #

Let’s think about another case of enqueuing a runnable task to a run queue. The core kernel scheduler will call sched_class.enqueue_task(), actually enqueue_task_scx() in the schd_ext framework. How should the enqueue_task_scx() function be designed? The proper data structure and how tasks are organized in the queue depend on the BPF scheduler’s scheduling policy. For example, a FIFO queue would be the best data structure in the FIFO policy. In the deadline-based scheduling policy, it would be nice to maintain runnable tasks ordered by their deadlines. Any ordered map (e.g., red-black tree) would be appropriate.

Now let’s briefly examine how sched_ext implements enqueue_task_scx(). As the following code snippet shows, enqueue_task_scx() eventually calls scx_ops.enqueue(args), which is an enqueue implementation of a BPF scheduler.

/* kernel/sched/ext.c */

static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags)
int sticky_cpu = p->scx.sticky_cpu;
enq_flags |= rq->scx.extra_enq_flags;
/* ... */
do_enqueue_task(rq, p, enq_flags, sticky_cpu);

static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
int sticky_cpu)
/* ... */
SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, enqueue, p, enq_flags);
/* ... */

#define SCX_CALL_OP_TASK(mask, op, task, args...) \
do { \
/* ... */ \
SCX_CALL_OP(mask, op, task, ##args); \
/* ... */ \
} while (0)

#define SCX_CALL_OP(mask, op, args...) \
do { \
/* ... */ \
scx_ops.op(args); \
/* ... */ \
} while (0)

The sched_ext framework defines an operation structure – struct sched_ext_ops, where scx_ops.enqueue() is defined. The following code snippet is a simplified version of struct sched_ext_ops. The full version with API description can be found in the source code.

/* kernel/sched/ext.c */
struct sched_ext_ops {
s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags);

void (*enqueue)(struct task_struct *p, u64 enq_flags);
void (*dispatch)(s32 cpu, struct task_struct *prev);

void (*tick)(struct task_struct *p);

void (*runnable)(struct task_struct *p, u64 enq_flags);
void (*running)(struct task_struct *p);
void (*stopping)(struct task_struct *p, bool runnable);
void (*quiescent)(struct task_struct *p, u64 deq_flags);

/* ... */

void (*update_idle)(s32 cpu, bool idle);

/* ... */

s32 (*init_task)(struct task_struct *p, struct scx_init_task_args *args);
void (*exit_task)(struct task_struct *p, struct scx_exit_task_args *args);

/* ... */

s32 (*init)(void);
void (*exit)(struct scx_exit_info *info);

/* ... */

static struct sched_ext_ops scx_ops;

There seem to be a lot of callbacks in a BPF scheduler, but most of them are pretty straightforward.

When a BPF scheduler is loaded and unloaded, you have control over it. The sched_ext_ops.init() and sched_ext_ops.exit() functions are your tools to initialize and de-initialize BPF scheduler-wide data structures. This is where you can create and destroy per-CPU structures and global run queues managed by the BPF scheduler.

When a new task is created, sched_ext_ops.init_task() is called. As you can expect, sched_ext_ops.exit_task() is called when a task is terminated. You can manage per-task data structures there.

During the lifetime, a task will become first runnable (sched_ext_ops.runnable()), then it will be picked by the scheduler and will be running (sched_ext_ops.running()). Then, when its time slice is exhausted, the task will be stopped and scheduled out (sched_ext_ops.stopping()). The task becomes non-runnable (sched_ext_ops.quiescent()), for example, when waiting for an IO event.

When a task wakes up, for example, when an IO event is delivered, the BPF scheduler should first decide which CPU the task should run on (sched_ext_ops.select_cpu()). Then, the task will be enqueued to a BPF-managed run queue (sched_ext_ops.enqueue()).

Each CPU will continue to execute the assigned tasks to it. When a CPU has nothing to run, it will get a task from the BPF-managed run queue (sched_ext_ops.dispatch()). Even if there is no task to run on the BPF-managed run queue, the CPU will be idle to save power until there is something to run. The state transition from/to idle state can be tracked by sched_ext_ops.update_idle().

Upon every scheduler tick (i.e., every 1/HZ seconds), sched_ext_ops.tick() is called, so the BPF-scheduler can periodically do chores (e.g., power management, preemption, etc).

That’s (almost) all! Writing a new BPF scheduler is nothing but implementing those struct sched_ext_ops functions. Note that the sched_ext framework provides a default implementation of each callback, so when a callback implementation is not provided by the BPF scheduler, the default implementation will be executed. That’s straightforward, huh?

Interface 3: BPF scheduler ⇒ sched_ext framework #

A BPF scheduler needs to talk to the sched_ext kernel framework. For instance, after choosing a task to be run, it should be somehow notified to the sched_ext framework. There are two ways from a BPF scheduler to the sched_ext framework: 1) BPF helper function and 2) dispatch queue (DSQ). DSQ is also created and manipulated by the BPF helper functions. Here is a list of some essential BPF helper functions. A full list of BPF helper functions are here, and their implementations with full API descriptions are here.

/* tools/shed_ext/include/scx/common.bpf.h */
s32 scx_bpf_create_dsq(u64 dsq_id, s32 node);
s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;

void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id,
u64 slice, u64 enq_flags);
void scx_bpf_dispatch_vtime(struct task_struct *p, u64 dsq_id, u64 slice,
u64 vtime, u64 enq_flags);

bool scx_bpf_consume(u64 dsq_id);
bool scx_bpf_consume_task(unsigned long it, struct task_struct *p);

s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
u64 wake_flags, bool *is_idle);

void scx_bpf_kick_cpu(s32 cpu, u64 flags);

Dispatch queue (DSQ) is a core construct between a BPF scheduler and the sched_ext kernel framework, so many BPF helper functions are around DSQ. DSQ is a queue holding runnable tasks. Tasks in a DSQ can be ordered by an arrival order (FIFO) or a vtime (virtual time) order, where vtime is an integer associated with a task. The sched_ext framework also creates and maintains its internal DSQs: 1) one system-wide (named global DSQ) and 2) one for each CPU (named local DSQ). Both internal DSQs are FIFO-ordered. Each CPU in the sched_ext framework runs a task on its local DSQ in a FIFO order.

A BPF scheduler can also create DSQs (either FIFO or vtime DSQ) to manage runnable tasks in its own way (scx_bpf_create_dsq()). sched_ext_ops.init() is a typical place to create custom DSQs.

A task can be enqueued to the DSQ in a FIFO order (scx_bpf_dispatch()) or vtime order (scx_bpf_dispatch_vtime()), typically at sched_ext_ops.enqueue().

A task in a DSQ can be consumed from tip (scx_bpf_consume()) or a specified task can be consumed (scx_bpf_consume_task()), typically at sched_ext_ops.dispatch(). Here consuming a task means that a task is removed from a custom DSQ and moved to the internal local DSQ, so a CPU can run a task in its locak DSQ.

In addition, the number of tasks in a DSQ can be queried (scx_bpf_dsq_nr_queued()). At the end, a DSQ should be destroyed (scx_bpf_destroy_dsq()).

Besides DSQ-related BPF helpers, two other helper functions are notable: scx_bpf_select_cpu_dfl() and scx_bpf_kick_cpu(). scx_bpf_select_cpu_dfl() finds an appropriate CPU to run a task using default core selection policy. It can be used in sched_ext_ops.select_cpu(). scx_bpf_kick_cpu() sends an IPI (Inter-Processor Interrupt) signal to another CPU for waking up an idle CPU (SCX_KICK_IDLE flag) and asking preempt out the currently running task (SCX_KICK_PREEMPT flag).

Interface 4: BPF scheduler ⟺ user-space counterpart #

Since the BPF scheduler is a regular BPF program, you can use any BPF tricks you like. A user-space program (typically written in C or Rust) can use any libbpf API. Most importantly, a BPF map is accessed and manipulated by a user-space program (e.g., bpf_map_lookup_elem()).

What’s next? #

The best way to digest the sched_ext concept is to read the existing BPF scheduler code. I strongly encourage you to read the source code of one of scx_lavd, scx_rusty, or scx_rustland. Then, you will have a solid foundation to write a new BPF scheduler or improve an existing one. In the next blog post, I will discuss how to find out what to improve as an author of the scx_lavd scheduler. I hope you enjoyed this blog post. Happy scheduling until the next blog post!