LinuxDays 2024

Kernel CPU schedulers specialized to an application
13.10, 12:00–12:50 (Europe/Prague), 107
Jazyk: English

We'll explore ways to design single-purpose task schedulers, specialized for exactly one application and one hardware platform, optimizing for performance. This approach is made affordable by a recent patch to the Linux kernel introducing Sched-Ext, an additional scheduler class that loads user-defined task schedulers in the form of BPF programs ("Ext" here is for "extensible").


The CPU scheduler is a part of the operating system's kernel; it's in charge of choosing the CPU where to execute application threads, when they should start running, and for how long. On Linux, most applications use the Fair scheduler, also called EEVDF from the name of its algorithm, which recently replaced the CFS algorithm. The Fair scheduler must accommodate the execution needs of all software, from interactive desktop apps to enterprise databases, and all hardware platforms, from low-power mobile devices to multi-socket server machines. We can call it a "general purpose scheduler": it runs well enough everywhere, and necessarily makes compromises so that no platform and no app is left behind.

In this presentation, we'll explore a different paradigm: the single-purpose scheduler, specialized for exactly one application and one hardware platform, optimizing for performance. This approach is made affordable by a recent patch to the Linux kernel introducing Sched-Ext, an additional scheduler class that loads user-defined task schedulers in the form of BPF programs ("Ext" here is for "extensible"). Given an app, such as a web browser, a video game, or a database, and a computer system (could be our laptop, or a server machine), we'll design our own task scheduler following a four steps method:

  1. Instrumentation. First thing, we need to characterize the workload we intend to optimize, and devise a tracing strategy that is both informative of its behavior and low overhead, as to not skew our data collection.
  2. Analysis of resource usage. Once we have data, we need to identify which resource is actually limiting our workload performance. For example if we find that lock contention is where most time is spent, it won't help to improve the storage access pattern, since the app isn't I/O bound.
  3. Algorithm design. Now that we know which resources are critical to the workload, we have to think of ways in which task scheduling can allocate them more efficiently. If we saw some threads getting interrupted often before finishing their work, and spending a long time getting back to where they left at the previous timeslice, we may want to reduce preemption when they're running.
  4. Tuning. Our initial algorithm design won't be perfect at the first try, so we'll leave some parameters in the implementation, and iterate our experiment until we find the best values for those parameters.

Obtížnost

Pokročilí

I joined SUSE in 2016 and have since been working on Linux kernel performance, task scheduling and power management.