Stupid RCU Tricks: How Read-Intensive is The Kernel’s Use of RCU?

RCU is a specialized synchronization mechanism, and is typically used where there are far more readers (rcu_read_lock(), rcu_read_unlock(), rcu_dereference(), and so on) than there are updaters (synchronize_rcu(), call_rcu(), rcu_assign_pointer(), and so on). But does the Linux kernel really make heavier use of RCU’s read-side primitives than of its update-side primitives?

One way to determine this would be to use something like ftrace to record all the calls to these functions. This works, but trace messages can be lost, especially when applied to frequently invoked functions. Also, dumping out the trace buffer can perturb the syatem. Another approach is to modify the kernel source code to count these function invocations in a cache-friendly manner, then come up with some way to dump this to userspace. This works, but I am lazy. Yet another approach is to ask the tracing folks for advice.

This last is what I actually did, and because the tracing person I happened to ask happened to be Andrii Nakryiko, I learned quite a bit about BPF in general and the bpftrace command in particular. If you don’t happen to have Andrii on hand, you can do quite well with Appendix A and Appendix B of Brendan Gregg’s “BPF Performance Tools”. You will of course need to install bpftrace itself, which is reasonably straightforward on many Linux distributions.

Linux-Kernel RCU Read Intensity

Those of you who have used sed and awk have a bit of a running start because you can invoke bpftrace with a -e argument and a series of tracepoint/program pairs, where a program is bpftrace code enclosed in curly braces. This code is compiled, verified, and loaded into the running kernel as a kernel module. When the code finishes executing, the results are printed right there for you on stdout. For example:

bpftrace -e 'kprobe:__rcu_read_lock { @rcu_reader = count(); } kprobe:rcu_gp_fqs_loop { @gp = count(); } interval:s:10 { exit(); }'

This command uses the kprobe facility to attach a program to the __rcu_read_lock() function and to attach a very similar program to the rcu_gp_fqs_loop() function, which happens to be invoked exactly once per RCU grace period. Both programs count the number of calls, with @gp being the bpftrace “variable” accumulating the count, and the count() function doing the counting in a cache-friendly manner. The final interval:s:10 in effect attaches a program to a timer, so that this last program will execute every 10 seconds (“s:10”). Except that the program invokes the exit() function that terminates this bpftrace program at the end of the very first 10-second time interval. Upon termination, bpftrace outputs the following on an idle system:

Attaching 3 probes...

@gp: 977
@rcu_reader: 6435368

In other words, there were about a thousand grace periods and more than six million RCU readers during that 10-second time period, for a read-to-grace-period ratio of more than six thousand. This certainly qualifies as read-intensive.

But what if the system is busy? Much depends on exactly how busy the system is, as well as exactly how it is busy, but let’s use that old standby, the kernel build (but using the nice command to avoid delaying bpftrace). Let’s also put the bpftrace script into a creatively named file rcu1.bpf like so:

kprobe:__rcu_read_lock
{
        @rcu_reader = count();
}

kprobe:rcu_gp_fqs_loop
{
        @gp = count();
}

interval:s:10
{
        exit();
}

This allows the command bpftrace rcu1.bpf to produce the following output:

Attaching 3 probes...

@gp: 274
@rcu_reader: 78211260

Where the idle system had about one thousand grace periods over the course of ten seconds, the busy system had only 274. On the other hand, the busy system had 78 million RCU read-side critical sections, more than ten times that of the idle system. The busy system had more than one quarter million RCU read-side critical sections per grace period, which is seriously read-intensive.

RCU works hard to make the same grace-period computation cover multiple requests. Because synchronize_rcu() invokes call_rcu(), we can use the number of call_rcu() invocations as a rough proxy for the number of updates, that is, the number of requests for a grace period. (The more invocations of synchronize_rcu_expedited() and kfree_rcu(), the rougher this proxy will be.)

We can make the bpftrace script more concise by assigning the same action to a group of tracepoints, as in the rcu2.bpf file shown here:

kprobe:__rcu_read_lock, kprobe:call_rcu, kprobe:rcu_gp_fqs_loop { @[func] = count(); } interval:s:10 { exit(); }

With this file in place, bpftrace rcu2.bpf produces the following output in the midst of a kernel build:

Attaching 4 probes... @[rcu_gp_fqs_loop]: 128 @[call_rcu]: 195721 @[__rcu_read_lock]: 21985946

These results look quite different from the earlier kernel-build results, confirming any suspicions you might harbor about the suitability of kernel builds as a repeatable benchmark. Nevertheless, there are about 180K RCU read-side critical sections per grace period, which is still seriously read-intensive. Furthermore, there are also almost 2K call_rcu() invocations per RCU grace period, which means that RCU is able to amortize the overhead of a given grace period down to almost nothing per grace-period request.

Linux-Kernel RCU Grace-Period Latency

The following bpftrace program makes a histogram of grace-period latencies, that is, the time from the call to rcu_gp_init() to the return from rcu_gp_cleanup():

kprobe:rcu_gp_init { @start = nsecs; } kretprobe:rcu_gp_cleanup { if (@start) { @gplat = hist((nsecs - @start)/1000000); } } interval:s:10 { printf("Internal grace-period latency, milliseconds:\n"); exit(); }

The kretprobe attaches the program to the return from rcu_gp_cleanup(). The hist() function computes a log-scale histogram. The check of the @start variable avoids a beginning-of-time value for this variable in the common case where this script start in the middle of a grace period. (Try it without that check!)

The output is as follows:

Attaching 3 probes... Internal grace-period latency, milliseconds: @gplat: [2, 4) 259 |@@@@@@@@@@@@@@@@@@@@@@ | [4, 8) 591 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 137 |@@@@@@@@@@@@ | [16, 32) 3 | | [32, 64) 5 | | @start: 95694642573968

Most of the grace periods complete within between four and eight milliseconds, with most of the remainder completing within between two and four milliseconds and then between eight and sixteen milliseonds, but with a few stragglers taking up to 64 milliseconds. The final @start line shows that bpftrace simply dumps out all the variables. You can use the delete(@start) function to prevent printing of @start, but please note that the next invocation of rcu_gp_init() will re-create it.

It is nice to know the internal latency of an RCU grace period, but most in-kernel users will be more concerned about the latency of the synchronize_rcu() function, which will need to wait for the current grace period to complete and also for callback invocation. We can measure this function’s latency with the following bpftrace script:

kprobe:synchronize_rcu { @start[tid] = nsecs; } kretprobe:synchronize_rcu { if (@start[tid]) { @srlat = hist((nsecs - @start[tid])/1000000); delete(@start[tid]); } } interval:s:10 { printf("synchronize_rcu() latency, milliseconds:\n"); exit(); }

The tid variable contains the ID of the currently running task, which allows this script to associate a given return from synchronize_rcu() with the corresponding call by using tid as an index to the @start variable.

As you would expect, the resulting histogram is weighted towards somewhat longer latencies, though without the stragglers:

Attaching 3 probes... synchronize_rcu() latency, milliseconds: @srlat: [4, 8) 9 |@@@@@@@@@@@@@@@ | [8, 16) 31 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 31 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| @start[4075307]: 96560784497352

In addition, we see not one but two values for @start. The delete statement gets rid of old ones, but any new call to synchronize_rcu() will create more of them.

Linux-Kernel Expedited RCU Grace-Period Latency

Linux kernels will sometimes executed synchronize_rcu_expedited() to obtain a faster grace period, and the following command will further cause synchronize_rcu() to act like synchronize_rcu_expedited():

echo 1 >  /sys/kernel/rcu_expedited

Doing this on a dual-socket system with 80 hardware threads might be ill-advised, but you only live once!

Ill-advised or not, the following bpftrace script measures synchronize_rcu_expedited() latency, but in microseconds rather than milliseconds:

kprobe:synchronize_rcu_expedited {
        @start[tid] = nsecs;
}

kretprobe:synchronize_rcu_expedited {
        if (@start[tid]) {
                @srelat = hist((nsecs - @start[tid])/1000);
                delete(@start[tid]);
        }
}

interval:s:10 {
        printf("synchronize_rcu() latency, microseconds:\n");
        exit();
}

The output of this script run concurrently with a kernel build is as follows:

Attaching 3 probes...
synchronize_rcu() latency, microseconds:


@srelat: 
[128, 256)            57 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)            14 |@@@@@@@@@@@@                                        |
[512, 1K)              1 |                                                    |
[1K, 2K)               2 |@                                                   |
[2K, 4K)               7 |@@@@@@                                              |
[4K, 8K)               2 |@                                                   |
[8K, 16K)              3 |@@                                                  |

@start[4140285]: 97489845318700

Most synchronize_rcu_expedited() invocations complete within a few hundred microseconds, but with a few stragglers around ten milliseconds.

But what about linear histograms? This is what the lhist() function is for, with added minimum, maximum, and bucket-size arguments:

kprobe:synchronize_rcu_expedited {
        @start[tid] = nsecs;
}

kretprobe:synchronize_rcu_expedited {
        if (@start[tid]) {
                @srelat = lhist((nsecs - @start[tid])/1000, 0, 1000, 100);
                delete(@start[tid]);
        }
}

interval:s:10 {
        printf("synchronize_rcu() latency, microseconds:\n");
        exit();
}

Running this with the usual kernel build in the background:

Attaching 3 probes...
synchronize_rcu() latency, microseconds:


@srelat: 
[100, 200)            26 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[200, 300)            13 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[300, 400)             5 |@@@@@@@@@@                                          |
[400, 500)             1 |@@                                                  |
[500, 600)             0 |                                                    |
[600, 700)             2 |@@@@                                                |
[700, 800)             0 |                                                    |
[800, 900)             1 |@@                                                  |
[900, 1000)            1 |@@                                                  |
[1000, ...)           18 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |

@start[4184562]: 98032023641157

The final bucket is overflow, containing measurements that exceeded the one-millisecond limit.

Summary

The bpftrace command can be used to quickly and easily script compiled in-kernel programs that can measure and monitor a wide variety of things. This post focused on a few aspects of RCU, but quite a bit more material may be found in Brendan Gregg’s “BPF Performance Tools” book.

Posted in Uncategorized | Tagged , , | Comments Off on Stupid RCU Tricks: How Read-Intensive is The Kernel’s Use of RCU?

Stupid RCU Tricks: Is RCU Watching?

It is just as easy to ask why wouldn’t RCU be watching all the time. After all, you never know when you might need to synchronize!

Unfortunately, the concept of an eternally watchful RCU is problematic in the context of the Linux kernel due to energy-efficiency considerations. The problem is that for RCU to be watching an idle CPU, RCU would need that CPU to execute instructions. And making an idle CPU unnecessarily execute instructions (for a rather broad definition of the word “unnecessary”) will terminally annoy a great many people in the battery-powered embedded world. And for good reason: Making RCU avoid watching idle CPUs back in the day resulted in 30-40% increases in battery lifetime.

In this, CPUs are not all that different from people. When someone is deep in thought, interrupting them can cause them to lose something like 20  of work. Similarly, when a CPU is deeply idle, asking it to execute instructions will consume not only the energy required for those instructions, but also much more energy to work its way out of that deep idle state, and then to return back to that deep idle state.

And this is why CPUs going idle tell RCU to stop watching them. This allows RCU to ignore them completely, where the most important part of ignoring them is refraining from asking them to execute instructions.

In some kernel configurations, RCU also ignores portions of the kernel’s entry/exit code, that is, the last bits of kernel execution before switching back to userspace execution and the first bits of kernel execution after switching away from userspace execution. This happens in kernels built with CONFIG_NO_HZ_FULL=y, but only on CPUs mentioned in the CPU list passed to the nohz_full kernel parameter. This enables carefully configured HPC applications and CPU-bound real-time applications to get near-bare-metal performance from such CPUs, while still having the entire Linux kernel at their beck and call. Because RCU is not watching such applications, RCU can permit the scheduling-clock interrupt to be turned off entirely, thus avoiding disturbing these types of performance-critical applications.

But if RCU is not watching a given CPU, that CPU invoking rcu_read_lock() has no effect, which can come as quite a surprise to the corresponding RCU read-side critical section, which naively expected to be able to safely traverse an RCU-protected data structure. This can be a nasty trap for the unwary, which is why kernels built with CONFIG_PROVE_LOCKING=y (lockdep) complain bitterly when rcu_read_lock() is invoked on a CPU that RCU is not watching.

But suppose that you have code that is invoked both from deep within the idle loop and from normal tasks. What can you do?

First, you can invoke rcu_is_watching(), which will return true if RCU is watching. And, as you might expect, lockdep uses this function to figure out when it should complain bitterly. So you invoke rcu_is_watching() and it helpfully returns false, indicating that you cannot invoke rcu_read_lock() and friends. What now?

You could do what the v5.18 Linux kernel’s kernel_text_address() does, which can be abbreviated as follows:

no_rcu = !rcu_is_watching();
if (no_rcu)
        rcu_nmi_enter(); // Make RCU watch!!!
do_rcu_traversals();
if (no_rcu)
        rcu_nmi_exit(); // Return RCU to its prior watchfulness state.

If your code is not so performance-critical, you can do what the arm64 implementation of cpu_suspend() does:

RCU_NONIDLE(__cpu_suspend_exit());

This macro forces RCU to watch while it executes its argument as follows:

#define RCU_NONIDLE(a) \
        do { \
                rcu_irq_enter_irqson(); \
                do { a; } while (0); \
                rcu_irq_exit_irqson(); \
        } while (0)

The rcu_irq_enter_irqson() and rcu_irq_exit_irqson() functions are essentially wrappers around the aforementioned rcu_nmi_enter() and rcu_nmi_exit().

Although RCU_NONIDLE() is easier to use and more compact than the more explicit approach used by kernel_text_address(), it is still a bit annoying to have to decorate your code in this manner. And this is why Peter Zijlstra has been reworking the various idle loops to cause RCU to be watching a much greater fraction of their code. This might well be an ongoing process as the idle loops continue gaining functionality, but Peter’s good work thus far at least makes RCU watch the idle governors and a much larger fraction of the idle loop’s trace events. When combined with the kernel entry/exit work by Peter, Thomas Gleixner, Mark Rutland, and many others, it is hoped that the functions not watched by RCU will all eventually be decorated with something like noinstr, for example:

static noinline noinstr unsigned long rcu_dynticks_inc(int incby)
{
        return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
}

We don’t need to worry about exactly what this function does. For this blog entry, it is enough to know that its noinstr tag prevents tracing this function, making it more convenient for RCU to not be watching it.

What exactly are you prohibited from doing while RCU is not watching you?

As noted before, RCU readers are a no-go. If you try invoking rcu_read_lock(), rcu_read_unlock(), rcu_read_lock_bh(), rcu_read_unlock_bh(), rcu_read_lock_sched(), or rcu_read_lock_sched() while rcu_is_watching() would return false, lockdep will complain.

On the other hand, using SRCU (srcu_read_lock() and srcu_read_unlock()) is just fine, as is RCU Tasks Trace (rcu_read_lock_trace() and rcu_read_unlock_trace()). RCU Tasks Rude does not have explicit read-side markers, but anything that disables preemption acts as an RCU Tasks Rude reader no matter what rcu_is_watching() would return at the time. There is some reason to hope that RCU Tasks Rude

RCU Tasks is an odd special case. Like RCU Tasks Rude, RCU Tasks has implicit read-side markers, which are any region of non-idle-task kernel code that does not do a voluntary context switch (the idle tasks being handled by RCU Tasks Rude). Except that in kernels built with CONFIG_PREEMPTION=n and without any of RCU’s test suite, the RCU Tasks API maps to plain old RCU. This means that code not watched by RCU is ignored by RCU Tasks in such kernels. Given that RCU Tasks ignores the idle tasks, this affects only user entry/exit code in kernels built with CONFIG_NO_HZ_FULL=y, and even then, only on CPUs mentioned in the list of CPUs given to the nohz_full kernel boot parameter, but nevertheless a trap for the unwary.

In post-v5.18 mainline, you can build your kernel with CONFIG_FORCE_TASKS_RCU=y, in which case RCU Tasks will always be built into your kernel, avoiding this trap.

In summary, energy-efficiency, battery-lifetime, and application-performance/latency concerns force RCU to avert its gaze from idle CPUs, and, in kernels built with CONFIG_NO_HZ_FULL=y, also from CPUs on the low-level kernel entry/exit code paths. Fortunately, the recent changes have allowed RCU to be watching more code, but this being the kernel, corner cases will always be with us. This corner-case code from which RCU averts its gaze requires the special handling described in this post.

Posted in Uncategorized | Comments Off on Stupid RCU Tricks: Is RCU Watching?

Parallel Programming: December 2021 Update

It is past time for another release of Is Parallel Programming Hard, And, If So, What Can You Do About It?. But first, what is the difference between an edition and a release?

The main difference is the level of validation. For example, during the several months leading up to the second edition, I read the entire book, fixing issues as I found them. So where an edition is a completed work, a release is primarily for the benefit of people who would like to see a recent snapshot of this book, but without having to install and run LaTeX and its dependencies.

Having cleared that up, here are the big-animal changes since the second edition:

  1. A lot of detailed work to make the new ebook-sized PDF usable, along with other formatting improvements, courtesy of Akira Yokosawa, SeongJae Park, and Balbir Singh.
  2. The nq build now places quick quizzes at the end of each chapter, courtesy of Akira Yokosawa.
  3. There are a few new sections in the “Locking” chapter, the “Putting It All Together” chapter, the “Advanced Synchronization: Memory Ordering” chapter, and the “Important Questions” appendix.
  4. There is now more validation information associated with much of the sample code throughout the book.
  5. The “RCU Usage” section has received a much-needed upgrade.
  6. There is now an index and a list of acronyms, courtesy of Akira Yokosawa.
  7. A new install_latex_package.sh script, courtesy of SeongJae Park.
  8. Greatly improved error handling, including scripts that check for common LaTeX formatting mistakes, courtesy of Akira Yokosawa.

Yang Lu and Zhouyi Zhou are translating the Second Edition to Chinese, and would greatly appreciate additional help.

Everyone mentioned above contributed a great many wordsmithing fixes, as did Chin En Lin, Elad Lahav, Zhouyi Zhou, and GitHub user “rootbeer”. A grateful “thank you” to everyone who contributed!

Posted in Uncategorized | Tagged , | Comments Off on Parallel Programming: December 2021 Update

Stupid RCU Tricks: Removing CONFIG_RCU_FAST_NO_HZ

The CONFIG_RCU_FAST_NO_HZ Kconfig option was added many years ago to improve energy efficiency for systems having significant numbers of short bursts of idle time. Prior to the addition of CONFIG_RCU_FAST_NO_HZ, RCU would insist on keeping a given idle CPU’s scheduling-clock tick enabled until all of that CPU’s RCU callbacks had been invoked. On certain types of battery-powered embedded systems, these few additional scheduling-clock ticks would consume up to 40% of the battery lifetime. The people working on such systems were not amused, and were not shy about letting me know of their dissatisfaction with RCU’s life choices. Please note that “letting me know” did not take the form of flaming me on LKML. Instead, they called me on the telephone and yelled at me.

Given that history, why on earth would I even be thinking about removing CONFIG_RCU_FAST_NO_HZ, let alone queuing a patch series intended for the v5.17 merge window???

The reason is that everyone I know of who builds their kernels with CONFIG_RCU_FAST_NO_HZ=y also boots those systems with each and every CPU designated as a rcu_nocbs CPU. With this combination, CONFIG_RCU_FAST_NO_HZ=y is doing nothing but placing a never-taken branch in the fastpath to and from idle. Such systems should therefore run slightly faster and with slightly better battery lifetime if their kernels were instead built with CONFIG_RCU_FAST_NO_HZ=n, which would get rid of that never-taken branch.

But given that battery-powered embedded folks badly wanted CONFIG_RCU_FAST_NO_HZ=y, and given that they are no longer getting any benefit from it, why on earth haven’t they noticed?

The have not noticed because rcu_nocbs CPUs do not invoke their own RCU callbacks. This work is instead delegated to a set of per-CPU rcuoc kthreads, with a smaller set of rcuog kthreads managing those callbacks and requesting grace periods as needed. By default, these rcuoc and rcuog kthreads are not bound, which allows both the scheduler (and for that matter, the systems administrator) to take both performance and energy efficiency into account and to run those kthreads wherever is appropriate at any given time. In contrast, non-rcu_nocbs CPUs will always run their own callbacks, even if that means powering up an inconveniently placed portion of the system at an inconvenient time. This includes CONFIG_RCU_FAST_NO_HZ=y kernels, whose only advantage is that they power up inconveniently placed portions of systems at inconvenient times only 25% as often as would a non-rcu_nocbs CPU in a CONFIG_RCU_FAST_NO_HZ=n kernel.

In short, the rcu_nocbs CPUs’ practice of letting the scheduler decide where to run the callbacks is especially helpful on asymmetric systems (AKA big.LITTLE systems), as shown by data collected by Dietmar Eggeman and Robin Randhawa. This point is emphasized by the aforementioned fact that everyone I know of who builds their kernels with CONFIG_RCU_FAST_NO_HZ=y also boots those systems with each and every CPU designated as a rcu_nocbs CPU.

So if no one is getting any benefit from building their kernels with CONFIG_RCU_FAST_NO_HZ=y, why keep that added complexity in the Linux kernel? Why indeed, and hence the patch series intended for the v5.17 merge window.

So if you know of someone who is getting significant benefit from CONFIG_RCU_FAST_NO_HZ=y who could not get that benefit from booting with rcu_nocbs CPUs, this would be a most excellent time to let me know!

Posted in Uncategorized | Tagged , , , , | Comments Off on Stupid RCU Tricks: Removing CONFIG_RCU_FAST_NO_HZ

Stupid RCU Tricks: Creating Branches For the -rcu Tree

Several people have expressed interest in how I go about creating the topic branches in the -rcu tree. So when I created branches earlier this week, I actually kept track.

But why bother with topic branches? In my case, reason is to allow anyone reviewing RCU patches to focus on a particular topic and to be able to keep that topic “in cache” while reviewing all of that topic’s patches. Even more important, it makes it easy for reviewers to completely ignore patches that are not of interest to them.

Preparation

If you intend to follow along by actually executing the commands (highly recommended), first run “git checkout dev.2021.11.30b” in your clone of the -rcu tree.

Before creating branches, you need some essential tools. First, you absolutely must be able to see what you are doing. For this, there is the gitk tool, or, for those forswearing GUIs, the tig tool. If you cannot see what you are doing, your picture of your repository will diverge from that of git, and git always wins. If you can tolerate GUIs at all, install gitk. It will change your life.

Once you have gitk or tig installed, use it to review the commits that you are looking to organize into branches. This allows checking for bugs and also helps identify potential topics. Pro tip: Expand the tool’s window to the full height of the screen. Another pro tip: Restrict gitk to the commits of interest, in my case by running “gitk v5.16-rc1..&”.

From here on out, I will just say “gitk”. If you are smart enough to use a non-GUI tool like tig, you are smart enough to translate. ;–)

Sorting Commits Into Topic Branches

The next step is to run this script, placing the output in a file:

#!/bin/bash

git log --pretty=format:"%h %s" $1 | \
        awk '{ print NR, "pick " $0 }' | \
        sort -k1n

I have this script in my bin directory under the name git-rebase-prep.sh, so I simply type:

git-rebase-prep.sh v5.16-rc1..dev.2021.11.30b > /tmp/rebase.txt

The first five lines of this file are as follows:

126 pick 12637ec9a1505 tools/memory-model:  Document locking corner cases
125 pick 5665cde49ec08 tools/memory-model: Make judgelitmus.sh note timeouts
124 pick 2e8007e79af5b tools/memory-model: Make cmplitmushist.sh note timeouts
123 pick 7a31810649915 tools/memory-model: Make judgelitmus.sh identify bad macros
122 pick 4e11469f67f2e tools/memory-model: Make judgelitmus.sh detect hard deadlocks

The first column is the commit number, with lower numbers meaning newer commits (that is, numbered in “git log” order), but the commits are ordered with the older commits first (that is, in the order that “git rebase” expects them). The second column is the commit’s SHA-1 ID, and the remaining columns are the commit-log subject line.

Edit this file in your choice of editor, but in a window that is the full height of your monitor. Adjust your editor so that you have access two two different regions of this file, for example, in vim, type control-W followed by the letter “s”. The upper pane will eventually have topic-branch names each followed by the commits in that topic branch.

This time was a bit of a special case because this past merge window’s change to the CONFIG_PREEMPT_DYNAMIC Kconfig option broke all non-preemptible CONFIG_SMP=n torture-test scenarios. (The default choice of CONFIG_PREEMPT_DYNAMIC=y forces CONFIG_PREEMPTION=y, which breaks tests of Tiny RCU and Tiny SRCU.) Therefore, the first “topic branch” is a single commit that adds CONFIG_PREEMPT_DYNAMIC=n to all affected torture-test scenarios. All RCU-related topic branches are then placed on top of this commit, thus allowing each topic branch to be torture-tested separately.

This means that the first few lines of the “/tmp/rebase.txt” file are as follows:

Underneath:
26 pick 37cf3c8c1ebda rcutorture: Add CONFIG_PREEMPT_DYNAMIC=n to tiny scenarios

@@@

And this line has been removed from the remainder of this file. The “@@@” is a marker separating the topic-branched commits from those still waiting to be classified.

From my scan of the commits, I tentatively created the following topic branches, so that the first few lines of the “/tmp/rebase.txt” file are now as follows:

Underneath:
26 pick 37cf3c8c1ebda rcutorture: Add CONFIG_PREEMPT_DYNAMIC=n to tiny scenarios

clocksource.2021.11.30c
doc.2021.11.30c
exp.2021.11.30c
fastnohz.2021.11.30c
fixes.2021.11.30c
nocb.2021.11.30c
nolibc.2021.11.30c
rcutorture.2021.11.30c
srcu.2021.11.30c
tasks.2021.11.30c
torture.2021.11.30c
torturescript.2021.11.30c
Merge above.
kcsan.2021.11.30c (merge)
lkmm.2021.11.30c (merge, then merge lkmm-dev)
clocksource.2021.11.30c (merge)
@@@

In other words, we will cherry-pick the first commit, rebase 11 branches on top of it (clocksource.2021.11.30c through torturescript.2021.11.30c, that is, the commits subject to some form of torture testing), and then merge the last ten of these branches together. The kcsan commits will be rebased on top of v5.16-rc1, as will the lkmm commits (but with the lkmm-dev commits rebased on top of the lkmm commits). Each of the kcsan, lkmm, and lkmm-dev branches will be merged separately on top of the ten-way merge point, and finally the clocksource.2021.11.30c branch will be merged on top of all of that. There are always a few commits not ready for the next merge window or that are to be upstreamed elsewhere, and these will be rebased on top of the clocksource.2021.11.30c merge point.

Please feel free to run “gitk v5.16-rc1..dev.2021.11.30c” in order to get a firm picture of the desired state.

The next step is to go through the remaining commits and assign them to branches. This process brought home the fact that there are no kcsan commits this time around (but there were plenty last time and likely will be plenty more next time), so I removed the “kcsan.2021.11.30c (merge)” line from /tmp/rebase.txt. In addition, there were not all that many commits combined for the rcutorture.2021.11.30c and torture.2021.11.30c topic branches, so I merged them all into torture.2021.11.30c. This is why the numbers in the first column of each commit line in /tmp/rebase.txt are important: In some cases, it is necessary to rearrange the topic branches, and it is sometimes important to get the commits in the correct order.

An this point, the /tmp/rebase.txt file looks (almost) like this.

Time to start actually creating the topic branches!

Create Topic Branches

We need two gitk instances, one for the old state (“gitk v5.16-rc1..dev.2021.11.30b”) and another for the new state which we will launch shortly:

git checkout v5.16-rc1
git cherry-pick 37cf3c8c1ebda
gitk v5.16-rc1..&

The first of the above commands put us at the desired v5.16-rc1 base commit for the topic branches, which just so happens to be where dev.2021.11.30b is based. Important safety tip: Don’t try creating branches while also moving the pile of commits to some other base commit at the same time. There are just too many things that can go wrong when you try that.

The second of the above commands rebases (or “cherry-picks”) the aforementioned CONFIG_PREEMPT_DYNAMIC-compensation commit. The final command launches the other gitk instance that will show us the evolving state of the topic branches.

Note that my rebased CONFIG_PREEMPT_DYNAMIC-compensation commit happened to have the SHA-1 commit ID 8c0abfd6d2f6b0221194241ac2908751a2a0385f. If you are following along, you will need to change the git rebase argument for --onto to the corresponding SHA-1 commit ID in your tree. If you instead use my ID, everything will work, except that you will be rebasing on my rebased CONFIG_PREEMPT_DYNAMIC-compensation commit instead of your own. Rebasing some on mine and some on yours will probably actually work, but I leave the decision as to whether to carry out that experiment to your better judgment.

I suggest placing the old-state gitk at the left of your screen, the rebase.txt and your command window in the middle, and the new-state gitk at the right of your screen. Being able to see everything all the time is key to successful creation of topic branches.

The next set of commands rebases the clocksource-related commits on top of the rebased CONFIG_PREEMPT_DYNAMIC-compensation commit:

git branch clocksource.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a clocksource.2021.11.30c

The -i argument to the “git rebase” command specifies interactive mode, which will place you in an editor with a very long series of commits. In vim, type “c}” to delete all of those commits and enter vim insertion mode. Then copy and paste the two lines following clocksource.2021.11.30c from the rebase.txt file. Hit the “escape” key to exit vim insertion mode and then strip the commit numbers from all of the lines, for example, by using the “:1,.s/^[0-9]* //vim command. Write out the file and exit the editor (in vim, your choice of ZZ, 😡, :wq, and probably others as well), at which point git will commence rebasing.

Don’t forget to hit the <F5> key in the new gitk window after the completion of each rebase command.

The next several branches are constructed in a similar manner:

git branch doc.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a doc.2021.11.30c
git branch exp.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a exp.2021.11.30c
git branch fastnohz.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a fastnohz.2021.11.30c
git branch fixes.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a fixes.2021.11.30c
git branch nocb.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a nocb.2021.11.30c
git branch nolibc.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a nolibc.2021.11.30c

For each “git rebase” command, copy and paste the corresponding section of the rebase.txt file. Don’t forget to hit <F5> in the new-state gitk window after running each “git rebase” command.

Remember that “almost” I mentioned above about the rebase.txt file? We now get to one of the deviations. At this point, I belatedly realized that there was only one SRCU commit (441a467cd9793 “srcu: Prevent redundant __srcu_read_unlock() wakeup”) and that it would be better to add that single commit to the fixes.2021.11.30c topic branch:

git checkout fixes.2021.11.30c
git cherry-pick 441a467cd9793

Note that I ignored commit ordering in this case. It is safe to do so because this is the only commit in the whole pile that touches kernel/rcu/srcutiny.c, it touches no other file, and there are no other types of dependencies against the other commits. Some times you get lucky!

Next, the rest of the RCU-related branches:

git branch tasks.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a tasks.2021.11.30c
git branch torture.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a torture.2021.11.30c
git branch torturescript.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto 8c0abfd6d2f6b0221194241ac2908751a2a0385f \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a torturescript.2021.11.30c

Once again, copy and paste the commits from the corresponding section of the rebase.txt file and once again do not forget to hit <F5> after each “git rebase” command.

It is now time to do a merge! Unfortunately, I messed up the copy-and-paste. Twice. And then forgot that “git reset --hard” takes the current branch with it. Again, more than once. Then I accidentally included clocksource.2021.11.30c in the merge.

Perhaps building branches late in the day was a mistake, but here are the mistaken commands:

git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c
git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f
git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c
git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f
git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c \
        fixes.2021.11.30c nocb.2021.11.30c nolibc.2021.11.30c tasks.2021.11.30c \
        torture.2021.11.30c torturescript.2021.11.30c
git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f
git checkout -B torturescript.2021.11.30c 90b21bcfb2846625c4f2e4a0bf5969543cef8ba7
git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c \
        fixes.2021.11.30c nocb.2021.11.30c nolibc.2021.11.30c tasks.2021.11.30c \
        torture.2021.11.30c torturescript.2021.11.30c
git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f
git checkout -B torturescript.2021.11.30c 90b21bcfb2846625c4f2e4a0bf5969543cef8ba7
git checkout 8c0abfd6d2f6b0221194241ac2908751a2a0385f
git merge clocksource.2021.11.30c doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c \
        fixes.2021.11.30c nocb.2021.11.30c nolibc.2021.11.30c tasks.2021.11.30c \
        torture.2021.11.30c torturescript.2021.11.30c
git reset --hard 8c0abfd6d2f6b0221194241ac2908751a2a0385f

If nothing else, this sorry story shows that you can recover from mistakes. And making sure to hit <F5> in the new-state gitk means that all the commits are right there, enabling you to piece things back together.

But I suggest that you instead execute the following single command:

git checkout 8c0abfd6d2f6b0221194241ac2908751a2a0385f

Then hit <F5> in the new-state gitk window and verify that each topic branch has its branch name in place, that none of those branch names are in bold face, and that none of the little per-commit circles are yellow, except for the v5.16-rc1 commit. If all of that is in place, it is time to do the merge:

git merge doc.2021.11.30c exp.2021.11.30c fastnohz.2021.11.30c fixes.2021.11.30c \
        nocb.2021.11.30c nolibc.2021.11.30c tasks.2021.11.30c torture.2021.11.30c \
        torturescript.2021.11.30c

This will put you into your editor, where you can fill out a commit log for the merge. This is what I did, documenting the topics:

Merge branches 'doc.2021.11.30c', 'exp.2021.11.30c', 'fastnohz.2021.11.30c', 'fixes.2021.11.30c', 'nocb.2021.11.30c', 'nolibc.2021.11.30c', 'tasks.2021.11.30c', 'torture.2021.11.30c' and 'torturescript.2021.11.30c' into HEAD
    
doc.2021.11.30c: Documentation updates.
exp.2021.11.30c: Expedited-grace-period fixes.
fastnohz.2021.11.30c: Remove CONFIG_RCU_FAST_NO_HZ.
fixes.2021.11.30c: Miscellaneous fixes.
nocb.2021.11.30c: No-CB CPU updates.
nolibc.2021.11.30c: Tiny in-kernel library updates.
tasks.2021.11.30c: RCU-tasks updates, including update-side scalability.
torture.2021.11.30c: Torture-test in-kernel module updates.
torturescript.2021.11.30c: Torture-test scripting updates.

The text starting with “Merge branches” that extends up to but not including the blank line is a single long line, just so you know.

Yet again, be sure to hit <F5> in the new-state gitk to see the results of the merge. Copy the SHA-1 commit ID of this merge point somewhere, keeping in mind that clicking on a commit in gitk places that commit’s SHA-1 ID in your clipboard. Alternatively, you can use bash variables: “mergepoint=`git rev-parse HEAD`

The next task is to rebase the various Linux-kernel memory model (LKMM) commits, as always substituting the commits from the corresponding portions of the rebase.txt file:

git branch lkmm.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto v5.16-rc1 d069e38e66dbfeca541d7a2f9790ef8e8d3c911a lkmm.2021.11.30c
git branch lkmm-dev.2021.11.30c d069e38e66dbfeca541d7a2f9790ef8e8d3c911a
git rebase --onto c438b7d860b4c1acb4ebff6d8d946d593ca5fe1e v5.16-rc1 lkmm-dev.2021.11.30c

Don’t forget <F5>!

The next step is to check out the previous merge point (you will need to substitute the SHA-1 ID recorded above) and do the remaining merges:

git checkout 32e5555b62e6a5f7304b13470cd1881a49a38d18
git merge lkmm.2021.11.30c
git merge lkmm-dev.201.11.30c
git merge lkmm-dev.2021.11.30c
git merge clocksource.2021.11.30c

Finally, rebase the five remaining commits from the “On top:” section of the rebase.txt file:

git branch dev.2021.11.30c a1f5859d3cf6dee7647d39ec926774f05c8d5593
git rebase -i --onto ce410b77746063e1c03f6e5bf85d68cca3e1c7ea \
        d069e38e66dbfeca541d7a2f9790ef8e8d3c911a dev.2021.11.30c

Alternatively, you might instead choose to check out the new branch and then cherry-pick the five commits as follows:

git checkout -b dev.2021.11.30c
git cherry-pick 2a1d7ed8553da ca507a89ffbb9 0ebafffe561ac e824a258c37d4 74f780ac279b0

This checkout-and-cherry-pick operation is completely equivalent to the prior branch-and-rebase operation.

Either way, as always, don’t forget to hit <F5> in your new-state gitk.

Check Your Work!

And finally, the most important thing is to check your work:

git diff dev.2021.11.30b

If these diffs are non-empty, there was some mistake somewhere in the process.

Possible Complications

This topic-branch exercise went quite smoothly and took about an hour to complete. Sometimes life is harder:

  1. There might be rebase and/or merge conflicts. I normally take this as a hint that my assignment of commits to topic branches was suboptimal. The commit numbers assigned by the git-rebase-prep.sh script are quite helpful when rejuggling commits. Yes, sometimes the resulting conflicts must be resolved manually, but that is a blog post for another day.
  2. One of the topic branches might depend on another. In this case, I rebase the dependent topic branch on top of the topic branch that it depends on. Looking at it another way, sometimes the topic branches are arranged in series instead of in parallel.
  3. A commit might depend on a number of topics, and the rest of the topics might depend on that commit. This is rare, but it does happen. The trick in this case is to create and merge the topic branches that the commit depends on, cherry-pick that commit on top of the resulting merge point, and then create and merge the remaining topic branches on top of that commit.

If you made it all the way down here, thank you for your time and attention, and I hope that this material was helpful. If you remember nothing else from this blog post, let it be gitk (or tig, as the case may be). Being able to see what you are doing allows you to learn to git much faster and allows you to safely take on much more complicated git-related tasks.

Posted in Uncategorized | Tagged | Comments Off on Stupid RCU Tricks: Creating Branches For the -rcu Tree

What Memory Model Should the Rust Language Use?

UNDER CONSTRUCTION

This blog post discusses a few alternative Rust-language memory models. I hope that this discussion is of value to the Rust community, but in the end, it is their language, so it is also their choice of memory model.

This discussion takes the Rust fearless-concurrency viewpoint, tempered by discussions I have observed and participated in while creating this blog series. Different members of that community of course have different viewpoints, and thus might reasonably advocate for different choices. Those who know me will understand that these viewpoints differ significantly from my own. However, my viewpoint is dictated by my long-standing privilege of living at the edge of what is possible in terms of performance, scalability, real-time response, energy efficiency, robustness, and much else besides. Where I live, significant levels of fear are not just wise, but also necessary for survival. To borrow an an old saying from aviation, there are old pilots, and there are bold pilots, but there are no old bold pilots.

Nevertheless, I expect that my more than three decades of experience with concurrency, my work on the C/C++ memory model (memory_order_consume notwithstanding), and my role as lead maintainer of the Linux-kernel memory model (LKMM) will help provide a good starting point for the more work-a-day situation that I believe that the Rust community wishes to support.

But Doesn’t Rust Already Have a Memory Model?

Kinda sorta?

Some in academia have assumed that Rust’s memory model will be based on that of C/C++, for example, see here. And the Rustnomicon agrees, though the author of that text does not appear to be very happy about it:

Rust pretty blatantly just inherits the memory model for atomics from C++20. This is not due to this model being particularly excellent or easy to understand. Indeed, this model is quite complex and known to have several flaws. Rather, it is a pragmatic concession to the fact that everyone is pretty bad at modeling atomics. At very least, we can benefit from existing tooling and research around the C/C++ memory model. (You’ll often see this model referred to as “C/C++11” or just “C11”. C just copies the C++ memory model; and C++11 was the first version of the model but it has received some bugfixes since then.)

Trying to fully explain the model in this book is fairly hopeless. It’s defined in terms of madness-inducing causality graphs that require a full book to properly understand in a practical way. If you want all the nitty-gritty details, you should check out the C++ specification. Still, we’ll try to cover the basics and some of the problems Rust developers face.

The C++ memory model is fundamentally about trying to bridge the gap between the semantics we want, the optimizations compilers want, and the inconsistent chaos our hardware wants. We would like to just write programs and have them do exactly what we said but, you know, fast. Wouldn’t that be great?

I would argue that compiler optimizations are the source of much more inconsistent chaos than hardware would ever dream of providing, but that argument has been going on for many years and I doubt that we will be able to settle it here.

Leaving that argument aside, entertaining though it might be, the wording of this passage of the Rustnomicon certainly invites alternatives to the C/C++ memory model. Let’s see what we can do.

Where to Start?

Let’s first eliminate a number of candidate memory models. Rust’s presumed portability goals rules out any of the hardware memory models. Rust’s growing connection to the deep embedded world rules out the memory models of dynamic languages such as Java and Javascript. The growing number of memory models derived from the C/C++ memory model are represented by the actual C/C++ memory model, so we will not consider those variants separately.

This leaves the aforementioned C/C++ memory model and of course LKMM, the latter inspired by the Rust communities ambitions within the Linux kernel. Because this blog series focuses on the Linux kernel, this post will start with LKMM, move to the C/C++ memory model, and end with a specific recommendation.

Linux-Kernel Memory Model (LKMM)

This section will consider aspects of LKMM, starting with the most fear-inducing aspects and moving to the more work-a-day aspects.

Control dependencies are not the most fearless portion of LKMM, in fact, in my role as lead maintainer of LKMM, I need people using control dependencies not to merely feel fearful, but instead to feel downright terrified. Control dependencies are fragile and easy for control-dependency-oblivious compiler optimizations to destroy. Therefore, for the time being, control dependencies should be excluded from any Rust memory model. Perhaps the day will come when we have a good way to communicate the intent of control dependencies to compilers, but today is not that day.

Address and data dependencies carried by pointers are not free of risk, but they do provide dependency-oblivious compilers significantly less scope for destructive optimizations. They should thus require much less fear than do control dependencies, little though that might be saying. Nevertheless, address and data dependencies are mainly used in conjunction with RCU, and we do not yet have a known good way of expressing all RCU use cases within the confines of Rust’s ownership model. Therefore, address and data dependencies should also be excluded from Rust’s memory model until such time as significant RCU use cases are handled or some other clear and present use case militates for address and data dependencies. It would be increasingly reasonable to permit control and data dependencies within Rust unsafe mode should use of RCU within Rust increase, keeping in mind that things like epoch-based reclamation (EBR) are particular classes of implementations of RCU.

At first glance, it seems entirely reasonable to countenance use of READ_ONCE() and WRITE_ONCE() within prospective Linux-kernel Rust code, but this post is discussing Rust in general, not just Rust in the Linux kernel. And the fact is that address, data, and control dependencies are the only things that prevent the Linux kernel’s use of READ_ONCE() and WRITE_ONCE() from resulting in out-of-thin-air (OOTA) behavior. Except that these operations (along with the Linux kernel’s unordered atomic read-modify-write (RMW) operations) are implemented so as to prevent the compiler from undertaking code-motion optimizations that might otherwise reorder such operations. Furthermore, all of the underlying hardware memory models of which I am aware preserve dependency orderings. Therefore, one might expect that these unordered operations might reasonably be part of a Rust memory model.

Unfortunately, one imporant benefit of a memory model is the tooling that analyzes concurrent code fragments, and if this tooling is to exclude OOTA behaviors, it is absolutely necessary for that tooling to understand dependencies. Except that we have already excluded such dependencies from the Rust memory model.

Therefore, the Rust memory model should restrict its support for Linux-kernel atomic operations to those that provide ordering. These would be the value-returning non-relaxed read-modify-write (RMW) atomic operations along with the _acquire() and _release() variants of non-value-returning RMW atomic operations. It might also make sense to allow combinations of unordered RMW operations with combination memory barriers, for example, atomic_inc() followed by smp_mb__after_atomic(), but it would make even more sense to wrapper these in combination as a single Rust-accessible primitive. This combined Rust primitive would no longer be unordered, and could thus be included as an ordered unit in the Rust memory model. Alternatively, unordered atomics might be relegated to Rust’s unsafe mode.

It is hard to imagine a useful Rust memory model that excludes locking.

Thus, starting from LKMM, we arrive at a model that supports ordered atomic operations and locking, possibly including unordered atomics in unsafe mode.

C/C++ Memory Model

This section will instead start from the C/C++ memory model.

Because memory_order_relaxed accesses can yield OOTA results, they seem inconsistent with Rust’s fearless-concurrency goals. These cannot actually occur in practice with any implementation that I am aware of, but fearless concurrency requires accurate tools, and this accuracy requires excluding even the theoretical possibility of OOTA results. Another reasonable approach would permit use of memory_order_relaxed only within Rust unsafe code.

The memory_order_consume is primarily useful in conjunction with RCU. In addition, in all implementations that I am aware of, memory_order_consume is simply promoted to memory_order_acquire. It therefore seems unnecessary to include memory_order_consume within a Rust memory model. As before, a reasonable alternative would permit its use only within Rust unsafe code.

In contrast, memory_order_acquire, memory_order_release, and memory_order_acq_rel, are all easily analyzed and have been heavily used in practice for decades. Although the celebrated memory_order_seq_cst (sequentially consistent) memory ordering can be difficult to analyze, its strong ordering is absolutely required by some use cases, including a sadly large fraction of concurrent algorithms published by academics and industrial researchers. In addition, there is a decades-long tradition of proofs and tooling handling sequential consistency, difficulties aside. Use of all four of these orderings should therefore be permitted from safe Rust code.

With the C/C++ memory model as it was with LKMM, it is hard to imagine a useful Rust memory model that excludes locking.

Thus, starting from the C/C++ memory model, we also arrive at a model that supports ordered atomic operations and locking, possibly along with unordered atomic operations in unsafe mode.

Recommendation

The fact that starting with LKMM got us to roughly the same place as did starting with the C/C++ memory model should give us considerable confidence in that destination. Why the “roughly”? Because there are subtle differences, as can be seen by comparing the C and C++ standards to LKMM.

One reaction to this situation would be to go through these memory models one corner case at a time, and for each corner case making a choice for Rust, now and forever. Perhaps a better approach is to pick one and adapt as needed based on real situations as they arise. At this time, there are far more projects living within the C/C++ memory model than within LKMM. Therefore, despite my role as lead maintainer of LKMM, it is my painful duty to recommend the following, based on the C/C++ memory model:

  1. Locking may be used from both safe and unsafe modes.
  2. Atomic operations using memory_order_acquire, memory_order_release, memory_order_acq_rel, and memory_order_seq_cst may be used from both safe and unsafe modes.
  3. Atomic operations using memory_order_relaxed and memory_order_consume may be only from unsafe mode.
  4. All atomic operations in Rust code should be marked, that is, Rust should avoid following the C/C++ practice of interpreting an unmarked mention of an atomic variable as a memory_order_seq_cst access to that variable. Requiring marking allows accesses to concurrently accessed shared variables to be identified at a glance, and also makes the choice of any default memory_order value much less pressing.

This provides explicit support for the operations and orderings that have a long history of heavy use and and for which analysis tools have long been available. It also provides provisional support of operations and orderings that, while also having a long history of heavy use, are beyond the current state of the art for accurate and complete analysis.

Other Options

One could argue that memory_order_relaxed also has many simple use cases, for but one example, constructing distributed counters. However, given the difficulty verifying its full range of use cases, unsafe mode seems the safest bet at the moment. Should any of the current efforts to more straightforwardly verify memory_order_relaxed accesses bear fruit, then perhaps memory_order_relaxed might be permitted in Rust safe mode.

Finally, one could argue that memory_order_consume should be excluded entirely, rather than simply being relegated to unsafe mode. However, some Rust libraries already use RCU internally, and the ability to flag RCU pointer traversals could prove helpful should Rust someday fully support address and data dependencies.

In contrast, as noted earlier, the other four memory_order enum members are heavily used and routinely analyzed. It is therefore reasonable to permit use of memory_order_acquire, memory_order_release, memory_order_acq_rel, and memory_order_seq_cst within Rust safe code.

What Does This Mean for the Linux Kernel?

As it turns out, not much.

Please keep in mind that the Linux kernel currently interoperates with the full range of memory models used by arbitrary userspace code written in arbitrary languages when running on a wide range of hardware memory models. Any adjustments required are currently handled by architecture specific code, for example, in the system-call layer or in exception entry/exit code. Strong ordering is also provided deeper within the Linux kernel, with but one example being the context-switch code, which must provide full ordering when processes migrate from one CPU to another.

It turns out that Rust code in Linux has wrappers around any C code that it might invoke, and it is through these wrappers that Rust code will make use of the non-Rust portions of the Linux kernel. These wrappers will therefore contain calls to whichever memory-ordering primitives might be required at any particular point in Rust’s memory-model evolution.

But Does Rust Really Need to Define Its Memory Model Right Now?

This is of course not my decision.

But it is only fair to point out that the longer the Rust community waits, the more the current “kinda sorta” choice of the full C/C++ memory model will become the actual long-term choice. Including those portions of the C/C++ memory model that have proven troublesome, and that are therefore likely to be subject to change.

My recommendation is therefore to adopt the untroublesome portions of the C/C++ memory model in safe mode, and the rest in unsafe mode. This approach allows people to write work-a-day concurrent algorithms in Rust and to be confident that the resulting code will still work in the future. It also allows those needing to live more dangerously to confine their code to unsafe blocks so as to attract an appropriate level of skepticism and attention.

But again, this decision rests not with me, but with the Rust communty.

History

November 4, 2021: Unmarked Rust-language accesses are never atomic operations plus some wordsmithing.

Posted in Uncategorized | Tagged , , | Comments Off on What Memory Model Should the Rust Language Use?

Stupid RCU Tricks: Waiting for Grace Periods From NMI Handlers

Suppose that you had a state machine implemented by NMI handlers, and that some of the transitions in this state machine need to wait for an RCU grace period to elapse. How could these state transitions be implemented?

Before we start, let’s dispense with a couple of silly options. First, we clearly are not going to invoke synchronize_rcu() from within an NMI handler. This function can sleep, and we cannot sleep even within non-threaded interrupt handlers, let alone within NMI handlers. Second, we are not going to invoke call_rcu() from within an NMI handler. This function disables interrupts to exclude concurrent invocations on a single CPU, and disabling interrupts does nothing to keep NMI handlers from running. Worse yet, when running on rcu_nocbs CPUs (which offload callback invocations to rcuo kthreads), call_rcu() acquires a lock. Therefore, invoking call_rcu() from within an NMI handler would at best result in deadlock, and might also result in corruption of RCU’s lists of callbacks.

So what can we do?

One approach would be to use a self-spawning RCU callback, that is, a callback that invokes call_rcu() on itself. (Yes, this is perfectly legal, just as it is with timer handlers. See the function rcu_torture_fwd_prog_cb() in kernel/rcu/rcutorture.c of v5.14 of the Linux kernel for an example.) This callback could also increment a counter that could be checked by the NMI handler. When the NMI handler wished to defer a state transition until the end of a future RCU grace period, it could transition to an additional “waiting for RCU” state while recording the value of that counter. Later, when the counter had advanced sufficiently, a subsequent NMI handler could complete the transition to the desired state.

Of course, it is not sufficient to wait for the counter to advance just once. The reason is that the initial NMI might have occurred just before the RCU callback executed, and the next NMI might happen just afterwards. Yes, the counter has advanced, but almost no time has elapsed, much less a full RCU grace period. It is instead necessary to wait for the counter to advance by two, and also to have the needed memory barriers in place.

But there is a better way that makes use of RCU’s existing counters and memory barriers. RCU provides these four functions for this purpose, two of which are usable from NMI handlers:

  1. get_state_synchronize_rcu(), which was first used in v4.10 in 2015, returns a “cookie” that can be checked later. SRCU provides a similar get_state_synchronize_srcu() API.
  2. start_poll_synchronize_rcu() returns a cookie as well, but also ensures that the needed RCU grace period gets scheduled. Unfortunately, this last requires locks to be acquired, which precludes its use in an NMI handler. SRCU provides a similar start_poll_synchronize_srcu() API, which was first used in v5.14 in 2021.
  3. poll_state_synchronize_rcu() takes a cookie from the first two functions and returns true if the corresponding RCU grace period has elapsed. SRCU provides a similar poll_state_synchronize_srcu() API, which was first used in v5.14 in 2021.
  4. cond_synchronize_rcu(), which was first used in v4.10 in 2015, also takes a cookie from the first two functions, but waits (if needed) until the corresponding RCU grace period has elapsed. Unfortunately, the possibility of waiting precludes this function’s use in an NMI handler.

So the first NMI handler can invoke get_state_synchronize_rcu() to obtain a cookie, then transition to the additional state. Later NMI handlers residing in this additional state could pass that cookie to poll_state_synchronize_rcu(), completing the transition if this function returns true. On busy systems, RCU grace periods are being initiated by many other things, so that there is almost always a grace period in progress, but if this must work on quiet systems, the aforementioned self-spawning callback could be used to force the issue when needed.

Of course, RCU has made internal use of grace-period polling for a very long time, starting prior to the beginning of the Linux-kernel git repository in 2005.

In short, NMI handlers can now work with both RCU and SRCU grace periods without the need to invent counters or to worry about memory-ordering issues!

Posted in Uncategorized | Tagged , , | Comments Off on Stupid RCU Tricks: Waiting for Grace Periods From NMI Handlers

Verification Challenges

You would like to do some formal verification of C code? Or you would like a challenge for your formal-verification tool? Either way, here you go!
 
 
 

  1. Stupid RCU Tricks: rcutorture Catches an RCU Bug, also known as Verification Challenge 1.
  2. Verification Challenge 2: RCU NO_HZ_FULL_SYSIDLE.
  3. Verification Challenge 3: cbmc.
  4. Verification Challenge 4: Tiny RCU.
  5. Verification Challenge 5: Uses of RCU.
  6. Verification Challenge 6: Linux-Kernel Tree RCU.
  7. Verification Challenge 7: Heavy Modifications to Linux-Kernel Tree RCU.
Posted in Uncategorized | Tagged , , | Comments Off on Verification Challenges

TL;DR: Memory-Model Recommendations for Rusting the Linux Kernel

These recommendations assume that the initial Linux-kernel targets for Rust developers are device drivers that do not have unusual performance and scalability requirements, meaning that wrappering of small C-language functions is tolerable. (Please note that most device drivers fit into this category.) It also assumes that the main goal is to reduce memory-safety bugs, although other bugs might be addressed as well. Or, Murphy being Murphy, created as well. But that is a risk in all software development, not just Rust in the Linux kernel.

Those interested in getting Rust into Linux-kernel device drivers sooner rather than later should look at the short-term recommendations, while those interested in extending Rust’s (and, for that matter, C’s) concurrency capabilities might be more interested in the long-term recommendations.

Short-Term Recommendations

The goal here is to allow the Rust language considerable memory-model flexibility while also providing Rust considerable freedom in what it might be used for within the Linux kernel. The recommendations are as follows:

  1. Provide wrappers around the existing Linux kernel’s locking, MMIO, I/O-barrier, and I/O-access primitives. If I was doing this work, I would add wrappers incrementally as they were actually needed, but I freely admit that there are benefits to providing a full set of wrappers from the get-go.
  2. Either wrapper READ_ONCE() or be careful to take Alpha’s, Itanium’s, and ARMv8’s requirements into account as needed. The situation with WRITE_ONCE() appears more straightforward. The same considerations apply to the atomic_read() and atomic_set() family of primitives.
  3. Atomic read-modify-write primitives should be made available to Rust programs via wrappers around C code. But who knows? Maybe it is once again time to try out intrinsics. Again, if I was doing the work, I would add wrappers/implementations incrementally.
  4. Although there has been significant discussion surrounding how sequence locking and RCU might be handled by Rust code, more work appears to be needed. For one thing, it is not clear that people proposing solutions are aware of the wide range of Linux-kernel use cases for these primitives. In the meantime, I recommend keeping direct use of sequence locking and RCU in C code, and providing Rust wrappers for the resulting higher-level APIs. In this case, it might be wise to make good use of compiler directives in order to limit Rust’s ability to apply code-motion optimizations. However, if you need sequence-locking or RCU Rust-language wrappers, there are a number that have been proposed. (Except that you should first take another look or three at wrappering higher-level APIs. Yes, APIs are hard, but there are huge benefits to proper APIs!)
  5. Control dependencies should be confined to C code. If needed, higher-level APIs whose C-language implementations require control dependencies can be wrappered for Rust use. But my best guess is that it will be some time before Rust code needs control dependencies.

Taking this approach should avoid memory-model-mismatch issues between Rust and C code in the Linux kernel. And discussions indicate that much (and maybe all) of the wrappering work called out in the first three items above has already been done.

Of course, situations that do not fit this set of recommendations can be addressed on a case-by-case basis. I would of course be happy to help. Specifically, in my role as lead Linux-kernel RCU maintainer:

  1. My first reaction to submission of Rust wrappers for the RCU API will be to ask hard questions about the possibility of higher-level APIs.
  2. If higher-level APIs are infeasible, I will look carefully at which RCU use cases are supported by the proposed wrappering. I am working to improve documentation of the range of RCU use cases in order to help submitters to do this as well. (This work will likely be helpful elsewhere as well.)
  3. Again, if higher-level APIs are infeasible, I will look carefully at how the Rust wrappers are helping to find bugs. This will clearly require me to continue learning Rust, as this requires me to have a detailed view of Rust’s ownership mechanisms and how things like reference-counting use cases work around limitations in these mechanisms.

This procedure should help buy the time required for me to learn more about the Rust language and for the Rust community to learn more about RCU and its many use cases.

Long-Term Recomendations

This section takes a more utopian view. What would a perfect Rust sequence-locking implementation/wrapper look like? Here are some off-the-cuff desiderata, none of which are met by the current C-code Linux-kernel implementation:

  1. Use of quantities computed from sequence-lock-protected variables in a failed reader should result in a warning. But please note that reliably associating variables with sequence locks may not be easy. One approach suggested in response to this series is to supply a closure containing the read-side critical section, thus restricting such leakage to (unsafe?) side effects.
  2. Improper access to variables not protected by the sequence lock should result in a warning. But please note that it is quite difficult to define “improper” in this context, let alone detect it in real code.
  3. Data races in failed sequence-lock readers should not cause failures. But please note that this is extremely difficult in general if the data races involve non-volatile C-language accesses in the reader. For example, the compiler would be within its rights to refetch the value after the old value had been checked. (This is why the sequence-locking post suggests marked accesses to sequence-lock-protected variables.)
  4. Data races involving sequence-locking updaters are detected, for example, via KCSAN.

As noted elsewhere, use of a wrapper around the existing C-language implementation allows C and Rust to use the same sequence lock. This might or might not prove to be important.

Similarly, what would a perfect Rust RCU wrapper look like? Again, here are some off-the-cuff desiderata, similarly unmet by existing C-code Linux-kernel implementations:

  1. Make call_rcu() and friends cause the specified object to make an end-of-grace-period transition in other threads’ ownership from readers to unowned. Again, not an immediate transition at the time of call_rcu() invocation, but rather a deferred transition that takes place at the end of some future grace period.
  2. Some Rust notion of type safety would be useful in slab caches flagged as SLAB_TYPESAFE_BY_RCU.
  3. Some Rust notion of existence guarantee would be useful for RCU readers.
  4. Detect pointers to RCU-protected objects being improperly leaked from RCU read-side critical sections, where proper leakage is possible via locks and reference counters. Perhaps an emphasis on closures is at least part of the answer.
  5. Detect mishandling of dependency-carrying pointers returned by rcu_dereference() and friends. Or perhaps introduce some marking such pointers to that the compiler will avoid breaking the ordering. See the address/data dependency post for a list of C++ working papers moving towards this goal.

There has recently been some work attempting to make the C compiler understand control dependencies. Perhaps Rust could make something useful happen in this area, perhaps by providing markings that allow the compiler to associate the reads with the corresponding control-dependent writes.

Perhaps Rust can better understand more of the ownership schemes that are used within the Linux kernel.

Rationale

Please note that middle-term approaches are likely to be useful, for example, those that simply provide wrappers around the C-language RCU and sequence-locking implementations. However, such approaches also carry risks, especially during that time when the intersections of the set of people deeply understanding Rust with the set of people deeply understanding RCU and sequence locking is the empty set. For example, one risk that became apparent during the effort to add RCU to the C++ standard is that people new to RCU and sequence locking will latch onto the first use case that they encounter and focus solely on that use case. This has also been a recurring issue for concurrency experts who encounter sequence locking and RCU for the first time. This of course means that I need to do a better job of documenting RCU, its use cases, and the relationships between them.

This is not to say that per-use-case work is pointless. In fact, such work can be extremely valuable, if nothing else, in helping to build understanding of the overall problem. It is also quite possible that current RCU and sequence-locking use cases will resemble those of the future in the same way that the venerable “while” loop resembles the wide panoply of interator-like constructs found not only in modern languages, including those written in C in the Linux kernel, in other words, perhaps Rust will focus on specific fearless-concurrency-friendly RCU/sequence-locking use cases. Except that the Linux kernel still contains “while” loops, as does a great deal of other software, which suggests that the low-level RCU and sequence-locking APIs will always be required. There will also be questions as to whether a given new-age use case is there to help developers using RCU and sequence locking on the one hand or whether its main purpose is instead to work around a shortcoming in Rust on the other. “This should be good clean fun!” 😉

Experience indicates that it will take significant time to sort out all of these issues. We should therefore make this time available by proceeding initially as described in the short-term recommendations.

Criteria for judging proposals include safety, performance, scalability, build time, development cost, maintenance effort, and so on, not necessarily in that order.

History

October 18, 2021: Add “Rationale” section.
October 20, 2021: Add a few expansions and clarifications.
October 21, 2021: RCU-specific recommendations.

Posted in Uncategorized | Tagged , , | Comments Off on TL;DR: Memory-Model Recommendations for Rusting the Linux Kernel

Rusting the Linux Kernel: Summary and Conclusions

We have taken a quick trip through history, through a number of the differences between the Linux kernel and the C/C++ memory models, sequence locks, RCU, ownership, zombie pointers, and KCSAN. I give a big “thank you” to everyone who has contributed to this discussion, both publicly and privately. It has been an excellent learning experience for me, and I hope that it has also been helpful to all of you.

To date, Can Rust Code Own Sequence Locks? has proven the most popular by far. Porting Linux-kernel code using sequence locking to Rust turns out to be trickier than one might expect, in part due to the inherently data-racy nature of this synchronization primitive.

So what are those advocating use of Rust within the Linux kernel to do?

The first thing is to keep in mind that the Linux kernel comprises tens of millions of lines of code. One disadvantage of this situation is that the Linux kernel is not going to be ported to much of anything quickly or easily. One corresponding advantage is that it allows Rust-for-Linux developers to carefully pick their battles, focusing first on those portions of the Linux kernel where conversion to Rust might do the most good. Given that order-of-magnitudes performance wins are unlikely, the likely focus is likely to be on proof of concept and on reduced bug rates. Given Rust’s heavy focus on bugs stemming from undefined behavior, and given the Linux kernel’s use of the -fno-strict-aliasing and -fno-strict-overflow compiler command-line options, bug-reduction choices will need to be made quite carefully. In addition, order-of-magnitude bug-rate differences across the source base provides a high noise floor that makes it more difficult to measure small bug-reduction rates. But perhaps enabling the movement of code into mainline and out of staging can be another useful goal. Or perhaps there is an especially buggy driver out there somewhere that could make good use of some Rust code.

Secondly, careful placement of interfaces between Rust and C code is necessary, especially if there is truth to the rumors that Rust does not inline C functions without the assistance of LTO. In addition, devilish details of Linux-kernel synchronization primitives and dependency handling may further constrain interface boundaries.

Finally, keep in mind that the Linux kernel is not written in standard C, but rather in a dialect that relies on gcc extensions, code-style standards, and external tools. Mastering these extensions, standards, and tools will of course require substantial time and effort, but on the other hand this situation provides precedent for Rust developers to also rely on extensions, style standards, and tools.

But what about the question that motivated this blog in the first place? What memory model should Linux-kernel Rust code use?

For Rust outside of the Linux kernel, the current state of compiler backends and the path of least resistance likely leads to something resembling the C/C++ memory model. Use of Rust in the Linux kernel is therefore not a substitute for continued participation in the C/C++ standards committees, much though some might wish otherwise.

Rust non-unsafe code does not depend much on the underlying memory model, at least assuming that unsafe Rust code and C code avoids undermining the data-race-free assumption of non-unsafe code. The need to preserve this assumption was in fact inspired the blog post discussing KCSAN.

In contrast, Rust unsafe code must pay close attention to the underlying memory model, and in the case of the Linux kernel, the only reasonable choice is of course the Linux-kernel memory model. That said, any useful memory-ordering tool will also need to pay attention to safe Rust code in order to correctly evaluate outcomes. However, there is substantial flexibility, depending on exactly where the interfaces are placed:

  1. As noted above, forbidding use of Rust unsafe code within the Linux kernel would render this memory-model question moot. Data-race freedom for the win!
  2. Allowing Rust code to access shared variables only via atomic operations with acquire (for loads), release (for stores) or stronger ordering would allow Rust unsafe code to use that subset of the Linux-kernel memory model that mirrors the C/C++ memory model.
  3. Restricting use of sequence locks, RCU, control dependencies, lockless atomics, and non-standard locking (e.g., stores to shared variables within read-side reader-writer-locking critical sections) to C code would allow Rust unsafe code to use that subset of the Linux-kernel memory model that closely (but sadly, not exactly) mirrors the C/C++ memory model.
  4. Case-by-case relaxation of the restrictions called out in the preceding pair of items pulls in the corresponding portions of the Linux-kernel memory model. For example, adding sequence locking, but only in cases where readers access only a limited co-located set of objects might permit some of the simpler Rust sequence-locking implementations to be used (see for example the comments to the sequence-locking post).
  5. Insisting that Rust be able to do everything that Linux-kernel C code currently does pulls in the entirety of the Linux-kernel memory model, including those parts that have motivated many of my years of C/C++ standards-committee fun and excitement.

With the first three options, a Rust compiler adhering to the C/C++ memory model will work just fine for Linux-kernel C code. In contrast, the last two options would require code-style restrictions (preferably automated), just as they are in current Linux-kernel C code. See the recommendations post for more detail.

In short, choose wisely and be very careful what you wish for! 😉

History

October 7, 2021: Added atomic-operation-only option to memory-model spectrum.
October 8, 2021: Fix typo noted by Miguel Ojeda
October 12, 2021: Self-review.
October 13, 2021: Add reference to recommendations post.

Posted in Uncategorized | Tagged , , | Comments Off on Rusting the Linux Kernel: Summary and Conclusions