Stupid RCU Tricks: The design of rcutorture

This installment of the rcutorture series takes a high-level look at its design. At the highest level, rcutorture is a stress test with a few unit-test components thrown in for good measure. It also includes scripts to handle both single-system and distributed testing. All of this code is of course paying homage to the many moods of Mr. Murphy.

The Many Moods of Mr. Murphy

As I have progressed through my career, I seem to have progressively miffed Mr. Murphy.

I completed my first professional (but pro bono) project in the mid-1970s. It had one user. Any million-year bugs it might have contained took the full million years to appear. This meant that Murphy was actually a pretty nice guy. Sure, whatever could happen would. Eventually. Maybe in geologic time.

In the 1980s, I completed a number of contract-programming projects that might have had installed bases of at many as 100 units. A million-year bug could be expected to appear about once per 10,000 years. In the 1990s, I worked on Sequent’s DYNIX/ptx proprietary-UNIX operating system, which had an installed base of perhaps 6,000 systems. A million-year bug could be expected to appear not quite once per two centuries.

Shortly after the year 2000, I started working on the Linux kernel. There are at best rough estimates of the Linux kernel’s installed based, and as of 2017, there were an estimated 20 billion systems of one sort of another running the Linux kernel, including smartphones, automobiles, household appliances, and much more. A million-year bug could be expected to appear more than once per hour across this huge installed base. In other words, over a period of about 40 years, Murphy has transitioned from being a pretty nice guy to being a total jerk!

Worse yet, should the Linux kernel capture even a modest fraction of the Internet-of-things market, a million-year bug could be expected to appear every few minutes across the installed base. Which might well result in Murphy becoming nothing less than a homicidal maniac.

Fortunately, there are some validation strategies that might help keep Murphy on the straight and narrow.

If You Cannot Beat Him, Join Him!

Given that everything that can happen eventually will, the task at hand is to try to make it happen in the comparative comfort and safety of the lab. This means aiding and abetting Mr. Murphy, at least within the lab environment. And this is the whole point of rcutorture, whose tricks include the following:

Temporal fuzzing.
Exercising race conditions.
Anticipating abuse.

Of course, none of these tricks are new, but it does not hurt to review them.

Temporal Fuzzing

But why not go for the full effect and apply straight-up fuzzing? The answer to this question may be found in RCU’s core API:

void rcu_read_lock(void);
void rcu_read_unlock(void);
void synchronize_rcu(void);
void call_rcu(struct rcu_head *head, rcu_callback_t func);

For the first three functions, there is nothing to fuzz, unless you are trying to test your compiler. For the last function, fuzzing of pointers—and most especially pointers to functions—is reserved for the truly brave and for those wishing to test their kernel’s exception handling.

But it does make sense to fuzz the timing of calls to these functions, and that is exactly what rcutorture does. RCU readers and updaters are invoked at random times, with readers and updaters cooperating to detect any too-short grace periods, memory misordering, and so on. Much of the fuzzing is randomly generated at run time, but there are also module parameters that insert delays in specific locations. This strategy is straightforward, but can also be powerful, for example, careful choice of delays and other configuration settings decreased the mean time between failure (MTBF) of a memorable heisenbug from hundreds of hours to less than five hours. This had the beneficial effect of de-heisening this bug.

Exercising Race Conditions

Many of the most troublesome bugs involve rare operations, and one way to join forces with Murphy is to make rare operations less rare during validation. And rcutorture takes this approach often, including for the following operations:

CPU hotplug.
Transitions to and from idle, including transitions to and from the whole system being idle.
Long RCU readers.
Readers from interrupt handlers.
Complex readers, for example, those overlapping with irq-disable regions.
Delayed grace periods, for example, allowing a CPU to go offline and come back online during grace-period initialization.
Racing call_rcu() invocations against rcu_barrier().
Periodic forced migrations to other CPUs.
Substantial testing of less-popular grace-period mechanisms.
Processes running on the hypervisor to preempt code running in rcutorture guest OSes.
Process exit.
”Near misses“ where the RCU grace-period guarantee is almost violated.
Moving CPUs to and from rcu_nocbs callback-offloaded mode.

This exercising of race conditions might be reminiscent of the Netflix Chaos Monkey.

Anticipating Abuse

There are things that RCU users are not supposed to do. Just as users of the fork() system call are not supposed to code up forkbombs, RCU users are not supposed to code up endless blasts of call_rcu() invocations (see Documentation/RCU/checklist.rst item 8). Nevertheless, rcutorture does engage in (carefully limited forms of) call_rcu() abuse in order to find stress-related RCU bugs. This abuse is enabled by default and may be controlled by the rcutorture.fwd_progress module parameter and friends.

In addition, rcutorture inserts the occasional long-term delay in preemptible RCU readers and exercises code paths that must avoid deadlocks involving the scheduler and RCU.

Meta-Murphy, AKA Test the Test

Of course, one danger of joining Murphy is that things can go wrong in test code just as easily as they can go wrong in the code under test.

For this reason, rcutorture provides the rcutorture.object_debug module parameter that verifies that the code checking for double call_rcu() invocations is working properly. In addition, the rcutorture.stall_cpu module parameter and friends may be used to force RCU CPU stall warning messages of various types.

The rcutorture tests of more fundamental RCU properties may be enabled by using the rcutorture.torture_type module parameter. For example, rcutorture.torture_type=busted selects a broken RCU implementation, which may also be selected using the BUSTED scenario. Either way, rcutorture had jolly well better complain about too-short grace periods. In addition, rcutorture.torture_type=busted_srcud forces rcutorture to run compound readers against SRCU, which does not support this notion. In this case also, rcutorture had better complain about too-short grace periods for these compound readers. The rcutorture.leakpointer module parameter tests the CONFIG_RCU_STRICT_GRACE_PERIOD Kconfig option’s ability to detect pointers leaked from RCU read-side critical sections. Finally, the rcutorture tests of RCU priority boosting can themselves be tested by using the BUSTED-BOOST scenario, which must then complain about priority-boosting failures.

Additional unscheduled tests of rcutorture testing are of course provided by bugs in RCU itself. Perhaps these are rare examples of Murphy working against himself, but they normally do not feel that way at the time!

Enlisting Darwin

Those who are willing to consider the possibility that natural selection applies to non-living objects might do well to consider validation such as that provided by rcutorture to be a selection function. Now, some developers might object to the thought that their carefully created changes are random mutations, but the sad fact is that long experience has often supported that view.

With this in mind, a good validation suite will select against bugs, resulting in robust software, right?

Wrong.

You see, bugs are a form of software. An undesirable form, perhaps, but a form nevertheless. Bugs will therefore adapt to any fixed validation suite and accumulate in your software, degrading its robustness. This means that any bugs located by end users must also be considered bugs against the validation suite, which after all failed to find those bugs. Modifying the validation suite to successfully find those bugs is therefore important, as is independent efforts to make the validation suite more capable. The hope is that modifying the test suite will make it more difficult for bugs to adapt to it.

But even that is insufficient. Blindly adding tests and test cases will eventually bloat your test suite to the point where it is no longer feasible to run all of it. It is therefore also necessary to review test cases and work out how to make them find bugs faster with less hardware, whether by merging tests, running more tests concurrently, or by more vigorously enlisting Mr. Murphy’s assistance. It might also be necessary to eliminate test cases that are no longer relevant, for example, now that RCU no longer has a synchronize_rcu_bh(), there is no point in testing it.

In short, the price of robust software is eternal test development.