How costly are atomic operations like AtomicU64::store()
in Rust?
Setting the stage #
This article came into existence after reading Mara Bos’ Book “Rust Atomics and Locks” just days before stumbling across this excellent talk by Scott Meters: YouTube link.
Every software engineer above entry level will sooner or later encounter atomic variables. Probably all of us have been told at some point that they are expensive. I was wondering how expensive atomic operations actually are. For me, getting a feeling for what to expect performance-wise is extremely important when designing systems. So maybe this article will also help you to develop a similar understanding, or, even better yet, motivate you to look into this yourself.
Non-atomic baseline #
We will run a few operations in a loop later on, so we need to get looping right first.
The nop
instruction is used as our payload, to tell the CPU to do nothing over and over again.
static NUM_ITER: u64 = 1_000_000_000;
fn main() {
let dur = set_nop();
println!("set_nop: {:?}", dur);
}
#[inline(never)]
fn set_nop() -> Duration{
let start = Instant::now();
for _ in 0..NUM_ITER {
unsafe{ asm!("nop") }
}
start.elapsed()
}
set_nop: 241.168845ms
Note the inline #[inline(never)]
at the top of our set_nop
function.
That kindly asks the compiler to try it’s best to resist its urges to inline the function.
That does not always work though.
Why don’t we want it to be inlined?
So we can use cargo-asm
to show the assembly of this single function in isolation easily.
Let’s have a look at that. (I cleaned up the irrelevant parts)
;cargo asm --release --bin operations "operations::set_nop"
operations::set_nop:
sub rsp, 24
call std::time::Instant::now
... ; save first instant
mov eax, 1000000000
.LBB5_1:
nop
dec rax
jne .LBB5_1
call std::time::Instant::elapsed
... ; prepare result
ret
Let’s go over this real quick.
Between the Instant::now
and Instant::elapsed
, we have the interesting part:
- save 1 billion in
eax
- define a label we will use to jump to
- execute our
nop
- decrement
rax
- jump to the label if we did not reach zero (highlighted)
This is probably the most basic loop. … Or is it? What happens if we remove the nop
?
#[inline(never)]
fn set_nop() -> Duration{
let start = Instant::now();
for _ in 0..NUM_ITER {
// nothing
}
start.elapsed()
}
set_nop: 90ns
Yeah ok, as expected, the compiler is smart enough to remove the whole thing completely.
Luckily, there is a way to prevent it from doing so: std::hint::black_box
.
#[inline(never)]
fn set_nop() -> Duration{
let start = Instant::now();
for _ in 0..NUM_ITER {
black_box(())
}
start.elapsed()
}
This will tell the compiler, that there might be terrible things happening with whatever is passed to black_box
and it can’t assume anything.
Let’s have a look at the resulting asm, this time only the loop.
mov eax, 1000000000
.LBB5_1:
dec rax
jne .LBB5_1
Nice. We can’t go simpler than this.
Diversion: Debug mode #
Just out of curiosity, let’s have a look what happens when we build the same code without the --release
flag.
It takes much longer: set_nop: 6.629613973s
.
What does the assembly show? (Note that you need to add --dev
to cargo asm
, as --release
is the default)
|
|
There’s a lot going on. Just a few highlights:
- The loop is actually an iterator of a range (L19)
- We jump away to
.LBB28_3
on each iteration (L23 to L30) black_box
is actually invoked as a function (32)- We jump back to the beginning of the loop again (L33 to L17) While this is way more complex and a lot more work, the CPU still manages to take “only” 28 times more time.
This shows us that while debug mode can provide us with some nice features, but to reason what will be going on in the actual code and how it will perform, we need to run it with --release
.
Different means to modify variables #
The set_nop
function has “set” in the name but so far does not set anything.
What ways do we have to set a variable from within a function?
There are many but the ones we will go with, are:
- a mutable reference
std::cell::RefCell
- an atomic variable (That’s what we are here for!)
To quote the rust docs about RefCell:
Borrows for
RefCell<T>s
are tracked at runtime,
So we expect there to be a cost. Let’s define these functions:
fn set_mut_ref(x: &mut u64) -> Duration
fn set_ref_cell(x: &RefCell<u64>) -> Duration
fn set_atomic(x: &AtomicU64) -> Duration
Note that set_mut_ref
is the only one that requires a mutable reference to argument, the other two are getting by using only an immutable reference.
While RefCell
’s main intention is interior mutability, why does AtomicU64
also provide that?
If we look at the borrowing rules, we can’t have multiple mutable borrows to the same variable at the same time.
“At the same time” is a bit hard to reason about for the compiler when multiple threads are involved (the whole point of using AtomicU64
in the first place), so we can’t use mutable borrows on multiple threads that are not scoped differently. The only reasonable way of designing the AtomicU64
therefore is to allow for interior mutability.
The implementation of these functions is super simple, using the black_box
just to make sure we won’t get tricked by the compiler.
fn set_mut_ref(x: &mut u64) -> Duration{
let start = Instant::now();
for _ in 0..NUM_ITER {
black_box(*x=1);
}
start.elapsed()
}
fn set_ref_cell(x: &RefCell<u64>) -> Duration{
let start = Instant::now();
for _ in 0..NUM_ITER {
black_box(*x.borrow_mut()=1);
}
start.elapsed()
}
fn set_atomic(x: &AtomicU64) -> Duration{
let start = Instant::now();
for _ in 0..NUM_ITER {
black_box(x.store(1, Relaxed));
}
start.elapsed()
}
Nothing special so far, on to the results.
set_nop: 258.450691ms
set_mut_ref: 253.218202ms
set_ref_cell: 770.481632ms
set_atomic: 256.575872ms
The ref cell is, as expected, slower than using a mutable reference directly. However, the atomic variable is as fast as the mutable reference!
However, when looking at the assembly again, we see something interesting.
;=====mut ref=======
mov qword ptr [rbx], 1
mov eax, 1000000000
.LBB6_1:
dec rax
jne .LBB6_1
;=====atomic=======
mov eax, 1000000000
.LBB8_1:
mov qword ptr [rip + operations::Z], 1
dec rax
jne .LBB8_1
That sneaky compiler… In the function getting the mutable reference, it writes 1
into the result outside the loop.
Another interesting observation: The store in the atomic variant is actually done in the loop, but it does not slow down the code compared to not doing anything at all in the loop.
I think it is safe to assume that the non-atomic operation is as fast as the atomic one, would it be executed inside the loop.
Not so fast…
The rust compiler is nearly too smart for the boring benchmarks I throw at it. Look at this asm output you get when you have an atomic variable that values are stored in and later read from:
➜ cargo asm --bin operations "operations::set_atomic"
...
.LBB8_1:
mov qword ptr [rip + operations::Z.0], 1
dec rax
jne .LBB8_1
...
There clearly is a store operation in the loop. Now observe what happens to the store loop if I remove the load operation in a completely different part of the code:
.LBB7_1:
dec rax
jne .LBB7_1
It behaves like the nop
loop earlier, since it realized the variable is never read from.
I walked into this trap when measuring performance later in this article.
Always double-check and cross-reference your benchmarks!
To maker matters worse (or better?), without using black_box
, the whole loop disappears!
A first look at the results #
I think at this point it is worth talking about the results. For me, there are already a few things to be learned:
- The performance overhead of using a
RefCell
is way lower than I thought. Consider that we are using it inside the loop, so more or less a worst-case usage pattern performance-wise. In real applications, the overhead might often be negligible. - Computers are fast. Doing about 4 billion iterations per second is insane. Per core, mind you. But this also provides an upper bound, we can’t go faster than this in a loop construct, without unrolling or other shenanigans.
- If storing a variable in an atomic variable from one thread, there is no noticeable difference compared to doing it with a non-atomic one. We will see later that there can be high costs when more than on threads are involved.
Incrementing variables #
If instead of blindly overwriting a value, what happens if we increment it?
#[inline(never)]
fn inc_mut_ref(x: &mut u64) -> Duration{
let start = Instant::now();
for _ in 0..NUM_ITER {
black_box(*x+=1);
}
start.elapsed()
}
#[inline(never)]
fn inc_ref_cell(x: &RefCell<u64>) -> Duration{
let start = Instant::now();
for _ in 0..NUM_ITER {
black_box(*x.borrow_mut()+=1);
}
start.elapsed()
}
#[inline(never)]
fn inc_atomic(x: &AtomicU64) -> Duration{
let start = Instant::now();
for _ in 0..NUM_ITER {
black_box(x.fetch_add(1, Relaxed));
}
start.elapsed()
}
inc_mut_ref: 260.048853ms
inc_ref_cell: 1.658529935s
inc_atomic: 4.364615524s
That’s interesting. I did expect there to be an increase for the atomic but not for the ref_cell. So, what’s going on there?
RefCell internals #
Again, some assembly. I will write some pseudo-code comments and remove the time measurement instructions.
operations::set_ref_cell:
... ; refcell is in [rbx]
mov rcx, qword ptr [rbx] ; copy the ref cell counter in rcx
mov eax, 1000000000 ; out loop counter
.LBB7_1:
test rcx, rcx ; fancy way to see if rcx is 0
jne .LBB7_4 ; handle already borrowed
mov qword ptr [rbx], -1 ; set -1 as the ref cell counter
mov qword ptr [rbx + 8], 1 ; write the 1 in our loop
mov rcx, qword ptr [rbx] ; load the ref cell counter again
inc rcx ; ... increment it's value
mov qword ptr [rbx], rcx ; ... and save it
dec rax ; decrement the loop counter
jne .LBB7_1 ; jump the loop
... ; result handling
.LBB7_4:
...
call qword ptr [rip + core::cell::panic_already_borrowed]
So what is different for the increment variant? (Despite the names and labels obviously…)
22c22
< mov qword ptr [rbx + 8], 1
---
> inc qword ptr [rbx + 8]
As expected, we increment the pointed-to value instead of setting it. However, unexpectedly, this is the only change. Where does the increase in runtime come from? Well, honestly, I don’t know.
Measuring is hard… After running the test a few more times, the increased time just… disappeared? Meh…
inc_mut_ref: 240.035115ms
inc_ref_cell: 784.511452ms
inc_atomic: 4.185777832s
This already averaged across billions of iterations earlier. WTF…
Atomic internals #
After changing from just storing some value into the atomic to actually atomically incrementing it, the code takes 17 times as much time. So obviously something changed. Let’s do a diff.
13,14c13,14
< .LBB8_1:
< mov qword ptr [rip + operations::Z], 1
---
> .LBB11_1:
> lock inc qword ptr [rip + operations::Z]
Since we were previously storing 1
in the variable in complete disregard of the current contents, that required little synchronization.
Now that we read, increment and store the variable, we need to make sure that we don’t have lost updates.
In this simple case, this can be achieved with a lock
instruction.
Not all instructions are lockable though.
With this, the CPU ensures that the calculation is correct and takes care of cache coherence for us.
Multiple threads #
Until now, we have used the atomic variables only in one thread, that’s a huge waste. Let’s use them in a context they were intended for: multithreading.
Why not just use a mut ref? Rust disallows that. If you need a refresher on why, read up on it here and also here.
So here’s the code we use for this experiment.
static NUM_ITER: u64 = 100_000_000; // note that we use "only" 100M instead of the 1B as before
static A : AtomicU64 = AtomicU64::new(42);
// ...
for _ in 0 .. 20 {
for r in [ Relaxed, Acquire, SeqCst]{
for num_threads in 0..24 {
let mut handles = vec![];
for _ in 0..num_threads {
let t = thread::spawn(move ||{
while !STOP.load(Acquire) {
A.store(p, Relaxed); // see note below
for _ in 0..100_000 {
black_box(A.load(Relaxed));
}
}
});
handles.push(t);
}
thread::sleep(Duration::from_millis(50));
let mut x = 0;
let dur = load_atomic(&mut x, r);
println!("load,{r:?},{num_threads},{:?} - x: {x}", dur.as_millis());
// write to STOP and join threads
}
}
wtr.flush().unwrap();
}
… and the payload function…
#[inline(never)]
fn load_atomic(x: &mut u64, r: Ordering) -> Duration{
let start = Instant::now();
for _ in 0..NUM_ITER {
black_box({
*x = A.load(r);
});
}
start.elapsed()
}
We have
- 20 iterations, each with
- all three relevant orderings
- iterations with 0 to 23 “annoying” contender-threads that will
- from time to time write some value in
A
, see the note in the next section - read
A
a lot of times - eventually stop
- from time to time write some value in
- a short sleep to give the threads time to spin up
- measuring the load, writing to a variable to prevent the compiler from cheating.
- logging and teardown
As expected, we can see x
to take the value of the thread id that wrote there last.
load,Relaxed,0,22 - x: 42
load,Relaxed,1,23 - x: 0
load,Relaxed,2,23 - x: 1
load,Relaxed,3,24 - x: 0
load,Relaxed,4,24 - x: 2
....
Time to chart these results, construct measurement code for the other scenarios and burn up a whole bunch of CPU cycles.
Concurrent reads #
The first test is reading in a loop as described above and spawns a number of threads that also read from that variable.
We can see the curve to sharply increase between 10 and 15 threads. Since the machine has 12 physical cores, this is somewhat explainable. But I did not expect this when designing the benchmark. Given that we can scale “normal” CPU bound workloads linearly across this threshold, this is an important finding! The explanation probably has to do something with CPU caches being shared between the threads, but I don’t dare to go into details.
Another observation is that the selected ordering parameter does not really influence performance. Given that we only load, this is very much expected.
This experiment gets a lot more interesting, and a lot slower, when we also write to the variable.
Store with concurrent reads #
Now we switch the measured loop to writing values, while keeping the contending threads to be readers. This is a common pattern, where one thread updates an atomic variable and a bunch of readers receives the value.
The graph shows the Relaxed-Relaxed
and Acquire-Release
orderings. Why not SeqCst
, will be discuss in a second.
Can we already learn something from this? One thing is, that we do see a steep increase when going over the number of physical cores. However, there is some weird behavior before that. Synchronizing with 10 threads is cheaper than with 5 threads? I am not going to guess why that is happening, my learning here is, that you have to measure in each application, because of such weirdness.
It is worth noting that with this simple test, we can already see that 24 readers are cheaper than one reader and one writer. My intuition tells me this does make sense, but I did not expect the ramp up to be this large.
Now back to the SeqCst
that was left out before.
The reason for that omission is, that is dwarfs the other two ordering strategies by a huge margin.
With the concurrent reads previously, there was no need to evict caches in the CPU. With the writes however, this changed dramatically.
We can also see that if the compiler can’t be sure there will be no readers, storing with SeqCst
from a single thread is already significantly more expensive than using the other orderings.
The problem described in this previous section bit me here when building the first version of this benchmark and I had to go back and double check every previous measurement.
This is quite interesting to see that if we add a sleep in the contending thread, effectively rendering the thread inactive, this already triggers the slowdown, just because the compiler can’t be smart about it.
The observed slow-down shows that it is crucially important to know when to use SeqCst
and when you don’t need it!
If we look at the same data with a log scale, the difference becomes apparent:
SeqCst
is about one order of magnitude slower in this scenario, when compared to Relaxed
or Acquire+Release
operations.
You think this was slow?
Get ready for write-contention…
Store with contending stores #
CPU cache coherence so far was not that much of a problem for the writer thread. Now that we add more contending writers, this is no longer the case and each write will probably require communication between the caches, causing evictions all the time.
Again, SeqCst
is far slower than the other two.
The log scaled graph makes this a bit more manageable.
Here we see a fascinating thing: For the Acquire+Release
operations, the maximum is actually at 6 threads.
Once again I don’t dare to try to explain why we see this, but that’s another reminder to benchmark your code early on!
As before, the SeqCst
ordering is about an order of magnitude more expensive.
So far, we disregarded the previous values in the registers. As we saw previously, atomic increments are more expensive than that. Let’s put a number on that “more” in the next section.
Increment with contending reads #
In the initial micro benchmark, we saw that 1B atomic increments take about 4.36 seconds if there are no other threads messing with our memory. So for the 100M iterations, we would expect 0.436 seconds. I am very glad, that this is actually the case when measured here!
Do note though, that for the sake of my time (and the electricity bill), I ran these benchmarks with 10M iterations and scaled the results accordingly. Luckily, this time we won’t need the log scale:
In contrast to previous experiments, the three orderings are roughly the same and diverge starting only from ~15 threads.
This does make sense: The increment is a read and write in one thing and with one writer.
In the experiments, this is somewhat similar to SeqCst
in stores with concurrent readers.
The graph shape and values are even quite similar.
Increment with contending stores #
This one will be the slowest of them all:
Our measured thread is incrementing a variable and all other threads write to it, regardless of its contents.
Since the stores are faster than the increment, they will be performed more often, if we don’t enforce SeqCst
ordering.
If the increment did a CAS, this means that the stores would more often interfere with the update.
Looking at the chart, I think this confirms the above: The SeqCst
is for the first time faster than the other two orderings!
This means that by going slower, we actually go faster…
Also note the scale compared to earlier experiments: with setup, we breached the 20 seconds per experiment. That’s a x860 increase from before. We will compare across these experiments soon.
Increment with contending increments #
As observed previously, increments behave similar to SeqCst
in these experiments.
So we would expect measured increments with contending increments to behave similar to SeqCst
in the previous experiment.
And in fact, that’s what we observe! I think this is a very good example that performance measurements are sometimes weird (spikes at 5) and counter-intuitive (going faster by going slower). Therefore, no matter how well you plan your system, it is paramount to measure.
Comparing results #
Next, we can compare the results across different experiments. In the following, you can see the runtime for the individual experiments for 1, 11 and 23 contender threads.
These graphs show the absurd increase in time required with more complex operations. Do note though that these graphs are using a log scale, otherwise it would be really heard to show an increase of x860.
With the 100M iterations we run with, the time for a single atomic operation ranges from 1/4th of a nanosecond to 1/4th of a microsecond.
Measuring CPU time near atomic operations #
In a previous article, we had a look at how we can measure time.
Rust’s Instant::now()
uses the OS and that internally uses RDTSCP
.
As we saw in the quote from here, there may some implications with this when working with atomics:
The RDTSCP instruction is not a serializing instruction, but it does wait until all previous instructions have executed and all previous loads are globally visible.
Let’s put that to a test.
Previously we saw that measuring a time difference of a nop
with 10M iterations costs us about 500 to 700 milliseconds.
Further, we know the minimum time we could possibly measure consecutively is about 20ns.
To design an experiment for this, we need to stay well-away from that 20ns.
Let’s use the store2
experiment from above with 11 CPUs.
This takes 44ms on average for 10M iterations and includes contending store operations.
We modify the store_atomic
function as follows for a test-run.
This does not do any atomic operations yet!
#[inline(never)]
fn store_atomic(r: Ordering) -> Duration{
let mut timings = vec![0;NUM_ITER as usize];
let start = Instant::now();
for i in 0..NUM_ITER {
black_box({
let this_start = Instant::now();
// not doing anything yet
timings[i as usize] = this_start.elapsed().as_nanos();
});
}
start.elapsed()
}
As expected, this yields something between 500 and 700 ms (in this case 593.362922ms). If we combine both, writing to the atomic (<50ms) and measuring time {<700ms}, we will see something of less than a second, right? Right?
Let’s try it out!
#[inline(never)]
fn store_atomic(r: Ordering) -> Duration{
let mut timings = vec![0;NUM_ITER as usize];
let start = Instant::now();
for i in 0..NUM_ITER {
black_box({
let this_start = Instant::now();
A.store(i,r); // now we store it
timings[i as usize] = this_start.elapsed().as_nanos();
});
}
start.elapsed()
}
...
[13] individual (Relaxed,Relaxed) with 10 readers: 3.337682106s
[14] individual (Relaxed,Relaxed) with 10 readers: 2.723738439s
[15] individual (Relaxed,Relaxed) with 10 readers: 4.834771545s
[16] individual (Relaxed,Relaxed) with 10 readers: 2.887653046s
[17] individual (Relaxed,Relaxed) with 10 readers: 2.596006156s
[18] individual (Relaxed,Relaxed) with 10 readers: 17.257207151s
[19] individual (Relaxed,Relaxed) with 10 readers: 4.840911815s
...
That’s certainly not less than a second. First observation: This fluctuates a lot. Second observation: We need to know if measuring the time or incrementing the atomic takes more time! For this, we can reuse the same mechanism as in the “timing” article: comparing what we measure between two instances across all operations vs summing up each observation.
#[inline(never)]
fn store_atomic(r: Ordering) -> Duration{
let mut timings = vec![0;NUM_ITER as usize];
let start = Instant::now();
for i in 0..NUM_ITER {
black_box({
let this_start = Instant::now();
A.store(i,r); // with and without this
timings[i as usize] = this_start.elapsed().as_nanos();
});
}
let total = start.elapsed();
let total_ns = total.as_nanos();
let total_ns_sum = timings.iter().sum::<u128>();
let diff = total_ns-total_ns_sum;
println!("total_ns:{total_ns}, total_ns_sum:{total_ns_sum}, diff:{diff}, ratio: {}",
total_ns as f64 / total_ns_sum as f64 );
total
}
Here, total_ns
is the obs
from last post, the observed total time.
// without a store
total_ns:557461022, total_ns_sum:240712474, diff:316748548, ratio: 2.315879242718431
total_ns:563480572, total_ns_sum:241284087, diff:322196485, ratio: 2.3353407968425204
...
// with store
total_ns:4003255641, total_ns_sum:3440019803, diff:563235838, ratio: 1.163730405711272
total_ns:4637864974, total_ns_sum:4312597522, diff:325267452, ratio: 1.0754226311035755
...
The output suggests, that the increased time taken is spent in storing the atomic variable, because the ratio converges towards 1.0. The conclusion here for me is, that an innocuous time measurement can severely mess up your performance when used “near” atomics.
Learnings #
These were a lot of numbers and graphs, let’s summarize what we can learn from them.
- The rust compiler tries its best to get rid of unnecessary code as possible. Don’t try to outsmart it.
- Modern computers are fast, 4bn store operations per second per core are amazing. We could probably squeeze more in that loop if we really wanted.
- Exchanging hundreds of millions of values in an atomic variable between threads is possible. This means that for “request counters”, that often range in the thousands per seconds, this is not even remotely a concern.
- Only reading or storing values is a lot faster than reading and writing (e.g. for an increment)
- The order parameter for Rust’s atomic operations is important to understand. Only use SeqCst when you need it!
- The performance penalty may or may not show depending on the thread configuration
- Obviously, correctness has to be upheld
- These experiments were using one single atomic variable. If you use multiple unrelated ones, you can expect higher performance (if you get around false sharing). If you use multiple related ones, expect lower performance.
- Measuring the time “near” atomic variables may lead to devastating and undesired side-effects.
Next steps #
With atomics out of the way, we can move on to more complex constructs: Locks, channels, async tasks, etc. If I don’t forget to do so, these will be linked here when available.
Check out the series overview!Cover photo by T I M E L O R D on Unsplash