Measuring time, how long can it take?
Here is a list of falsehoods we may believe about time. It ranges from obvious to absurd, but it gets the point across.
Luckily, for many microbenchmarks, we only need to measure short durations and can discard the measurements should any of the obvious edge cases (change of timezone) occur. The more nuanced problems still persist.
Let’s kick this off with the most trivial measurement: Getting time twice.
Using Instant::now #
let first = Instant::now();
let second = Instant::now();
let duration = second-first;
println!("duration: {duration:?}" );
This outputs 80ns
most of the time.
Of course measuring once is not enough, let’s put that in a loop. (Using statrs for statistics)
let mut observations = Vec::new();
for _i in 0 .. 10_000_000 {
let first = Instant::now();
let second = Instant::now();
let duration = second-first;
observations.push(duration.as_nanos() as f64)
}
// ... some boring data crunching
println!("Avg: {avg:.2}ns, min {min}ns, max {max}ns, sd {sd:.2}ns")
This gives us:
Avg: 22.24ns, min 20ns, max 18150ns, sd 28.64ns
With only 22ns
between two Instant::now()
, we have a lower bound of what we can measure with the function.
As Grace Hopper once explained, a nanosecond is not that far.
This time resolution should be enough for most of our applications.
But… according to the docs it uses a syscall, and these may be expensive, right? Right?
Let’s talk to the CPU directly.
Reading RDTSCP #
Unfortunately, talking to the CPU requires magic incantations, known as assembler.
I can’t write assembler.
Fortunately, Rust allows us to do exactly what I want to do without touching actual assembly code:
core::arch::x86::__rdtscp
. That scary looking function accesses one of the Time Stamp Counters of the CPU using RDTSCP
(see here) and sidesteps many of the problems we see with other timers. The only thing we need to ensure for our purpose is that the two measurements are from the same CPU core (the P
-part of the instruction). This should give us the amount of cycles the core has progressed since boot. This can be different of different cores, hence the check if we are on the same core for both measurements.
fn read_cpu_timer(){
let aux = &mut 0u32 ;
let aux2 = &mut 0u32;
let a = unsafe{ __rdtscp(aux as *mut u32) };
let b = unsafe{ __rdtscp(aux2 as *mut u32) };
if aux != aux2{
panic!("We were moved!")
}
println!("diff: {}", b-a);
}
When running, I do see two values appearing over-and over again: 38
and 76
. Of course there are some outliers (114
here and there), but these two share the majority.
Let’s run that in a loop and get some stats:
Avg: 59.08op, med: 76op, min: 38op, max: 25232op, sd: 34.19op
But how much time is 76 instructions?
The CPUmanufacturers website states a base clock of 3.8GHz.
The CPU governor reports more-or-less a maximum of3804371
at /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
under full load, so I guess it’s reasonable to assume it does not make user of the boost budget in this experiment.
The 3.8Ghz
translate to 0.26ns
per cycle.
This means that light can travel around 8cm in that time.
That won’t even leave the front of the PC case.
So, how precise can we measure?
[This]{https://www.agner.org/optimize/instruction_tables.pdf} awesome document denotes the following values for RDTSCP
.
- Ops:
37
- Reciprocal throughput:
67
Ops:
Number of macro-operations issued from instruction decoder to schedulers.
Reciprocal throughput:
this value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute.
I have no true understanding of either, so here’s my guess: Each operation takes 37 Cycles and 67 Cycles from the start of one to the start of the next pass.
The above measurements are suspiciously close to those of the docs. So I guess that is as precise as we can get.
There still is a bit of offset between the “promise” of the CPU to complete in 37 cycles and the observed 38. Let’s dig into this real quick:
cargo asm --release --bin timing "timing::read_cpu_timer"
...
timing::read_cpu_timer:
.cfi_startproc
rdtscp ; read tiemr
mov r8d, ecx ; save cpu id
mov rdi, rdx ; one part of our timestamp
mov rsi, rax ; the other part of our timestamp
rdtscp ; read timer again
cmp r8d, ecx ; compare current and previous cpu ids
jne .LBB12_2 ; jump to abort if we moved
shl rdi, 32 ;\
or rsi, rdi ; \
shl rdx, 32 ; +-> maths
or rax, rdx ; /
sub rax, rsi ;/
ret
.LBB12_2:
push rax
.cfi_def_cfa_offset 16
call qword ptr [rip + std::process::abort@GOTPCREL]
Of course, we need to save our result between invocations of RDTSCP
.
These moves happen in 1/3 of a cycle (however that works…), so this explains the 1 cycle difference between 37 and 38.
Since measuring time in operations has a bit of an artificial feeling to it, let’s convert the stats above to nanoseconds:
Avg: 15.55ns, med: 20.00ns, min: 10.00ns, max: 6640.00ns, sd: 9.00ns
If we ask the processor for the time difference directly, we can measure about 10 or 20 nanoseconds.
That’s still only 6m that light could travel and is more than enough precision for me.
Given that the Instant::now()
call has an only slightly worse resolution and way better ergonomics, I will use that one in future experiments.
ADD
operations for the same cost as asking for the time.
And that does not even take into account reordering that is prevented by RDTSCP
.
Multi-Threaded time reading #
Another scenario to look into is measuring time difference when using multiple threads. In this section, we will look at time differences in a single thread, not across multiple threads. The internet states that:
The RDTSCP instruction is not a serializing instruction, but it does wait until all previous instructions have executed and all previous loads are globally visible.
If we don’t communicate between threads, we should be fine, right? We can confirm this in a simple setup, we slap the whole thing from above in a loop that spawns in increasing amount of threads.
fn main() {
for fun in [simple_time, cpu_time]{
for num_threads in 1..=24 {
let handles =
(0..num_threads)
.into_iter()
.map(|i|
thread::spawn(move || (0..10)
.into_iter()
.for_each(|_|fun(i)))
)
.collect::<Vec<JoinHandle<()>>>();
handles.into_iter().for_each(|j| j.join().unwrap());
}
}
}
Further, we need the RDTSCP
-variant to not abort if the CPUid changed.
In now retries until it stays on a CPU for the measured duration.
Now the only thing left to do is to have the simple_time
and cpu_time
to also measure the total time and report their individual measurements along with their type (I or C) and thread id.
This gives us plenty of output.
[I-13] obs: 625.878651ms, sum: 231925090, avg: 23.19ns, med: 20ns, min 20ns, max 23651ns, sd 23.22ns
[I-19] obs: 621.997408ms, sum: 230249356, avg: 23.02ns, med: 20ns, min 20ns, max 57460ns, sd 30.97ns
[I-21] obs: 633.422257ms, sum: 234348479, avg: 23.43ns, med: 20ns, min 20ns, max 32251ns, sd 25.46ns
...
[C- 9] obs: 492.09669ms, sum: 187603120, avg: 18.76ns, med: 20.00ns, min: 10.00ns, max: 2591300.00ns, sd: 1356.47ns
[C-21] obs: 476.078693ms, sum: 178186400, avg: 17.82ns, med: 20.00ns, min: 10.00ns, max: 128380.00ns, sd: 133.59ns
[C-12] obs: 729.991761ms, sum: 234324810, avg: 23.43ns, med: 20.00ns, min: 10.00ns, max: 10038080.00ns, sd: 4981.55ns
The obs
field here was added to measure the actual time taken compared to the time difference.
Let me explain: We measure the “reciprocal throughput” here (as we learned earlier), so that’s from one invocation to the other, we then start the loop again, without measuring anything.
The other part of time, from the second measurement of one loop iteration to the first measurement of the next iteration, is not measured.
This means that obs
will always be at least twice as large as sum
.
Further, obs
takes saving the result in the vec, decrementing the loop counter and the jump for the loop into account.
The actual observed time obs
is about 1.3 times larger than the sum of all measurements sum
.
Another interesting piece of data here is that the RDTSCP
-variant has some crazy outliers.
In the last line of the above listing measurement was 500K times as slow as the average.
If that’s due to scheduling or an excessive amount of retries in one measurement I can’t tell.
The Instant
-based measurements “only” have outliers of 20k in these sample measurements.
The most important result we can gather is that measuring from multiple threads does not negatively impact the measurements in an obvious way. Spoiler alert: we will see in the next article, that this is not as easy as it seems.
Conclusion #
Rust can measure with 20ns to 30ns resolution using either RDTSCP
or Instant::now
.
Both are prone to insane outliers, so measuring things once most likely won’t produce reliable results.
Merely measuring time is not affected by multithreading if threads don’t communicate.
From the docs we learned though, that RDTSCP
has some hefty costs to be expected when there are loads and stores across threads.
This (and much more) will be investigated in the next article.
Cover photo by Karsten Füllhaas on Unsplash