How does the Linux kernel load pages into memory when you request something to be mapped using mmap
?
Recently I had to investigate why our database product was slow to start in a customer’s environment.
The component acting up was using mmap
to read parts of an index into memory.
While doing some research on it, I got nerd sniped.
mmap Basics #
I will quickly summarize the basics of mmap and provide an example for readers not familiar with mmap
.
If you want to skip this, head to mmap behavior
What is mmap? #
The docs describe it as follows:
mmap() creates a new mapping in the virtual address space of the calling process.
There are many ways how this can be used and what can be mapped.
The one relevant for this article is the following:
You can use a mmap
call to map the contents of file into virtual address space.
Let’s walk through an example to see how this looks like in code:
- First, we need to get a file descriptor, we can get one by opening a file
int fd = open("test.bin", O_RDONLY);
if(fd==-1){/*...*/}
- Then we call
mmap
passing in a bit of information of where and how the mapping should be generated.
long map_len = 100;
long read_pos = 50;
char *map = mmap(0, map_len, PROT_READ, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {/*...*/}
- Using the mapping, we can now read any position within the memory region.
printf("read value 0x%x\n", map[read_pos]);
- When done, close the mapping and the file (or let the OS do that for you…)
Nothing too spectacular so far, so why shouldn’t we just use a regular read
call to read the data in a buffer?
Mmap allows us to do some interesting things, one of them is to have a mapping that is larger than main memory (RAM).
Consider the following example.
int fd = open("test.bin", O_RDONLY);
if(fd==-1){/*...*/}
long map_len = 300L*1024*1024*1024; // 300GB
long read_pos = 100L*1024*1024*1024; // offset of ~100GB
char *map = mmap(0, map_len, PROT_READ, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {/*...*/}
printf("read value 0x%x\n", map[read_pos]);
We can read any byte within that 300GB memory chunk (we read one at an offset of ~100GB) and don’t need to care about where we want to read, the OS will take care to always have the required data loaded from disk into memory when we need it. This is awesome! Mmap is even more awesome than that: the OS also takes care of unloading memory again, so we won’t run out of RAM.
However, as with all tech, there are some caveats.
Alternative options #
If we were to replicate the same behavior without using mmap
, we had a few options:
- Install 300GB of RAM. Easy fix. That would probably be enough to run two instances of slack at the same time!
- Implement custom paging. This would require us to write a boatload of code to decide what page to load, free up pages when we no longer need them, prevent access to unloaded pages, and so on. And this is only the read path, when we would be writing too, this gets an order of magnitude more complex. Most Databases have to do this though.
- Don’t use a backing array but use the underlying file APIs directly.
Then a
seek
within the file and a call toread
would do the trick. This works great when reading parts of the file sequentially. However, when you read a portion of the file twice, you either end up building option 2 or run into severe performance issues.
There are a lot of things the OS takes care of in this regard, we will discover them as we follow the investigation. Any of the options (including using mmap) has some drawbacks that range from obvious to subtle and from minor annoyance to absolute showstopper.
Extended example #
We will have two example workloads later on. One of them is reading from the mapped buffer sequentially and the other one is reading in a pseudo-random manner. Since I feel more comfortable using Rust than C, let’s rewrite it in Rust .
fn do_mmap() -> Result<()> {
let file = OpenOptions::new()
.read(true)
.open("10g.bin")
.context("Could not open file")?;
let mmap = unsafe {
MmapOptions::new()
.len(BUFFER_SIZE)
.map(&file)
.context("Could not map file")?
};
read_sequential(&mmap);
Ok(())
}
Now we can define the first read workload (sequential):
fn read_sequential(mmap: &Mmap) {
let mut sum = 0u64;
for i in 0..BUFFER_SIZE {
sum += mmap[i] as u64;
}
println!("Sum is {sum}");
}
We will use a file size of 10GB (BUFFER_SIZE = 10 * 1024 * 1024 * 1024
)
This function reads the whole buffer once sequentially. To prevent the compiler to optimize away the whole thing, it prints the sum of it. There are other ways to achieve this, but this is by far the simplest one.
We also need to support unpredictable reads. While they are at best “pseudo-random”, I did make sure that they are spread across the file nicely. This resembles the original workload more closely as you will see in a few sections.
fn read_random(mmap: &Mmap) {
let prime = 5706217411;
let mut sum = 0u64;
for i in 0..BUFFER_SIZE {
let pos = (i*prime)%BUFFER_SIZE;
sum += mmap[pos] as u64;
}
println!("Sum is {sum}");
}
Here 5706217411
is a prime number that is not part of the factorization of BUFFER_SIZE
, hence (i*prime)%BUFFER_SIZE
won’t hit the same byte twice.
Since we don’t need true randomness but only something that can’t be predicted by the kernel, this is good enough.
Understanding mmap behavior #
Now that we have a test program (we will use fio later too), let’s investigate what is going on in the system. We will use a whole host of tools for this, so if you want to follow along, get your hands on them now:
- vmtouch
- sysstat (for iotop and sar)
- bpftrace
Let’s start from the most basic investigation: how much data is kept in buff/cache
before we start anything?
➜ free -h
total used free shared buff/cache available
Mem: 125Gi 40Gi 83Gi 2.5Gi 4.9Gi 84Gi
Swap: 0B 0B 0B
^
|
here ----------------------------------------------------------|
Nearly 5GB are already in buff/cache
before we even run our code!
➜ ./target/release/reader
...
[2024-07-27T16:24:07Z INFO reader] mmap round SEQUENTIAL took 5.976189549s
...
[2024-07-27T16:24:10Z INFO reader] mmap round SEQUENTIAL took 2.691732412s
...
[2024-07-27T16:24:13Z INFO reader] mmap round SEQUENTIAL took 2.658959719s
One of these is not like the others: The first iteration took longer. This is because we initially needed to load the data in virtual memory. We will make the difference more severe later on. Further, reading 10GB in 2.7 seconds is already quite fast.
And what about our free
output now?
➜ free -h
total used free shared buff/cache available
Mem: 125Gi 40Gi 73Gi 2.5Gi 14Gi 85Gi
Swap: 0B 0B 0B
^
|
here ----------------------------------------------------------|
It’s 14 GB! That’s ~10GB more than earlier. This is exactly what we would expect.
So ho do we get rid of it again?
Evicting memory #
You can instruct the kernel to drop all relevant caches by writing 1
to the correct file in the procfs:
sudo sh -c 'echo 1 >/proc/sys/vm/drop_caches'
A quick check with free confirms that we were successful.
➜ free -h
total used free shared buff/cache available
Mem: 125Gi 40Gi 88Gi 2.5Gi 4.4Gi 84Gi
Swap: 0B 0B 0B
^
|
here ----------------------------------------------------------|
Don’t run this on a production machine without thinking, though.
While we are at it, let’s briefly discuss whether we should manually evict memory (this is a discussion I had with a customer earlier).
This is memory owned by the kernel, managed by the kernel, reclaimed by the kernel.
The kernel kicks out buffer/cache memory from physical memory first, when we need more “actual” memory for our applications.
So your application will not trigger out of memory killings just because 80% of memory is used by buffer/cache.
Assuming we know better than the kernel, when and how to reclaim memory, would be wrong to assume.
So if you want to limit buffer/cache usage of the whole system, think again.
There are however some scenarios, e.g. if you have bursty workloads that suddenly and quickly need to allocate a boatload of memory.
Another scenario that we will completely ignore in this article is writing to mapped memory, causing dirty pages. For that, check on the vm_dirty_*
settings.
However, of course you can evict memory (it’s Linux after all, it does not judge us for our bad decisions).
With free
we can see the overall system summary.
But can we see that for our test file only?
Maybe even which parts are mapped?
This is where vmtouch
comes into play.
Observing virtual memory with vmtouch #
With vmtouch
, we have a small but powerful tool to see what parts of a file are loaded in memory.
Again, let’s have a look at the 10GB test file 10g.bin
before the test code ran.
➜ vmtouch -v 10g.bin
10g.bin
[ ] 0/2621440
Files: 1
Directories: 0
Resident Pages: 0/2621440 0/10G 0%
Elapsed: 0.15446 seconds
As the large empty space in the brackets indicates, nothing is loaded in memory yet. No we let the test program run for 1 second (should load 1/5.9th of the file) and observe the file again.
➜ timeout 1s ./target/release/reader
[2024-07-27T16:48:23Z INFO reader] Testee running with pid 3402768
➜ vmtouch -v 10g.bin
10g.bin
[OOOOOOOOOOo ] 464736/2621440
Files: 1
Directories: 0
Resident Pages: 464736/2621440 1G/10G 17.7%
Elapsed: 0.21727 seconds
Cool, vmtouch
shows us exactly what pages are loaded.
2621440
) than “O"s (60
).
Therefore, vmtouch
has to map several pages to a single O.
If any page of such a bucket is mapped you will get a o
, if all pages of the bucket are mapped, you will see an O
.
With vmtouch
, we can do another cool thing: We can kick out a file from memory.
➜ vmtouch -e 10g.bin
Files: 1
Directories: 0
Evicted Pages: 2621440 (10G)
Elapsed: 0.26097 seconds
That way we don’t need to drop all the pages of the whole OS. With a simple loop, we can now use the tool to see loaded memory live as it fills up:
Note that the bar stays at 100% once the code loops on reading it, the gif ends after it is done for the first time and loops on its own.
What does the loading behavior look like for a random/spread read pattern?
To actually see something, I had to slow down reading significantly, in this case a 100msec
pause was introduced
After 6 seconds, the whole bar is full of o
.
There is kind of a problem though: We see it filled with o
and not O
. Further, not even 1% of the pages has been loaded!
To investigate this, we first need to have a look at an uncomfortable truth:
Random reads, even from something already loaded in memory are waaaay slower than sequential reads.
As we will see in the next section, the difference is striking.
We need to fix this fist before we can investigate mmap further.
Interlude: slow reads from memory #
To read the same amount of bytes already loaded in VM in a pseudo randomly spaced manner, takes 577 seconds compared to the 1.9 we saw earlier.
virtual memory | loading from disk | |
---|---|---|
sequential | 1.9 | 11 |
random | 577 | ∞ |
We observed more than 200 times as slow reads from memory when reading at random spots, compared to sequential reads. Keep in mind, the virtual memory was already loaded, so this excludes reads from disk!
I am quite confident this is due to cache misses, since each read is on a separate page, 5706217411
bytes apart.
Let’s verify this assumption before blindly changing the code.
First, we replace our memory mapped buffer by a huge 10GB array on the heap: let mut map = vec![0u8; BUFFER_SIZE];
This rules out any unexpected side effects from using a mapped buffer here.
Then, we measure again, to see if the problem still persists.
And yes, it does: 1.3s
for a sequential read and 81s
for the random one.
heap | virtual memory | loading from disk | |
---|---|---|---|
sequential | 1.3 | 1.9 | 11 |
random | 81 | 565 | ∞ |
While reading from heap is faster than from VM, we still see the same pattern: the memory spread/random reads are a lot slower. We can investigate whether the assumption of cache misses holds.
Using perf to investigate cache misses #
Let’s use some science and come up with a null hypothesis: We assume cache misses are not different for both read patterns.
If we can falisfy this one, we can investigate H1: Cache misses are larger for the random access pattern.
With perf
, this can be verified.
For this use-case, perf provides the ready-made mem
profile.
We run the following once for each access pattern.
perf mem record -a ./target/release/mem_only_reader
perf mem report
perf mem -t load report --sort=mem
Note that perf
will not deliver us with exact numbers how often something occurred but rather with how often it observed something to have happened.
Further, the runtime (and work) of the two variants was highly different, mainly since I did not want to wait for half an hour for it to finish.
The following output is for sequential operation.
Overhead Samples Memory access
27.10% 885 RAM hit
24.29% 233983 N/A
18.61% 179296 L1 hit
18.15% 12924 L2 hit
11.73% 1140 core, same node Any cache hit
0.11% 1012 LFB/MAB hit
0.00% 38 Uncached hit
And this one for random access.
Overhead Samples Memory access
67.01% 32644 RAM hit
18.22% 84602 core, same node Any cache hit
6.38% 1206079 N/A
5.14% 69827 L2 hit
3.24% 612569 L1 hit
0.01% 1702 LFB/MAB hit
0.00% 105 Uncached hit
What to make of this? Well, H0 can be rejected for sure, there is a difference in cache hits. And we can accept H1, since there are clearly more RAM lookups compared to L1 and L2 hits for the random access pattern variant.
How do we fix this? We can’t trick physics after all.
A simple cheat #
We can “fix” this problem by not doing fully spread/random reads. We could decide to access a random page but then read within that page (or maybe the subsequent one) sequentially. This of course is not the same before, but we do not want to measure RAM speed but rather investigate mmap behavior. See the “Note” below to see why this is still fair.
Enough with science, let’s engineer this: How much do we need to read per page to achieve similar results? Note that we won’t be able to achieve the same results, since we need to do a bit more work compared to what the plain sequential version does. This also makes it harder for the compiler to optimize our code.
With the following, we try to find out the optimal size to read within (and slightly beyond) a page.
fn read_random(map: &[u8]) {
let prime = 5706217411;
for i in 0..14 {
let start_time = Instant::now();
let mut sum = 0u64;
let block_size = 1<<i;
for i in 0..BUFFER_SIZE / block_size {
for k in 0..block_size {
let pos = (i * prime + k) % BUFFER_SIZE;
sum += map[pos] as u64;
}
}
let dur = start_time.elapsed();
println!("block_size: {block_size}, dur: {dur:?} sum: {sum} ");
}
}
Nah, let’s use a log scale for x.
This is better.
We can see that around 1024
, the returns are diminishing.
This means we can read that number of bytes of a random page, achieve better performance and still be within one memory page!
fn read_random(map: &[u8]) {
let prime = 5706217411;
let mut sum = 0u64;
let block_size = 1024;
for i in 0..BUFFER_SIZE / block_size {
for k in 0..block_size {
let pos = (k + i * prime) % (BUFFER_SIZE);
sum += map[pos] as u64;
}
}
println!("sum: {sum} ");
}
Now we still have a bit of a problem: in one variant we do 10 billion modulo operations and in the other one we don’t. By cheating a tiny bit, we can get the runtime from 6.3 seconds down to 3.2 seconds:
fn read_random(map: &[u8]) {
let prime = 5706217411;
let mut sum = 0u64;
let block_size = 1024;
for i in 0..BUFFER_SIZE / block_size {
let start = (i * prime) % (BUFFER_SIZE-block_size);
for k in 0..block_size {
let pos = start+k;
sum += map[pos] as u64;
}
}
println!("sum: {sum} ");
}
This way we don’t read the last 1024 bytes of a 10GB array (duh) but trim the 10 billion modulo operations down to 10 million.
Checking one more time with perf
, shows that we were successful:
Overhead Samples Memory access
28.24% 534468 N/A
28.08% 1219 RAM hit
16.79% 2450 core, same node Any cache hit
15.87% 300309 L1 hit
10.98% 8218 L2 hit
0.04% 720 LFB/MAB hit
0.00% 48 Uncached hit
With the new “random” reads, we can finally continue chasing after the mmap stuff.
Using cgroups to restrict memory #
Back to the actual topic of the article, memory mapped files.
Until now, the program had unrestricted memory and IO. This changes now, since our customer’s cloud environment has less resources than a developer machine.
We will start by restricting memory first (mostly since this gives us a cool gif).
Using systemd-run
, we can get a user slice restricted by a custom ad-hoc cgroup.
systemd-run --scope -p MemoryMax=1G --user zsh
In the new shell that just spawned, we can run the test
➜ ./target/release/mem_only_reader
[2024-07-28T07:34:51Z INFO mem_only_reader] Testee running with pid 100585
[1] 100585 killed ./target/release/mem_only_reader
We got killed? Oh right, that is the code that uses the 10g array on the heap, that obviously won’t fit into the 1gb this cgroup allows for. Let’s run the mmap version but with the files loaded in VM.
➜ ./target/release/reader
[2024-07-28T07:39:50Z INFO reader] Testee running with pid 105427
[2024-07-28T07:39:52Z INFO reader] mmap round SEQUENTIAL took 2.06146459s
[2024-07-28T07:39:54Z INFO reader] mmap round SEQUENTIAL took 1.488209825s
...
Awesome, we see the same performance as outside the cgroup!
Now we kick out the file from virtual memory. If the OS does not have the file cached, the process running in the cgroup needs to request it but has to pay for it with its memory quota.
As you can see, the process never loads the whole file in memory, even after it is done once. This also means, that the speed-up after the first iteration we observed earlier does no longer occur! This is detrimental to performance!
One side note though: if the main cgroup, with its unrestricted power, is loading the whole file in memory, the cgroup-ed process does benefit from the pages in the page cache and runs with the expected performance (1.4 seconds). It’s only the chroup-ed process that is not allowed to load them in VM.
As you would expect, being constrained by a cgroup is bad news for the random access version… Remember, this has the same effect as having a system with just too little memory (I could instead pull out some memory bars or run with a 300gb file, but a cgroup is way more convenient). When loading random pages, the time takes is a lot larger than when loading pages sequentially, even when reading the whole page.
limits | file | sequential | random |
---|---|---|---|
none | heap | 1.4 s | 3.6 s |
none | resident | 1.5 s | 3.9 s |
none | unloaded | 11.1 s | 28.1 s |
constrained | heap | killed | killed |
constrained | resident | 1.5 s | 4.0 s |
constrained | unloaded | 15.5 s | 1012.8 s |
What can we learn from this so far?
- Loading data from disk is expensive. Loading sequentially is roughly three times faster than doing it in random order if we don’t need to evict pages due to memory constraints.
- Reads for sequential data are faster than random reads, for memory mapped data, as well as data that lives on the heap.
- If you don’t have enough memory to keep the whole mapped file in memory, sequential reads are slower but still feasible. Whether using mmap here compared to a different IO approach does make sense, is a different story though. Loading random pages with mmap in a resource constrained environment is about 67 times slower than reading them sequentially and about 36 times slower than in an unconstrained environment.
- There seems to be something going on in the background (this will demystify this later): sequential reads cause ~650MB/s of disk IO for 15.5 seconds. That’s 10GB (the size of the mapping we are reading!). However, random reads cause 321MB/s for 1012 seconds. That’s 328GB! That’s about 32 times as much as we would need to read (spoiler: this will be important later again).
We will use cgroups later to restrict IO, but for now we need to investigate another thing first. Reading a 10GB file in 11 seconds is still kinda slow. Since loading memory requested in a memory mapping is synchronous, we probably could get more performance using multiple threads. Let’s use fio to test that.
Using fio to compare mmap and io_uring #
With fio
, we can compare IO performance of different modes.
Let’s start by replicating our current setup.
fio --filename=test.bin --size=10G --rw=read --ioengine=mmap --numjobs=1 --group_reporting --name=iops-test-job
This does:
- Use a file instead of whole device:
--filename=test.bin
of size--size=10G
- Read sequentially
--rw=read
- Using mmap
--ioengine=mmap
- In one thread
--numjobs=1
- and a bunch of boring params
The output is quite verbose, but the only line we are interested in for now is the following:
read: IOPS=242k, BW=943MiB/s (989MB/s)(10.0GiB/10854msec)
This means fio
read the 10GB in 10.8 sec, just as we did!
But can we observe these 989MB/s
fio claims to read from the outside? (This will be important in a second)
We can use iostat
for this:
iostat -d 1 -h
When fio
is running, we can see about 1GB/sec
being read.
tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd Device
4140.00 1.0G 0.0k 0.0k 1.0G 0.0k 0.0k dm-0
8280.00 1.0G 0.0k 0.0k 1.0G 0.0k 0.0k nvme0n1
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k nvme1n1
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k sda
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k sdb
In this case dm-1 is a mapped device located on nvme0n1, hence the doubled throughput.
What happens if we increase the number of threads to get around the synchronous limitations mentioned earlier?
fio --filename=test.bin --size=10G --rw=read --ioengine=mmap --numjobs=24 --group_reporting --name=iops-test-job
...
read: IOPS=3993k, BW=15.2GiB/s (16.4GB/s)(240GiB/15756msec)
Wow! …and What?
Fio claims to have higher read bandwidth of 16.4GB/s but takes even longer than before 15756msec
?
There’s something wrong…
Since memory mapped buffers can be read by all threads, the first one to touch a page, causes it to be loaded into memory and the other threads then can benefit from this. So most of the reads happen from memory.
This means, we can’t rely on fio
for reporting the actual bandwidth.
Now the iostat
thing comes in handy: using it, we can see that the actual bandwidth was only about ~740MB/s
.
This aligns with the x1.5 increase in total time.
But the device should be able to handle more IO.
So let’s find out first what the maximum throughput is, using io_uring
as the engine.
fio --filename=test.bin --direct=1 --size=10G --rw=read --bs=256k --ioengine=io_uring --iodepth=64 --numjobs=24 --group_reporting --name=iops-test-job
In addition to the previous arguments, we do some changes:
- We use
io_uring
as the IO engine, a asynchronous IO interface for somewhat modern Linux kernels (>5.1) - We use direct io with
--direct=1
to circumvent the OS’s buffers - We use a block size of 256KB
--bs=256k
, which is outright cheating, since we instruct the read to fetch a whole chunk of data in one go - By specifying
--iodepth=64
we allow 64 asynchronous reads to be in flight at any time
And apparently this worked to bump our bandwidth up to 4.8GB/s
tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd Device
19684.00 4.8G 0.0k 0.0k 4.8G 0.0k 0.0k dm-0
39375.00 4.8G 0.0k 0.0k 4.8G 0.0k 0.0k nvme0n1
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k nvme1n1
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k sda
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k sdb
This is async IO, so one IO thread should be enough, let’s see what a --numjobs=1
yields:
read: IOPS=20.4k, BW=5100MiB/s (5347MB/s)(10.0GiB/2008msec)
So with >5GB/s
for io_uring
, I guess we need to ramp up out mmap game.
What is different so far between using async io with io_uring
and our mmap
approach:
mmap
is sequential, waiting for each page read (we ignore readahead for now but get back to it later on) to complete and then read the next page- Reads with
mmap
do have a fixed page size, that we cannot really influence - We can’t use
fio
to increase threading withmmap
, otherwise the reads will overlap, and we will read from memory
Fio randread with mmap #
So far we tried sequential reads with fio
, maybe we have any luck with random reads?
fio --filename=test.bin --size=10G --rw=randread --ioengine=mmap --numjobs=1 --group_reporting --name=iops-test-job
...
read: IOPS=11.5k, BW=45.0MiB/s (47.2MB/s)(10.0GiB/227388msec)
Oof, nope, that’s not what we want. Maybe for multiple random-read workers?
fio --filename=test.bin --size=10G --rw=randread --ioengine=mmap --numjobs=24 --group_reporting --name=iops-test-job
...
read: IOPS=3068k, BW=11.7GiB/s (12.6GB/s)(240GiB/20506msec)
Argh, fio
“lies” to us again about that.
However, it manages to load the file in 19628msec
, so I won’t complain.
We do see one interesting thing in iostat though:
tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd Device
205934.00 804.4M 0.0k 0.0k 804.4M 0.0k 0.0k dm-0
205931.00 804.4M 0.0k 0.0k 804.4M 0.0k 0.0k nvme0n1
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k nvme1n1
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k sda
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k sdb
Notice the tps
being way higher than before?
Is the SSD actually serving with 205934 IOPS
?
Looking at sar
, we see similar numbers ( the majflt/s
column):
pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s pgprom/s pgdem/s
913544.00 208.00 428142.00 228387.00 3933.00 0.00 0.00 0.00 0.00 0.00
According to the manufacturers page([neutrality is disputed]), the device would be able to handle 1 million read IOPS?
Hard to believe but ok, let’s try.
With io_uring
, a queue depth of 64 and 24 threads, I got 736,236 IOPS.
Mind you, this is with LUKS, an actual filesystem, no tweaking, and not even cooling with liquid nitrogen.
I guess the stated 1M is not that much off.
It’s amazing how much speed you can get out of actual hardware when comparing that with offerings from your average cloud vendor. The insane amount of IOPS will become important later on again.
Multithreaded, nonoverlapping reads #
Since we can’t tell fio
to read sections of the file in several threads (and not the whole one, leading to the nasty caching again), we have to adjust our tool from earlier.
Using rayon
, changing to a multi-threaded read becomes as trivial as changing from into_iter
to into_par_iter
:
fn read_sequential(mmap: &Mmap) {
let chunk_size = BUFFER_SIZE/num_threads;
let sum = (0..num_threads)
.into_par_iter()
.map(|t| {
let mut sum = 0u64;
for i in t*chunk_size..(t+1)*chunk_size{
sum += mmap[i] as u64;
}
sum
})
.sum::<u64>();
println!("Sum is {sum}");
}
When running this sequential code, we achieve the same throughput as io_uring
of 5GB/s!
Further, we finish loading in 2.0 seconds.
That’s even faster than reading spread that are already in memory!
tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd Device
40214.00 4.9G 0.0k 0.0k 4.9G 0.0k 0.0k dm-0
40216.00 4.9G 0.0k 0.0k 4.9G 0.0k 0.0k nvme0n1
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k nvme1n1
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k sda
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k sdb
Note however, we used 40,214
IOPS with mmap
while io_uring
only required 19,684
, roughly twice as much.
Both were capped by throughput (4.9G/s).
Now what about multi-threaded random reads using mmap? The code is slightly more complex and cuts a few corners but if each thread reads one page less than it should, I am fine with it.
fn read_random(mmap: &Mmap) {
let prime = 33757333;
let block_size = 4096;
let chunk_size = BUFFER_SIZE/num_threads;
let sum = (0..num_threads)
.into_par_iter()
.map(|t| {
let mut sum = 0u64;
for i in t*chunk_size/block_size..(t+1)*chunk_size/ block_size{
let start = (i * prime) % (chunk_size-block_size);
for k in 0..block_size {
let pos = start+k;
sum += mmap[pos] as u64;
} }
sum
})
.sum::<u64>();
println!("sum: {sum} ");
}
With a runtime of only 2.9 seconds, this is also looking good bandwidth-wise:
tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd Device
38747.00 4.7G 156.0k 0.0k 4.7G 156.0k 0.0k dm-0
38726.00 4.7G 156.0k 0.0k 4.7G 156.0k 0.0k nvme0n1
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k nvme1n1
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k sda
0.00 0.0k 0.0k 0.0k 0.0k 0.0k 0.0k sdb
As we will find out in the next section(s), these random reads may not be random enough though.
Comparing read sizes for random and sequential mmap #
Remember the 200K IPS from fio earlier? Allow me to put them next to the ones of the non-overlapping reads:
tps kB_read/s
205934.00 804.4M // fio with mmap randread
vs
tps kB_read/s
38747.00 4.7G
If we calculate the size of these reads, we get 804.4M / 205934 = 3.9
and 4.7G / 38747 = 127
.
Going on an Adventure (into the Kernel) #
I want to find out, where the buffer size for reading in memory mapped pages by the kernel is specified.
For normal reads, that simply provide a buffer and length argument, tracing these reads is easy with bpftrace
.
However, there is no syscall when reading from a memory region.
So we need to resort to reading the kernel source and then will be able to find the correct function to trace with bpftrace
.
So what syscall are we after, mmap
?
In include/linux/syscalls.h
m there are some functions defined related to mmap:
sys_mmap_pgoff
(this one is obsolete)sys_old_mmap
(this one is old)ksys_mmap_pgoff
that’s the one!
The ksys_mmap_pgoff()
is the one follow, click here if you want to follow along the call chain. We will ignore hugetables or anonymous mappings.
The following is the happy path for creating a mapping:
So now that a mapping is created, how is any data loaded? Until now, we only saw the creation process.
Following the calls of populate
from vm_mmap_pgoff
, we eventually reach paging logic.
A comment in handle_mm_fault
refers to filemap_fault
and __folio_lock_or_retry
, so we better check that one out.
However, filemap_fault
has the option to be specific for each filesystem.
Luckily, that’s not the case for ext4, as it points to the “original” filemap_fault
.
static const struct vm_operations_struct ext4_file_vm_ops = {
.fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = ext4_page_mkwrite,
};
Again, we follow the call-graph, this time in filemap_fault
.
To keep the complexity of the graph under control, I am a bit sloppy with it at two instances:
- return from filemap does all the other things after the dashed arrow first, but the most important part is, that it returns immediately after triggering a readahead
- magic obviously there’s a bit more involved than a magic box. We will investigate that in a second though.
The ondemand_readahead
here gets the information about how much to read!:
tatic void ondemand_readahead(struct readahead_control *ractl,
struct folio *folio, unsigned long req_size)
There’s req_size
but also ractl->ra->ra_pages
that are interesting. Let’s find out what these are set to, ra_pages
first, since req_size is not constant.
static void blk_apply_bdi_limits(struct backing_dev_info *bdi,
struct queue_limits *lim)
{
/*
* For read-ahead of large files to be effective, we need to read ahead
* at least twice the optimal I/O size.
*/
bdi->ra_pages = max(lim->io_opt * 2 / PAGE_SIZE, VM_READAHEAD_PAGES);
bdi->io_pages = lim->max_sectors >> PAGE_SECTORS_SHIFT;
}
Mhm, looks like io_opt
is device dependent, according to cat /sys/block/dm-0/queue/optimal_io_size
it’s 0 on my machine (?).
VM_READAHEAD_PAGES
is (SZ_128K / PAGE_SIZE)
, that makes 32
.
Now this explains, why we see twice as much IOPS for the same bandwidth when using 256K: mmap reads with 128K by default.
The ondemand_readahead
function is only invoked if the kernel assumes that it can gain anything from doing a readahead. If we tell it that there is no point in doing so, it is not called.
The do_(a)sync_mmap_readahead
functions take care of that by checking for VM_RAND_READ
.
As you can see later, we are able to influence this.
Further, a set VM_SEQ_READ
forces a readahead in do_sync_mmap_readahead
.
There is also a mechanism in place the kernel detect if the readahead helps, and it will stop doing automatic readaheads if there’s no good coming from it.
Use bpftrace to trace kernel activity #
The following snippet traces the filemap_fault
kernel function and prints a few interesting arguments when our test file (inode=107086783
) is hit. I know there is a way to do have this matching done from the outside, but this works good enough.
#!/bin/env bpftrace
#include <linux/fs.h>
#include <linux/mm.h>
kfunc:vmlinux:filemap_fault {
$file = args->vmf->vma->vm_file;
$inode = $file->f_inode->i_ino;
if($inode == 107086783){
$vm_flags = args->vmf->vma->vm_flags;
$fault_flags = args->vmf->flags;
$ra = $file->f_ra;
$pages = $ra.ra_pages;
printf("filemap_fault %x %x %d\n", $vm_flags, $fault_flags, $pages);
}
}
Time to compare the output:
tool | access | vm_flags | fault_flags | num_pages |
---|---|---|---|---|
reader | sequential | 80000D1 | 1254 | 32 |
reader | random | 80000D1 | 1254 | 32 |
fio | seq | 80080D1 | 1254, 1a54 | 64 |
fio | rand | 80100D1 | 1254, 1a54 | 32 |
Let’s start with the fault_flags: `
0x1254 = 0b0001001001010100
FAULT_FLAG_ALLOW_RETRY
FAULT_FLAG_KILLABLE
FAULT_FLAG_USER
FAULT_FLAG_INTERRUPTIBLE
FAULT_FLAG_VMA_LOCK
0x1a54 = 0b0001101001010100
- same as above and
FAULT_FLAG_ORIG_PTE_VALID
Here,PTE
stands for Page Table Entry. This does not really help tbh.
The vm_flags are a bit more interesting though:
fio rand:
0x080100D1
0x00000001 VM_READ
0x00000010 VM_MAYREAD
0x00000040 VM_MAYEXEC
0x00000080 VM_MAYSHARE
0x00010000 VM_RAND_READ <---
0x08000000 VM_SOFTDIRTY
fio seq:
0x080080D1
0x00000001 VM_READ
0x00000010 VM_MAYREAD
0x00000040 VM_MAYEXEC
0x00000080 VM_MAYSHARE
0x00008000 VM_SEQ_READ <---
0x08000000 VM_SOFTDIRTY
reader seq = reader rand:
0x080000D1
0x00000001 VM_READ
0x00000010 VM_MAYREAD
0x00000040 VM_MAYEXEC
0x00000080 VM_MAYSHARE
0x08000000 VM_SOFTDIRTY
In contrast to my code, when fio
causes the pages to be loaded, it magically has the correct read type set!
This is unfair…
Let’s fix this.
Linux has a madvise
syscall when searching in the fio
sources for it, we can see that it uses that one. Let’s see if we can do that too.
mmap.advise(Advice::Sequential).unwrap();
A quick glance with bpftrace
confirms that we have the same flags!
Further, the pte
from above now also appears. I have no clue what that means but i am sure it’s a good thing…
Using the addition, we can now produce similar behavior as fio
(the new code is called smart_reader
):
==> out_trace_all_rand_fio.log <==
Attaching 6 probes...
filemap_fault fio 107086783 134283473 pages: 32
filemap_fault ret fio 107086783 ret: 1028
filemap_fault fio 107086783 134283473 pages: 32
filemap_fault ret fio 107086783 ret: 1028
filemap_fault fio 107086783 134283473 pages: 32
filemap_fault ret fio 107086783 ret: 1028
filemap_fault fio 107086783 134283473 pages: 32
filemap_fault ret fio 107086783 ret: 1028
filemap_fault fio 107086783 134283473 pages: 32
==> out_trace_all_rand_reader.log <==
Attaching 6 probes...
filemap_fault reader 107086783 134217937 pages: 32
page_cache_ra_order 17413618531 reader 0 32 0 107086783
filemap_fault ret reader 107086783 ret: 1028
filemap_fault reader 107086783 134217937 pages: 32
page_cache_ra_order 17422315767 reader 0 32 1393103 107086783
filemap_fault ret reader 107086783 ret: 1028
filemap_fault reader 107086783 134217937 pages: 32
page_cache_ra_order 17422646179 reader 0 32 164783 107086783
filemap_fault ret reader 107086783 ret: 1028
==> out_trace_all_rand_smart_reader.log <==
Attaching 6 probes...
filemap_fault smart_reader 107086783 134283473 pages: 32
filemap_fault ret smart_reader 107086783 ret: 1028
filemap_fault smart_reader 107086783 134283473 pages: 32
filemap_fault ret smart_reader 107086783 ret: 1028
filemap_fault smart_reader 107086783 134283473 pages: 32
filemap_fault ret smart_reader 107086783 ret: 1028
filemap_fault smart_reader 107086783 134283473 pages: 32
filemap_fault ret smart_reader 107086783 ret: 1028
filemap_fault smart_reader 107086783 134283473 pages: 32
We can see that it no longer does readaheads!
On the sequential side, we were not that successful, it still reads only 32 pages.
I am not entirely sure how fio
manages to read 64 pages, but I can live with that for now.
We do have another trick up out sleeves for sequential reads: We can pre-poplate the whole buffer: MAP_POPULATE
. However, this article is already too long.
The actual investigation #
Our software product has an index that is used for uniqueness-checks for IDs amongst other things. These IDs are user-controlled data and in most cases random UUIDs. The index is a glorified HashMap stored in a read-only file, that is access via a memory mapping. We strongly recommend to make sure that the index fits into memory, to avoid resource congestion on lookups.
We don’t load the index in memory initially but only when parts of it are accessed. (Spoiler: This will become important later.) Since we hash random input data, the access patters of the memory backing the index are also random, with small portions being sequential when a hash bucket is traversed.
One of our customers contacted support with the observation that application startup was slow. Like, really slow. 15 minutes slow.
We quickly identified the above-mentioned index to be the culprit. Now the question arises: Why is loading the index slow on this customer’s environment, and how can we make it fast?
Let’s summarize what we deal with:
Number of files | 128 |
Total size | 28GB |
Loading time | 15min |
Max read bandwidth | 200MB/sec |
Max IOPS | 5K |
Observed bandwidth actual | 32MB/sec |
Observed bandwidth synthetic | 20MB/sec |
Benchmarks on the customer hardware #
Since our database uses mixed workloads, let’s use the synthetic fio randread
benchmark from above.
This gives us 20MB/sec, as we would expect with the knowledge from above (no readahead, 4k page size).
A simple run with the read
benchmark from fio
confirms that the readahead works as expected: we get 200MB/s
with ~800 IOPS
.
Given that fio manages to convince the kernel to read 64 pages instead of the default 32 for a readahead, we would expect 781 IPS, so I guess the numbers match up.
Although we can only read 32 pages, the resulting 1.6K IOPS are still quite a bit from the 5k IOPS limit, so we should be good.
Replicating the cloud environment with cgroups #
Earlier, I described the cloud environment of the customer that triggered this whole endeavor. Since I can’t run random benchmarks on their machines, let’s try to replicate it using cgroups.
Earlier we used systemd-run
to constrain memory.
According to the docs, we can also constrain IOPS and bandwidth.
For this to work, you need to [set up delegation] (https://wiki.archlinux.org/title/Cgroups#User_delegation).
Then, we can spawn a resource constrained-session as before:
systemd-run --scope -p IOReadBandwidthMax="/ 200M" -p IOReadIOPSMax="/ 5000" --user zsh
If we run in this constrained environment, we get the expected values. Time to fix the problem.
The trivial fix #
If you followed along until here, you probably know the easiest fix already:
- use
madvise
to tell the kernel we will read the file sequentially - Read the whole file into memory sequentially
- Read files in parallel to overcome the synchonous limits if running on environments with more bandwith and IOPS
- use
madvise
again with neutral just to be sure
However, we run on java. Forget about madvise, open_options etc… It turns out, that a simple sequential read already is enough. With about 10 lines of code, the loading time was reduced from 15 minutes to about two minutes. This is still quite slow but frankly, having a bandwidth of only 200MB/s is already not that much to work with.
Conclusion #
What did we learn on this journey?
mmap
can be used to have data in virtual memory that is larger than physical memory- With
cgroups
, we can simulate resource constrained environments, e.g. by limiting CPU, memory or IOPS - Reading from random places in memory is much slower than reading continuous regions. This is true for memory mapped files, as well as plain old heap space.
- When using
mmap
:- The kernel will try to read-ahead to have reads larger than the default page size of 4k
- If it detects that this is not helping, it will stop at some point
- We can use
madvise
to enforce this behavior
mmap
does provide many convenience features but for best performance, either experimentation or research is needed.
Closing remarks #
Thanks for reading!
I can now enjoy catharsis from closing a hundred browser tabs.
Cover photo by Nik Shuliahin on Unsplash