MMap: a tour to kernel space and back again

Table of Contents

How does the Linux kernel load pages into memory when you request something to be mapped using mmap?

Skip the basics? Go here: Shortcut

Recently I had to investigate why our database product was slow to start in a customer’s environment. The component acting up was using mmap to read parts of an index into memory. While doing some research on it, I got nerd sniped.

mmap Basics
#

I will quickly summarize the basics of mmap and provide an example for readers not familiar with mmap. If you want to skip this, head to mmap behavior

What is mmap?
#

The docs describe it as follows:

mmap() creates a new mapping in the virtual address space of the calling process.

There are many ways how this can be used and what can be mapped. The one relevant for this article is the following: You can use a mmap call to map the contents of file into virtual address space.

Let’s walk through an example to see how this looks like in code:

First, we need to get a file descriptor, we can get one by opening a file

int fd = open("test.bin", O_RDONLY);
if(fd==-1){/*...*/}

Then we call mmap passing in a bit of information of where and how the mapping should be generated.

long map_len  = 100;
long read_pos = 50;
char *map = mmap(0, map_len, PROT_READ, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {/*...*/}

Using the mapping, we can now read any position within the memory region.

printf("read value 0x%x\n",  map[read_pos]);

When done, close the mapping and the file (or let the OS do that for you…)

Nothing too spectacular so far, so why shouldn’t we just use a regular read call to read the data in a buffer? Mmap allows us to do some interesting things, one of them is to have a mapping that is larger than main memory (RAM).

Consider the following example.

int fd = open("test.bin", O_RDONLY);
if(fd==-1){/*...*/}

long map_len  = 300L*1024*1024*1024;  // 300GB
long read_pos = 100L*1024*1024*1024;  // offset of ~100GB
char *map = mmap(0, map_len, PROT_READ, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {/*...*/}

printf("read value 0x%x\n",  map[read_pos]);

We can read any byte within that 300GB memory chunk (we read one at an offset of ~100GB) and don’t need to care about where we want to read, the OS will take care to always have the required data loaded from disk into memory when we need it. This is awesome! Mmap is even more awesome than that: the OS also takes care of unloading memory again, so we won’t run out of RAM.

However, as with all tech, there are some caveats.

Alternative options
#

If we were to replicate the same behavior without using mmap, we had a few options:

Install 300GB of RAM. Easy fix. That would probably be enough to run two instances of slack at the same time!
Implement custom paging. This would require us to write a boatload of code to decide what page to load, free up pages when we no longer need them, prevent access to unloaded pages, and so on. And this is only the read path, when we would be writing too, this gets an order of magnitude more complex. Most Databases have to do this though.
Don’t use a backing array but use the underlying file APIs directly. Then a seek within the file and a call to read would do the trick. This works great when reading parts of the file sequentially. However, when you read a portion of the file twice, you either end up building option 2 or run into severe performance issues.

There are a lot of things the OS takes care of in this regard, we will discover them as we follow the investigation. Any of the options (including using mmap) has some drawbacks that range from obvious to subtle and from minor annoyance to absolute showstopper.

Extended example
#

We will have two example workloads later on. One of them is reading from the mapped buffer sequentially and the other one is reading in a pseudo-random manner. Since I feel more comfortable using Rust than C, let’s rewrite it in Rust .

fn do_mmap() -> Result<()> {
    let file = OpenOptions::new()
        .read(true)
        .open("10g.bin")
        .context("Could not open file")?;

    let mmap = unsafe {
        MmapOptions::new()
            .len(BUFFER_SIZE)
            .map(&file)
            .context("Could not map file")?
    };
    read_sequential(&mmap);
    Ok(())
}

Now we can define the first read workload (sequential):

fn read_sequential(mmap: &Mmap) {
    let mut sum = 0u64;
    for i in 0..BUFFER_SIZE {
        sum += mmap[i] as u64;
    }

    println!("Sum is {sum}");
}

We will use a file size of 10GB (BUFFER_SIZE = 10 * 1024 * 1024 * 1024)

This function reads the whole buffer once sequentially. To prevent the compiler to optimize away the whole thing, it prints the sum of it. There are other ways to achieve this, but this is by far the simplest one.

We also need to support unpredictable reads. While they are at best “pseudo-random”, I did make sure that they are spread across the file nicely. This resembles the original workload more closely as you will see in a few sections.

fn read_random(mmap: &Mmap) {
    let prime = 5706217411;
    let mut sum = 0u64;
    for i in 0..BUFFER_SIZE {
        let pos = (i*prime)%BUFFER_SIZE;
        sum += mmap[pos] as u64;
    }
    println!("Sum is {sum}");
}

Here 5706217411 is a prime number that is not part of the factorization of BUFFER_SIZE, hence (i*prime)%BUFFER_SIZE won’t hit the same byte twice. Since we don’t need true randomness but only something that can’t be predicted by the kernel, this is good enough.

Understanding mmap behavior
#

Now that we have a test program (we will use fio later too), let’s investigate what is going on in the system. We will use a whole host of tools for this, so if you want to follow along, get your hands on them now:

vmtouch
sysstat (for iotop and sar)
bpftrace

Let’s start from the most basic investigation: how much data is kept in buff/cache before we start anything?

➜ free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi        40Gi        83Gi       2.5Gi       4.9Gi        84Gi
Swap:             0B          0B          0B
                                                               ^
                                                               |
here ----------------------------------------------------------|

Nearly 5GB are already in buff/cache before we even run our code!

➜  ./target/release/reader
...
[2024-07-27T16:24:07Z INFO  reader] mmap round SEQUENTIAL took 5.976189549s
...
[2024-07-27T16:24:10Z INFO  reader] mmap round SEQUENTIAL took 2.691732412s
...
[2024-07-27T16:24:13Z INFO  reader] mmap round SEQUENTIAL took 2.658959719s

One of these is not like the others: The first iteration took longer. This is because we initially needed to load the data in virtual memory. We will make the difference more severe later on. Further, reading 10GB in 2.7 seconds is already quite fast.

And what about our free output now?

➜ free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi        40Gi        73Gi       2.5Gi        14Gi        85Gi
Swap:             0B          0B          0B
                                                               ^
                                                               |
here ----------------------------------------------------------|

It’s 14 GB! That’s ~10GB more than earlier. This is exactly what we would expect.

So ho do we get rid of it again?

Evicting memory
#

You can instruct the kernel to drop all relevant caches by writing 1 to the correct file in the procfs:

sudo sh -c 'echo 1 >/proc/sys/vm/drop_caches'

A quick check with free confirms that we were successful.

➜ free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi        40Gi        88Gi       2.5Gi       4.4Gi        84Gi
Swap:             0B          0B          0B
                                                               ^
                                                               |
here ----------------------------------------------------------|

Don’t run this on a production machine without thinking, though.

While we are at it, let’s briefly discuss whether we should manually evict memory (this is a discussion I had with a customer earlier). This is memory owned by the kernel, managed by the kernel, reclaimed by the kernel. The kernel kicks out buffer/cache memory from physical memory first, when we need more “actual” memory for our applications. So your application will not trigger out of memory killings just because 80% of memory is used by buffer/cache. Assuming we know better than the kernel, when and how to reclaim memory, would be wrong to assume. So if you want to limit buffer/cache usage of the whole system, think again. There are however some scenarios, e.g. if you have bursty workloads that suddenly and quickly need to allocate a boatload of memory. Another scenario that we will completely ignore in this article is writing to mapped memory, causing dirty pages. For that, check on the vm_dirty_* settings.

However, of course you can evict memory (it’s Linux after all, it does not judge us for our bad decisions).

With free we can see the overall system summary. But can we see that for our test file only? Maybe even which parts are mapped? This is where vmtouch comes into play.

Observing virtual memory with vmtouch
#

With vmtouch, we have a small but powerful tool to see what parts of a file are loaded in memory. Again, let’s have a look at the 10GB test file 10g.bin before the test code ran.

➜ vmtouch -v 10g.bin
10g.bin
[                                                            ] 0/2621440

           Files: 1
     Directories: 0
  Resident Pages: 0/2621440  0/10G  0%
         Elapsed: 0.15446 seconds

As the large empty space in the brackets indicates, nothing is loaded in memory yet. No we let the test program run for 1 second (should load 1/5.9th of the file) and observe the file again.

➜ timeout 1s ./target/release/reader
[2024-07-27T16:48:23Z INFO  reader] Testee running with pid 3402768
➜ vmtouch -v 10g.bin
10g.bin
[OOOOOOOOOOo                                                 ] 464736/2621440

           Files: 1
     Directories: 0
  Resident Pages: 464736/2621440  1G/10G  17.7%
         Elapsed: 0.21727 seconds

Cool, vmtouch shows us exactly what pages are loaded.

Note: There are more pages (2621440) than “O"s (60). Therefore, vmtouch has to map several pages to a single O. If any page of such a bucket is mapped you will get a o, if all pages of the bucket are mapped, you will see an O.

With vmtouch, we can do another cool thing: We can kick out a file from memory.

➜ vmtouch -e 10g.bin
           Files: 1
     Directories: 0
   Evicted Pages: 2621440 (10G)
         Elapsed: 0.26097 seconds

That way we don’t need to drop all the pages of the whole OS. With a simple loop, we can now use the tool to see loaded memory live as it fills up:

A bar filling up from left to right, indicating memory being loaded — This is what the test program causes the kernel to load

Note that the bar stays at 100% once the code loops on reading it, the gif ends after it is done for the first time and loops on its own.

What does the loading behavior look like for a random/spread read pattern? To actually see something, I had to slow down reading significantly, in this case a 100msec pause was introduced After 6 seconds, the whole bar is full of o.

A bar filling up at random spots, indicating memory being loaded — Reads are spaced 100msec apart, slowly causing `o` to appear.

There is kind of a problem though: We see it filled with o and not O. Further, not even 1% of the pages has been loaded! To investigate this, we first need to have a look at an uncomfortable truth: Random reads, even from something already loaded in memory are waaaay slower than sequential reads. As we will see in the next section, the difference is striking.

We need to fix this fist before we can investigate mmap further.

Interlude: slow reads from memory
#

To read the same amount of bytes already loaded in VM in a pseudo randomly spaced manner, takes 577 seconds compared to the 1.9 we saw earlier.

	virtual memory	loading from disk
sequential	1.9	11
random	577	∞

Note: I rebooted the machine and the execution was quite a bit faster! Sequential reads went from 2.6 seconds to 1.9 seconds. The random one from 577 to 565 seconds.

We observed more than 200 times as slow reads from memory when reading at random spots, compared to sequential reads. Keep in mind, the virtual memory was already loaded, so this excludes reads from disk!

I am quite confident this is due to cache misses, since each read is on a separate page, 5706217411 bytes apart. Let’s verify this assumption before blindly changing the code.

First, we replace our memory mapped buffer by a huge 10GB array on the heap: let mut map = vec![0u8; BUFFER_SIZE]; This rules out any unexpected side effects from using a mapped buffer here.

Then, we measure again, to see if the problem still persists. And yes, it does: 1.3s for a sequential read and 81s for the random one.

	heap	virtual memory	loading from disk
sequential	1.3	1.9	11
random	81	565	∞

Note: When I left the application running for a day by accident, I was able to observe some strange behavior when leaving the random reads running for some time: For the first few hours, the duration was stable at 550 to 580 seconds. Then it slowly decreased down to 107 seconds, where it stayed for the rest of the day except a short hike up to 250 seconds again. The performance was stable even after restarting the application. When evicting and re-adding the memory pages, the behavior was reproducible. If anyone knows what could cause this, please let me know!

While reading from heap is faster than from VM, we still see the same pattern: the memory spread/random reads are a lot slower. We can investigate whether the assumption of cache misses holds.

Using perf to investigate cache misses
#

Let’s use some science and come up with a null hypothesis: We assume cache misses are not different for both read patterns.

If we can falisfy this one, we can investigate H1: Cache misses are larger for the random access pattern.

With perf, this can be verified. For this use-case, perf provides the ready-made mem profile. We run the following once for each access pattern.

perf mem record -a ./target/release/mem_only_reader
perf mem report
perf mem -t load report --sort=mem

Note that perf will not deliver us with exact numbers how often something occurred but rather with how often it observed something to have happened. Further, the runtime (and work) of the two variants was highly different, mainly since I did not want to wait for half an hour for it to finish.

The following output is for sequential operation.

Overhead       Samples  Memory access
  27.10%           885  RAM hit
  24.29%        233983  N/A
  18.61%        179296  L1 hit
  18.15%         12924  L2 hit
  11.73%          1140  core, same node Any cache hit
   0.11%          1012  LFB/MAB hit
   0.00%            38  Uncached hit

And this one for random access.

Overhead       Samples  Memory access
  67.01%         32644  RAM hit
  18.22%         84602  core, same node Any cache hit
   6.38%       1206079  N/A
   5.14%         69827  L2 hit
   3.24%        612569  L1 hit
   0.01%          1702  LFB/MAB hit
   0.00%           105  Uncached hit

What to make of this? Well, H0 can be rejected for sure, there is a difference in cache hits. And we can accept H1, since there are clearly more RAM lookups compared to L1 and L2 hits for the random access pattern variant.

How do we fix this? We can’t trick physics after all.

A simple cheat
#

We can “fix” this problem by not doing fully spread/random reads. We could decide to access a random page but then read within that page (or maybe the subsequent one) sequentially. This of course is not the same before, but we do not want to measure RAM speed but rather investigate mmap behavior. See the “Note” below to see why this is still fair.

Enough with science, let’s engineer this: How much do we need to read per page to achieve similar results? Note that we won’t be able to achieve the same results, since we need to do a bit more work compared to what the plain sequential version does. This also makes it harder for the compiler to optimize our code.

With the following, we try to find out the optimal size to read within (and slightly beyond) a page.

fn read_random(map: &[u8]) {
    let prime = 5706217411;
    for i in 0..14 {
        let start_time = Instant::now();
        let mut sum = 0u64;
        let block_size = 1<<i;
        for i in 0..BUFFER_SIZE / block_size {
            for k in 0..block_size {
                let pos = (i * prime + k) % BUFFER_SIZE;
                sum += map[pos] as u64;
            }
        }
        let dur = start_time.elapsed();
        println!("block_size: {block_size}, dur: {dur:?} sum: {sum} ");
    }
}

Nah, let’s use a log scale for x.

This is better. We can see that around 1024, the returns are diminishing. This means we can read that number of bytes of a random page, achieve better performance and still be within one memory page!

fn read_random(map: &[u8]) {
    let prime = 5706217411;
    let mut sum = 0u64;
    let block_size = 1024;
    for i in 0..BUFFER_SIZE / block_size {
        for k in 0..block_size {
            let pos = (k + i * prime) % (BUFFER_SIZE);
            sum += map[pos] as u64;
        }
    }
    println!("sum: {sum} ");
}

Now we still have a bit of a problem: in one variant we do 10 billion modulo operations and in the other one we don’t. By cheating a tiny bit, we can get the runtime from 6.3 seconds down to 3.2 seconds:

fn read_random(map: &[u8]) {
    let prime = 5706217411;
    let mut sum = 0u64;
    let block_size = 1024;
    for i in 0..BUFFER_SIZE / block_size {
        let start = (i * prime) % (BUFFER_SIZE-block_size);
        for k in 0..block_size {
            let pos = start+k;
            sum += map[pos] as u64;
        }
    }
    println!("sum: {sum} ");
}

This way we don’t read the last 1024 bytes of a 10GB array (duh) but trim the 10 billion modulo operations down to 10 million.

Checking one more time with perf, shows that we were successful:

Overhead       Samples  Memory access
  28.24%        534468  N/A
  28.08%          1219  RAM hit
  16.79%          2450  core, same node Any cache hit
  15.87%        300309  L1 hit
  10.98%          8218  L2 hit
   0.04%           720  LFB/MAB hit
   0.00%            48  Uncached hit

Note: Is it fair to compare the mixture of sequential and random reads to the sequential access pattern in the context of mapped memory and still call it random? Yes, it still is fair, since we don’t want to measure how fast the CPU can load data from main memory but how fast the whole system can get the mapped pages ready. So operating on a per-page level, is fair. Multiple reads per page won’t even hit the mapping mechanisms, if they fit into the cache.

With the new “random” reads, we can finally continue chasing after the mmap stuff.

Using cgroups to restrict memory
#

Back to the actual topic of the article, memory mapped files.

Until now, the program had unrestricted memory and IO. This changes now, since our customer’s cloud environment has less resources than a developer machine.

We will start by restricting memory first (mostly since this gives us a cool gif). Using systemd-run, we can get a user slice restricted by a custom ad-hoc cgroup.

systemd-run --scope -p MemoryMax=1G  --user zsh

In the new shell that just spawned, we can run the test

➜ ./target/release/mem_only_reader
[2024-07-28T07:34:51Z INFO  mem_only_reader] Testee running with pid 100585
[1]    100585 killed     ./target/release/mem_only_reader

We got killed? Oh right, that is the code that uses the 10g array on the heap, that obviously won’t fit into the 1gb this cgroup allows for. Let’s run the mmap version but with the files loaded in VM.

➜ ./target/release/reader
[2024-07-28T07:39:50Z INFO  reader] Testee running with pid 105427
[2024-07-28T07:39:52Z INFO  reader] mmap round SEQUENTIAL took 2.06146459s
[2024-07-28T07:39:54Z INFO  reader] mmap round SEQUENTIAL took 1.488209825s
...

Awesome, we see the same performance as outside the cgroup!

Now we kick out the file from virtual memory. If the OS does not have the file cached, the process running in the cgroup needs to request it but has to pay for it with its memory quota.

A worm like thing crawling through the loading bar, indicating memory being loaded — The memory is loaded in a ‘crawling’ section through the file, never exceeding 1GB.

As you can see, the process never loads the whole file in memory, even after it is done once. This also means, that the speed-up after the first iteration we observed earlier does no longer occur! This is detrimental to performance!

One side note though: if the main cgroup, with its unrestricted power, is loading the whole file in memory, the cgroup-ed process does benefit from the pages in the page cache and runs with the expected performance (1.4 seconds). It’s only the chroup-ed process that is not allowed to load them in VM.

As you would expect, being constrained by a cgroup is bad news for the random access version… Remember, this has the same effect as having a system with just too little memory (I could instead pull out some memory bars or run with a 300gb file, but a cgroup is way more convenient). When loading random pages, the time takes is a lot larger than when loading pages sequentially, even when reading the whole page.

limits	file	sequential	random
none	heap	1.4 s	3.6 s
none	resident	1.5 s	3.9 s
none	unloaded	11.1 s	28.1 s
constrained	heap	killed	killed
constrained	resident	1.5 s	4.0 s
constrained	unloaded	15.5 s	1012.8 s

What can we learn from this so far?

Loading data from disk is expensive. Loading sequentially is roughly three times faster than doing it in random order if we don’t need to evict pages due to memory constraints.
Reads for sequential data are faster than random reads, for memory mapped data, as well as data that lives on the heap.
If you don’t have enough memory to keep the whole mapped file in memory, sequential reads are slower but still feasible. Whether using mmap here compared to a different IO approach does make sense, is a different story though. Loading random pages with mmap in a resource constrained environment is about 67 times slower than reading them sequentially and about 36 times slower than in an unconstrained environment.
There seems to be something going on in the background (this will demystify this later): sequential reads cause ~650MB/s of disk IO for 15.5 seconds. That’s 10GB (the size of the mapping we are reading!). However, random reads cause 321MB/s for 1012 seconds. That’s 328GB! That’s about 32 times as much as we would need to read (spoiler: this will be important later again).

We will use cgroups later to restrict IO, but for now we need to investigate another thing first. Reading a 10GB file in 11 seconds is still kinda slow. Since loading memory requested in a memory mapping is synchronous, we probably could get more performance using multiple threads. Let’s use fio to test that.

Using fio to compare mmap and io_uring
#

With fio, we can compare IO performance of different modes. Let’s start by replicating our current setup.

fio --filename=test.bin --size=10G --rw=read --ioengine=mmap --numjobs=1 --group_reporting --name=iops-test-job

This does:

Use a file instead of whole device: --filename=test.bin of size --size=10G
Read sequentially --rw=read
Using mmap --ioengine=mmap
In one thread --numjobs=1
and a bunch of boring params

The output is quite verbose, but the only line we are interested in for now is the following:

  read: IOPS=242k, BW=943MiB/s (989MB/s)(10.0GiB/10854msec)

This means fio read the 10GB in 10.8 sec, just as we did! But can we observe these 989MB/s fio claims to read from the outside? (This will be important in a second)

We can use iostat for this:

iostat -d 1 -h

When fio is running, we can see about 1GB/sec being read.

      tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd Device
  4140.00         1.0G         0.0k         0.0k       1.0G       0.0k       0.0k dm-0
  8280.00         1.0G         0.0k         0.0k       1.0G       0.0k       0.0k nvme0n1
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k nvme1n1
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k sda
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k sdb

In this case dm-1 is a mapped device located on nvme0n1, hence the doubled throughput.

What happens if we increase the number of threads to get around the synchronous limitations mentioned earlier?

fio --filename=test.bin --size=10G --rw=read --ioengine=mmap --numjobs=24 --group_reporting --name=iops-test-job
...
  read: IOPS=3993k, BW=15.2GiB/s (16.4GB/s)(240GiB/15756msec)

Wow! …and What? Fio claims to have higher read bandwidth of 16.4GB/s but takes even longer than before 15756msec? There’s something wrong… Since memory mapped buffers can be read by all threads, the first one to touch a page, causes it to be loaded into memory and the other threads then can benefit from this. So most of the reads happen from memory. This means, we can’t rely on fio for reporting the actual bandwidth.

Now the iostat thing comes in handy: using it, we can see that the actual bandwidth was only about ~740MB/s. This aligns with the x1.5 increase in total time. But the device should be able to handle more IO. So let’s find out first what the maximum throughput is, using io_uring as the engine.

fio --filename=test.bin --direct=1 --size=10G --rw=read --bs=256k --ioengine=io_uring --iodepth=64 --numjobs=24 --group_reporting --name=iops-test-job

In addition to the previous arguments, we do some changes:

We use io_uring as the IO engine, a asynchronous IO interface for somewhat modern Linux kernels (>5.1)
We use direct io with --direct=1 to circumvent the OS’s buffers
We use a block size of 256KB --bs=256k, which is outright cheating, since we instruct the read to fetch a whole chunk of data in one go
By specifying --iodepth=64 we allow 64 asynchronous reads to be in flight at any time

And apparently this worked to bump our bandwidth up to 4.8GB/s

      tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd Device
 19684.00         4.8G         0.0k         0.0k       4.8G       0.0k       0.0k dm-0
 39375.00         4.8G         0.0k         0.0k       4.8G       0.0k       0.0k nvme0n1
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k nvme1n1
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k sda
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k sdb

This is async IO, so one IO thread should be enough, let’s see what a --numjobs=1 yields:

  read: IOPS=20.4k, BW=5100MiB/s (5347MB/s)(10.0GiB/2008msec)

So with >5GB/s for io_uring, I guess we need to ramp up out mmap game.

What is different so far between using async io with io_uring and our mmap approach:

mmap is sequential, waiting for each page read (we ignore readahead for now but get back to it later on) to complete and then read the next page
Reads with mmap do have a fixed page size, that we cannot really influence
We can’t use fio to increase threading with mmap, otherwise the reads will overlap, and we will read from memory

Fio randread with mmap
#

So far we tried sequential reads with fio, maybe we have any luck with random reads?

fio --filename=test.bin --size=10G --rw=randread --ioengine=mmap --numjobs=1 --group_reporting --name=iops-test-job
...
  read: IOPS=11.5k, BW=45.0MiB/s (47.2MB/s)(10.0GiB/227388msec)

Oof, nope, that’s not what we want. Maybe for multiple random-read workers?

fio --filename=test.bin --size=10G --rw=randread --ioengine=mmap --numjobs=24 --group_reporting --name=iops-test-job
...
  read: IOPS=3068k, BW=11.7GiB/s (12.6GB/s)(240GiB/20506msec)

Argh, fio “lies” to us again about that. However, it manages to load the file in 19628msec, so I won’t complain. We do see one interesting thing in iostat though:

      tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd Device
205934.00       804.4M         0.0k         0.0k     804.4M       0.0k       0.0k dm-0
205931.00       804.4M         0.0k         0.0k     804.4M       0.0k       0.0k nvme0n1
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k nvme1n1
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k sda
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k sdb

Notice the tps being way higher than before? Is the SSD actually serving with 205934 IOPS? Looking at sar, we see similar numbers ( the majflt/s column):

 pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s  pgprom/s   pgdem/s
913544.00    208.00 428142.00 228387.00   3933.00      0.00      0.00      0.00      0.00      0.00

According to the manufacturers page([neutrality is disputed]), the device would be able to handle 1 million read IOPS? Hard to believe but ok, let’s try. With io_uring, a queue depth of 64 and 24 threads, I got 736,236 IOPS. Mind you, this is with LUKS, an actual filesystem, no tweaking, and not even cooling with liquid nitrogen. I guess the stated 1M is not that much off.

It’s amazing how much speed you can get out of actual hardware when comparing that with offerings from your average cloud vendor. The insane amount of IOPS will become important later on again.

Multithreaded, nonoverlapping reads
#

Since we can’t tell fio to read sections of the file in several threads (and not the whole one, leading to the nasty caching again), we have to adjust our tool from earlier. Using rayon, changing to a multi-threaded read becomes as trivial as changing from into_iter to into_par_iter:

fn read_sequential(mmap: &Mmap) {
    let chunk_size = BUFFER_SIZE/num_threads;
    let sum = (0..num_threads)
        .into_par_iter()
        .map(|t| {
            let mut sum = 0u64;
            for i in t*chunk_size..(t+1)*chunk_size{
                sum += mmap[i] as u64;
            }
            sum
        })
        .sum::<u64>();
    println!("Sum is {sum}");
}

When running this sequential code, we achieve the same throughput as io_uring of 5GB/s! Further, we finish loading in 2.0 seconds. That’s even faster than reading spread that are already in memory!

      tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd Device
 40214.00         4.9G         0.0k         0.0k       4.9G       0.0k       0.0k dm-0
 40216.00         4.9G         0.0k         0.0k       4.9G       0.0k       0.0k nvme0n1
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k nvme1n1
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k sda
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k sdb

Note however, we used 40,214 IOPS with mmap while io_uring only required 19,684, roughly twice as much. Both were capped by throughput (4.9G/s).

Now what about multi-threaded random reads using mmap? The code is slightly more complex and cuts a few corners but if each thread reads one page less than it should, I am fine with it.

fn read_random(mmap: &Mmap) {
    let prime = 33757333;
    let block_size = 4096;
    let chunk_size = BUFFER_SIZE/num_threads;

    let sum = (0..num_threads)
        .into_par_iter()
        .map(|t| {
            let mut sum = 0u64;
            for i in t*chunk_size/block_size..(t+1)*chunk_size/ block_size{
                let start = (i * prime) % (chunk_size-block_size);
                for k in 0..block_size {
                    let pos = start+k;
                    sum += mmap[pos] as u64;
                }            }
            sum
        })
        .sum::<u64>();   
    println!("sum: {sum} ");
}

With a runtime of only 2.9 seconds, this is also looking good bandwidth-wise:

      tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd Device
 38747.00         4.7G       156.0k         0.0k       4.7G     156.0k       0.0k dm-0
 38726.00         4.7G       156.0k         0.0k       4.7G     156.0k       0.0k nvme0n1
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k nvme1n1
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k sda
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k sdb

As we will find out in the next section(s), these random reads may not be random enough though.

Comparing read sizes for random and sequential mmap
#

Remember the 200K IPS from fio earlier? Allow me to put them next to the ones of the non-overlapping reads:

      tps    kB_read/s
205934.00       804.4M  // fio with mmap randread
vs
      tps    kB_read/s
 38747.00         4.7G

If we calculate the size of these reads, we get 804.4M / 205934 = 3.9 and 4.7G / 38747 = 127.

Going on an Adventure (into the Kernel)
#

The Hobbit getting ready for an adventure — This won’t be an Unexpected Journey

I want to find out, where the buffer size for reading in memory mapped pages by the kernel is specified. For normal reads, that simply provide a buffer and length argument, tracing these reads is easy with bpftrace. However, there is no syscall when reading from a memory region. So we need to resort to reading the kernel source and then will be able to find the correct function to trace with bpftrace.

So what syscall are we after, mmap? In include/linux/syscalls.hm there are some functions defined related to mmap:

sys_mmap_pgoff (this one is obsolete)
sys_old_mmap (this one is old)
ksys_mmap_pgoff that’s the one!

The ksys_mmap_pgoff() is the one follow, click here if you want to follow along the call chain. We will ignore hugetables or anonymous mappings. The following is the happy path for creating a mapping:

graph TB; ksys_mmap_pgoff--> vm_mmap_pgoff--> do_mmap --> mmap_region --> call_mmap --> ext4_file_mmap

So now that a mapping is created, how is any data loaded? Until now, we only saw the creation process. Following the calls of populate from vm_mmap_pgoff, we eventually reach paging logic. A comment in handle_mm_fault refers to filemap_fault and __folio_lock_or_retry, so we better check that one out. However, filemap_fault has the option to be specific for each filesystem. Luckily, that’s not the case for ext4, as it points to the “original” filemap_fault.

static const struct vm_operations_struct ext4_file_vm_ops = {
	.fault		= filemap_fault,
	.map_pages	= filemap_map_pages,
	.page_mkwrite   = ext4_page_mkwrite,
};

Again, we follow the call-graph, this time in filemap_fault.

graph TB; filemap_fault --> loaded{loaded?} subgraph loaded filemap_get_folio --> __filemap_get_folio --> filemap_get_entry end loaded-- yes --> return[return from filemap] -.-> do_async_mmap_readahead do_async_mmap_readahead --> folio_test_readahead{readahead?} folio_test_readahead -- yes --> page_cache_async_ra page_cache_async_ra --> ondemand_readahead loaded-- no --> do_sync_mmap_readahead --> magic(magic) --> page_cache_sync_ra --> ondemand_readahead --> page_cache_ra_order --> others[...] click filemap_fault "https://elixir.bootlin.com/linux/v6.10/source/mm/filemap.c#L3274" click do_async_mmap_readahead "https://elixir.bootlin.com/linux/v6.10/source/mm/filemap.c#L3184"

To keep the complexity of the graph under control, I am a bit sloppy with it at two instances:

return from filemap does all the other things after the dashed arrow first, but the most important part is, that it returns immediately after triggering a readahead
magic obviously there’s a bit more involved than a magic box. We will investigate that in a second though.

The ondemand_readahead here gets the information about how much to read!:

tatic void ondemand_readahead(struct readahead_control *ractl,
		struct folio *folio, unsigned long req_size)

There’s req_size but also ractl->ra->ra_pages that are interesting. Let’s find out what these are set to, ra_pages first, since req_size is not constant.

static void blk_apply_bdi_limits(struct backing_dev_info *bdi,
		struct queue_limits *lim)
{
	/*
	 * For read-ahead of large files to be effective, we need to read ahead
	 * at least twice the optimal I/O size.
	 */
	bdi->ra_pages = max(lim->io_opt * 2 / PAGE_SIZE, VM_READAHEAD_PAGES);
	bdi->io_pages = lim->max_sectors >> PAGE_SECTORS_SHIFT;
}

Mhm, looks like io_opt is device dependent, according to cat /sys/block/dm-0/queue/optimal_io_size it’s 0 on my machine (?). VM_READAHEAD_PAGES is (SZ_128K / PAGE_SIZE), that makes 32. Now this explains, why we see twice as much IOPS for the same bandwidth when using 256K: mmap reads with 128K by default.

The ondemand_readahead function is only invoked if the kernel assumes that it can gain anything from doing a readahead. If we tell it that there is no point in doing so, it is not called. The do_(a)sync_mmap_readahead functions take care of that by checking for VM_RAND_READ. As you can see later, we are able to influence this. Further, a set VM_SEQ_READ forces a readahead in do_sync_mmap_readahead. There is also a mechanism in place the kernel detect if the readahead helps, and it will stop doing automatic readaheads if there’s no good coming from it.

Use bpftrace to trace kernel activity
#

The following snippet traces the filemap_fault kernel function and prints a few interesting arguments when our test file (inode=107086783) is hit. I know there is a way to do have this matching done from the outside, but this works good enough.

#!/bin/env bpftrace

#include <linux/fs.h>
#include <linux/mm.h>

kfunc:vmlinux:filemap_fault { 
    $file = args->vmf->vma->vm_file;
    $inode = $file->f_inode->i_ino;
    if($inode == 107086783){
        $vm_flags = args->vmf->vma->vm_flags;
        $fault_flags = args->vmf->flags;
        $ra = $file->f_ra;
        $pages = $ra.ra_pages;
        printf("filemap_fault %x %x %d\n", $vm_flags, $fault_flags, $pages); 
    }
}

Time to compare the output:

tool	access	vm_flags	fault_flags	num_pages
reader	sequential	80000D1	1254	32
reader	random	80000D1	1254	32
fio	seq	80080D1	1254, 1a54	64
fio	rand	80100D1	1254, 1a54	32

Let’s start with the fault_flags: `

0x1254 = 0b0001001001010100
- FAULT_FLAG_ALLOW_RETRY
- FAULT_FLAG_KILLABLE
- FAULT_FLAG_USER
- FAULT_FLAG_INTERRUPTIBLE
- FAULT_FLAG_VMA_LOCK
0x1a54 = 0b0001101001010100
- same as above and
- FAULT_FLAG_ORIG_PTE_VALID Here, PTE stands for Page Table Entry. This does not really help tbh.

The vm_flags are a bit more interesting though:

fio rand:
0x080100D1
0x00000001	VM_READ		
0x00000010	VM_MAYREAD	
0x00000040  VM_MAYEXEC	
0x00000080  VM_MAYSHARE	
0x00010000  VM_RAND_READ	<---
0x08000000  VM_SOFTDIRTY	

fio seq:
0x080080D1
0x00000001	VM_READ		
0x00000010	VM_MAYREAD	
0x00000040  VM_MAYEXEC	
0x00000080  VM_MAYSHARE	
0x00008000  VM_SEQ_READ	  <---
0x08000000  VM_SOFTDIRTY	

reader seq = reader rand:
0x080000D1
0x00000001	VM_READ		
0x00000010	VM_MAYREAD	
0x00000040  VM_MAYEXEC	
0x00000080  VM_MAYSHARE	
0x08000000  VM_SOFTDIRTY

In contrast to my code, when fio causes the pages to be loaded, it magically has the correct read type set! This is unfair…

Let’s fix this.

Linux has a madvise syscall when searching in the fio sources for it, we can see that it uses that one. Let’s see if we can do that too.

    mmap.advise(Advice::Sequential).unwrap();

A quick glance with bpftrace confirms that we have the same flags! Further, the pte from above now also appears. I have no clue what that means but i am sure it’s a good thing…

Using the addition, we can now produce similar behavior as fio (the new code is called smart_reader):

==> out_trace_all_rand_fio.log <==
Attaching 6 probes...
filemap_fault       fio 107086783 134283473 pages: 32
filemap_fault ret   fio 107086783 ret: 1028
filemap_fault       fio 107086783 134283473 pages: 32
filemap_fault ret   fio 107086783 ret: 1028
filemap_fault       fio 107086783 134283473 pages: 32
filemap_fault ret   fio 107086783 ret: 1028
filemap_fault       fio 107086783 134283473 pages: 32
filemap_fault ret   fio 107086783 ret: 1028
filemap_fault       fio 107086783 134283473 pages: 32

==> out_trace_all_rand_reader.log <==
Attaching 6 probes...
filemap_fault       reader 107086783 134217937 pages: 32
page_cache_ra_order   17413618531 reader 0 32 0 107086783
filemap_fault ret   reader 107086783 ret: 1028
filemap_fault       reader 107086783 134217937 pages: 32
page_cache_ra_order   17422315767 reader 0 32 1393103 107086783
filemap_fault ret   reader 107086783 ret: 1028
filemap_fault       reader 107086783 134217937 pages: 32
page_cache_ra_order   17422646179 reader 0 32 164783 107086783
filemap_fault ret   reader 107086783 ret: 1028

==> out_trace_all_rand_smart_reader.log <== 
Attaching 6 probes...
filemap_fault       smart_reader 107086783 134283473 pages: 32
filemap_fault ret   smart_reader 107086783 ret: 1028
filemap_fault       smart_reader 107086783 134283473 pages: 32
filemap_fault ret   smart_reader 107086783 ret: 1028
filemap_fault       smart_reader 107086783 134283473 pages: 32
filemap_fault ret   smart_reader 107086783 ret: 1028
filemap_fault       smart_reader 107086783 134283473 pages: 32
filemap_fault ret   smart_reader 107086783 ret: 1028
filemap_fault       smart_reader 107086783 134283473 pages: 32

We can see that it no longer does readaheads!

On the sequential side, we were not that successful, it still reads only 32 pages. I am not entirely sure how fio manages to read 64 pages, but I can live with that for now. We do have another trick up out sleeves for sequential reads: We can pre-poplate the whole buffer: MAP_POPULATE. However, this article is already too long.

The actual investigation
#

Our software product has an index that is used for uniqueness-checks for IDs amongst other things. These IDs are user-controlled data and in most cases random UUIDs. The index is a glorified HashMap stored in a read-only file, that is access via a memory mapping. We strongly recommend to make sure that the index fits into memory, to avoid resource congestion on lookups.

We don’t load the index in memory initially but only when parts of it are accessed. (Spoiler: This will become important later.) Since we hash random input data, the access patters of the memory backing the index are also random, with small portions being sequential when a hash bucket is traversed.

One of our customers contacted support with the observation that application startup was slow. Like, really slow. 15 minutes slow.

We quickly identified the above-mentioned index to be the culprit. Now the question arises: Why is loading the index slow on this customer’s environment, and how can we make it fast?

Let’s summarize what we deal with:


Number of files	128
Total size	28GB
Loading time	15min
Max read bandwidth	200MB/sec
Max IOPS	5K


Observed bandwidth actual	32MB/sec
Observed bandwidth synthetic	20MB/sec

Benchmarks on the customer hardware
#

Since our database uses mixed workloads, let’s use the synthetic fio randread benchmark from above. This gives us 20MB/sec, as we would expect with the knowledge from above (no readahead, 4k page size). A simple run with the read benchmark from fio confirms that the readahead works as expected: we get 200MB/s with ~800 IOPS. Given that fio manages to convince the kernel to read 64 pages instead of the default 32 for a readahead, we would expect 781 IPS, so I guess the numbers match up. Although we can only read 32 pages, the resulting 1.6K IOPS are still quite a bit from the 5k IOPS limit, so we should be good.

Replicating the cloud environment with cgroups
#

Earlier, I described the cloud environment of the customer that triggered this whole endeavor. Since I can’t run random benchmarks on their machines, let’s try to replicate it using cgroups.

Earlier we used systemd-run to constrain memory. According to the docs, we can also constrain IOPS and bandwidth. For this to work, you need to [set up delegation] (https://wiki.archlinux.org/title/Cgroups#User_delegation). Then, we can spawn a resource constrained-session as before:

systemd-run --scope -p IOReadBandwidthMax="/ 200M" -p IOReadIOPSMax="/ 5000" --user zsh

If we run in this constrained environment, we get the expected values. Time to fix the problem.

The trivial fix
#

If you followed along until here, you probably know the easiest fix already:

use madvise to tell the kernel we will read the file sequentially
Read the whole file into memory sequentially
Read files in parallel to overcome the synchonous limits if running on environments with more bandwith and IOPS
use madvise again with neutral just to be sure

However, we run on java. Forget about madvise, open_options etc… It turns out, that a simple sequential read already is enough. With about 10 lines of code, the loading time was reduced from 15 minutes to about two minutes. This is still quite slow but frankly, having a bandwidth of only 200MB/s is already not that much to work with.

Conclusion
#

What did we learn on this journey?

mmap can be used to have data in virtual memory that is larger than physical memory
With cgroups, we can simulate resource constrained environments, e.g. by limiting CPU, memory or IOPS
Reading from random places in memory is much slower than reading continuous regions. This is true for memory mapped files, as well as plain old heap space.
When using mmap:
- The kernel will try to read-ahead to have reads larger than the default page size of 4k
- If it detects that this is not helping, it will stop at some point
- We can use madvise to enforce this behavior
mmap does provide many convenience features but for best performance, either experimentation or research is needed.

Closing remarks
#

Thanks for reading!

I can now enjoy catharsis from closing a hundred browser tabs.

Cover photo by Nik Shuliahin on Unsplash

mmap Basics #

What is mmap? #

Alternative options #

Extended example #

Understanding mmap behavior #

Evicting memory #

Observing virtual memory with vmtouch #

Interlude: slow reads from memory #

Using perf to investigate cache misses #

A simple cheat #

Using cgroups to restrict memory #

Using fio to compare mmap and io_uring #

Fio randread with mmap #

Multithreaded, nonoverlapping reads #

Comparing read sizes for random and sequential mmap #

Going on an Adventure (into the Kernel) #

Use bpftrace to trace kernel activity #

The actual investigation #

Benchmarks on the customer hardware #

Replicating the cloud environment with cgroups #

The trivial fix #

Conclusion #

Closing remarks #