Open source

Dropping cache didn’t drop cache

Friday, 7 May 2021

The article requires some prerequisite knowledge about Linux kernel memory management subsystem.

Recently, while investigating an OOM (out-of-memory) problem, Twitter engineers found that the slab cache was increasing consistently, but the page cache was decreasing consistently. A closer look showed that the highest consumption of the slab cache was the dentry cache, and the dentry caches were charged to one memory control group (cgroup). It seems that the Linux kernel’s memory reclaimer only reclaimed the page cache, but didn’t reclaim the dentry cache at all.

We tried to drop the slab cache by running:

This post is unavailable

This post is unavailable.

# echo 2 > /proc/sys/vm/drop_caches

but surprisingly it didn’t drop the cache.

By debugging the problem, we found that there was a rare race condition in the shrinker code.

Background

To fully understand what prevented the slab cache from being dropped, we need to take a closer look at the Linux kernel internals, particularly the following memory management subsystem:

Memory reclaimer
Shrinker
Memory cgroups

The memory reclaimer

The memory resource for a machine is not infinite.There might be more demand than supply for memory resources if we run multiple programs simultaneously or run a very big program. Once available memory is low, the memory reclaimer (aka kswapd or direct reclaim) tries to evict “old” memory to free enough space to satisfy the allocators. The reclaimer scans lists (most of which are least-recently-used lists, aka LRU) to evict the not-recently-used memory. The lists include anonymous memory lists, file memory lists, and slab caches (Figure 1).

This post is unavailable

This post is unavailable.

The reclaimer scans all the above lists, except the unevictable list. However, sometimes the reclaimer only scans a few of them. For example, the reclaimer does not scan the anonymous memory list if the swap partition is not available.

The shrinker

When the reclaimer is trying to reclaim slab caches, it needs to call shrinkers. Slab caches are memory objects with different sizes managed by different subsystems. For example, this includes the inode cache and dentry cache. Some slab caches are reclaimable. Since these caches hold the data for specific subsystems and have specific states, they can be reclaimed by dedicated shrinkers that understand how the caches are managed.

Filesystem metadata caches have a dedicated shrinker for each segment of metadata. The reclaimable filesystem metadata caches are in memory cgroup-aware and non-uniform memory access (NUMA) aware LRU lists. The shrinker scans these lists to reclaim some of the caches when memory pressure is hit.

The memory cgroup

For modern data centers, cgroups are usually used to manage resources and to achieve better resource isolation. The memory cgroup is used to manage memory resources.

Before the memory cgroup is present the aforementioned LRU lists are attached to NUMA nodes.

This post is unavailable

This post is unavailable.

When the memory cgroup is present, the LRU lists are at per-memory-cgroup and per-node level. Each memory cgroup has its own dedicated LRU lists for the memory reclaimer Figure 3.

This post is unavailable

This post is unavailable.

When reclaiming memory, the memory reclaimer traverses all memory cgroups to scan the LRU lists.

For a typical production deployment, there may be around 40 shrinkers, including filesystem metadata shrinkers for each super block.

When there are many memory cgroups and many shrinkers, serious scalability problems may occur if all the shrinkers are called for all the memory cgroups. These problems happen even though there is not a reclaimable object for some shrinkers.

The upstream kernel introduced an optimization, called shrinker_maps. It is a bitmap that indicates which shrinkers have reclaimable objects so that they are called. The bitmap is set when a new object is added to the LRU, and is cleared when the LRU is empty. The optimization made the aforementioned scalability issue gone. When one memory cgroup is offlined (for example, when the directory is removed) its LRUs are spliced to its parent memory cgroup and the parent memory cgroup’s bitmap is set accordingly.

The following command traverses all LRU lists, from all memory cgroups, to shrink all reclaimable objects.

This post is unavailable

This post is unavailable.

# echo 2 > /proc/sys/vm/drop_caches

Debugging the caching issue

By checking memory related statistics (for example, /proc/meminfo and /proc/slabinfo) exported by the kernel and checking slab usage using slabtop, it seems that dentry caches consume around 2GB memory under one memory cgroup. This means there should be around 10 million dentry objects since typically the dentry object size is 192 bytes. The dropping cache should be able to shrink most of them.

However, there might be a couple of reasons that the dropping cache doesn’t drop slab caches. For example, the LRU list might be short or the objects might be referencing or locked when scanning and therefore temporarily unreclaimable.The next step is to get some insight about what is going on with the shrinkers.

The Linux kernel provides plenty of ways to observe how the kernel runs. The built-in tracepoints are one of the most used ones. There are two tracepoints about shrinker:

vmscan:mm_shrink_slab_start
vmscan:mm_shrink_slab_end

There are a couple of different tools to enable tracepoints and get tracing logs. For example, directly manipulating the files under debugfs, perf, or trace-cmd. We personally prefer trace-cmd, which is a package that supports plenty of commands for kernel tracing. The package is also supported by almost all distros. For the usage of trace-cmd, please refer to the manpage.

The following command starts tracing for the shrinker tracing points:

This post is unavailable

This post is unavailable.

# trace-cmd start -e vmscan:mm_shrink_slab_start -e vmscan:mm_shrink_slab_end

Then, this command shows the tracing log:

This post is unavailable

This post is unavailable.

# trace-cmd show

The tracing result looks like, for example, the following:

This post is unavailable

This post is unavailable.

         kswapd1-474   [066] .... 508610.111172: mm_shrink_slab_start: super_cac
he_scan+0x0/0x1a0 000000000043230d: nid: 1 objects to shrink 0 gfp_flags GFP_KERNEL cache items 38084993 delta 18596 total_scan 18596 priority 12
         kswapd1-474   [066] .... 508610.139951: mm_shrink_slab_end: super_cache
_scan+0x0/0x1a0 000000000043230d: nid: 1 unused scan count 0 new scan count 164  total_scan 164 last shrinker return val 18432

By analyzing the tracing log, it seems that the dentry caches under some cgroups which consume the most dentry caches were not scanned at all. This means that the shrinker was even not called. It definitely looks odd!

As aforementioned, the only way that the shrinkers can be prevented from being called is if the shrinker_map bitmap is cleared. But the LRU list must be empty. Is it possible that the shrinker bitmap is cleared, even though there are still objects on the LRU list? We need to inspect shrinker-related kernel data structures to find out what was going on.

We could inspect kernel data structures by a couple of tools, for example, the well-known crash-utility. Here we used drgn. Drgn supports programming with Python and live debugging.

With the Python script below, we can easily inspect the length of the LRU list and shrinker_map bitmap for a specific memory cgroup:

This post is unavailable

This post is unavailable.

for user in css_for_each_child(prog['root_mem_cgroup'].css):
    if (user.cgroup.kn.name.string_().decode() == "user.slice"):
        break

memcg = container_of(user, 'struct mem_cgroup', 'css')

nr_items=0
for sb in list_for_each_entry('struct super_block', prog['super_blocks'].address_of_(), 's_list'):
    nr_items += sb.s_dentry_lru.node.memcg_lrus.lru[9].nr_items

bitmap = memcg.nodeinfo[0].shrinker_map.map[0]

The nr_items shows the length of the dentry LRU lists and bitmap shows the shrinker_map bitmap.

The result shows that there were around 10 million dentry cache objects on the LRU list, and that most of them were reclaimable. So, the LRU list is definitely not empty. But the shrinker_map bitmap shows the corresponding bit was cleared.

It really seems weird that the bit was cleared even though there were abundant reclaimable caches. It smells like a kernel bug.

Root cause analysis

By reading the kernel code, it seems that the kernel clears the bitmap if and only if the corresponding LRU list is empty. The kernel sets the bit when:

A new object is added to LRU list
During reparenting if parent’s LRU list is empty

Since our use case creates and removes memory cgroups very frequently, reparenting happens very often. It seems like there might be some problems with reparenting.

Further code inspection leads to the below possible race condition:

This post is unavailable

This post is unavailable.

CPU A	CPU B
reparent
dst->nr_items == 0
	shrinker:
	total_objects == 0
add src->nr_items to dst
set_bit
	return SHRINK_EMPTY
	clear_bit
child memcg offline
replace child's kmemcg_id with
parent's (in memcg_offline_kmem())
	list_lru_del() between shrinker runs
	see parent's kmemcg_id
	dec dst->nr_items
reparent again
dst->nr_items may go negative
due to concurrent list_lru_del()
	The second run of shrinker:
	read nr_items without any synchronization, so it may
	see intermediate negative nr_items then total_objects
	may return 0 coincidently

	keep the bit cleared
dst->nr_items != 0
skip set_bit
add scr->nr_item to dst

After this point dst->nr_item may never go zero, so
reparenting will not set shrinker_map bit anymore.

This post is unavailable

This post is unavailable.

The fix

Once the race condition is located, the fix seems easy. It actually just has a couple of lines:

This post is unavailable

This post is unavailable.

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 8de5e37..1e61161 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 	struct list_lru_node *nlru = &lru->node[nid];
 	int dst_idx = dst_memcg->kmemcg_id;
 	struct list_lru_one *src, *dst;
-	bool set;
 
 	/*
 	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
@@ -546,11 +545,12 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 	dst = list_lru_from_memcg_idx(nlru, dst_idx);
 
 	list_splice_init(&src->list, &dst->list);
-	set = (!dst->nr_items && src->nr_items);
-	dst->nr_items += src->nr_items;
-	if (set)
+
+	if (src->nr_items) {
+		dst->nr_items += src->nr_items;
 		memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
-	src->nr_items = 0;
+		src->nr_items = 0;
+	}
 
 	spin_unlock_irq(&nlru->lock);
 }

We contributed the patch to the upstream kernel and it has been merged into the mainline kernel tree. See commit on git.kernel.org.

Methodology

To investigate the issue, we used the drill-down analysis method and USE method (Utilization, Saturation and Error). Our process was the following:,

Narrowed down to memory problem using monitoring data
Identified the slab cache issue using statistics tools and tracing
Located the problematic code.

We also identified a couple of contributing factors with this approach and these tools.

The various kinds of tools we used were:

Monitoring: This is used for continually recording statistics over time. The historic data can be pulled so that time-based usage trends can be identified. Sometimes the data sources for monitoring overlaps with statistics tools.
Statistics tools: The kernel has a lot of built-in statistics which can be examined by reading /proc, /sys, or 3rd party tools such as, iostat. They are usually used to check resource utilization. Typically, examining these statistics is the first step to analyzing a performance problem or kernel bug.
Tracepoints: The kernel is also equipped with plenty of built-in tracepoints. They can be turned on or off on the fly using various means. Tracing is quite powerful and can show tons of details about kernel internal activities. It is suitable for deeper inspection. But the tracepoints are definitely not available everywhere. Dynamic tracing could be used where tracepoints are not available.
Kernel data structures inspectors: Sometimes even tracepoints and dynamic tracing doesn’t reveal enough information to figure out the root causes. In this case, the kernel developers have to inspect the kernel data structures. Usually such inspectors provide built-in commands and advanced language programming interfaces to make the inspection much easier. However, a developer must have deep knowledge about kernel internals to use them.
Inspection of code: To locate exactly where the bug is and to come up with an appropriate fix, code inspection is necessary.

Conclusion

Memory reclamation is one of the most complicated parts of the Linux kernel. It is full of heuristic algorithms, complex corner cases, complicated data structures, and convoluted interaction with other subsystems. Memory reclamation is the core part of the Linux kernel and is relied upon by other subsystems. However, the bugs or suboptimal behaviors of memory reclamation may take a long time to get discovered. The fixes may be quite subtle and the validation may take substantial efforts to guarantee no regressions.

This post is unavailable

This post is unavailable.