Somebody should really document this stuff...: Response to "Optimizing Linux Memory Management..."

I read a fantastic article last week by some engineers from LinkedIn. It was fantastic because it really made me think about how we want the kernel to work. But, it also contains inaccuracies that trouble me. I have a few thoughts on the article.

The "zone reclaim" feature at the root of the authors' troubles is not enabled everywhere, not even on all NUMA systems. Hardware vendors are essentially responsible for whether this feature is on or not.
The "Linux is quite bad at cleaning up this garbage properly" comment really stings. It's actually the opposite from what I believe and have been advising folks on for years. Linux is actually fantastic at managing its garbage. It is arguable that the kernel's default behavior is not obvious, that we should use zone_reclaim_mode=0 in all but the most extreme NUMA environments. But, the fact is that the kernel was working both as designed and as documented.
Do not read in to this article too much, especially for trying to understand how the Linux VM or the kernel works. The authors misread the "global spinlock on the zone" source code and the interpretation in the article is dead wrong.
Memory pressure is caused when someone needs a particular kind of memory. Usually, that memory is simply any free memory. But, not all memory is the same from the kernel's perspective: you can see pressure when there is lots of free memory of other kinds. A few examples of these special needs would be DMA-capable memory, physical contiguity for large pages, "low" memory, and NUMA locality. The authors made the fundamental mistake of assuming that having any free memory means that there is no pressure.
There is no such thing as "NUMA memory balancing" in the kernel that was running. The authors are observant in noticing that direct page scans and thp_splits occur at the same times, but they are wholly incorrect in assuming that these constitute any intentional rebalancing. "Transparent HugePages do not play nice with NUMA systems" is also a dangerously broad thing to say, and it is not supported by even the data that the authors present.

The thing that most troubles me is how difficult it was for these fellow software engineers (with access to the source code and documentation for their kernel) to figure out what the kernel was doing. How do we get end users to make the leap from "I see latency spikes in my custom database" to "I should set zone_reclaim_mode=0"?

The LinkedIn folks also pointed out that very few Google searches end up pointing to the (wonderful) linux-mm wiki. Does everybody know that it's there? Does anybody actually use it?

Somebody should really document this stuff...

Monday, October 21, 2013

Response to "Optimizing Linux Memory Management..."

1 comment: