Why should we forbid THP and How to do it

We have encountered many cases of performance fluctuations caused by certain features of the operating system in production environments, with THP being the most frequent offender. Therefore, this article will share with you the reasons, typical symptoms, analysis methods, and configuration recommendations and disabling methods for using THP.

Introduction to THP (Transparent Huge Page)

The world is not black and white, and THP is an important feature of the kernel that continues to evolve. Its purpose is to reduce Page Faults and improve the hit rate of the TLB (Translation Lookaside Buffer used by the memory management unit to improve the speed of virtual address to physical address translation) by mapping page table entries to larger memory. Combined with the principle of storage hierarchy design, when the memory access locality of a program is good, THP will bring performance improvements. However, when the memory access locality is poor, THP not only loses its advantages but may also become a demon, causing system instability. Unfortunately, the access characteristics of database workloads are usually discrete.

Review of Linux Memory Management

Before discussing the negative effects caused by THP, let’s review how the Linux operating system manages physical memory.

For different architectures, the kernel corresponds to different memory layout diagrams. The user space is mapped through multi-level page tables to save space for mapping management, while the kernel space uses linear mapping for simplicity and efficiency. At kernel startup, physical pages are added to the buddy system, and memory is allocated and released when user requests are made. To accommodate slow devices and multiple workloads, Linux classifies pages into anonymous and file pages (Page Cache) and swapness, and uses Page Cache to cache files (slow devices). The proportion of swap cache and swapness to be reclaimed when memory is insufficient is determined by the workload characteristics of the user.

To respond to the user’s memory allocation needs as quickly as possible and ensure that the system runs when memory resources are tight, Linux defines three watermarks (high, low, and min). When the remaining physical memory is below the low watermark but above the min watermark, memory is asynchronously reclaimed by the kswapd kernel thread when the user requests memory, until the watermark is restored to above the high watermark. If the asynchronous reclamation speed cannot keep up with the memory allocation speed of the thread, direct memory reclamation will be triggered synchronously, which means that all threads requesting memory will participate in memory reclamation synchronously, and will obtain memory after raising the watermark. If the page to be reclaimed is clean, the blocking time caused by synchronization is relatively short, otherwise it can be very long (such as tens, hundreds of milliseconds or even seconds, depending on the speed of the backend device).

In addition to watermarks, when large contiguous memory is requested and there is sufficient remaining physical memory but severe fragmentation, the kernel may also trigger direct memory reclamation when doing memory defragmentation (depending on the fragmentation index, which will be discussed later). Therefore, direct memory reclamation and memory defragmentation are the main delays that may be encountered in the process of process memory allocation. Under the load of poor memory access locality, THP will become the culprit behind these two events.

The Most Typical Symptom - Skyrocketing Sys CPU Usage

We have found in multiple user sites that the most typical symptom of performance fluctuations caused by allocating THP is the skyrocketing of Sys CPU usage. This symptom is relatively easy to analyze. By capturing the on-cpu flame graph with perf, we can see that all threads in the R state of our service are doing memory defragmentation, and the page fault exception handling function is do_huge_pmd_anonymous_page, indicating that there is currently no continuous 2M physical memory, so direct memory defragmentation is triggered, which is a time-consuming process and the reason for the increase in sys utilization.

Indirect Symptom - Skyrocketing Sys Load

Real systems are often complex. When allocating THP or other high-order memory, the system does not do direct memory defragmentation, leaving such typical criminal characteristics, but mixes with other behaviors, such as direct memory reclamation. The participation of direct memory reclamation makes things slightly more complicated and confusing. For example, why does the system keep doing direct memory reclamation when the remaining physical memory in the normal zone is higher than the high watermark? In-depth analysis of the processing logic of slow memory allocation reveals that slow memory allocation paths mainly consist of several steps:

Asynchronous memory defragmentation
Direct memory reclamation
Direct memory defragmentation
oom reclamation

After each step is processed, memory is attempted to be allocated. If it can be allocated, the page is returned directly and the subsequent parts are skipped. The kernel provides a fragmentation index for each order of the buddy system to indicate whether memory allocation failure is due to insufficient memory or fragmentation. It is associated with /proc/sys/vm/extfrag_threshold. When it approaches 1000, it indicates that allocation failure is mainly related to fragmentation, and the kernel tends to do memory defragmentation. When it approaches 0, it indicates that allocation failure is more closely related to insufficient memory, and the kernel tends to do memory reclamation. Therefore, there is a phenomenon of frequent direct memory reclamation when it is higher than the high watermark. Since THP’s use and activation of high-order memory occupy and accelerate memory fragmentation, it causes performance fluctuations.

For this symptom, the judgment method is as follows:

Run sar -B to observe pgscand/s, which means the number of direct memory reclamation occurrences per second. If it is continuously greater than 0 for a period of time, continue to perform subsequent steps for troubleshooting;
Run cat /sys/lernel/debug/extfrag/extfrag_index to observe the memory fragmentation index, focusing on the fragmentation index of order >= 3. When it approaches 1.000, it indicates severe fragmentation, and when it approaches 0, it indicates insufficient memory;
Run cat /proc/buddyinfo, cat /proc/pagetypeinfo to check the memory fragmentation status, and the meaning of the indicators can be found in https://man.imzye.com/Linux/procps/. Also focus on the number of remaining pages of order >= 3. Pagetypeinfo shows more detailed information than buddyinfo. Group the pages according to the migration type (buddy system achieves anti-fragmentation through migration types). It should be noted that when the pages of the Unmovable migration type are all concentrated in order < 3, it indicates that the kernel slab is severely fragmented. We need to use other tools to troubleshoot the specific reasons;
For CentOS 7.6 and other kernels that support BPF, you can also run our R&D tools drsnoop and compactsnoop to perform quantitative analysis of delays. Please refer to the corresponding documents for usage and interpretation methods;
(Opt) Use ftrace to capture the mm_page_alloc_extfrag event to observe information on stealing pages from backup migration types due to memory fragmentation.

Atypical Symptom - Abnormal RES Usage

We encountered a scenario where the service occupies tens of GB of physical memory just after startup on an AARCH64 server. By observing the /proc/pid/smaps file, we can see that most of the memory is used for THP, and the PAGE SIZE selected by the CentOS 7 kernel compiled on AARCH64 is 64K, so the memory usage is much lower than that of the X86_64 platform. In the process of locating the issue, we also fixed the bug that jemalloc did not completely disable THP: fix opt.thp:never still use THP with base_map.

Conclusion

For programs or workloads that have not been optimized for memory access locality, enabling and setting THP and THP defrag to always be on is harmful and useless for long-running services. Moreover, the kernel only provides defer for THP defrag optimization since version 4.6 of the kernel. Therefore, for the CentOS 7 3.10 version of the kernel that we often use, if the workload itself is a discrete memory access program, enabling THP and THP defrag may cause performance fluctuations.

Reference

https://mp.weixin.qq.com/s/2--YMLx3kubjwPncenzDeQ

Appendix

View the current THP mode:

cat /sys/kernel/mm/transparent_hugepage/enabled

If the value is always, execute:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

Complete the shutdown operation.

It should be noted that in order to prevent the server from becoming ineffective after restarting, these two commands should be written into the .sevice file and managed by systemd.

Leave a message