문제

  • How to monitor memory usage statistics and tune the memory management subsystem if needed?
  • Memory tuning guidelines for Red Hat Enterprise Linux.

환경

  • Red Hat Enterprise Linux 3
  • Red Hat Enterprise Linux 4
  • Red Hat Enterprise Linux 5
  • Red Hat Enterprise Linux 6

해결

LowMem Starvation
  • Memory usage on 32-bit system can become problematic under some workloads, especially for I/O intensive applications such as:    

    • Oracle Database or Application Server

    • Java

    • With the x86 architecture the first 16MB-896MB of physical memory is known as "low memory" (ZONE_NORMAL) which is permanently mapped into kernel space. Many kernel resources must live in the low memory zone. In fact, many kernel operations can only take place in this zone. This means that the low memory area is the most performance critical zone. For example, if you run many resources intensive applications/programs and/or use large physical memory, then "low memory" can become low since more kernel structures must be allocated in this area. Under heavy I/O workloads the kernel may become starved for LowMem even though there is an abundance of available HighMem.  As the kernel tries to keep as much data in cache as possible this can lead to oom-killers or complete system hangs.

    • In 64-bit systems all the memory is allocated in ZONE_NORMAL. So lowmem starvation will not affect  64-bit systems. Moving to 64-bit would be a permanent fix for lowmem starvation.
Diagnosing
  • The amount of LowMem can be checked in /proc/meminfo.  If the LowFree falls below 50Mb it may be cause for concern.  However this does not always indicate a problem as the kernel will try to use the entire LowMem zone and it may be able to reclaim some of the cache.

        MemTotal:       502784 kB
        MemFree:         29128 kB
        HighTotal:      162088 kB
        HighFree:        22860 kB
        LowTotal:       340696 kB
        LowFree:          6268 kB
    
  • OOM-KILLER: the kernel should print sysrq-M information to messages and the console.  You may see the Normal zone reporting all_unreclaimable? yes, meaning the kernel could not reclaim any memory in this zone.

kernel: DMA free:12544kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB pages_scanned:2814 all_unreclaimable? yes
kernel: Normal free:888kB min:928kB low:1856kB high:2784kB active:4152kB inactive:3724kB present:901120kB pages_scanned:9900 all_unreclaimable? yes
kernel: HighMem free:8731264kB min:512kB low:1024kB high:1536kB active:784164kB inactive:38796kB present:10354684kB pages_scanned:0 all_unreclaimable? no
  • In this case we can also see that the largest contiguous block of memory in the LowMem range is 32kB, so if the kernel requires anything larger than that the allocation may fail.
kernel: DMA: 4*4kB 4*8kB 3*16kB 3*32kB 3*64kB 3*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB 2*4096kB = 12544kB
kernel: Normal: 80*4kB 18*kB 29*16kB 3*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 888kB
kernel: HighMem: 660*4kB 160*8kB 61*16kB 55*32kB 496*64kB 2799*128kB 2385*256kB 1356*512kB 643*1024kB 261*2048kB 1425*4096kB = 8731264kB
  • SYSTEM HANGS: A core file captured at the time of the hang can often provide evidence for LowMem starvation:

        crash> kmem -i
                     PAGES        TOTAL      PERCENTAGE
        TOTAL MEM  1021613       3.9 GB   
             FREE   159502     623.1 MB   15% of TOTAL MEM
             USED   862111       3.3 GB   84% of TOTAL MEM
           SHARED   198019     773.5 MB   19% of TOTAL MEM
          BUFFERS    43212     168.8 MB    4% of TOTAL MEM
           CACHED   103623     404.8 MB   10% of TOTAL MEM
             SLAB    63170     246.8 MB    6% of TOTAL MEM
    
        TOTAL HIGH   802802       3.1 GB   78% of TOTAL MEM
         FREE HIGH   155824     608.7 MB   19% of TOTAL HIGH
         TOTAL LOW   218811     854.7 MB   21% of TOTAL MEM
          FREE LOW     3678      14.4 MB    1% of TOTAL LOW
    
        TOTAL SWAP  1048554         4 GB   
         SWAP USED       52       208 KB    0% of TOTAL SWAP
         SWAP FREE  1048502         4 GB   99% of TOTAL SWAP
    
  • Installing hangwatch can also be useful if the M flag is enabled for sysrq.

Tuning
Sysctl
  • RHEL 4    

    • Attempt to protect 100Mb of LowMem from userspace allocations (defaults to 0)

              vm.lower_zone_protection=100
      
  • RHEL 5

    • The pagecache value represents a percentage of physical RAM. When the size of the filesystem cache exceeds this size then cache pages are added only to the inactive list so under memory reclaim conditions the kernel is more likely to reclaim pages from the cache instead of swapping anonymous pages.

              vm.pagecache=100
      
  • RHEL 6

    • Will take into account highmem along with lowmem when calculating dirty_ratio and dirty_background_ratio. This will will make page reclaiming faster.

              vm.highmem_is_dirtyable=1
      
  • RHEL 5 and 6

    • zone_reclaim_mode determines the approaches to reclaim memory when a zone runs out of memory. If it is set to zero then no zone reclaim occurs. Allocations will be satisfied from other zones / nodes in the system. The value is a bitmap, consisting of:

              1       = Zone reclaim on
              2       = Zone reclaim writes dirty pages out
              4       = Zone reclaim swaps pages
      
    • Attempt to protect approximately 1/9 (98Mb) of LowMem from userspace allocations (defaults to 1/32, or 27.5 Mb)

              vm.lowmem_reserve_ratio=256     256     9
      
    • Note: The above parameter requires special syntax.

  • RHEL 4, 5 and 6

    • Have a higher tendency to swap out to disk.  This value can go from 0 to 100 (default 60).  Setting below 10 is not recommended.

              vm.swappiness=80
      
    • Try to keep at least 19Mb of memory free (default varies).  Adjust this to something higher than what is currently in use.

              vm.min_free_kbytes=19000
      
    • Decrease the amount of time for a page to be considered old enough for flushing to disk via the pdflush daemon (default 2999).  Expressed in 100'ths of a second.

              vm.dirty_expire_centisecs=2000
      
    • Shorten the interval at which the pdflush daemon wakes up to write dirty data to disk (default 499).  Expressed in 100'ths of a second.

              vm.dirty_writeback_centisecs=400
      
    • Decrease the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects (default 100, do not increase this beyond 100 as it can cause excessive reclaim).

              vm.vfs_cache_pressure=50
      
Overcommit Memory
  • Overcommitting memory allows the kernel to potentially allocate more memory than the system actually has.  This is perfectly safe, and in fact default behavior, as the Linux VM will handle the management of memory.  However, to tune it, consider the following information per the man proc documentation:
       /proc/sys/vm/overcommit_memory
              This file contains the kernel virtual memory accounting mode.  Values are:

                     0: heuristic overcommit (this is the default)
                     1: always overcommit, never check
                     2: always check, never overcommit

              In mode 0, calls of mmap(2)  with  MAP_NORESERVE  are  not  checked,  and  the
              default  check  is  very  weak, leading to the risk of getting a process "OOM-
              killed".  Under Linux 2.4 any non-zero  value  implies  mode  1.   In  mode  2
              (available  since Linux 2.6), the total virtual address space on the system is
              limited to (SS + RAM*(r/100)), where SS is the size of the swap space, and RAM
              is  the  size  of  the  physical  memory,  and  r  is the contents of the file
              /proc/sys/vm/overcommit_ratio.
HugePages
  • Enabling an application to use HugePages provides many benefits for the VM as it allows that application to lock data into memory and prevent it from swapping.  Some advantages of such a configuration:    

    • Increased performance by through increased TLB hits

    • Pages are locked in memory and are never swapped out which guarantees that shared memory like SGA remains in RAM

    • Contiguous pages are preallocated and cannot be used for anything else but for System V shared memory (e.g. SGA)

    • Less bookkeeping work for the kernel for that part of virtual memory due to larger page sizes

  • HugePages are only useful for applications that are aware of them (i.e., don't recommend them as a way to solve all memory issues).  They are only used for shared memory allocations so be sure not to allocate too many pages.  By default the size of one HugePage is 2Mb.  For Oracle systems allocate enough HugePages to hold the entire SGA in memory.

  • To enable HugePages use a sysctl setting to define how many pages should be allocated:    

    • RHEL 3

      vm.hugetlb_pool=1024
      
    • RHEL 4 onwards

      vm.nr_hugepages=1024
      
    • The application user must also have its memlock limit increased in /etc/security/limits.conf so they can lock that many pages into memory:    

      oracle - memlock 2097152
      
    • On RHEL 4, 5 or 6 this user must be logged out and back in (i.e., the application restarted) before the settings will be applied.  On RHEL 3 the system must be rebooted.

Resources

Warning: The following links are to sources that are not authored by Red Hat directly. We cannot verify its accuracy and content.

진단 단계

Apart from /proc/meminfo, the file /proc/zoneinfo can also be used to monitor the memory usage statistics. In that file, note the following fields under each zone. For example, in Normal zone of Node 0,

Node 0, zone   Normal
  pages free     1451395
        min      4000
        low      5000
        high     6000

pages_low - When pages_low number of free pages is reached, kswapd is woken up by the buddy allocator to start freeing pages.

pages_min - When pages_min is reached, the allocator will do the kswapd work in a synchronous fashion, sometimes referred to as the direct-reclaim path.

pages_high - Once kswapd has been woken to start freeing pages it will not consider the zone to be “balanced” until pages_high pages are free. Once the watermark has been reached, kswapd will go back to sleep.

Here, if we observe the value of pages_low and pages_min over a period of time, we will be fairly able to get an estimate of the usage pattern of memory in that particular zone of that node.



+ Recent posts