Transparent huge pages disabled but compact_stall is not null - memory-management

We noticed large JVM pauses during garbage collection where user and system times are much smaller than the total time. [Times: user=3.99 sys=0.55, real=34.29 secs] We suspected it could be due to memory management and checked transparent and huge pages config which show both are disabled:
/sys/kernel/mm/redhat_transparent_hugepage/enabled:always [never]
/sys/kernel/mm/redhat_transparent_hugepage/defrag:[always] never
/sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag:[yes] no
However looking at THP and related counters, we see a lot of compaction stalls:
egrep 'trans|thp|compact_' /proc/vmstat
nr_anon_transparent_hugepages 0
compact_blocks_moved 113682
compact_pages_moved 3535156
compact_pagemigrate_failed 0
compact_stall 1944
compact_fail 186
compact_success 1758
thp_fault_alloc 6
thp_fault_fallback 0
thp_collapse_alloc 15
thp_collapse_alloc_failed 0
thp_split 17
So the question is, why THP and compaction stall/fail counters are not 0 if THPs are disabled and how to disable compaction so it does not interfere with our JVM (which we believe is the reason of long GC pauses)
This is happening on RHEL6.2, 2.6.32-279.5.2.el6.x86_64, JVM 6u21 32-bit. Thanks!

To really get rid of THP you must make sure that not only the THP daemon is disabled, but also the THP defrag tool. defrag will run independent from THP, while the settings in /sys/kernel/mm/khugepaged/defrag only allow control whether the THP daemon may run defrag as well.
That means: Even if your applications don't get the (potential) benefit of THP, the defragmentation process which makes your system stall is still active.
It is encouraged to use the distribution independent path for controlling THP & defrag settings:
/sys/kernel/mm/transparent_hugepage/ (which may be a symlink to /sys/kernel/mm/redhat_transparent_hugepage)
This results in:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
If you are running a java application and want to know whether THP/defrag is causing jvm pauses or stalls, it may be worth to have a look into your gc log. Having -XX:+PrintGcDetails enabled, you may observe "real" times that are significantly longer that the sys/user times.
In my case the following one-liner was sufficient
less gc.log | grep sys=0 | grep user=0 | grep -P "real=[1-9]"
The earliest description of the negative effects of THP is afaik this blog post by Greg Rahn: http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/

Related

What can be done to lower UE4Editor startup time?

Status: the problem lowered, but compared to other users reports it persists.
I have moved to UE4.27.0 and the startup time lowered from 11 (v4.26.2) to 6 minutes! (the RAM usage lowered too!) But doesnt compare to the speed other ppl report "almost instantly"...
It is not compiling anything, not even shaders, it is like the 6th time I run it for one project.
Should I try to disable plugins? but Im new with UE and dont want to difficult my usage. Tho, for ex., I have nothing VR related to test so it could really be initially disabled.
HD READ SPEED? NO
I have tested moving UE4Editor whole engine path (100GB) to a 3xSSD(Stripes), but the UE4Editor startup time remained the same. My HD were it is too, is fast but not so fast as the 3xSSD.
CPU USAGE? MAY BE if it could use 4 cores could solve it?
UE4Editor startup uses A SINGLE CORE ONLY, i can confirm with htop and system monitor, it is possible to see only a single core being used 100% and it changes between the 4 cores, so only one is used at 100% per time.
I tested this command line parameter -USEALLAVAILABLECORES after the project URL for UE4Editor, but nothing changed. I read that option is ignored in some machines, so may be if I patch it's usage it could work on mine?
GPU? no?
a report about an integrated graphics card (weak one) says it doesnt interfere with the startup time.
LOG for UE4Editor v4.27.0 with the new biggest intervals ("..." means ommited log lines to make it easier to read; "!(interval in seconds)" is just to easy reading it (no ommitted lines here)):
[2021.09.15-23.38.20:677][ 0]LogHAL: Linux SourceCodeAccessSettings: NullSourceCodeAccessor
!22s
[2021.09.15-23.38.42:780][ 0]LogTcpMessaging: Initializing TcpMessaging bridge
[2021.09.15-23.38.42:782][ 0]LogUdpMessaging: Initializing bridge on interface 0.0.0.0:0 to multicast group 230.0.0.1:6666.
!16s
[2021.09.15-23.38.58:158][ 0]LogPython: Using Python 3.7.7
...
[2021.09.15-23.39.01:817][ 0]LogImageWrapper: Warning: PNG Warning: Duplicate iCCP chunk
!75s
[2021.09.15-23.40.16:951][ 0]SourceControl: Source control is disabled
...
[2021.09.15-23.40.26:867][ 0]LogAndroidPermission: UAndroidPermissionCallbackProxy::GetInstance
!16s
[2021.09.15-23.40.42:325][ 0]LogAudioCaptureCore: Display: No Audio Capture implementations found. Audio input will be silent.
...
[2021.09.15-23.41.08:207][ 0]LogInit: Transaction tracking system initialized
!9s
[2021.09.15-23.41.17:513][ 0]BlueprintLog: New page: Editor Load
!23s
[2021.09.15-23.41.40:396][ 0]LocalizationService: Localization service is disabled
...
[2021.09.15-23.41.45:457][ 0]MemoryProfiler: OnSessionChanged
!13s
[2021.09.15-23.41.58:497][ 0]LogCook: Display: CookSettings for Memory: MemoryMaxUsedVirtual 0MiB, MemoryMaxUsedPhysical 16384MiB, MemoryMinFreeVirtual 0MiB, MemoryMinFreePhysical 1024MiB
SPECS:
I'm using ubuntu 20.04.
My CPU is 4 cores 3.6GHz.
GeForce GT 710 1GB.
Related question but for older UE4: https://answers.unrealengine.com/questions/987852/view.html
Unreal Engine needs a high-end pc with a lot of RAM, fast SSD's, a good CPU and a medium graphic card. First of all there are always some shaders that needs to be compiled from the engine, and a lot of assets to be loaded in the startup time. As I can see you're on Linux you are probably using a self-compiled Unreal Engine version.... not the best thing to do for a newbie, because this may cause several problems on load time, startup, compiling and a lot of other stuff. If it's the first times you're using Unreal, try using it on Windows, it's all easier.

how metadata journal stalls write system call?

I am using CentOS 7.1.1503 with kernel linux 3.10.0-229.el7.x86_64, ext4 file system with ordered journal mode, and delalloc enabled.
When my app writes logs to a file continually (about 6M/s), I can find the write system call stalled for 100-700 ms occasionally, When I disable ext4's journal feature, or set the journal mode to writeback, or disable delay allocate, the stalling disappears. When I set linux's writeback more frequent, dirty page expire time shorter, the problem reduced.
I printed the process's stack when stalling happends, I got this:
[<ffffffff812e31f4>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffa0195854>] ext4_da_get_block_prep+0x1a4/0x4b0 [ext4]
[<ffffffff811fbe17>] __block_write_begin+0x1a7/0x490
[<ffffffffa019b71c>] ext4_da_write_begin+0x15c/0x340 [ext4]
[<ffffffff8115685e>] generic_file_buffered_write+0x11e/0x290
[<ffffffff811589c5>] __generic_file_aio_write+0x1d5/0x3e0
[<ffffffff81158c2d>] generic_file_aio_write+0x5d/0xc0
[<ffffffffa0190b75>] ext4_file_write+0xb5/0x460 [ext4]
[<ffffffff811c64cd>] do_sync_write+0x8d/0xd0
[<ffffffff811c6c6d>] vfs_write+0xbd/0x1e0
[<ffffffff811c76b8>] SyS_write+0x58/0xb0
[<ffffffff81614a29>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
I read the source code of linux kernel, and found that call_rwsem_down_read_failed will call rwsem_down_read_failed which will keep waiting for rw_semaphore.
I think the reason is that metadata journal flushing must wait for related dirty pages flushed, when dirty pages flushing took long time, the journal blocked, and journal commit has this inode's rw_semaphore, write system call to this inode is stalled.
I really hope I can find evidence to prove it.

Creating many Sockets in ZMQ - too many files error

I am trying to create sockets with inproc:// transport class from the same context in C.
I can create 2036 sockets, when I try to create more zmq_socket() returns NULL and the zmq_errno says 24 'Too many open files'.
How can I create more than 2036 sockets? Especially as inproc forces me to use only one context.
There are several things I don't understand:
- the sockets are eventually turned to inproc, why does it take up files?
- Increasing ZMQ_MAX_SOCKETS does not help, the system file limit appears to be the limiting factor
- I am unable to increase the file limit with ulimit on my Mac, no workaround helped.
// the code is actually in cython and can be found here:
https://github.com/DavoudTaghawiNejad/ABsinthCE
Use zmq_ctx_set():
zmq_ctx_set (context, ZMQ_MAX_SOCKETS, 256);
You can change these using sysctl ( tried on Yosemite and El Capitan ), but the problem is what to change. Here is a post on this topic: Increasing the maximum number of tcp/ip connections in linux
That's on Linux, and the Mac is based on BSD 4.x, but man pages for sysctl on BSD are available online.
Note: sysctl is a private interface on iOS.
Solution is multi-fold complex:
inproc does not force you to have a common Context() instance, but it is handy to have one, as the signalling / messaging goes without any data-transfers, just by Zero-copy, pointer manipulations for in-RAM blocks of memory, which is extremely fast.
I started to assemble ZeroMQ-related facts about having some 70.000 ~ 200.000 file-descriptors available for "sockets", as supported by O/S kernel settings, but your published aims are higher. Much higher.
Given your git-published Multi-agent ABCE Project paper refers to nanosecond shaving, HPC-domain grade solution to have ( cit. / emphasis added: )
the whopping number of 1.073.545.225, many more agents than fit into the memory of even the most sophisticated supercomputer, some small hundreds of thousands of file-descriptors are not much worth spending time with.
Your Project faces multiple troubles at the same time.
Let's peel the problem layers off, step by step:
File Descriptors (FD) -- Linux O/S level -- System-wide Limits:
To see the actual as-is state:
edit /etc/sysctl.conf file
# vi /etc/sysctl.conf
Append a config directive as follows:
fs.file-max = 100000
Save and close the file.
Users need to log out and log back in again to changes take effect or just type the following command:
# sysctl -p
Verify your settings with command:
# cat /proc/sys/fs/file-max
( Max ) User-specific File Descriptors (FD) Limits:
Each user has additionally a set of ( soft-limit, hard-limit ):
# su - ABsinthCE
$ ulimit -Hn
$ ulimit -Sn
However, you can limit your ABsinthCE user ( or any other ) to any specific limits by editing /etc/security/limits.conf file, enter:
# vi /etc/security/limits.conf
Where you set ABsinthCE user the respective soft- and hard-limit as needed:
ABsinthCE soft nofile 123456
ABsinthCE hard nofile 234567
All that is not for free - each file descriptor takes up some kernel memory, so at some point you may and you will exhaust it. A few hundred thousands file descriptors are not trouble for server deployments, where event-based ( epoll on Linux ) server architectures are used. But simply straight forget to try to grow this anywhere near the said 1.073.545.225 level.
Today,one can have a private HPC machine ( not a Cloud illusion ) with ~ 50-500 TB RAM.
But still, the multi-agent Project application architecture ought be re-defined, not to fail on extreme resources allocations ( just due to a forgiving syntax simplicity ).
Professional Multi-Agent simulators are right due to extreme scaling very, VERY CONSERVATIVE on per-Agent instance resource-locking.
So the best results are to be expected ( both performance-wise and latency-wise ) when using direct memory-mapped operations. ZeroMQ inproc:// transport-class is fine and does not require a Context() instance to allocate IO-thread ( as there is no data-pump at all, if using just inproc:// transport-class ), which is very efficient for a fast prototyping phase. The same approach will become risky for growing the scales much higher towards the levels expected in production.
Latency-shaving and accelerated-time simulator operations throughput scaling is the next set of targets, for raising both the Multi-Agent based simulations static scales and increasing the simulator performance.
For a serious nanoseconds huntingfollow the excellent Bloomberg's guru, John Lakos, insights on HPC.
Either pre-allocate ( as a common Best Practice in RTOS domain) and do not allocate at all, or follow John's fabulous testing-supported insights presented on ACCU 2017.

ext4 commit= mount option and dirty_writeback_centisecs

I'm tring to understand the way bytes go from write() to the phisical disk plate to tune my picture server performance.
Thing I don't understand is what is the difference between these two: commit= mount option and dirty_writeback_centisecs. Looks like they are about the same procces of writing changes to the storage device, but still different.
I do not get it clear which one fires first on the way to the disk for my bytes.
Yeah, I just ran into this investigating mount options for an SDCard Ubuntu install on an ARM Chromebook. Here's what I can tell you...
Here's how to see the dirty and writeback amounts:
user#chrubuntu:~$ cat /proc/meminfo | grep "Dirty" -A1
Dirty: 14232 kB
Writeback: 4608 kB
(edit: This dirty and writeback is rather high, I had a compile running when I ran this.)
So data to be written out is dirty. Dirty data can still be eliminated (if say, a temporary file is created, used, and deleted before it goes to writeback, it'll never have to be written out). As dirty data is moved into writeback, the kernel tries to combine smaller requests that may be into dirty into single larger I/O requests, this is one reason why dirty_expire_centisecs is usually not set too low. Dirty data is usually put into writeback when a) Enough data is cached to get up to vm.dirty_background_ratio. b) As data gets to be vm.dirty_writeback_centisecs centiseconds old (3000 default is 30 seconds) it is put into writeback. vm.dirty_writeback_centisecs, a writeback daemon is run by default every 500 centiseconds (5 seconds) to actually flush out anything in writeback.
fsync will flush out an individual file (force it from dirty into writeback and wait until it's flushed out of writeback), and sync does that with everything. As far as I know, it does this ASAP, bypassing any attempt to try to balance disk reads and writes, it stalls the device doing 100% writes until the sync completes.
commit=5 default ext4 mount option actually forces a sync() every 5 seconds on that filesystem. This is intended to ensure that writes are not unduly delayed if there's heavy read activity (ideally losing a maximum of 5 seconds of data if power is cut or whatever.) What I found with an Ubuntu install on SDCard (in a Chromebook) is that this actually just leads to massive filesystem stalls like every 5 seconds if you're writing much to the card, ChromeOS uses commit=600 and I applied that Ubuntu-side to good effect.
The dirty_writeback_centisecs, configures the daemons of the kernel Linux related to the virtual memory (that's why the vm). Which are in charge of making a write back from the RAM memory to all the storage devices, so if you configure the dirty_writeback_centisecs and you have 25 different storage devices mounted on your system it will have the same amount of time of writeback for all the 25 storage systems.
While the commit is done per storage device (actually is per filesystem) and is related to the sync process instead of the daemons from the virtual memory.
So you can see it as:
dirty_writeback_centisecs
writing from RAM to all filesystems
commit
each filesystem fetches from RAM

redis bgsave failed because fork Cannot allocate memory

all:
here is my server memory info with 'free -m'
total used free shared buffers cached
Mem: 64433 49259 15174 0 3 31
-/+ buffers/cache: 49224 15209
Swap: 8197 184 8012
my redis-server has used 46G memory, there is almost 15G memory left free
As my knowledge,fork is copy on write, it should not failed when there has 15G free memory,which is enough to malloc necessary kernel structures .
besides, when redis-server used 42G memory, bgsave is ok and fork is ok too.
Is there any vm parameter I can tune to make fork return success ?
More specifically, from the Redis FAQ
Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can't tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages, with the result that if you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.
Setting overcommit_memory to 1 says Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.
Redis doesn't need as much memory as the OS thinks it does to write to disk, so may pre-emptively fail the fork.
Modify /etc/sysctl.conf and add:
vm.overcommit_memory=1
Then restart sysctl with:
On FreeBSD:
sudo /etc/rc.d/sysctl reload
On Linux:
sudo sysctl -p /etc/sysctl.conf
From proc(5) man pages:
/proc/sys/vm/overcommit_memory
This file contains the kernel virtual memory accounting mode. Values are:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
In mode 0, calls of mmap(2) with MAP_NORESERVE set are not checked, and the default check is very weak, leading to the risk of getting a process "OOM-killed". Under Linux 2.4
any non-zero value implies mode 1. In mode 2 (available since Linux 2.6), the total virtual address space on the system is limited to (SS + RAM*(r/100)), where SS is the size
of the swap space, and RAM is the size of the physical memory, and r is the contents of the file /proc/sys/vm/overcommit_ratio.
Redis's fork-based snapshotting method can effectively double physical memory usage and easily OOM in cases like yours. Reliance on linux virtual memory for doing snapshotting is problematic, because Linux has no visibility into Redis data structures.
Recently a new redis-compatible project Dragonfly has been released. Among other things, it solves the OOM problem entirely. (disclosure - I am the author of this project).

Resources