Creating many Sockets in ZMQ - too many files error - zeromq

I am trying to create sockets with inproc:// transport class from the same context in C.
I can create 2036 sockets, when I try to create more zmq_socket() returns NULL and the zmq_errno says 24 'Too many open files'.
How can I create more than 2036 sockets? Especially as inproc forces me to use only one context.
There are several things I don't understand:
- the sockets are eventually turned to inproc, why does it take up files?
- Increasing ZMQ_MAX_SOCKETS does not help, the system file limit appears to be the limiting factor
- I am unable to increase the file limit with ulimit on my Mac, no workaround helped.
// the code is actually in cython and can be found here:
https://github.com/DavoudTaghawiNejad/ABsinthCE

Use zmq_ctx_set():
zmq_ctx_set (context, ZMQ_MAX_SOCKETS, 256);

You can change these using sysctl ( tried on Yosemite and El Capitan ), but the problem is what to change. Here is a post on this topic: Increasing the maximum number of tcp/ip connections in linux
That's on Linux, and the Mac is based on BSD 4.x, but man pages for sysctl on BSD are available online.
Note: sysctl is a private interface on iOS.

Solution is multi-fold complex:
inproc does not force you to have a common Context() instance, but it is handy to have one, as the signalling / messaging goes without any data-transfers, just by Zero-copy, pointer manipulations for in-RAM blocks of memory, which is extremely fast.
I started to assemble ZeroMQ-related facts about having some 70.000 ~ 200.000 file-descriptors available for "sockets", as supported by O/S kernel settings, but your published aims are higher. Much higher.
Given your git-published Multi-agent ABCE Project paper refers to nanosecond shaving, HPC-domain grade solution to have ( cit. / emphasis added: )
the whopping number of 1.073.545.225, many more agents than fit into the memory of even the most sophisticated supercomputer, some small hundreds of thousands of file-descriptors are not much worth spending time with.
Your Project faces multiple troubles at the same time.
Let's peel the problem layers off, step by step:
File Descriptors (FD) -- Linux O/S level -- System-wide Limits:
To see the actual as-is state:
edit /etc/sysctl.conf file
# vi /etc/sysctl.conf
Append a config directive as follows:
fs.file-max = 100000
Save and close the file.
Users need to log out and log back in again to changes take effect or just type the following command:
# sysctl -p
Verify your settings with command:
# cat /proc/sys/fs/file-max
( Max ) User-specific File Descriptors (FD) Limits:
Each user has additionally a set of ( soft-limit, hard-limit ):
# su - ABsinthCE
$ ulimit -Hn
$ ulimit -Sn
However, you can limit your ABsinthCE user ( or any other ) to any specific limits by editing /etc/security/limits.conf file, enter:
# vi /etc/security/limits.conf
Where you set ABsinthCE user the respective soft- and hard-limit as needed:
ABsinthCE soft nofile 123456
ABsinthCE hard nofile 234567
All that is not for free - each file descriptor takes up some kernel memory, so at some point you may and you will exhaust it. A few hundred thousands file descriptors are not trouble for server deployments, where event-based ( epoll on Linux ) server architectures are used. But simply straight forget to try to grow this anywhere near the said 1.073.545.225 level.
Today,one can have a private HPC machine ( not a Cloud illusion ) with ~ 50-500 TB RAM.
But still, the multi-agent Project application architecture ought be re-defined, not to fail on extreme resources allocations ( just due to a forgiving syntax simplicity ).
Professional Multi-Agent simulators are right due to extreme scaling very, VERY CONSERVATIVE on per-Agent instance resource-locking.
So the best results are to be expected ( both performance-wise and latency-wise ) when using direct memory-mapped operations. ZeroMQ inproc:// transport-class is fine and does not require a Context() instance to allocate IO-thread ( as there is no data-pump at all, if using just inproc:// transport-class ), which is very efficient for a fast prototyping phase. The same approach will become risky for growing the scales much higher towards the levels expected in production.
Latency-shaving and accelerated-time simulator operations throughput scaling is the next set of targets, for raising both the Multi-Agent based simulations static scales and increasing the simulator performance.
For a serious nanoseconds huntingfollow the excellent Bloomberg's guru, John Lakos, insights on HPC.
Either pre-allocate ( as a common Best Practice in RTOS domain) and do not allocate at all, or follow John's fabulous testing-supported insights presented on ACCU 2017.

Related

What can be done to lower UE4Editor startup time?

Status: the problem lowered, but compared to other users reports it persists.
I have moved to UE4.27.0 and the startup time lowered from 11 (v4.26.2) to 6 minutes! (the RAM usage lowered too!) But doesnt compare to the speed other ppl report "almost instantly"...
It is not compiling anything, not even shaders, it is like the 6th time I run it for one project.
Should I try to disable plugins? but Im new with UE and dont want to difficult my usage. Tho, for ex., I have nothing VR related to test so it could really be initially disabled.
HD READ SPEED? NO
I have tested moving UE4Editor whole engine path (100GB) to a 3xSSD(Stripes), but the UE4Editor startup time remained the same. My HD were it is too, is fast but not so fast as the 3xSSD.
CPU USAGE? MAY BE if it could use 4 cores could solve it?
UE4Editor startup uses A SINGLE CORE ONLY, i can confirm with htop and system monitor, it is possible to see only a single core being used 100% and it changes between the 4 cores, so only one is used at 100% per time.
I tested this command line parameter -USEALLAVAILABLECORES after the project URL for UE4Editor, but nothing changed. I read that option is ignored in some machines, so may be if I patch it's usage it could work on mine?
GPU? no?
a report about an integrated graphics card (weak one) says it doesnt interfere with the startup time.
LOG for UE4Editor v4.27.0 with the new biggest intervals ("..." means ommited log lines to make it easier to read; "!(interval in seconds)" is just to easy reading it (no ommitted lines here)):
[2021.09.15-23.38.20:677][ 0]LogHAL: Linux SourceCodeAccessSettings: NullSourceCodeAccessor
!22s
[2021.09.15-23.38.42:780][ 0]LogTcpMessaging: Initializing TcpMessaging bridge
[2021.09.15-23.38.42:782][ 0]LogUdpMessaging: Initializing bridge on interface 0.0.0.0:0 to multicast group 230.0.0.1:6666.
!16s
[2021.09.15-23.38.58:158][ 0]LogPython: Using Python 3.7.7
...
[2021.09.15-23.39.01:817][ 0]LogImageWrapper: Warning: PNG Warning: Duplicate iCCP chunk
!75s
[2021.09.15-23.40.16:951][ 0]SourceControl: Source control is disabled
...
[2021.09.15-23.40.26:867][ 0]LogAndroidPermission: UAndroidPermissionCallbackProxy::GetInstance
!16s
[2021.09.15-23.40.42:325][ 0]LogAudioCaptureCore: Display: No Audio Capture implementations found. Audio input will be silent.
...
[2021.09.15-23.41.08:207][ 0]LogInit: Transaction tracking system initialized
!9s
[2021.09.15-23.41.17:513][ 0]BlueprintLog: New page: Editor Load
!23s
[2021.09.15-23.41.40:396][ 0]LocalizationService: Localization service is disabled
...
[2021.09.15-23.41.45:457][ 0]MemoryProfiler: OnSessionChanged
!13s
[2021.09.15-23.41.58:497][ 0]LogCook: Display: CookSettings for Memory: MemoryMaxUsedVirtual 0MiB, MemoryMaxUsedPhysical 16384MiB, MemoryMinFreeVirtual 0MiB, MemoryMinFreePhysical 1024MiB
SPECS:
I'm using ubuntu 20.04.
My CPU is 4 cores 3.6GHz.
GeForce GT 710 1GB.
Related question but for older UE4: https://answers.unrealengine.com/questions/987852/view.html
Unreal Engine needs a high-end pc with a lot of RAM, fast SSD's, a good CPU and a medium graphic card. First of all there are always some shaders that needs to be compiled from the engine, and a lot of assets to be loaded in the startup time. As I can see you're on Linux you are probably using a self-compiled Unreal Engine version.... not the best thing to do for a newbie, because this may cause several problems on load time, startup, compiling and a lot of other stuff. If it's the first times you're using Unreal, try using it on Windows, it's all easier.

Why doesn't the Linux kernel see the cache sizes in the gem5 emulator in full system mode?

I want to play around with cache sizes in my gem5 simulator to see how it affects performance of programs, and possibly tune programs at runtime.
As a sanity check, I tried to check that the command lines arguments I used were working , and so I tried the various methods proposed at: https://superuser.com/questions/55776/finding-l2-cache-size-in-linux/1298808#1298808
cat /sys/devices/system/cpu/cpu0/cache/index2/size
getconf LEVEL2_CACHE_SIZE
But I observed that:
the file /sys/devices/system/cpu/cpu0/cache/index2/size does not exist
getconf is empty
Why is that?
I am certain however that the caches are being, since I've benchmarked simple programs, and the cycle counts increase when I decrease the caches.
For example, my base command is:
M5_PATH='/data/git/linux-kernel-module-cheat/gem5/gem5-system' '/data/git/linux-kernel-module-cheat/gem5/gem5/build/ARM/gem5.opt' '/data/git/linux-kernel-module-cheat/gem5/gem5/configs/example/fs.py' --command-line='earlyprintk=pl011,0x1c090000 console=ttyAMA0 lpj=19988480 rw loglevel=8 mem=512MB root=/dev/sda nokaslr norandmaps printk.devkmsg=on printk.time=y' --disk-image='/data/git/linux-kernel-module-cheat/buildroot/output.arm-gem5~/images/rootfs.ext2' --dtb-file='/data/git/linux-kernel-module-cheat/gem5/gem5/system/arm/dt/armv7_gem5_v1_1cpu.dtb' --kernel='/data/git/linux-kernel-module-cheat/buildroot/output.arm-gem5~/build/linux-custom/vmlinux' --machine-type=VExpress_GEM5_V1 --num-cpus=1 --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024 --cpu-type=HPI
With those tiny caches, running the following:
m5 resetstats && dhrystone 10000 && m5 dumpstats
takes 175M cycles, and only 16M cycles if I use the exact same command but with huge caches of size 1024MB.
I observe a similar behavior for x86.
I'm using this testing infrastructure: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/05d8a324f74849f03404eb847f8da748e2e4502c#gem5-change-system-parameters which implies:
gem5 commit: fbe63074e3a8128bdbe1a5e8f6509c565a3abbd4
Linux kernel v4.15 with configuration: https://github.com/cirosantilli/linux-kernel-module-cheat/blob/05d8a324f74849f03404eb847f8da748e2e4502c/kernel_config_arm-gem5
Related thread on the mailing list: http://gem5-users.gem5.narkive.com/4xVBlf3c/verify-cache-configuration
For comparison, QEMU v2.11.0 x86 did show the cache sizes, but not the ARM one.
Maybe for ARM we would need to modify the bootloaders to tell that to kernel? But I don't know how those things work very well:
https://github.com/gem5/gem5/blob/fbe63074e3a8128bdbe1a5e8f6509c565a3abbd4/system/arm/simple_bootloader/simple.S
https://github.com/gem5/gem5/blob/fbe63074e3a8128bdbe1a5e8f6509c565a3abbd4/system/arm/aarch64_bootloader/boot.S
I have been told that:
gem5 doesn't implement the cache size discovery registers.
The problem is that it is really hard to configure them in the general case, and they might not even be able to represent the hierarchy in gem5.

How to set intel_idle.max_cstate=0 to disable c-states?

I would like to disable c-states on my computer.
I disabled c-state on BIOS but I don't obtain any result. However, I found an explanation :
"Most newer Linux distributions, on systems with Intel processors, use the “intel_idle” driver (probably compiled into your kernel and not a separate module) to use C-states. This driver uses knowledge of the various CPUs to control C-states without input from system firmware (BIOS). This driver will mostly ignore any other BIOS setting and kernel parameters"
I found two solutions to solve this problem but I don't know how to apply:
1) " so if you want control over C-states, you should use kernel parameter “intel_idle.max_cstate=0” to disable this driver."
I don't know neither how I can check the value (of intel_idle.max_cstate ) and neither how I can change its value.
2) "To dynamically control C-states, open the file /dev/cpu_dma_latency and write the maximum allowable latency to it. This will prevent C-states with transition latencies higher than the specified value from being used, as long as the file /dev/cpu_dma_latency is kept open. Writing a maximum allowable latency of 0 will keep the processors in C0"
I can't read the file cpu_dma_latency.
Thanks for your help.
Computer:
Intel Xeon CPU E5-2620
Gnome 2.28.2
Linux 2.6.32-358
To alter the value at boot time, you can modify the GRUB configuration or edit it on the fly -- the method to modify that varies by distribution. This is the Ubuntu documentation to change kernel parameters either for a single boot, or permanently. For a RHEL-derived distribution, I don't see docs that are quite as clear, but you directly modify /boot/grub/grub.conf to include the parameter on the "kernel" lines for each bootable stanza.
For the second part of the question, many device files are read-only or write-only. You could use a small perl script like this (untested and not very clean, but should work) to keep the file open:
#!/usr/bin/perl
use FileHandle;
my $fd = open (">/dev/cpu_dma_latency");
print $fd "0";
print "Press CTRL-C to end.\n";
while (1) {
sleep 5;
}
Redhat has a C snippet in a KB article here as well and more description of the parameter.

redis bgsave failed because fork Cannot allocate memory

all:
here is my server memory info with 'free -m'
total used free shared buffers cached
Mem: 64433 49259 15174 0 3 31
-/+ buffers/cache: 49224 15209
Swap: 8197 184 8012
my redis-server has used 46G memory, there is almost 15G memory left free
As my knowledge,fork is copy on write, it should not failed when there has 15G free memory,which is enough to malloc necessary kernel structures .
besides, when redis-server used 42G memory, bgsave is ok and fork is ok too.
Is there any vm parameter I can tune to make fork return success ?
More specifically, from the Redis FAQ
Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can't tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages, with the result that if you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.
Setting overcommit_memory to 1 says Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.
Redis doesn't need as much memory as the OS thinks it does to write to disk, so may pre-emptively fail the fork.
Modify /etc/sysctl.conf and add:
vm.overcommit_memory=1
Then restart sysctl with:
On FreeBSD:
sudo /etc/rc.d/sysctl reload
On Linux:
sudo sysctl -p /etc/sysctl.conf
From proc(5) man pages:
/proc/sys/vm/overcommit_memory
This file contains the kernel virtual memory accounting mode. Values are:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
In mode 0, calls of mmap(2) with MAP_NORESERVE set are not checked, and the default check is very weak, leading to the risk of getting a process "OOM-killed". Under Linux 2.4
any non-zero value implies mode 1. In mode 2 (available since Linux 2.6), the total virtual address space on the system is limited to (SS + RAM*(r/100)), where SS is the size
of the swap space, and RAM is the size of the physical memory, and r is the contents of the file /proc/sys/vm/overcommit_ratio.
Redis's fork-based snapshotting method can effectively double physical memory usage and easily OOM in cases like yours. Reliance on linux virtual memory for doing snapshotting is problematic, because Linux has no visibility into Redis data structures.
Recently a new redis-compatible project Dragonfly has been released. Among other things, it solves the OOM problem entirely. (disclosure - I am the author of this project).

what is the size of windows semaphore object?

How to find size of a semaphore object in windows?
I tried using sizeof() but we cannot give name of the sempahore object as an argument to sizeof. It has to be the handle. sizeof(HANDLE) gives us the size of handle and not semaphore.
This what is known as an "opaque handle.". There is no way to know how big it really is, what it contains or how any of the functions work internally. This gives Microsoft the ability to completely rewrite the implementation with each new version of Windows if they want to without worrying about breaking existing code. It's a similar concept to having a public and private interface to a class. Since we are not working on the Windows kernel, we only get to see the public interface.
Update:
It might be possible to get a rough idea of how big they are by creating a bunch and monitoring what happens to your memory usage in Process Explorer. However, since there is a good chance that they live in the kernel and not in user space, it might not show up at all. In any case, there are no guarantees about any other version of Windows, past or future, including patches/service packs.
It's something "hidden" from you. You can't say how big it is. And it's a kernel object, so it probably doesn't even live in your address space. It's like asking "how big is the Process Table?", or "how many MB is Windows wasting?".
I'll add that I have made a small test on my Windows 7 32 bits machine: 100000 kernel semaphores (with name X{number} with 0 <= number < 100000)) : 4 mb of kernel memory and 8 mb of user space (both measured with Task Manager). It's about 40 bytes/semaphore in kernel space and 80 bytes/semaphore in user space! (this in Win32... In 64 bits it'll probably double)

Resources